Running importance analysis with Hail
=====================================

This is an *VariantSpark* example notebook.


One of the main applications of VariantSpark is discovery of genomic variants correlated with a response variable (e.g. case vs control) using random forest gini importance.

The `chr22_1000.vcf` is a very small sample of the chromosome 22 VCF file from the 1000 Genomes Project.

`chr22-labels.csv` is a CSV file with sample response variables (labels). In fact the labels directly represent the number of alternative alleles for each sample at a specific genomic position. E.g.: column 22_16050408 has labels derived from variants in chromosome 22 position 16050408. We would expect then that position 22:16050408 in the VCF file is strongly correlated with the label 22_16050408.

Both data sets are located in the `..\data` directory.

This notebook demonstrates how to run importance analysis on these data with *VariantSpark* Hail integration.

Step 1: Create a `HailContext` using `SparkContext` object (here injected as `sc`):

In [1]:
from hail import HailContext
import varspark.hail
hc = HailContext(sc)

Running on Apache Spark version 2.2.1
SparkUI available at http://140.253.176.47:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.1-74bf1eb


Step 2: Load Hail variant dataset  `vds` from a sample `.vcf` file.

In [2]:
vds = hc.import_vcf('../data/chr22_1000.vcf')

2018-07-11 12:05:17 Hail: INFO: No multiallelics detected.
2018-07-11 12:05:17 Hail: INFO: Coerced almost-sorted dataset


Step 2: Load labels into Hail table `labels`.

In [3]:
labels = hc.import_table('../data/chr22-labels.csv', key="sample", impute = True, delimiter=",")

2018-07-11 12:05:17 Hail: INFO: Reading table to impute column types
2018-07-11 12:05:17 Hail: INFO: Finished type imputation
  Loading column `sample' as type String (imputed)
  Loading column `22_16050408' as type Int (imputed)
  Loading column `22_16050612' as type Int (imputed)
  Loading column `22_16050678' as type Int (imputed)
  Loading column `22_16050984' as type Int (imputed)
  Loading column `22_16051107' as type Int (imputed)
  Loading column `22_16051249' as type Int (imputed)
  Loading column `22_16051347' as type Int (imputed)
  Loading column `22_16051453' as type Int (imputed)
  Loading column `22_16051477' as type Int (imputed)
  Loading column `22_16051480' as type Int (imputed)


Step 3: Annotate dataset samples with labels.

In [4]:
vds = vds.annotate_samples_table(labels, root="sa.pheno")

Step 4: Run the importance analysis and retrieve important variants (as Hail table):

In [5]:
via = vds.importance_analysis("sa.pheno.`22_16050408`", n_trees = 500, seed = 13L,  batch_size = 20)
iv  = via.important_variants()

Step 5: Display the results.

In [6]:
print("Random forest OOB error: %s" % via.oob_error)

Random forest OOB error: 0.014652014652


In [7]:
iv.to_dataframe().show()

+--------------+-------------+-----------+------------------+--------------------+
|variant.contig|variant.start|variant.ref|variant.altAlleles|          importance|
+--------------+-------------+-----------+------------------+--------------------+
|            22|     16050408|          T|           [[T,C]]|9.040375514083192E-4|
|            22|     16051480|          T|           [[T,C]]|8.397590270282872E-4|
|            22|     16050678|          C|           [[C,T]]|6.948530160834497E-4|
|            22|     16052838|          T|           [[T,A]]|6.855251834675504E-4|
|            22|     16053197|          G|           [[G,T]]| 6.45022919591987E-4|
|            22|     16051107|          C|           [[C,A]]|6.206310454636674E-4|
|            22|     16053435|          G|           [[G,T]]|5.515279651774938E-4|
|            22|     16052656|          T|           [[T,C]]|5.061636050055417E-4|
|            22|     16053509|          A|           [[A,G]]|4.640749808744204...|
|   

For more information on using *VariantSpark* and the Python API and Hail integration please visit the [documentation](http://variantspark.readthedocs.io/en/latest/).