Running importance analysis with Hail
=====================================

This is an *VariantSpark* example notebook.


One of the main applications of VariantSpark is discovery of genomic variants correlated with a response variable (e.g. case vs control) using random forest gini importance.

The `chr22_1000.vcf` is a very small sample of the chromosome 22 VCF file from the 1000 Genomes Project.

`chr22-labels-hail.csv` is a CSV file with sample response variables (labels). In fact the labels directly represent the number of alternative alleles for each sample at a specific genomic position. E.g.: column x22_16050408 has labels derived from variants in chromosome 22 position 16050408. We would expect then that position 22:16050408 in the VCF file is strongly correlated with the label x22_16050408.

Both data sets are located in the `..\data` directory.

This notebook demonstrates how to run importance analysis on these data with *VariantSpark* Hail integration.

Step 1: Create a `HailContext` using `SparkContext` object (here injected as `sc`):

In [0]:
import hail as hl
import varspark.hail as vshl
vshl.init(sc, idempotent=True)

Step 2: Load Hail variant dataset  `vds` from a sample `.vcf` file.

In [0]:
vds = hl.import_vcf('dbfs:/databricks/Filestore/chr22_1000.vcf')

Step 3: Load labels into Hail table `labels`.

In [0]:
labels = hl.import_table('dbfs:/databricks/Filestore/chr22-labels-hail.csv', impute = True, delimiter=",").key_by('sample')

Step 4: Annotate dataset samples with labels.

In [0]:
vds = vds.annotate_cols(label = labels[vds.s])
vds.cols().show(3)

Step 5: Build the random forest model with `label.x22_16050408` as the respose variable.

In [0]:
rf_model = vshl.random_forest_model(y=vds.label['x22_16050408'],
                x=vds.GT.n_alt_alleles(), seed = 13, mtry_fraction = 0.05, min_node_size = 5, max_depth = 10)
rf_model.fit_trees(100, 50)

Step 6: Display the results: print OOB error calculated variable importance.

In [0]:
print("OOB error: %s" % rf_model.oob_error())
impTable = rf_model.variable_importance()
impTable.order_by(hl.desc(impTable.importance)).show(10)

Optionally release the resouces (RAM) associated with the model.

In [0]:
#rf_model.release()

For more information on using *VariantSpark* and the Python API and Hail integration please visit the [documentation](http://variantspark.readthedocs.io/en/latest/).