Running importance analysis with Python API
=====================================

This is an *VariantSpark* example notebook.


One of the main applications of VariantSpark is discovery of genomic variants correlated with a response variable (e.g. case vs control) using random forest gini importance.

The `chr22_1000.vcf` is a very small sample of the chromosome 22 VCF file from the 1000 Genomes Project.

`chr22-labels.csv` is a CSV file with sample response variables (labels). In fact the labels directly represent the number of alternative alleles for each sample at a specific genomic position. E.g.: column 22_16050408 has labels derived from variants in chromosome 22 position 16050408. We would expect then that position 22:16050408 in the VCF file is strongly correlated with the label 22_16050408.

Both data sets are located in the `..\data` directory.

This notebook demonstrates how to run importance analysis on these data with *VariantSpark* Python API.

Step 1: Create a `VarsparkContext` using `SparkSession` object (here injected as `spark`):

In [1]:
from varspark import VarsparkContext
vc = VarsparkContext(spark, silent = True)

Step 2: Load the features `fs` and labels `ls` from data files.

In [2]:
features = vc.import_vcf('../data/chr22_1000.vcf')
labels = vc.load_label('../data/chr22-labels.csv', '22_16050408')

Step 3: Run the importance analysis and retrieve top important variables:

In [3]:
ia = features.importance_analysis(labels, seed = 13, n_trees=500, batch_size=20)
top_variables = ia.important_variables()

Step 4: Display the results.

In [4]:
print("%s\t%s" % ('Variable', 'Importance'))
for var_and_imp in top_variables:
    print("%s\t%s" % var_and_imp)    

Variable	Importance
22_16050408	0.000863450363985
22_16051107	0.000784308342255
22_16053197	0.000653021887401
22_16051480	0.000632510000666
22_16050678	0.000595007893431
22_16052656	0.000563403500509
22_16051882	0.000552010380163
22_16053435	0.000532201151905
22_16053797	0.000502325268641
22_16052838	0.000482158614379


For more information on using *VariantSpark* and the Python API please visit the [documentation](http://variantspark.readthedocs.io/en/latest/).