Running importance analysis with Python API
=====================================

This is an *VariantSpark* example notebook.


One of the main applications of VariantSpark is discovery of genomic variants correlated with a response variable (e.g. case vs control) using random forest gini importance.

The `chr22_1000.vcf` is a very small sample of the chromosome 22 VCF file from the 1000 Genomes Project.

`chr22-labels.csv` is a CSV file with sample response variables (labels). In fact the labels directly represent the number of alternative alleles for each sample at a specific genomic position. E.g.: column 22_16050408 has labels derived from variants in chromosome 22 position 16050408. We would expect then that position 22:16050408 in the VCF file is strongly correlated with the label 22_16050408.

Both data sets are located in the `..\data` directory.

This notebook demonstrates how to run importance analysis on these data with *VariantSpark* Python API.

Step 1: Create a spark session with VariantSpark jar attached.

In [0]:
import varspark as vs
from pyspark.sql import SparkSession 
spark = SparkSession.builder.config('spark.jars', vs.find_jar()).getOrCreate()

Step 2: Create a `VarsparkContext` using `SparkSession` object (here injected as `spark`):

In [0]:
vc = vs.VarsparkContext(spark, silent = True)

Step 3: Load the features `fs` and labels `ls` from data files.

In [0]:
features = vc.import_vcf('dbfs:/databricks/Filestore/chr22_1000.vcf')
labels = vc.load_label('dbfs:/databricks/Filestore/chr22-labels.csv', '22_16050408')

Step 4: Run the importance analysis and retrieve top important variables:

In [0]:
ia = features.importance_analysis(labels, seed = 13, n_trees=500, batch_size=20)
top_variables = ia.important_variables()

Step 5: Display the results.

In [0]:
print("%s\t%s" % ('Variable', 'Importance'))
for var_and_imp in top_variables:
    print("%s\t%s" % var_and_imp)    

For more information on using *VariantSpark* and the Python API please visit the [documentation](http://variantspark.readthedocs.io/en/latest/).