Running importance analysis with Hail
=====================================

This is an *VariantSpark* example notebook.


One of the main applications of VariantSpark is discovery of genomic variants correlated with a response variable (e.g. case vs control) using random forest gini importance.

The `chr22_1000.vcf` is a very small sample of the chromosome 22 VCF file from the 1000 Genomes Project.

`chr22-labels-hail.csv` is a CSV file with sample response variables (labels). In fact the labels directly represent the number of alternative alleles for each sample at a specific genomic position. E.g.: column x22_16050408 has labels derived from variants in chromosome 22 position 16050408. We would expect then that position 22:16050408 in the VCF file is strongly correlated with the label x22_16050408.

Both data sets are located in the `..\data` directory.

This notebook demonstrates how to run importance analysis on these data with *VariantSpark* Hail integration.

Step 1: Create a `HailContext` using `SparkContext` object (here injected as `sc`):

In [1]:
import hail as hl
import varspark.hail as vshl
vshl.init()

using variant-spark jar at '/Users/szu004/dev/VariantSpark/target/variant-spark_2.11-0.3.0-SNAPSHOT-all.jar'
using hail jar at '/Users/szu004/miniconda3/envs/vs-dev3.6/lib/python3.6/site-packages/hail/hail-all-spark.jar'
using hail jar at /Users/szu004/miniconda3/envs/vs-dev3.6/lib/python3.6/site-packages/hail/hail-all-spark.jar
Running on Apache Spark version 2.4.1
SparkUI available at http://192.168.1.10:4041
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.16-6da0d3571629
LOGGING: writing to /Users/szu004/dev/VariantSpark/examples/hail-20201218-1127-0.2.16-6da0d3571629.log


Step 2: Load Hail variant dataset  `vds` from a sample `.vcf` file.

In [2]:
vds = hl.import_vcf('../data/chr22_1000.vcf')

Step 2: Load labels into Hail table `labels`.

In [3]:
labels = hl.import_table('../data/chr22-labels-hail.csv', impute = True, delimiter=",").key_by('sample')

2020-12-18 11:27:17 Hail: INFO: Reading table to impute column types
2020-12-18 11:27:17 Hail: INFO: Finished type imputation
  Loading column 'sample' as type 'str' (imputed)
  Loading column 'x22_16050408' as type 'int32' (imputed)
  Loading column 'x22_16050612' as type 'str' (imputed)
  Loading column 'x22_16050678' as type 'str' (imputed)
  Loading column 'x22_16050984' as type 'int32' (imputed)
  Loading column 'x22_16051107' as type 'int32' (imputed)
  Loading column 'x22_16051249' as type 'int32' (imputed)
  Loading column 'x22_16051347' as type 'int32' (imputed)
  Loading column 'x22_16051453' as type 'int32' (imputed)
  Loading column 'x22_16051477' as type 'int32' (imputed)
  Loading column 'x22_16051480' as type 'int32' (imputed)
2020-12-18 11:27:17 Hail: WARN: Name collision: field 'sample' already in object dict. 
  This field must be referenced with __getitem__ syntax: obj['sample']


Step 3: Annotate dataset samples with labels.

In [4]:
vds = vds.annotate_cols(label = labels[vds.s])
vds.cols().show(3)

2020-12-18 11:27:17 Hail: WARN: cols(): Resulting column table is sorted by 'col_key'.
    To preserve matrix table column order, first unkey columns with 'key_cols_by()'
2020-12-18 11:27:18 Hail: INFO: Coerced almost-sorted dataset
2020-12-18 11:27:19 Hail: INFO: Coerced sorted dataset
2020-12-18 11:27:19 Hail: INFO: Coerced sorted dataset


s,label.x22_16050408,label.x22_16050612,label.x22_16050678,label.x22_16050984,label.x22_16051107,label.x22_16051249,label.x22_16051347,label.x22_16051453,label.x22_16051477,label.x22_16051480
str,int32,str,str,int32,int32,int32,int32,int32,int32,int32
"""HG00096""",0,"""hahaha""","""heheh""",0,0,0,0,0,0,0
"""HG00097""",1,"""ala ma""","""1""",0,1,1,1,1,0,1
"""HG00099""",1,"""1""","""1""",0,1,1,1,1,0,1


Step 4: Build the random forest model with `label.x22_16050408` as the respose variable.

In [5]:
rf_model = vshl.random_forest_model(y=vds.label['x22_16050408'],
                x=vds.GT.n_alt_alleles(), seed = 13, mtry_fraction = 0.05, min_node_size = 5, max_depth = 10)
rf_model.fit_trees(100, 50)

2020-12-18 11:27:21 Hail: INFO: Loaded 1988 variables


Step 5: Display the results: print OOB error calculated variable importance.

In [6]:
print("OOB error: %s" % rf_model.oob_error())
impTable = rf_model.variable_importance()
impTable.order_by(hl.desc(impTable.importance)).show(10)

OOB error: 0.010073260073260074


2020-12-18 11:27:24 Hail: INFO: Coerced sorted dataset


locus,alleles,importance
locus<GRCh37>,array<str>,float64
22:16050408,"[""T"",""C""]",36.7
22:16050678,"[""C"",""T""]",26.0
22:16052838,"[""T"",""A""]",18.9
22:16053197,"[""G"",""T""]",15.5
22:16051882,"[""C"",""T""]",14.9
22:16053727,"[""T"",""G""]",14.2
22:16051480,"[""T"",""C""]",14.0
22:16052656,"[""T"",""C""]",13.5
22:16053797,"[""T"",""C""]",9.84
22:16051107,"[""C"",""A""]",9.22


Optionally release the resouces (RAM) associated with the model.

In [7]:
#rf_model.release()

For more information on using *VariantSpark* and the Python API and Hail integration please visit the [documentation](http://variantspark.readthedocs.io/en/latest/).