VariantSpark integration with Hail 0.2
==============================

## Bootstrap

This is needed to include variant-spark jar in the classpath. 
Can be simplified on terra.

In [1]:
import pkg_resources
import hail as hl
import varspark as vs
HAIL_JAR=pkg_resources.resource_filename(hl.__name__, "hail-all-spark.jar")
VS_JAR=vs.find_jar()

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder\
    .appName("HipsterIndex") \
    .config("spark.driver.extraClassPath", HAIL_JAR)\
    .config("spark.executor.extraClassPath", HAIL_JAR)\
    .config("spark.jars", ",".join([HAIL_JAR,VS_JAR]))\
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.kryo.registrator", "is.hail.kryo.HailKryoRegistrator") \
    .getOrCreate()

In [3]:
import hail as hl
hl.init(sc=spark.sparkContext)

using hail jar at /Users/szu004/miniconda2/envs/hail/lib/python3.6/site-packages/hail/hail-all-spark.jar
Running on Apache Spark version 2.4.1
SparkUI available at http://szu004-mac-dp.nexus.csiro.au:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.16-6da0d3571629
LOGGING: writing to /Users/szu004/dev/variant-spark/dev-notebooks/hail-20190726-1037-0.2.16-6da0d3571629.log


In [4]:
from hail.plot import show
from pprint import pprint
hl.plot.output_notebook()

# Load and explore hipster data

In [5]:
data = hl.import_vcf('../data/hipsterIndex/hipster.vcf.bgz')

In [6]:
labels = hl.import_table('../data/hipsterIndex/hipster_labels.txt', delimiter=',', 
                types=dict(label='float64', score='float64')).key_by('samples')

2019-07-26 10:37:23 Hail: INFO: Reading table with no type imputation
  Loading column 'samples' as type 'str' (type not specified)
  Loading column 'score' as type 'float64' (user-specified)
  Loading column 'label' as type 'float64' (user-specified)



In [7]:
mt = data.annotate_cols(hipster = labels[data.s])
mt.describe()

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
    'hipster': struct {
        score: float64, 
        label: float64
    }
----------------------------------------
Row fields:
    'locus': locus<GRCh37>
    'alleles': array<str>
    'rsid': str
    'qual': float64
    'filters': set<str>
    'info': struct {
        AA: str, 
        AC: array<int32>, 
        AF: array<float64>, 
        AFR_AF: array<float64>, 
        AMR_AF: array<float64>, 
        AN: int32, 
        CIEND: array<int32>, 
        CIPOS: array<int32>, 
        CS: str, 
        DP: int32, 
        EAS_AF: array<float64>, 
        END: int32, 
        EUR_AF: array<float64>, 
        EX_TARGET: bool, 
        IMPRECISE: bool, 
        MC: array<str>, 
        MEINFO: array<str>, 
        MEND: int32, 
        MLEN: int32, 
        MSTART: int32, 
        MULTI_ALLELIC: bool, 
        NS: int32, 
        SAS_AF: array<float64>, 

In [8]:
mt.count()

2019-07-26 10:37:24 Hail: INFO: Coerced almost-sorted dataset


(17010, 2504)

## Run log regression using Hail

In [9]:
gwas = hl.logistic_regression_rows(test='score',
                                y=mt.hipster.label,
                                 x=mt.GT.n_alt_alleles(),
                                 covariates=[1.0],
                                 pass_through=[mt.rsid])

2019-07-26 10:37:28 Hail: INFO: logistic_regression_rows: running score on 2504 samples for response variable y,
    with input variable x, and 1 additional covariate...


In [10]:
gwas.show(3)

locus,alleles,rsid,chi_sq_stat,p_value
locus<GRCh37>,array<str>,str,float64,float64
2:109511398,"[""G"",""A""]","""rs150055772""",0.197,0.657
2:109511454,"[""C"",""A""]","""rs558429529""",1.55,0.213
2:109511463,"[""G"",""A""]","""rs200762071""",3.65,0.056


In [11]:
p = hl.plot.manhattan(gwas.p_value, hover_fields=dict(rs=gwas.rsid))
show(p)

_Fig 1: Manhattan plot for logistic regression p-values._

## Build random forest and extract gini importanct with VaiantSpark (on the same data)

In [12]:
import varspark.hail as vshl

In [13]:
rf_model = vshl.random_forest_model(y=mt.hipster.label,
                    x=mt.GT.n_alt_alleles())
rf_model.fit_trees(500, 100)

In [14]:
print(rf_model.oob_error())
impTable = rf_model.variable_importance()
impTable.show(3)

0.1920926517571885


2019-07-26 10:40:48 Hail: INFO: Coerced sorted dataset


locus,alleles,importance
locus<GRCh37>,array<str>,float64
2:109511398,"[""G"",""A""]",0.000279
2:109511454,"[""C"",""A""]",0.00589
2:109511463,"[""G"",""A""]",0.0954


Join hail and VariantSpark results (this is only needed here to get the RSID's)

In [15]:
gwas_with_imp = gwas.join(impTable)

In [16]:
import varspark.hail.plot as vshlplt
p = vshlplt.manhattan_imp(gwas_with_imp.importance, 
                            hover_fields=dict(ri=gwas_with_imp.rsid),
                            significance_line = None)
show(p)

_Fig 2: Manhattan plot for rf gin importance values._

## Compare logistc regression values vs. rf importance

In [17]:
p = hl.plot.scatter(x=-hl.log10(gwas_with_imp.p_value),
                    y=gwas_with_imp.importance, 
                    xlabel = '-log10(p-value)',
                    ylabel = 'gini importance',
                    hover_fields=dict(rs=gwas_with_imp.rsid, loc=gwas_with_imp.locus))
show(p)

_Fig 3: Compare gini importance vs logistic regresion p-values._