VariantSpark integration with Hail 0.2
==============================

## Bootstrap

Use `vshl.init()` to include `variant-spark` jar on the classpath. 

In [1]:
import hail as hl
import varspark.hail as vshl
vshl.init()

using variant-spark jar at '/Users/reg032/workspace/VariantSpark/target/variant-spark_2.11-0.5.0-a0-dev0-all.jar'
22/05/09 14:34:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/05/09 14:34:16 WARN Hail: This Hail JAR was compiled for Spark 3.1.1, running with Spark 3.1.2.
  Compatibility is not guaranteed.
22/05/09 14:34:17 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
22/05/09 14:34:17 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
22/05/09 14:34:17 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
22/05/09 14:34:17 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
22/

In [2]:
from hail.plot import show
from pprint import pprint
hl.plot.output_notebook()

# Load and explore hipster data

In [3]:
data = hl.import_vcf('../data/hipsterIndex/hipster.vcf.bgz')

In [4]:
labels = hl.import_table('../data/hipsterIndex/hipster_labels_covariates.txt', delimiter=',', 
                types=dict(label='float64', score='float64', age='float64', PC0='float64', PC1='float64', PC2='float64')).key_by('samples')

2022-05-09 14:34:22 Hail: INFO: Reading table without type imputation
  Loading field 'samples' as type str (not specified)
  Loading field 'score' as type float64 (user-supplied)
  Loading field 'label' as type float64 (user-supplied)
  Loading field 'age' as type float64 (user-supplied)
  Loading field 'PC0' as type float64 (user-supplied)
  Loading field 'PC1' as type float64 (user-supplied)
  Loading field 'PC2' as type float64 (user-supplied)


In [5]:
mt = data.annotate_cols(hipster = labels[data.s])
mt.describe()

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
    'hipster': struct {
        score: float64, 
        label: float64, 
        age: float64, 
        PC0: float64, 
        PC1: float64, 
        PC2: float64
    }
----------------------------------------
Row fields:
    'locus': locus<GRCh37>
    'alleles': array<str>
    'rsid': str
    'qual': float64
    'filters': set<str>
    'info': struct {
        AA: str, 
        AC: array<int32>, 
        AF: array<float64>, 
        AFR_AF: array<float64>, 
        AMR_AF: array<float64>, 
        AN: int32, 
        CIEND: array<int32>, 
        CIPOS: array<int32>, 
        CS: str, 
        DP: int32, 
        EAS_AF: array<float64>, 
        END: int32, 
        EUR_AF: array<float64>, 
        EX_TARGET: bool, 
        IMPRECISE: bool, 
        MC: array<str>, 
        MEINFO: array<str>, 
        MEND: int32, 
        MLEN: int32, 
        MSTART

In [6]:
mt.count()

2022-05-09 14:34:25 Hail: INFO: Coerced almost-sorted dataset       (0 + 1) / 1]
[Stage 1:>                                                          (0 + 1) / 1]

(17010, 2504)

## Run log regression using Hail

In [7]:
gwas = hl.logistic_regression_rows(test='score',
                                y=mt.hipster.label,
                                 x=mt.GT.n_alt_alleles(),
                                 covariates=[1.0, mt.hipster.age, mt.hipster.PC0, mt.hipster.PC1, mt.hipster.PC2],
                                 pass_through=[mt.rsid])

2022-05-09 14:34:29 Hail: INFO: Coerced almost-sorted dataset
2022-05-09 14:34:32 Hail: INFO: logistic_regression_rows: running score on 2504 samples for response variable y,
    with input variable x, and 5 additional covariates...


In [8]:
gwas.show(3)

[Stage 4:>                                                          (0 + 1) / 1]

locus,alleles,rsid,chi_sq_stat,p_value
locus<GRCh37>,array<str>,str,float64,float64
2:109511398,"[""G"",""A""]","""rs150055772""",0.245,0.621
2:109511454,"[""C"",""A""]","""rs558429529""",1.58,0.208
2:109511463,"[""G"",""A""]","""rs200762071""",3.39,0.0654


In [9]:
p = hl.plot.manhattan(gwas.p_value, hover_fields=dict(rs=gwas.rsid))
show(p)

_Fig 1: Manhattan plot for logistic regression p-values._

## Build random forest and extract gini importance with VariantSpark (on the same data)

In [10]:
rf_model = vshl.random_forest_model(y=mt.hipster.label,
                    x=mt.GT.n_alt_alleles(), 
                    covariates={'age':mt.hipster.age, 'PC0':mt.hipster.PC0, 'PC1':mt.hipster.PC1, 'PC2':mt.hipster.PC2})
rf_model.fit_trees(500, 100)

2022-05-09 14:35:04 Hail: INFO: Coerced almost-sorted dataset
[Stage 696:>                                                        (0 + 8) / 9]

Capture the variant importances

In [11]:
print(rf_model.oob_error())
impTable = rf_model.variable_importance()
impTable.show(3)

0.18130990415335463


2022-05-09 14:38:43 Hail: INFO: Coerced sorted dataset


locus,alleles,importance,splitCount
locus<GRCh37>,array<str>,float64,int64
2:109511398,"[""G"",""A""]",0.0,0
2:109511454,"[""C"",""A""]",0.0184,4
2:109511463,"[""G"",""A""]",0.113,24


Show the covariates importances

In [12]:
covImpTable = rf_model.covariate_importance()
covImpTable.show(4)

2022-05-09 14:38:44 Hail: INFO: Coerced sorted dataset
2022-05-09 14:38:44 Hail: INFO: Coerced dataset with out-of-order partitions.


covariate,importance,splitCount
str,float64,int64
"""PC0""",2.53,471
"""PC1""",2.52,479
"""PC2""",2.57,464
"""age""",2.55,473


Join hail and VariantSpark results (this is only needed here to get the RSID's)

In [13]:
gwas_with_imp = gwas.join(impTable)

2022-05-09 14:38:44 Hail: INFO: Table.join: renamed the following fields on the right to avoid name conflicts:
    'alleles' -> 'alleles_1'
    'locus' -> 'locus_1'


In [14]:
import varspark.hail.plot as vshlplt
p = vshlplt.manhattan_imp(gwas_with_imp.importance, 
                            hover_fields=dict(ri=gwas_with_imp.rsid),
                            significance_line = None)
show(p)

[Stage 746:>                                                        (0 + 1) / 1]

_Fig 2: Manhattan plot for rf gin importance values._

## Compare logistc regression values vs. rf importance

In [15]:
p = hl.plot.scatter(x=-hl.log10(gwas_with_imp.p_value),
                    y=gwas_with_imp.importance, 
                    xlabel = '-log10(p-value)',
                    ylabel = 'gini importance',
                    hover_fields=dict(rs=gwas_with_imp.rsid, loc=gwas_with_imp.locus))
show(p)

_Fig 3: Compare gini importance vs logistic regresion p-values._