# Running p-value computation with python API

This is an *VariantSpark* example notebook.

One of the main applications of VariantSpark is discovery of genomic variants correlated with a response variable (e.g. case vs control) using random forest gini importance.

The `chr22_1000.vcf` is a very small sample of the chromosome 22 VCF file from the 1000 Genomes Project.

`chr22-labels-hail.csv` is a CSV file with sample response variables (labels). In fact the labels directly represent the number of alternative alleles for each sample at a specific genomic position. E.g.: column x22_16050408 has labels derived from variants in chromosome 22 position 16050408. We would expect then that position 22:16050408 in the VCF file is strongly correlated with the label x22_16050408.

Both data sets are located in the `..\\data` directory.

This notebook demonstrates how to run importance analysis on these data with *VariantSpark* Hail integration.

Step 1: Create a `HailContext` using `SparkContext` object (here injected as `sc`):

In [1]:
import numpy as np
from unidip import UniDip #pip

import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib import patches

from varspark.pvalues_calculation import * 
import hail as hl
import varspark.hail as vshl
vshl.init()

IndexError: list index out of range

Step 2: Load Hail variant dataset `vds` from a sample `.vcf` file.

In [None]:
vds = hl.import_vcf('../data/chr22_1000.vcf')

Step 3: Load labels into Hail table `labels`.

In [None]:
labels = hl.import_table('../data/chr22-labels-hail.csv', impute = True, delimiter=",").key_by('sample')

Step 4: Annotate dataset samples with labels.

In [None]:
vds = vds.annotate_cols(label = labels[vds.s])
vds.cols().show(3)

Step 5: Build the random forest model with `label.x22_16050408` as the respose variable

In [None]:
rf_model = vshl.random_forest_model(y=vds.label['x22_16050408'],
                x=vds.GT.n_alt_alleles(), seed = 13, mtry_fraction = 0.05, min_node_size = 5, max_depth = 10)
rf_model.fit_trees(100, 50)

Step 6: Display the results: print OOB error calculated variable importance.

In [None]:
print("OOB error: %s" % rf_model.oob_error())
impTable = rf_model.variable_importance()
impTable.order_by(hl.desc(impTable.importance)).show(10)

Step 7: Obtaiin the variable importance table and their `splitCount`

In [None]:
class PValueCalculator:
    
    def __init__(self, df):
        self._df = df

    @classmethod
    def from_imp_table(cls,impTable):
        impDf = impTable.filter(impTable.splitCount >= 1).to_spark(flatten=False).toPandas()
        df = impDf.assign(logImportance = np.log(impDf.importance))
        return PValueCalculator(df)

    def plot_log_densities(self, ax, min_split_count = 1, max_split_count=6, palette = 'Set1',
                      xLabel = 'log(importance)', yLabel = 'density'):
        #TODO test preconditions
        no_lines = max_split_count - min_split_count + 1
        colors= sns.mpl_palette(palette, no_lines)
        df = self._df
        for i,c in zip(range(min_split_count, max_split_count + 1), colors):
            sns.kdeplot(df.logImportance[df.splitCount >= i],
                        ax = ax, c=c, bw_adjust=0.5) #bw low show sharper distributions
    
    
        #ax.legend(labels=range(1,n_lines), bbox_to_anchor=(1,1))
        ax.set_xlabel(xLabel)
        ax.set_ylabel(yLabel)


    def plot_log_hist(self, ax, split_count, bins = 100,
                          xLabel = 'log(importance)', yLabel = 'count'):
        # check preconditions
        df = self._df
        sns.histplot(df.logImportance[df.splitCount >= split_count], ax = ax, bins=bins)
        ax.set_xlabel(xLabel)
        ax.set_ylabel(yLabel)
        
        
    def compute_p_values(self, countThreshold = 2, pValue = 0.05, **kwargs):
        impDfWithLog = self._df[self._df.splitCount >= countThreshold]
        pValueResult = run_it_importances(impDfWithLog.logImportance, pValue)
        #return impDfWithLog.assign(pValue =  pValueResult['ppp'])
        return (impDfWithLog.assign(pvalue = pValueResult['ppp']), pValueResult)   
    
    
    def find_split_count_th(self, min_split_count = 1, max_split_count=6, ntrials=1000):
        df = self._df
        for splitCountThreshold in range(min_split_count,max_split_count + 1):
            dat = np.msort(df[df['splitCount']>splitCountThreshold]['logImportance'])
            intervals = UniDip(dat, ntrials=ntrials).run() #ntrials can be increased to achieve higher robustness
            if len(intervals) <= 1: 
                break
        # TODO: check if converged !!!
        return splitCountThreshold

In [None]:
pValCalc = PValueCalculator.from_imp_table(impTable)

Setp 8: Determine the cutoff (of how many times a variable was used to split a tree) to get a unimodal density.

In [None]:
autoSplitCountTh = pValCalc.find_split_count_th()
print("The automatically selected SplitCount Threshold is %s" % autoSplitCountTh)

Step 9: Display (A) Density graphs of the Gini importance scores with different colours indicating the SplitCounts of a variable. (B) Histogram of Gini importances scores of variables with `SplitCountThreshold` selected

In [None]:
#plt.rcParams['figure.figsize'] = [15, 10]
#SMALL_SIZE = 12
#MEDIUM_SIZE = 14
#BIGGER_SIZE = 16
#plt.rc('font', size=SMALL_SIZE)          # controls default text sizes
#plt.rc('axes', titlesize=SMALL_SIZE)     # fontsize of the axes title
#plt.rc('axes', labelsize=MEDIUM_SIZE)    # fontsize of the x and y labels
#plt.rc('xtick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
#plt.rc('ytick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
#plt.rc('legend', fontsize=SMALL_SIZE)    # legend fontsize
#plt.rc('figure', titlesize=BIGGER_SIZE)  # fontsize of the figure title

fig, ax1 = plt.subplots(figsize=(10, 5), layout='constrained')
pValCalc.plot_log_densities(ax1)
plt.show()

In [None]:
fig, ax2 = plt.subplots(figsize=(10, 5), layout='constrained')
pValCalc.plot_log_hist(ax2, 2)
plt.show()

Step 10: Preparing the DataFrame for the variant p-value calculation

Step 11: Computing p-values and keeping the significant ones.

In [None]:
pvalueDF, info = pValCalc.compute_p_values(countThreshold = 2, maxFRD = 0.2)
print("C = %s" % info['C'])
pvalueDF

In [None]:
pvalueDF.sort_values('pvalue')