Define functional categories and filter calls for those in Control and SCZ individuals

In [1]:
%load_ext autoreload
%autoreload 2
%reload_ext autoreload
from matplotlib import pyplot as plt
import pandas as pd
import attila_utils
import funcvar
import functools
import ensembl_rest
import os.path
from bsmcalls import SNPnexus
%matplotlib inline
#pd.set_option('display.max_rows', None)
#pd.set_option('display.max_colwidth', -1)

## Preparations

In [2]:
D = SNPnexus.load_data('/home/attila/projects/bsm/results/2020-09-07-annotations/annotated-calls.p')

In [None]:
d = {'coding nonsyn': 'near_gens_Annotation', 'stop-gain': 'near_gens_Annotation', 'intronic (splice_site)': 'near_gens_Annotation'}
A = funcvar.count_members(D, d)
B = D.groupby('Dx')['sift_Prediction'].apply(pd.Series.value_counts).unstack().T.loc[['Deleterious', 'Deleterious - Low Confidence']]
counts = pd.concat([A, B], axis=0)
#counts = counts.append(D.groupby('Dx')['tfbs_TFBS Name_bin'].sum().astype('int64'))
#counts = counts.append(D.groupby('Dx')['regbuild_Epigenome_nervoussys_bin'].sum().astype('int64'))
#counts = counts.append(D.groupby('Dx')['gerp_Element RS Score_bin'].sum().astype('int64'))
counts

### GWAS genes

Here we take supplementary table 4 from the [CLOZUK paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5918692/) (clozapine UK), whose `Gene(s) tagged` column we turn into the `gwasgenes` set.

In [None]:
clozukpath = '/home/attila/projects/bsm/resources/CLOZUK/supp-table-4.csv'
gwasgenes = funcvar.get_geneset(df=pd.read_csv(clozukpath, skiprows=7), col='Gene(s) tagged')
repr(gwasgenes)

In [None]:
Dgwas = D.loc[D['near_gens_Overlapped Gene'].apply(lambda x: bool(set(x).intersection(gwasgenes))), :]
counts_gwas = funcvar.all_functional_counts(Dgwas)
counts_gwas

### DeepSEA score
[DeepSEA](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4768299/) predicts effects of noncoding variants with deep learningâ€“based sequence model.

In [None]:
fig, ax = plt.subplots()
D['deepsea_Functional Significance Score'].hist(ax=ax)
ax.set_xlabel('DeepSEA functional significance score')
ax.set_ylabel('variants')
attila_utils.savefig(fig, 'deepsea-hist')

The histogram suggests a cutoff somewhere between 0.5 and 0.6 score.  I will define functionally significant variants using both a more lenient and a more stringent threshold.

### Loss of function variants

These include stop-gain, intronic (splice_site), and frameshift.  Note that we didn't observe any frameshift variants.  These might be the most revealing.

In [None]:
cols = ['Dataset', 'Dx', 'near_gens_Annotation', 'near_gens_Overlapped Gene']
LoFrows = D['near_gens_Annotation'].apply(lambda x: bool(set(x).intersection({'stop-gain', 'intronic (splice_site)'})))
D.loc[LoFrows, cols]

### Functional variants

First let's see Taejeong's definition of functional variants

> The terms that we chose as functional are missense, stop_gained, splice_region, regulatory, and TF_binding.

### Outlier individual

In [None]:
# TODO
#D.loc[funcAby, sel_cols].loc['CMC_MSSM_224']

In [24]:
%connect_info

{
  "shell_port": 39483,
  "iopub_port": 56535,
  "stdin_port": 39011,
  "control_port": 35935,
  "hb_port": 47039,
  "ip": "127.0.0.1",
  "key": "e817b041-299ce11244dbe8a002d82911",
  "transport": "tcp",
  "signature_scheme": "hmac-sha256",
  "kernel_name": ""
}

Paste the above JSON into a file, and connect with:
    $> jupyter <app> --existing <file>
or, if you are local, you can connect with just:
    $> jupyter <app> --existing kernel-10573a53-c5fb-42e8-b706-3c923096a43a.json
or even just:
    $> jupyter <app> --existing
if this is the most recent Jupyter kernel you have started.
