Define functional categories and filter calls for those in Control and SCZ individuals

In [26]:
%load_ext autoreload
%autoreload 2
%reload_ext autoreload
from matplotlib import pyplot as plt
import pandas as pd
import attila_utils
import funcvar
import functools
import ensembl_rest
import os.path
from bsmcalls import SNPnexus
from bsmcalls import operations
%matplotlib inline
#pd.set_option('display.max_rows', None)
#pd.set_option('display.max_colwidth', -1)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Preparations

In [4]:
data = SNPnexus.load_data('/home/attila/projects/bsm/results/2020-09-07-annotations/annotated-calls.p')

In [None]:
d = {'coding nonsyn': 'near_gens_Annotation', 'stop-gain': 'near_gens_Annotation', 'intronic (splice_site)': 'near_gens_Annotation'}
A = funcvar.count_members(D, d)
B = D.groupby('Dx')['sift_Prediction'].apply(pd.Series.value_counts).unstack().T.loc[['Deleterious', 'Deleterious - Low Confidence']]
counts = pd.concat([A, B], axis=0)
#counts = counts.append(D.groupby('Dx')['tfbs_TFBS Name_bin'].sum().astype('int64'))
#counts = counts.append(D.groupby('Dx')['regbuild_Epigenome_nervoussys_bin'].sum().astype('int64'))
#counts = counts.append(D.groupby('Dx')['gerp_Element RS Score_bin'].sum().astype('int64'))
counts

### GWAS genes

Here we take supplementary table 4 from the [CLOZUK paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5918692/) (clozapine UK), whose `Gene(s) tagged` column we turn into the `gwasgenes` set.

In [18]:
clozukpath = '/home/attila/projects/bsm/resources/CLOZUK/supp-table-4.csv'
gwasgenes = funcvar.get_geneset(df=pd.read_csv(clozukpath, skiprows=7), col='Gene(s) tagged')
repr(gwasgenes)

"{'DDX28', 'TP53TG1', 'HARBI1', 'DUS2', 'MAP3K11', 'ACD', 'LINC00051', 'LOC440704', 'MIR6777', 'TM6SF2', 'ZNF536', 'GRAPL', 'MIR3160-1', 'NUTF2', 'RBFOX1', 'TSR1', 'NLGN4X', 'PSMA4', 'SOX2-OT', 'TCF4', 'PPP1R13B', 'SPATS2L', 'ASH2L', 'MSANTD2', 'SRPK2', 'LOC79999', 'PPP1R16B', 'ZSWIM6', 'SCAF1', 'PGM3', 'CACNA1D', 'CALB2', 'LCAT', 'FAM109B', 'FSHB', 'TMEM243', 'ABCB9', 'MIR1281', 'C2orf47', 'ZFYVE21', 'LOC338963', 'ARL6IP4', 'CSMD1', 'SNORD91A', 'AKT3', 'ANKRD63', 'SLC32A1', 'OPCML', 'KCTD13', 'EMX1', 'SREBF2', 'RIMS1', 'TBX6', 'PTPRF', 'VPS45', 'C3orf49', 'SLC39A8', 'GPR135', 'ESRP2', 'TMX2-CTNND1', 'MIR33B', 'PSMB10', 'YPEL3', 'CKAP5', 'PTN', 'MIR8072', 'STAT6', 'TBC1D5', 'ALMS1P', 'CENPM', 'FES', 'PPP2R3A', 'LOC101929406', 'EFTUD1P1', 'LOC100507091', 'ALPK3', 'MPP6', 'WDR73', 'HCN1', 'COQ10B', 'SMG6', 'MIR4655', 'TMEM219', 'FTCDNL1', 'IREB2', 'GOLGA6L5P', 'KAT5', 'MED19', 'DGKI', 'NGEF', 'NMB', 'C10orf32', 'CNNM2', 'RPS19BP1', 'FGFR1', 'PITPNM2', 'TYW5', 'CYP2D6', 'AS3MT', 'ATXN7', 

In [None]:
Dgwas = D.loc[D['near_gens_Overlapped Gene'].apply(lambda x: bool(set(x).intersection(gwasgenes))), :]
counts_gwas = funcvar.all_functional_counts(Dgwas)
counts_gwas

### DeepSEA score
[DeepSEA](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4768299/) predicts effects of noncoding variants with deep learning–based sequence model.

In [None]:
fig, ax = plt.subplots()
D['deepsea_Functional Significance Score'].hist(ax=ax)
ax.set_xlabel('DeepSEA functional significance score')
ax.set_ylabel('variants')
attila_utils.savefig(fig, 'deepsea-hist')

The histogram suggests a cutoff somewhere between 0.5 and 0.6 score.  I will define functionally significant variants using both a more lenient and a more stringent threshold.

### Loss of function variants

These include stop-gain, intronic (splice_site), and frameshift.  Note that we didn't observe any frameshift variants.  These might be the most revealing.

In [None]:
cols = ['Dataset', 'Dx', 'near_gens_Annotation', 'near_gens_Overlapped Gene']
LoFrows = D['near_gens_Annotation'].apply(lambda x: bool(set(x).intersection({'stop-gain', 'intronic (splice_site)'})))
D.loc[LoFrows, cols]

### Functional variants

First let's see Taejeong's definition of functional variants

> The terms that we chose as functional are missense, stop_gained, splice_region, regulatory, and TF_binding.

### Outlier individual

In [None]:
# TODO
#D.loc[funcAby, sel_cols].loc['CMC_MSSM_224']

In [2]:
%connect_info

{
  "shell_port": 34191,
  "iopub_port": 36673,
  "stdin_port": 60383,
  "control_port": 33159,
  "hb_port": 34475,
  "ip": "127.0.0.1",
  "key": "7bfcf33d-e37779bc3a37f7b246ca39f4",
  "transport": "tcp",
  "signature_scheme": "hmac-sha256",
  "kernel_name": ""
}

Paste the above JSON into a file, and connect with:
    $> jupyter <app> --existing <file>
or, if you are local, you can connect with just:
    $> jupyter <app> --existing kernel-2ade7d60-05c3-4a2b-81d7-ee658b5634c5.json
or even just:
    $> jupyter <app> --existing
if this is the most recent Jupyter kernel you have started.
