Calculating genic recurrence and addressing its association with schizophrenia.

In [3]:
%load_ext autoreload
%autoreload 2
%reload_ext autoreload
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import attila_utils
from bsmcalls import readVCF, individuals, preprocessing, SNPnexus
import attila_utils
import recur
%matplotlib inline

## All genic variants

The following steps have been re-implemented in `notebook/2020-12-28-genic-recurrence/recur.py`.

Import annotated, expanded calls

In [4]:
# we could also call here recur.read_expand_calls()
calls = SNPnexus.read_annot()
colnamel = ['near_gens_Type', 'near_gens_Annotation', 'ensembl_Predicted Function', 'sift_Prediction']
nonestrl = ['None', 'None', 'other', 'other']
calls = SNPnexus.expand_multiple_setvalued(calls, colnamel=colnamel, nonestrl=nonestrl)

Number of overlapped genes

In [5]:
coding = calls.loc[~ (calls['near_gens_Overlapped Gene'] == 'None')]
ncoding = coding.pivot_table(values='near_gens_Overlapped Gene', index='Dx', aggfunc='nunique', margins=True).rename(columns={'near_gens_Overlapped Gene': 'num calls'})
ncoding['frac calls'] = ncoding / ncoding.loc['All', 'num calls']
ncoding[['num calls']].style.bar(vmin=0)

Unnamed: 0_level_0,num calls
Dx,Unnamed: 1_level_1
Control,471
SCZ,1228
All,1610


In [6]:
genes = list(set(coding['near_gens_Overlapped Gene']))
genes.sort()

In [7]:
s = coding.groupby('Individual ID')['near_gens_Overlapped Gene'].agg(set)
s

Individual ID
CMC_MSSM_027    {SLC10A7, C10orf11, CDH4, FGF14, SLC35F6:CENPA...
CMC_MSSM_055    {CDK19, FIP1L1, SFXN2, CDC42SE2, LDB3, DYX1C1-...
CMC_MSSM_056    {SNED1:AC005237.4, C9, CCDC120, RP11-260O18.1,...
CMC_MSSM_069        {EPHA6, CNKSR2, LIPA, GALNT16, CASK, ZNF385D}
CMC_MSSM_097    {CTD-2008P7.3, MADD, SNTG1, SEC14L4, FAM156A, ...
                                      ...                        
CMC_PITT_098    {PTPRN2, SYT16, RP11-649A16.1, MYLK4, HAND2-AS...
CMC_PITT_101    {NEGR1, MIPOL1, ZFHX4, SEMA3A, RALGAPB, SLC35E...
CMC_PITT_113    {DOT1L, CD164, QPCT, RNFT2, MARK4, AUTS2, MGAT...
CMC_PITT_117    {RALGAPA2, C9orf96, AC133680.1, CERS6, TMEM184...
CMC_PITT_118    {SGCD, TBC1D5, ZBTB20, SELT, DCC, FEZ2, FTO:RP...
Name: near_gens_Overlapped Gene, Length: 86, dtype: object

In [8]:
recurrence_dict = {gene: [ind for ind in s.index if gene in s[ind]] for gene in genes}
recurrence_ser = pd.Series(recurrence_dict)
recurrence_ser

5S_rRNA                            [CMC_MSSM_391]
AAGAB                              [CMC_MSSM_304]
ABCA12                             [CMC_MSSM_372]
ABCA9:ABCA9-AS1                    [CMC_MSSM_363]
ABCG1                              [CMC_MSSM_352]
                                 ...             
ZNF804A                            [CMC_MSSM_273]
ZNF804B              [CMC_MSSM_130, CMC_PITT_098]
ZYG11A                             [CMC_MSSM_321]
hsa-mir-490                        [CMC_MSSM_097]
hsa-mir-490:CHRM2    [CMC_MSSM_055, CMC_MSSM_099]
Length: 1610, dtype: object

In [9]:
ss = recurrence_ser.apply(lambda l: [sum([calls.xs(key=ind, axis=0, level='Individual ID')['Dx'].unique()[0] == dx for ind in l]) for dx in ['Control', 'SCZ']])
recurrence = ss.apply(pd.Series).rename(columns={0: 'n Control indiv', 1: 'n SCZ indiv'})
recurrence['Individual IDs'] = recurrence_ser

In [10]:
recurrence.sort_values('n SCZ indiv', ascending=False)

Unnamed: 0,n Control indiv,n SCZ indiv,Individual IDs
KCND2,0,4,"[CMC_MSSM_211, CMC_MSSM_327, CMC_MSSM_352, CMC..."
RBFOX1,0,4,"[CMC_MSSM_213, CMC_MSSM_224, CMC_MSSM_305, CMC..."
SGCD,0,3,"[CMC_MSSM_352, CMC_PITT_072, CMC_PITT_118]"
ADARB2,0,3,"[CMC_MSSM_321, CMC_MSSM_366, CMC_MSSM_415]"
TTC28,0,3,"[CMC_MSSM_118, CMC_MSSM_265, CMC_MSSM_293]"
...,...,...,...
NTSR1,1,0,[CMC_PITT_064]
NUCB2,1,0,[CMC_PITT_050]
NUDT7,1,0,[CMC_MSSM_215]
NYAP2,1,0,[CMC_MSSM_406]


In [11]:
joint_recur = recurrence.value_counts(['n Control indiv', 'n SCZ indiv']).unstack().fillna(0).astype('int64').rename_axis(index='m Control indiv')
import seaborn as sns
cm = sns.light_palette("green", as_cmap=True)
joint_recur.style.background_gradient(cmap=cm, vmin=0, vmax=100).set_caption('# mutated genes in (m, n) individuals')

n SCZ indiv,0,1,2,3,4
m Control indiv,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0,1035,93,9,2
1,367,65,13,0,0
2,15,8,1,1,0
3,0,1,0,0,0


In [12]:
marginal_recur = pd.DataFrame([recurrence.value_counts(x) for x in ['n Control indiv', 'n SCZ indiv']], index=['Control', 'SCZ']).T.fillna(0).astype('int64').rename_axis(index='n indiv')
marginal_recur.style.background_gradient(cmap=cm, vmin=0, vmax=500).set_caption('# mutated genes in n individuals')

Unnamed: 0_level_0,Control,SCZ
n indiv,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1139,382
1,445,1109
2,25,107
3,1,10
4,0,2


## Functional genic variants

Here I repeat the calculations with the modification that first I filter for functional genic variants.  Functional genic variants are: 'coding nonsyn', 'stop-gain', 'intronic (splice_site)'.

In [13]:
coding_functional = recur.get_coding(calls)
coding_functional[recur.functional_genic].apply(sum)

coding nonsyn             59
stop-gain                  4
intronic (splice_site)     1
dtype: int64

In [14]:
recurrence_functional = recur.genic_recurrence_df(coding_functional)
recur.contingency_tab(recurrence_functional).style.background_gradient(cmap=cm, vmin=0, vmax=100).set_caption('# mutated genes in (m, n) individuals')

n SCZ indiv,0,1,2
m Control indiv,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,43,1
1,15,0,0


We see that only one gene is recurrently mutated.  Both mutations are in SCZ individuals.  The mutated gene is an olfactory receptor:

In [15]:
recurrence_functional.loc[recurrence_functional[['n Control indiv', 'n SCZ indiv']].sum(axis=1) > 1]

Unnamed: 0,n Control indiv,n SCZ indiv,Individual IDs
OR2G6,0,2,"[CMC_MSSM_193, CMC_MSSM_211]"


In [16]:
%connect_info

{
  "shell_port": 59331,
  "iopub_port": 60005,
  "stdin_port": 47103,
  "control_port": 48785,
  "hb_port": 35265,
  "ip": "127.0.0.1",
  "key": "fe97ccb2-b86727c1a9e6c482de52d2b7",
  "transport": "tcp",
  "signature_scheme": "hmac-sha256",
  "kernel_name": ""
}

Paste the above JSON into a file, and connect with:
    $> jupyter <app> --existing <file>
or, if you are local, you can connect with just:
    $> jupyter <app> --existing kernel-a8805923-697d-4886-8fa1-9d8966270445.json
or even just:
    $> jupyter <app> --existing
if this is the most recent Jupyter kernel you have started.
