Addig annotations from SNPnexus output files

In [1]:
%load_ext autoreload
%autoreload 2
%reload_ext autoreload
from matplotlib import pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from bsmcalls import individuals
from bsmcalls import readVCF
from bsmcalls import SNPnexus
from bsmcalls import preprocessing
import pandas as pd
import numpy as np
import re
%matplotlib inline

In [2]:
calls, clin = individuals.get_data(merge=False, cleancalls=False)

## Annotation with SNPnexus
[SNPnexus](https://www.snp-nexus.org/v4/) is a recently updated web service that uses numerous databases to annotate human genomic variants (see [this article](https://academic.oup.com/nar/article/48/W1/W185/5851388)).  The short sections below present information regarding the way I used SNPnexus to annotate our somatic variant calls.  The following settings were used:

* Human Assembly: GRCh37/hg19
* filtered VCF files were uploaded
* all annotation categories were selected
* *TXT per annotation*

### Overlapping genes
A key annotation of *Overlapped and Nearest Genes*, stored in the `near_gens.txt` output file of SNPnexus.  Each variant is overlapped by zero, one or more genes. In the last case two or more genes overlap each other around the given variant and `near_gens.txt` lists the same variant in multiple rows corresponding to the multiple overlapping genes. Since my `calls` DataFrame must contain exactly one row for each variant I collapsed those multiple rows into one by listing the overlapping genes in a single colon (`:`) separated string.

[This article](https://www.nature.com/articles/s41598-019-49802-w) presents a study of overlapping genes and shows that sometimes more than 5 genes may overlap at a locus.  This means that for each variant I should come up with a---possibly empty---set of genes that overlap that variant.

### Selecting sets of annotations

In [84]:
#annotlist = ['ensembl', 'near_gens', 'phast', 'gerp']
annotlist = ['1KGen', 'cpg', 'deepsea', 'encode', 'ensembl', 'gerp', 'near_gens', 'phast', 'sift']
na_values ={}
na_values.update({'1KGen': {'AFR Frequency': 'None', 'AMR Frequency': 'None', 'EAS Frequency': 'None', 'EUR Frequency': 'None', 'SAS Frequency': 'None'}})
na_values.update({'near_gens': {'Overlapped Gene': 'None', 'Type': 'None', 'Annotation': 'None', 'Nearest Upstream Gene': 'None', 'Type of Nearest Upstream Gene': 'None', 'Distance to Nearest Upstream Gene': 'None', 'Nearest Downstream Gene': 'None', 'Type of Nearest Downstream Gene': 'None', 'Distance to Nearest Downstream Gene': 'None'}})
annot = SNPnexus.get_multi_annotations(annotlist=annotlist, na_values=na_values)

TypeError: sequence item 0: expected str instance, float found

In [None]:
cols2binarize = []
cols2binarize += ['1KGen_AFR Frequency', '1KGen_AMR Frequency', '1KGen_EAS Frequency', '1KGen_EUR Frequency', '1KGen_SAS Frequency']
cols2binarize += ['cpg_CpG Island']
cols2binarize += ['gerp_Element RS Score']
cols2binarize += ['phast_Score']
annot = SNPnexus.binarize_cols(cols2binarize, annot, calls, suffix='_bin')

In [82]:
colsdict = {}
# order reflecting severity of effect
colsdict.update({'sift_Prediction': ['Deleterious', 'Deleterious - Low Confidence', 'Tolerated', 'Tolerated - Low Confidence']})
# order reflecting increasing frequency of categories in the data set
colsdict.update({'encode_Feature Type Class': ['Polymerase', 'Open Chromatin', 'Transcription Factor', 'Histone']})
colsdict.update({'ensembl_Predicted Function': ['coding', 'intronic', 'intronic (splice_site)', '5utr', '3utr', '5upstream', '3downstream', 'non-coding intronic', 'non-coding']})
annot = SNPnexus.regularize_categ_cols(colsdict, annot, calls, nafillval='other')

In [39]:
annot.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 3411 entries, ('CMC_MSSM_027', 'NeuN_pl', '1', 11973569, 'C/T') to ('CMC_MSSM_331', 'muscle', '17', 29109599, 'T/G')
Data columns (total 69 columns):
 #   Column                                 Non-Null Count  Dtype   
---  ------                                 --------------  -----   
 0   1KGen_Variation ID                     247 non-null    object  
 1   1KGen_Chromosome                       247 non-null    object  
 2   1KGen_Position                         247 non-null    float64 
 3   1KGen_dbSNP                            247 non-null    object  
 4   1KGen_REF Allele                       247 non-null    object  
 5   1KGen_ALT Allele                       247 non-null    object  
 6   1KGen_Minor Allele                     247 non-null    object  
 7   1KGen_AFR Frequency                    62 non-null     float64 
 8   1KGen_AFR Frequency_bin                3411 non-null   category
 9   1KGen_AMR Frequency                 

In [7]:
data = pd.concat([calls, annot])

### Evolutionary constraints

Four evolutionary conservation scores by three methods are compared to each other below

* gerp++ has two RS (rejected substitution) scores
* `gerp_Element RS Score` refers to conservative elements while `gerp_Base RS Score` to conservative bases

In [None]:
len(evoldata[['SiPhyLOD', 'phast_Score']].dropna())

In [None]:
len(evoldata[['SiPhyLOD', 'gerp_Element RS Score']].dropna())

In [None]:
len(evoldata[['phast_Score', 'gerp_Element RS Score']].dropna())

The following plots illustrate the relationship between `phast_Score` and `gerp_Element RS Score` (left) as well as that between `gerp_Base vs gerp_Element` and `gerp_Element RS Score` (right).

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
ax[0].scatter(x='phast_Score', y='gerp_Element RS Score', data=evoldata)
ax[0].set_xlabel('phast_Score')
ax[0].set_ylabel('gerp_Element RS Score')
ax[0].set_title('phast vs gerp_Element')
ax[1].scatter(x='gerp_Base RS Score', y='gerp_Element RS Score', data=evoldata)
ax[1].set_xlabel('gerp_Base RS Score')
ax[1].set_title('gerp_Base vs gerp_Element')

## Andy's questions

* Roadmap epigenome annotation of SNPnexus: what does it exactly mean?
* mutation types (A/C, ...): how does the mutational spectrum in the outlier sample compare to that in other samples?
* brain expressed genes combined with chromatin state DLPFC
* control callsets from other BSMN groups

In [43]:
%connect_info

{
  "shell_port": 40757,
  "iopub_port": 54401,
  "stdin_port": 46735,
  "control_port": 40925,
  "hb_port": 38815,
  "ip": "127.0.0.1",
  "key": "b5f2135c-5a8abbf8bf5c0593898a8107",
  "transport": "tcp",
  "signature_scheme": "hmac-sha256",
  "kernel_name": ""
}

Paste the above JSON into a file, and connect with:
    $> jupyter <app> --existing <file>
or, if you are local, you can connect with just:
    $> jupyter <app> --existing kernel-386f13ce-d4c8-4cc0-8d12-1351e70304f4.json
or even just:
    $> jupyter <app> --existing
if this is the most recent Jupyter kernel you have started.
