Addig annotations from SNPnexus output files

In [1]:
%load_ext autoreload
%autoreload 2
%reload_ext autoreload
from bsmcalls import individuals
from bsmcalls import readVCF
from bsmcalls import SNPnexus
import pandas as pd
import numpy as np
import re
%matplotlib inline

In [2]:
calls = individuals.get_data(merge=True, cleancalls=False)

## Annotation with SNPnexus
[SNPnexus](https://www.snp-nexus.org/v4/) is a recently updated web service that uses numerous databases to annotate human genomic variants (see [this article](https://academic.oup.com/nar/article/48/W1/W185/5851388)).  The short sections below present information regarding the way I used SNPnexus to annotate our somatic variant calls.  The following settings were used:

* Human Assembly: GRCh37/hg19
* filtered VCF files were uploaded
* all annotation categories were selected
* *TXT per annotation*

### Overlapping genes
A key annotation of *Overlapped and Nearest Genes*, stored in the `near_gens.txt` output file of SNPnexus.  Each variant is overlapped by zero, one or more genes. In the last case two or more genes overlap each other around the given variant and `near_gens.txt` lists the same variant in multiple rows corresponding to the multiple overlapping genes. Since my `calls` DataFrame must contain exactly one row for each variant I collapsed those multiple rows into one by listing the overlapping genes in a single colon (`:`) separated string.

[This article](https://www.nature.com/articles/s41598-019-49802-w) presents a study of overlapping genes and shows that sometimes more than 5 genes may overlap at a locus.  This means that for each variant I should come up with a---possibly empty---set of genes that overlap that variant.

### Selecting sets of annotations

In [3]:
annotlist = ['near_gens', 'sift']
annot = SNPnexus.get_multi_annotations(annotlist=annotlist)

In [4]:
annot.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 54 entries, ('CMC_MSSM_027', 'NeuN_pl', '11', 22272365, 'A/T') to ('CMC_PITT_118', 'NeuN_pl', '2', 29124973, 'T/G')
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   (sift, Variation ID)  54 non-null     object 
 1   (sift, Chromosome)    54 non-null     object 
 2   (sift, Position)      54 non-null     int64  
 3   (sift, dbSNP)         54 non-null     object 
 4   (sift, Variant)       54 non-null     object 
 5   (sift, Transcript)    54 non-null     object 
 6   (sift, Gene)          54 non-null     object 
 7   (sift, AA Position)   54 non-null     int64  
 8   (sift, Wild AA)       54 non-null     object 
 9   (sift, Mutant AA)     54 non-null     object 
 10  (sift, Score)         54 non-null     float64
 11  (sift, Prediction)    54 non-null     object 
dtypes: float64(1), int64(2), object(9)
memory usage: 6.5+ KB


* Roadmap epigenome annotation of SNPnexus: what does it exactly mean?
* mutation types (A/C, ...): how does the mutational spectrum in the outlier sample compare to that in other samples?

In [5]:
%connect_info

{
  "shell_port": 40925,
  "iopub_port": 54783,
  "stdin_port": 41135,
  "control_port": 40913,
  "hb_port": 46773,
  "ip": "127.0.0.1",
  "key": "aed42e55-7b2193f7796169674f71a8b0",
  "transport": "tcp",
  "signature_scheme": "hmac-sha256",
  "kernel_name": ""
}

Paste the above JSON into a file, and connect with:
    $> jupyter <app> --existing <file>
or, if you are local, you can connect with just:
    $> jupyter <app> --existing kernel-e73fc0b8-35d2-4223-8979-5afa91e20980.json
or even just:
    $> jupyter <app> --existing
if this is the most recent Jupyter kernel you have started.
