# Introduction to VAPr: package for the aggregation of genomic variant data 

#### Author: C. Mazzaferro & Kathleen Fisch
#### Email: cmazzafe@ucsd.edu
#### Date: June 2016
 
## Outline of Notebook
<a id = "toc"></a>
1. <a href = "#background">Background</a>
2. <a href = "#setup"> Annovar Set-Up</a>
  * <a href = "#naming">Set paths and names</a>
3. <a href = "#mongo">Set Up MongoDB</a>
  * <a href = "#parse">Parse to MongoDB</a>
4. <a href = "#export">Export CSV and VCF files</a>

<a id = "background"></a>
## Background

This notebook will walk you through the basic steps of how variants coming from a VCF can be annotated efficiently and thoroughly using the package VAPr. In particular, the package is aimed at providing a way of retrieving variant information using [ANNOVAR](http://annovar.openbioinformatics.org/en/latest/) and [myvariant.info](myvariant.info) and consolidating it in conveninent formats. It is well-suited for bioinformaticians interested in aggregating variant information into a single database for ease of use and to provide higher analysis capabities.

For a more complete description of the functionalities of the package, you can visit the [VAPr Sample Usage Notebook](https://github.com/ucsd-ccbb/VAPr/blob/master/VAPr%20Sample%20Usage.ipynb) and/or check the full documentation on GitHub. 

<a id = "setup"></a>
## Annovar Set-Up

Annovar* is one of the most popular open source variant annotation packages currently available. It's a collection of comand line perl scripts that levarage on up-to-date publicly available datasets, and allows the execution of region, gene, and filter based annotations. VAPr employs the functionalities of Annovar by providing wrapper funcions that execute automatically the database downloads and update, as well as the annotation itself. In particular, the wrapper was written in order to provide the user with the latest annotation data, but also minizing the overhead of  having to learn a new toolkit fro scratch. The wrapper this takes care of:

 1. Building a syntactically correct command line string command
 2. Storing downloaded files, as well as annotated ones to their appropriate locations
 3. Automatic updates of databases
 
The databases currently supported as default are the following:
 
- knownGene
- tfbsConsSites
- cytoBand
- genomicSuperDups
- esp6500siv2_all
- 1000g2015aug_all
- popfreq_all
- clinvar_20140929
- cosmic70
- nci60

Which can be expanded or restricted by the user, depending on his or her research needs. 

Required set up for this step: download annovar from [here](http://www.openbioinformatics.org/annovar/annovar_download_form.php) and extract it to the location you'd like the databases to live in. The entire disk size of the databases will be around 25 GB, so make sure you have such space available.
 

*[ANNOVAR: Functional annotation of genetic variants from next-generation sequencing data Nucleic Acids Research](http://nar.oxfordjournals.org/content/38/16/e164) , 38:e164, 2010


<a id = "naming"></a>
### Set paths and names
Import modules and, set the variable paths and initialize the obect `sub_process`, that contains the functions in charge of dealing with annovar


In [1]:
# third party modules
import pandas as pd
import importlib

# variantannotation modules
from VAPr import parser_models, annovar_suprocess


IN_PATH = "/Volumes/Carlo_HD1/CCBB/VAPr_files/vcf_files/not_annotated/Normal_targeted_seq.vcf"
OUT_PATH = "/Volumes/Carlo_HD1/CCBB/VAPr_files/csv_files/"
ANNOVAR_PATH = "/Volumes/Carlo_HD1/CCBB/annovar/"   #location of the scipts and databases
sub_process = annovar_suprocess.AnnovarWrapper(IN_PATH, OUT_PATH, ANNOVAR_PATH)

  (fname, cnt))


In [20]:
#Get an estimate of the number of variants in the vcf file
print("Number of variants in vcf file: %i" % sum(1 for line in open(IN_PATH)))

Number of variants in vcf file: 26625


In [4]:
sub_process.download_dbs()

Currently downloading database file: hg19_kgXref
Currently downloading database file: hg19_knownGeneMrna
Currently downloading database file: hg19_knownGene

Annovar finished dowloading on file : hg19_knownGene. A .txt file has been created in the ANNOVAR_PATH directory

Currently downloading database file: hg19_kgXref

Annovar finished dowloading on file : hg19_kgXref. A .txt file has been created in the ANNOVAR_PATH directory

Currently downloading database file: hg19_knownGeneMrna
Currently downloading database file: hg19_cosmic70
Currently downloading database file: hg19_cosmic70
Currently downloading database file: hg19_cosmic70

Annovar finished dowloading on file : hg19_cosmic70. A .txt file has been created in the ANNOVAR_PATH directory

Currently downloading database file: genomicSuperDups

Annovar finished dowloading on file : genomicSuperDups. A .txt file has been created in the ANNOVAR_PATH directory

Currently downloading database file: hg19_cytoBand

Annovar finished dowl

'Finished downloading databases to /Volumes/Carlo_HD1/CCBB/annovar/humandb/'

In [16]:
%%time
sub_process.run_annovar()

Currently working on VCF file: Normal_targeted_seq_annotated, field variant_function
Currently working on VCF file: Normal_targeted_seq_annotated, field exonic_variant_function
Currently working on VCF file: Normal_targeted_seq_annotated, field hg19_genomicSuperDups
Currently working on VCF file: Normal_targeted_seq_annotated, field hg19_cosmic70_filtered
Currently working on VCF file: Normal_targeted_seq_annotated, field hg19_cosmic70_dropped
Currently working on VCF file: Normal_targeted_seq_annotated, field hg19_cytoBand
Currently working on VCF file: Normal_targeted_seq_annotated, field hg19_popfreq_all_20150413_filtered
Currently working on VCF file: Normal_targeted_seq_annotated, field hg19_popfreq_all_20150413_dropped
Currently working on VCF file: Normal_targeted_seq_annotated, field 2015_08_dropped
Currently working on VCF file: Normal_targeted_seq_annotated, field hg19_esp6500siv2_all_dropped
Currently working on VCF file: Normal_targeted_seq_annotated, field hg19_esp6500siv2

'Finished running ANNOVAR on /Volumes/Carlo_HD1/CCBB/VAPr_files/vcf_files/not_annotated/Normal_targeted_seq.vcf'

~1 minute for 25k variants

<a id = "mongo"></a>
## Set up MongoDB 

As mentioned, a MongoDB instance must be installed and running. Platform-specific installation instructions can be found [here](https://www.mongodb.com/download-center?jmp=docs&_ga=1.190648363.1890535995.1486057213#community), alongside with download of the latest distributions.

![Imgur](http://i.imgur.com/FjLUDFb.png)

A great introduction to the concepts relevant to MongoDB, and specifically how to interact with it using python can be found [here](https://docs.mongodb.com/getting-started/python/introduction/). Of particular interest is the explanation on how documents are formatted and stored inside a the database. Variants will follow that format as well: a sample entry of the database will look roughly as follows (the document for this variant (HGVS_id: chr20:g.25194768A>G) has been reduced to half of its actual size). 


```python
{
  "_id": ObjectId("5887c6d3bc644e51c028971a"),
  "chr": "chr19",
  "cytoband": {
    "Region": "13",
    "Sub_Band": "41",
    "Chromosome": 19,
    "Band": "q",
    "Name": "19q13.41"
  },
  "alt": "C",
  "hg19": {
    "end": 25194768,
    "start": 25194768
  },
  
  ...
  
  "otherinfo": [
    "GT:AD:DP:GQ:PL",
    "1/1:0,2:2:6:89,6,0"
  ],
  "hgvs_id": "chr20:g.25194768A>G",
  "esp6500siv2_all": "0.71",
  "ref": "T",
  "chrom": "20",
  "grasp": {
    "exclusively_male_female": "n",
    "initial_sample_description": [
      "Up to 46186 EA individuals",
      [
        "Up to 3445 EA cases",
        " 6935 EA controls"
      ],
      "69395 EA individuals",
      "69395 EA individuals"
    ],
    "creation_date": "8/17/12",
    "gwas_ancestry_description": "European",
    "srsid": 6083780,
    "platform_snps_passing_qc": [
      "Affymetrix & Illumina [~2.5 million] (imputed)",
      "Illumina [528745]",
      [
        "Affymetrix",
        " Illumina & Perlegen [~2.5 million] (imputed)"
      ],
      [
        "Affymetrix",
        " Illumina & Perlegen [~2.5 million] (imputed)"
      ]
    ],
    "hg19": {
      "chr": 20,
      "pos": 25194768
    },
    "publication": [
      {
        "pmid": 20081858,
        "phenotype": "HOMA-B",
        "p_value": 0.033930000000000001825,
        "snpid": "rs6083780",
        "paper_phenotype_description": [
          "Glucose homeostasis traits (fasting glucose",
          " fasting insulin",
          " HOMA-B",
          " HOMA-IR)"
        ],
        "date_pub": "1/17/2010",
        "title": "New genetic loci implicated in fasting glucose homeostasis and their impact on type 2 diabetes risk.",
        "location_within_paper": "FullData",
        "paper_phenotype_categories": "Quantitative trait(s);Type 2 diabetes (T2D);Blood-related",
        "journal": "Nat Genet"
      },
      {
        "pmid": 20522523,
        "phenotype": "Partial epilepsy",
        "p_value": 0.022849999999999998784,
        "snpid": "rs6083780",
        "paper_phenotype_description": "Epilepsy (partial epilepsy)",
        "date_pub": "6/22/2010",
        "title": "Common genetic variation and susceptibility to partial epilepsies: a genome-wide association study.",
        "location_within_paper": "FullScan",
        "paper_phenotype_categories": "Neuro;Epilepsy",
        "journal": "Brain"
      }
    ],
    "last_curation_date": "8/17/12",
    "replication": [
      {
        "total_samples": 76558,
        "european": 76558
      },
      {
        "total_samples": 133661,
        "european": 133661
      }
    ],
    "discovery": [
      {
        "total_samples": 46186,
        "european": 46186
      },
      {
        "total_samples": 10380,
        "european": 10380
      }
    ],
    "hupfield": "Jan2014",
    "includes_male_female_only_analyses": "n",
    "in_gene": "(ENTPD6)",
    "replication_sample_description": [
      "up to 76558 EA individuals",
      "NR",
      "133661 EA individuals",
      "133661 EA individuals"
    ]
  },
  "start": 51447065
} ```


The richness of the data derives from the usage of MyVariant.info services and the high-availability of the datasets hosted by them. Further, being updated at least monthly, their databases can be guaranteed to deliver the most accurate and relevat data for specific variants. 

<a id = "parse"></a>
### Prepare to parse data to MongoDB

Here, we finally parse the data to Mongo. Assuming all the installation steps have been taken, and mongodb is running (the command `sudo chkconfig mongod on` should return [ OK ]. Check [this](https://docs.mongodb.com/manual/tutorial/install-mongodb-on-amazon/) for more information on how to perform and check the installation on an AWS instance. 

#### File names and variables

The required file names and variables are the following:


- CSV annotated (assuming its been annotated with ANNOVAR already. If not, refer to the documentation for instructions on how to perform annotation).
- VCF (original file containing the variants
- MongoDB collection and Database names. A mongoDB instance must be running upon calling the main VAPr function, but the parsing will be done automatically by the underlying `pymongo` library.


In [1]:
from VAPr import parser_models, annovar_suprocess
import importlib
importlib.reload(parser_models)
csv_file = '/Volumes/Carlo_HD1/CCBB/VAPr_files/csv_files/Sample_MCCT-3jks_raw_variants_annotated.hg19_multianno.txt'
vcf_file = '/Volumes/Carlo_HD1/CCBB/VAPr_files/vcf_files/not_annotated/Sample_MCCT-3jks_raw_variants.vcf'
collection_name = 'My_Variant_Collection_File_One'
db_name = 'My_Variant_Database'

pars = parser_models.VariantParsing(vcf_file, collection_name, db_name, annotated_file=csv_file)

  (fname, cnt))


In [2]:
print("Number of variants in vcf file: %i" % sum(1 for line in open(vcf_file)))

Number of variants in vcf file: 232680


In [4]:
import vcf
import itertools
import myvariant

def complete_chromosome(expanded_list):
    for i in range(0, len(expanded_list)):
        if 'M' in expanded_list[i]:
            one = expanded_list[i].split(':')[0]
            two = expanded_list[i].split(':')[1]
            if 'MT' not in one:
                one = 'chrMT'
            expanded_list[i] = one + ':' + two
    return expanded_list

def get_variants_from_vcf(vcf_file):
    """
    Retrieves variant names from a LARGE vcf file.
    :param step: ...
    :return: a list of variants formatted according to HGVS standards
    """
    list_ids = []
    reader = vcf.Reader(open(vcf_file, 'r'))

    for record in itertools.islice(reader, 0, 20000):
        if len(record.ALT) > 1:
            for alt in record.ALT:
                list_ids.append(myvariant.format_hgvs(record.CHROM, record.POS,
                                                      record.REF, str(alt)))
        else:
            list_ids.append(myvariant.format_hgvs(record.CHROM, record.POS,
                                                  record.REF, str(record.ALT[0])))

    return complete_chromosome(list_ids)

list_hgvs_ids = get_variants_from_vcf(vcf_file)

In [8]:
mv = myvariant.MyVariantInfo()
len(list_hgvs_ids)

20004

In [4]:
list_hgvs_ids = pars.hgvs.get_variants_from_vcf(0)
myvariants_variants = pars.get_dict_myvariant(list_hgvs_ids)

import myvariant
mv = myvariant.MyVariantInfo()
# This will retrieve a list of dictionaries
#variant_data = mv.getvariants(variant_list, as_dataframe=False)
#variant_data = self.remove_id_key(variant_data)

querying 1-952...done.


In [5]:
list_hgvs_ids[0]

'chrMT:g.146T>C'

In [12]:
import grequests

_url = mv.url + '/variant/'
_kwargs = {'ids': list_hgvs_ids[0:5]}


def concurrent_post(url, params):
    
    return_raw = params.pop('return_raw', False)
    headers = {'content-type': 'application/x-www-form-urlencoded',
               'user-agent': "Python-requests_myvariant.py/%s (gzip)" % requests.__version__}
    res = grequests.post(url, data=params, headers=headers)
    if mv.raise_for_status:
        
        # raise requests.exceptions.HTTPError if not 200
        res.raise_for_status()
    if return_raw:
        return res
    else:
        return res.json()
    

import grequests

In [40]:
param = {'ids': list_hgvs_ids[0]}
return_raw = param.pop('return_raw', False)
headers = {'content-type': 'application/x-www-form-urlencoded',
           'user-agent': "Python-requests_myvariant.py/%s (gzip)" % requests.__version__}
res = grequests.post(_url, data=param, headers=headers)

In [35]:
No_requests = 20
params = [{'ids': list_hgvs_ids[i]} for i in range(No_requests)]

In [79]:
%%time
anno = mv.getvariants(list_hgvs_ids[0:5000], as_dataframe=False)

querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
CPU times: user 1.34 s, sys: 127 ms, total: 1.46 s
Wall time: 30.1 s


In [92]:
ls[0][0] == anno[0]

True

In [86]:
anno[0].keys()

dict_keys(['hg19', 'vcf', 'cadd', 'chrom', 'query', '_id', 'dbsnp', 'snpeff'])

In [105]:
ls =[]

In [106]:
for i,r in enumerate(resps):
    if r is None:
        ls.append({'notfound' : True, 'query': params[i]})
    else:
        ls.append(r.json()[0])

In [113]:
ls[979]

{'notfound': True, 'query': {'ids': 'chr1:g.6196493A>G'}}

In [110]:
anno[979]

{'_id': 'chr1:g.6196493A>G',
 'cadd': {'1000g': {'af': 0.51,
   'afr': 0.36,
   'amr': 0.47,
   'asn': 0.81,
   'eur': 0.39},
  '_license': 'http://goo.gl/bkpNhq',
  'alt': 'G',
  'anc': 'A',
  'annotype': 'Transcript',
  'bstatistic': 754,
  'chmm': {'bivflnk': 0.0,
   'enh': 0.0,
   'enhbiv': 0.0,
   'het': 0.0,
   'quies': 0.488,
   'reprpc': 0.102,
   'reprpcwk': 0.339,
   'tssa': 0.0,
   'tssaflnk': 0.0,
   'tssbiv': 0.0,
   'tx': 0.008,
   'txflnk': 0.0,
   'txwk': 0.063,
   'znfrpts': 0.0},
  'chrom': 1,
  'consdetail': 'intron',
  'consequence': 'INTRONIC',
  'consscore': 2,
  'cpg': 0.05,
  'dna': {'helt': 0.0, 'mgw': 0.26, 'prot': 3.07, 'roll': 1.23},
  'encode': {'h3k27ac': 4.0,
   'h3k4me1': 2.0,
   'h3k4me3': 3.0,
   'nucleo': 1.6,
   'occ': 2,
   'p_val': {'comb': 0.95,
    'ctcf': 0.0,
    'dnas': 1.63,
    'faire': 0.0,
    'mycp': 0.0,
    'polii': 0.0},
   'sig': {'ctcf': 0.03,
    'dnase': 0.04,
    'faire': 0.0,
    'myc': 0.0,
    'polii': 0.0}},
  'fitcons': 0.053

In [None]:
return_raw = True

query_fn = lambda vids: self._getvariants_inner(vids, **kwargs)
out = []

for hits in self._repeated_query(query_fn, vids, verbose=verbose):
    if return_raw:
        out.append(hits)   # hits is the raw response text
print(out)

def _repeated_query(self, query_fn, query_li, verbose=True, **fn_kwargs):
    '''run query_fn for input query_li in a batch (self.step).
       return a generator of query_result in each batch.
       input query_li can be a list/tuple/iterable
    '''
    step = min(self.step, self.max_query)
    i = 0
    for batch, cnt in iter_n(query_li, step, with_cnt=True):
        if verbose:
            print("querying {0}-{1}...".format(i+1, cnt), end="")
        i = cnt
        query_result = query_fn(batch, **fn_kwargs)
        yield query_result
        if verbose:
            print("done.")

In [34]:
%%time
import grequests
import requests
No_requests = 10

_url = mv.url + '/variant/'
_urls = [_url] * No_requests

params = [{'ids': list_hgvs_ids[i]} for i in range(No_requests)]

#return_raw = params.pop('return_raw', False)

headers = {'content-type': 'application/x-www-form-urlencoded',
           'user-agent': "Python-requests_myvariant.py/%s (gzip)" % requests.__version__}

res = (grequests.post(_url, data=params[i], headers=headers) for i, _ in enumerate(params))
                      

resps = grequests.map(res)

ls=[]

for i,r in enumerate(resps):
    if r is None:
        ls.append({'notfound' : True, 'query': params[i]})
    else:
        ls.append(r.json())

CPU times: user 35.8 ms, sys: 6.94 ms, total: 42.7 ms
Wall time: 164 ms


In [44]:
ls[7]

[{'_id': 'chr1:g.723891G>C',
  'cadd': {'1000g': {'af': 0.79,
    'afr': 0.5,
    'amr': 0.86,
    'asn': 0.78,
    'eur': 0.95},
   '_license': 'http://goo.gl/bkpNhq',
   'alt': 'C',
   'annotype': 'Intergenic',
   'bstatistic': 982,
   'chmm': {'bivflnk': 0.0,
    'enh': 0.0,
    'enhbiv': 0.0,
    'het': 0.512,
    'quies': 0.142,
    'reprpc': 0.0,
    'reprpcwk': 0.0,
    'tssa': 0.0,
    'tssaflnk': 0.0,
    'tssbiv': 0.0,
    'tx': 0.0,
    'txflnk': 0.0,
    'txwk': 0.323,
    'znfrpts': 0.024},
   'chrom': 1,
   'consdetail': 'downstream',
   'consequence': 'DOWNSTREAM',
   'consscore': 1,
   'cpg': 0.04,
   'dna': {'helt': -0.86, 'mgw': 0.47, 'prot': -0.63, 'roll': 5.62},
   'encode': {'exp': 4.89,
    'h3k27ac': 5.12,
    'h3k4me1': 4.0,
    'h3k4me3': 3.08,
    'nucleo': 2.0,
    'occ': 3,
    'p_val': {'comb': 0.54,
     'ctcf': 0.0,
     'dnas': 0.0,
     'faire': 1.09,
     'mycp': 0.0,
     'polii': 0.0},
    'sig': {'ctcf': 0.0,
     'dnase': 0.0,
     'faire': 0.01,
 

In [32]:
resps = grequests.map([res])
for r in resps:
    print (r.json())

[{'hg19': {'start': 146, 'end': 146}, 'vcf': {'alt': 'C', 'position': '146', 'ref': 'T'}, 'snpeff': {'ann': {'effect': 'intergenic_region', 'putative_impact': 'MODIFIER'}}, 'chrom': 'MT', 'wellderly': {'genotypes': [{'count': 1, 'freq': 0.005, 'genotype': 'T/C'}, {'count': 12, 'freq': 0.06, 'genotype': 'C/C'}, {'count': 187, 'freq': 0.935, 'genotype': 'T/T'}], 'hg19': {'start': 146, 'end': 146}, 'chrom': 'MT', 'vartype': 'snp', 'alt': 'C', 'ref': 'T', 'alleles': [{'allele': 'C', 'freq': 0.0625}, {'allele': 'T', 'freq': 0.9375}], 'pos': 146}, 'query': 'chrMT:g.146T>C', '_id': 'chrMT:g.146T>C', 'dbsnp': {'var_subtype': 'ts', 'alt': 'C', 'allele_origin': 'unspecified', 'alleles': [{'allele': 'T'}, {'allele': 'C'}], 'gene': {'geneid': '4549', 'symbol': 'RNR1'}, 'hg19': {'start': 146, 'end': 146}, 'chrom': 'MT', 'vartype': 'snp', 'class': 'SNV', 'ref': 'T', 'flags': ['ASP', 'R5'], 'dbsnp_build': 138, 'rsid': 'rs370482130', 'validated': False}}]


In [14]:
list_hgvs_ids[0:5]

['chrMT:g.146T>C',
 'chrMT:g.150T>C',
 'chrMT:g.195C>T',
 'chrMT:g.410A>T',
 'chrMT:g.516_517del']

In [89]:
mv._getvariants_inner('chrMT:g.146T>C')
mv._post()

TypeError: _post() missing 2 required positional arguments: 'url' and 'params'

In [82]:
myvariants_variants[0]

{'chrom': 'MT',
 'dbsnp': {'allele_origin': 'unspecified',
  'alleles': [{'allele': 'T'}, {'allele': 'C'}],
  'alt': 'C',
  'chrom': 'MT',
  'class': 'SNV',
  'dbsnp_build': 138,
  'flags': ['ASP', 'R5'],
  'gene': {'geneid': '4549', 'symbol': 'RNR1'},
  'hg19': {'end': 146, 'start': 146},
  'ref': 'T',
  'rsid': 'rs370482130',
  'validated': False,
  'var_subtype': 'ts',
  'vartype': 'snp'},
 'hg19': {'end': 146, 'start': 146},
 'hgvs_id': 'chrMT:g.146T>C',
 'snpeff': {'ann': {'effect': 'intergenic_region',
   'putative_impact': 'MODIFIER'}},
 'vcf': {'alt': 'C', 'position': '146', 'ref': 'T'},
 'wellderly': {'alleles': [{'allele': 'C', 'freq': 0.0625},
   {'allele': 'T', 'freq': 0.9375}],
  'alt': 'C',
  'chrom': 'MT',
  'genotypes': [{'count': 1, 'freq': 0.005, 'genotype': 'T/C'},
   {'count': 12, 'freq': 0.06, 'genotype': 'C/C'},
   {'count': 187, 'freq': 0.935, 'genotype': 'T/T'}],
  'hg19': {'end': 146, 'start': 146},
  'pos': 146,
  'ref': 'T',
  'vartype': 'snp'}}

In [56]:
%%time 
pars.push_to_db(buffer=True) #chunksize = 900, buffer_len = 13000

querying 1-902...done.
querying 1-909...done.
querying 1-902...done.
querying 1-905...done.
querying 1-903...done.
querying 1-904...done.
querying 1-903...done.
querying 1-906...done.
querying 1-906...done.
querying 1-909...done.
querying 1-908...done.
querying 1-905...done.
querying 1-903...done.
querying 1-905...done.
querying 1-905...done.
Parsing Buffer...
querying 1-904...done.
querying 1-905...done.
querying 1-905...done.
querying 1-904...done.
querying 1-900...done.
querying 1-906...done.
querying 1-912...done.
querying 1-901...done.
querying 1-906...done.
querying 1-907...done.
querying 1-905...done.
querying 1-903...done.
querying 1-908...done.
querying 1-908...done.
querying 1-403...done.
Parsing Buffer...
CPU times: user 1min 6s, sys: 958 ms, total: 1min 7s
Wall time: 2min 53s


'Done'

In [79]:
%%time 
pars.push_to_db(buffer=True) #chunksize = 950, buffer_len = 50 000

querying 1-952...done.
querying 1-960...done.
querying 1-952...done.
querying 1-957...done.
querying 1-950...done.
querying 1-954...done.
querying 1-955...done.
querying 1-957...done.
querying 1-956...done.
querying 1-961...done.
querying 1-957...done.
querying 1-953...done.
querying 1-955...done.
querying 1-956...done.
querying 1-952...done.
querying 1-957...done.
querying 1-955...done.
querying 1-954...done.
querying 1-950...done.
querying 1-956...done.
querying 1-962...done.
querying 1-953...done.
querying 1-956...done.
querying 1-956...done.
querying 1-955...done.
querying 1-953...done.
querying 1-962...done.
querying 1-856...done.
Parsing Buffer...
CPU times: user 1min 1s, sys: 952 ms, total: 1min 2s
Wall time: 2min 50s


'Done2'

In [41]:
%%time 
pars.push_to_db()

querying 1-602...done.
602 {'dbsnp': {'gene': {'symbol': 'RNR1', 'geneid': '4549'}, 'allele_origin': 'unspecified', 'var_subtype': 'ts', 'alleles': [{'allele': 'T'}, {'allele': 'C'}], 'flags': ['ASP', 'R5'], 'vartype': 'snp', 'validated': False, 'alt': 'C', 'dbsnp_build': 138, 'rsid': 'rs370482130', 'chrom': 'MT', 'class': 'SNV', 'hg19': {'end': 146, 'start': 146}, 'ref': 'T'}, 'chr': 'chrMT', 'alt': 'C', 'end': 146, 'func_knowngene': 'upstream;downstream', 'start': 146, 'wellderly': {'alt': 'C', 'alleles': [{'freq': 0.0625, 'allele': 'C'}, {'freq': 0.9375, 'allele': 'T'}], 'chrom': 'MT', 'vartype': 'snp', 'hg19': {'end': 146, 'start': 146}, 'pos': 146, 'ref': 'T', 'genotypes': [{'freq': 0.005, 'count': 1, 'genotype': 'T/C'}, {'freq': 0.06, 'count': 12, 'genotype': 'C/C'}, {'freq': 0.935, 'count': 187, 'genotype': 'T/T'}]}, 'snpeff': {'ann': {'effect': 'intergenic_region', 'putative_impact': 'MODIFIER'}}, 'vcf': {'alt': 'C', 'ref': 'T', 'position': '146'}, 'gene_knowngene': 'JB137816;D

'Done'

~3 minutes for 25k variants

## Implement a filter

 - filter 1: ThousandGenomeAll < 0.05 or info not available
 - filter 2: ESP6500siv2_all < 0.05 or info not available
 - filter 3: cosmic70 information is present
 - filter 4: Func_knownGene is exonic, splicing, or both
 - filter 5: ExonicFunc_knownGene is not "synonymous SNV"
 - filter 6: Read Depth (DP) > 10

In [31]:
#names and paths 
rare_cancer_variants_csv = ".../out_files/filtered_csv.csv"
rare_cancer_variants_vcf =  ".../filtered_vcf.vcf"
#input_vcf_compressed =  '/test_vcf/Tumor_RNAseq_variants.vcf.gz

In [35]:
#Apply filter.
from VAPr import MongoDB_querying, file_writer

filter_collection = MongoDB_querying.Filters(db_name, collection_name)
rare_cancer_variants = filter_collection.rare_cancer_variant()

Variants found that match rarity criteria: 1


### Inspect entry

In [59]:
rare_cancer_variants[0]

{'1000g2015aug_all': 0.00159744,
 '_id': ObjectId('589e35893c5b990f8441ed78'),
 'alt': 'G',
 'cadd': {'1000g': {'af': 0.5, 'afr': 0.5, 'amr': 0.5, 'asn': 0.5, 'eur': 0.5},
  '_license': 'http://goo.gl/bkpNhq',
  'alt': 'G',
  'anc': 'A',
  'annotype': 'CodingTranscript',
  'bstatistic': 872,
  'chmm': {'bivflnk': 0.0,
   'enh': 0.0,
   'enhbiv': 0.008,
   'het': 0.016,
   'quies': 0.504,
   'reprpc': 0.087,
   'reprpcwk': 0.354,
   'tssa': 0.0,
   'tssaflnk': 0.0,
   'tssbiv': 0.008,
   'tx': 0.0,
   'txflnk': 0.008,
   'txwk': 0.008,
   'znfrpts': 0.0},
  'chrom': 7,
  'consdetail': 'missense',
  'consequence': 'NON_SYNONYMOUS',
  'consscore': 7,
  'cpg': 0.03,
  'dna': {'helt': 0.03, 'mgw': 0.4, 'prot': 3.05, 'roll': 3.04},
  'encode': {'exp': 40.15,
   'h3k27ac': 7.04,
   'h3k4me1': 3.0,
   'h3k4me3': 3.0,
   'nucleo': 3.7},
  'exon': '4/5',
  'fitcons': 0.527649,
  'gc': 0.55,
  'gene': {'ccds_id': 'CCDS5872.1',
   'cds': {'cdna_pos': 514,
    'cds_pos': 508,
    'rel_cdna_pos': 0.

### Other filters
The filter may have been too 'aggressive' or there simply aren't enough variants that match the criteria. In any case, it is possible to implement manually a filter in order to obtain more results. The query syntax is the one implemented by the library [pymongo](http://api.mongodb.com/python/current/tutorial.html), developed by MongoDB.

In [54]:
from pymongo import MongoClient

client = MongoClient()
db = getattr(client, db_name)
collection = getattr(db, collection_name)


filtered = collection.find({"$and": [
                                   {"$or": [{"esp6500siv2_all": {"$lt": 0.1}}, {"esp6500siv2_all": {"$exists": False}}]},
                                   {"$or": [{"func_knowngene": "exonic"}, {"func_knowngene": "splicing"}]},
                                   {"genotype.filter_passing_reads_count": {"$gte": 1}},
                                   {"cosmic70": {"$exists": True}},
                                   {"1000g2015aug_all": {"$exists": True}}
                         ]})

as_list = list(filtered)
len(as_list)

9

### Or a dataframe for easy manipulation

In [56]:
df = pd.DataFrame(as_list)
df.columns

Index(['1000g2015aug_all', '_id', 'alt', 'cadd', 'chr', 'chrom', 'cosmic',
       'cosmic70', 'cytoband', 'dbnsfp', 'dbsnp', 'end', 'exac',
       'exac_nontcga', 'exonicfunc_knowngene', 'func_knowngene',
       'gene_knowngene', 'geno2mp', 'genomicsuperdups', 'genotype', 'hg19',
       'hgvs_id', 'mutdb', 'nci60', 'otherinfo', 'ref', 'snpeff', 'start',
       'tfbsconssites', 'vcf', 'wellderly'],
      dtype='object')

In [57]:
df.head(4)

Unnamed: 0,1000g2015aug_all,_id,alt,cadd,chr,chrom,cosmic,cosmic70,cytoband,dbnsfp,...,hgvs_id,mutdb,nci60,otherinfo,ref,snpeff,start,tfbsconssites,vcf,wellderly
0,0.39397,589e35563c5b990f8441c699,A,"{'cpg': 0.24, 'consdetail': ['synonymous', 're...",chr1,1,"{'alt': 'T', 'chrom': '1', 'ref': 'C', 'mut_fr...","ID=COSM1320072;OCCURENCE=2(thyroid),1(ovary)","{'Band': 'p', 'Chromosome': 1, 'Sub_Band': '2'...",,...,chr1:g.120612006G>A,,0.22,"[GT:AD:DP:GQ:PL, 0/1:9,8:17:99:240,0,335]",G,"{'ann': [{'feature_type': 'transcript', 'rank'...",120612006,,"{'alt': 'A', 'position': '120612006', 'ref': 'G'}","{'adviser_score': '5~NOTCH2~Common, Predicted ..."
1,0.289137,589e35593c5b990f8441c96e,AGACCATGGCCCCGCCCAGTCCCT,,chr1,1,,ID=COSM1745478;OCCURENCE=4(urinary_tract),"{'Band': 'q', 'Chromosome': 1, 'Sub_Band': '1'...",,...,chr1:g.203186950_203186951insAGACCATGGCCCCGCCC...,,,"[GT:AD:DP:GQ:PL, 1/1:0,3:3:9:135,9,0]",-,"{'lof': {'gene_id': 'CHIT1', 'genename': 'CHIT...",203186950,,"{'alt': 'CAGACCATGGCCCCGCCCAGTCCCT', 'position...","{'adviser_score': '5~CHIT1~Common, Predicted N..."
2,0.706669,589e35593c5b990f8441ca33,-,,chr1,1,,"ID=COSM244564;OCCURENCE=1(NS),2(pancreas),2(la...","{'Band': 'q', 'Chromosome': 1, 'Name': '1q43',...",,...,chr1:g.240255569_240255571del,,,"[GT:AD:DP:GQ:PL, 1/1:0,2:2:6:80,6,0]",GGC,"{'ann': {'feature_type': 'transcript', 'rank':...",240255569,,"{'alt': 'G', 'position': '240255568', 'ref': '...","{'alt': 'G', 'chrom': '1', 'gene': 'FMN2', 're..."
3,0.062899,589e355e3c5b990f8441ce6a,G,"{'cpg': 0.05, 'consdetail': 'intron', 'gene': ...",chr2,2,,ID=COSM3836683;OCCURENCE=1(breast),"{'Band': 'q', 'Chromosome': 2, 'Sub_Band': '2'...",,...,chr2:g.120015164A>G,,,"[GT:AD:DP:GQ:PL, 0/1:1,4:5:30:156,0,30]",A,"{'ann': [{'feature_id': 'NM_182915.2', 'gene_i...",120015164,,"{'alt': 'G', 'position': '120015164', 'ref': 'A'}","{'alt': 'G', 'chrom': '2', 'gene': 'STEAP3', '..."


## Export
Files can also be exported to CSV, as well as VCF files while mantaining the annotations

In [61]:
#Crete writer object for filtered lists:
my_writer = file_writer.FileWriter(db_name, collection_name)

rare_cancer_variants_csv = '/data/out_files/rare_vars.csv'

#cancer variants filtered files
my_writer.generate_annotated_csv(rare_cancer_variants, rare_cancer_variants_csv)

'Finished writing annotated, filtered CSV file'

## Export Filtered (VCF) Files
This is possible as well, although a bit trickier. Re-creating a vcf file from a list of variants requires an index file from the original file from which a filtered version will be built. This can be achieved using the tabix tool from http://genometoolbox.blogspot.com/2013/11/installing-tabix-on-unix.html.

*NOTE*: The annotations from MyVariant.info will be added to the INFO field

**Step 1**: Download Tabix:

`tar xvjf tabix-0.2.6.tar.bz2`  
`cd tabix-0.2.6`    
`make`    
`export PATH=$PATH:/path_to_tabix/tabix-0.2.6`   

**Step 2**: create a .vcf.gz file from original vcf:

`bgzip -c file.vcf > file.vcf.gz`


**Step 3**: run tabix on the input vcf file:

`tabix -p vcf file.vcf.gz`

In [None]:
### Specify name of files and path 

rare_cancer_variants_vcf = "/data/out_files/rare_variants.vcf"
input_vcf_compressed = "/data/vcf_files/test_file_one.vcf.gz"

filtered_variant_list = list(filtered)

#Crete writer object for filtered lists:
my_writer = file_writer.FileWriter(db_name, collection_name)
my_writer.generate_annotated_vcf(filtered_variant_list, input_vcf_compressed, rare_cancer_variants_vcf)