## MyVariant.info and MyGene.info Use Case

The following R script demonstrates the utility of MyVariant.info and MyGene.info R clients to annotate variants and prioritize candidate genes in patients with rare Mendelian diseases. This specific study uses data obtained from the database of phenotype and genotype (dbGaP) study...FASTQ files generated by Ng et al for the Miller syndrome study. FASTQ files were processed according to the Broad Institute’s best practices. Individual samples were aligned to the hg19 reference genome using BWA-MEM 0.7.10. Variants were called using GATK 3.3-0 HaplotypeCaller and quality scores were recalibrated using GATK VariantRecalibrator.

In [1]:
library(myvariant)
library(mygene)
library(VariantAnnotation)
library(GO.db)
source("https://raw.githubusercontent.com/Network-of-BioThings/myvariant.info/master/docs/ipynb/mendelian.R")
setwd("~/sulab/myvariant/vcf/recal")
vcf.files <- paste(getwd(), list.files(getwd()), sep="/")

Loading required package: GenomicFeatures
Loading required package: AnnotationDbi
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.


Attaching package: ‘AnnotationDbi’

The following object is masked from ‘package:dplyr’:

    select


Attaching package: ‘mygene’

The following object is masked from ‘package:dplyr’:

    query

Loading required package: DBI



mendelian.R defines some helper functions that are used in the analysis downstream of retrieving annotations:

`replaceWith0` - replaces all NAs in a data.frame with 0.

`rankByCaddScore` - for prioritizing genes by deleteriousness (scaled CADD score).

---
## Annotating variants with MyVariant.info
The following function reads in each output VCF file using the VariantAnnotation package available from Bioconductor. Install with `biocLite("VariantAnnotation")`. `formatHgvs` (from the ***myvariant*** Bioconductor package) is a function that reads the genomic location and variant information from the VCF to create HGVS IDs which serve as a primary key for each variant. The function `getVariants` makes the queries to MyVariant.info to retrieve annotations.

In [4]:
getVars <- function(vcf.file){
  cat(paste("Processing ", vcf.file, "...\n", sep=" "))
  vcf <- readVcf(vcf.file, genome="hg19")
  vcf <- vcf[isSNV(vcf)]
  vars <- rowRanges(vcf)
  vars <- as(vars, "DataFrame")
  vars$query <- formatHgvs(vcf, "snp")
  annotations <- getVariants(vars$query, fields=c("dbnsfp.genename", "dbnsfp.1000gp1.af", 
                                                  "exac.af", "cadd.consequence", "cadd.phred"))
  annotations[c('DP', 'FS', 'QD')] <- info(vcf)[c('DP', 'FS', 'QD')]
  annotations <- replaceWith0(annotations)
  annotations
}

vars <- lapply(vcf.files, getVars)

Processing  /Users/Adam/sulab/myvariant/vcf/recal/subject01_recalibrate_SNP_vqsr.vcf ...
found header lines for 3 ‘fixed’ fields: ALT, QUAL, FILTER 
found header lines for 24 ‘info’ fields: AC, AF, ..., VQSLOD, culprit 
found header lines for 5 ‘geno’ fields: GT, AD, DP, GQ, PL 


Querying chunk 1 of 37
Querying chunk 2 of 37
Querying chunk 3 of 37
Querying chunk 4 of 37
Querying chunk 5 of 37
Querying chunk 6 of 37
Querying chunk 7 of 37
Querying chunk 8 of 37
Querying chunk 9 of 37
Querying chunk 10 of 37
Querying chunk 11 of 37
Querying chunk 12 of 37
Querying chunk 13 of 37
Querying chunk 14 of 37
Querying chunk 15 of 37
Querying chunk 16 of 37
Querying chunk 17 of 37
Querying chunk 18 of 37
Querying chunk 19 of 37
Querying chunk 20 of 37
Querying chunk 21 of 37
Querying chunk 22 of 37
Querying chunk 23 of 37
Querying chunk 24 of 37
Querying chunk 25 of 37
Querying chunk 26 of 37
Querying chunk 27 of 37
Querying chunk 28 of 37
Querying chunk 29 of 37
Querying chunk 30 of 37
Querying chunk 31 of 37
Querying chunk 32 of 37
Querying chunk 33 of 37
Querying chunk 34 of 37
Querying chunk 35 of 37
Querying chunk 36 of 37
Querying chunk 37 of 37
Concatenating data, please be patient.


Processing  /Users/Adam/sulab/myvariant/vcf/recal/subject02_recalibrate_SNP_vqsr.vcf ...
found header lines for 3 ‘fixed’ fields: ALT, QUAL, FILTER 
found header lines for 24 ‘info’ fields: AC, AF, ..., VQSLOD, culprit 
found header lines for 5 ‘geno’ fields: GT, AD, DP, GQ, PL 


Querying chunk 1 of 36
Querying chunk 2 of 36
Querying chunk 3 of 36
Querying chunk 4 of 36
Querying chunk 5 of 36
Querying chunk 6 of 36
Querying chunk 7 of 36
Querying chunk 8 of 36
Querying chunk 9 of 36
Querying chunk 10 of 36
Querying chunk 11 of 36
Querying chunk 12 of 36
Querying chunk 13 of 36
Querying chunk 14 of 36
Querying chunk 15 of 36
Querying chunk 16 of 36
Querying chunk 17 of 36
Querying chunk 18 of 36
Querying chunk 19 of 36
Querying chunk 20 of 36
Querying chunk 21 of 36
Querying chunk 22 of 36
Querying chunk 23 of 36
Querying chunk 24 of 36
Querying chunk 25 of 36
Querying chunk 26 of 36
Querying chunk 27 of 36
Querying chunk 28 of 36
Querying chunk 29 of 36
Querying chunk 30 of 36
Querying chunk 31 of 36
Querying chunk 32 of 36
Querying chunk 33 of 36
Querying chunk 34 of 36
Querying chunk 35 of 36
Querying chunk 36 of 36
Concatenating data, please be patient.


Processing  /Users/Adam/sulab/myvariant/vcf/recal/subject03_recalibrate_SNP_vqsr.vcf ...
found header lines for 3 ‘fixed’ fields: ALT, QUAL, FILTER 
found header lines for 24 ‘info’ fields: AC, AF, ..., VQSLOD, culprit 
found header lines for 5 ‘geno’ fields: GT, AD, DP, GQ, PL 


Querying chunk 1 of 29
Querying chunk 2 of 29
Querying chunk 3 of 29
Querying chunk 4 of 29
Querying chunk 5 of 29
Querying chunk 6 of 29
Querying chunk 7 of 29
Querying chunk 8 of 29
Querying chunk 9 of 29
Querying chunk 10 of 29
Querying chunk 11 of 29
Querying chunk 12 of 29
Querying chunk 13 of 29
Querying chunk 14 of 29
Querying chunk 15 of 29
Querying chunk 16 of 29
Querying chunk 17 of 29
Querying chunk 18 of 29
Querying chunk 19 of 29
Querying chunk 20 of 29
Querying chunk 21 of 29
Querying chunk 22 of 29
Querying chunk 23 of 29
Querying chunk 24 of 29
Querying chunk 25 of 29
Querying chunk 26 of 29
Querying chunk 27 of 29
Querying chunk 28 of 29
Querying chunk 29 of 29
Concatenating data, please be patient.


Processing  /Users/Adam/sulab/myvariant/vcf/recal/subject04_recalibrate_SNP_vqsr.vcf ...
found header lines for 3 ‘fixed’ fields: ALT, QUAL, FILTER 
found header lines for 24 ‘info’ fields: AC, AF, ..., VQSLOD, culprit 
found header lines for 5 ‘geno’ fields: GT, AD, DP, GQ, PL 


Querying chunk 1 of 35
Querying chunk 2 of 35
Querying chunk 3 of 35
Querying chunk 4 of 35
Querying chunk 5 of 35
Querying chunk 6 of 35
Querying chunk 7 of 35
Querying chunk 8 of 35
Querying chunk 9 of 35
Querying chunk 10 of 35
Querying chunk 11 of 35
Querying chunk 12 of 35
Querying chunk 13 of 35
Querying chunk 14 of 35
Querying chunk 15 of 35
Querying chunk 16 of 35
Querying chunk 17 of 35
Querying chunk 18 of 35
Querying chunk 19 of 35
Querying chunk 20 of 35
Querying chunk 21 of 35
Querying chunk 22 of 35
Querying chunk 23 of 35
Querying chunk 24 of 35
Querying chunk 25 of 35
Querying chunk 26 of 35
Querying chunk 27 of 35
Querying chunk 28 of 35
Querying chunk 29 of 35
Querying chunk 30 of 35
Querying chunk 31 of 35
Querying chunk 32 of 35
Querying chunk 33 of 35
Querying chunk 34 of 35
Querying chunk 35 of 35
Concatenating data, please be patient.


We then filter each of the VCF files based on annotations.

In [5]:
cat(paste(nrow(subset(data.frame(table(unlist(lapply(vars, function(i) unique(i$dbnsfp.genename))))), 
    Freq == 4)), "common genes are mutated amongst patients"))

2442 common genes are mutated amongst patients

---
## Filtering Steps
The next steps apply annotation based filters by subsetting each data.frame by those variants that meet each criteria.

1) The first filter utilizes annotations output by HaplotypeCaller from GATK. Here we keep variants with at least 8 reads, Fisher Strand bias less than 30, and quality over depth greater than 2. 

In [6]:
filter1 <- lapply(vars, function(i) subset(i, DP > 8 & FS < 30 & QD > 2))
    
cat(paste(nrow(subset(data.frame(table(unlist(lapply(filter1, function(i) unique(i$dbnsfp.genename))))), 
    Freq == 4)), "genes remain after filtering for coverage and strand bias"))

2309 genes remain after filtering for coverage and strand bias

2) Mendelian diseases are most likely to be caused by nonsynonymous mutations. We can take advantage of the CADD database that denotes the mutation type in the field "cadd.consequence".

In [7]:
filter2 <- lapply(filter1, function(i) subset(i, cadd.consequence %in% c("NON_SYNONYMOUS", "STOP_GAINED", "STOP_LOST", 
                                           "CANONICAL_SPLICE", "SPLICE_SITE")))
    
cat(paste(nrow(subset(data.frame(table(unlist(lapply(filter2, function(i) unique(i$dbnsfp.genename))))), 
    Freq == 4)), "genes remain after filtering for nonsynonymous and splice site variants"))

1918 genes remain after filtering for nonsynonymous and splice site variants

3) The third filter keeps rare variants according to the ExAC data set with allele frequency < 0.01. Rare diseases are likely caused by mutations that have not been documented yet.

In [8]:
filter3 <- lapply(filter2, function(i) subset(i, exac.af < 0.01))
    
cat(paste(nrow(subset(data.frame(table(unlist(lapply(filter3, function(i) unique(i$dbnsfp.genename))))), 
    Freq == 4)), "genes remain after filtering for allele frequency < 0.01 in the ExAC dataset"))

19 genes remain after filtering for allele frequency < 0.01 in the ExAC dataset

4) The last filter keeps rare variants according to the 1000 Genomes Project with allele frequency < 0.01.

In [9]:
filter4 <- lapply(filter3, function(i) subset(i, sapply(dbnsfp.1000gp1.af, function(j) j < 0.01 )))
    
top.genes <- subset(data.frame(table(unlist(lapply(filter4, function(i) unique(i$dbnsfp.genename))))), Freq == 4)
top.genes <- subset(top.genes, !(Var1 %in% c("NULL", 0)))
cat(paste(nrow(top.genes), "genes remain after filtering for allele frequency < 0.01 in 1000 Genomes Project"))

9 genes remain after filtering for allele frequency < 0.01 in 1000 Genomes Project

In [12]:
top.genes$Var1

---
## Prioritizing genes
We can prioritize the genes that contain variants in each of the patients according to CADD (deleteriousness) score. `rankByCaddScore` extracts the average CADD scores of the variants in each gene per patient and ranks in descending order.

In [13]:
ranked <- rankByCaddScore(top.genes$Var1, filter4)
ranked

In `!=.default`(gene, c("NULL", 0)): longer object length is not a multiple of shorter object length

Unnamed: 0,gene,cadd.phred
1,DHODH,26.81
2,CTBP2,21.385
3,PIK3R3,20.7
4,HYDIN,18.91
5,CDC27,18.545
6,XKR3,15.72
7,CDON,10.02
8,FAM104B,7.2055
9,KIR3DL3,1.012


---
##Annotating candidate genes with MyGene.info
Now we can annotate the genes using the `queryMany` function from ***mygene*** which can be install from Bioconductor (`biocLite("mygene")`). We specify `scopes=symbol` and request annotations by the `field` parameter. 

In [14]:
res <- data.frame(queryMany(ranked$gene, scopes="symbol", species="human", fields=c("go.BP", "name", "MIM", "uniprot")))

Finished


In [15]:
res$go.BP

[[1]]
                                                     term         id evidence
1                 pyrimidine nucleobase metabolic process GO:0006206      TAS
2    'de novo' pyrimidine nucleobase biosynthetic process GO:0006207      IBA
3                                        female pregnancy GO:0007565      IEA
4                                               lactation GO:0007595      IEA
5                                    response to caffeine GO:0031000      IEA
6                                        response to drug GO:0042493      IEA
7                                  response to starvation GO:0042594      IEA
8                positive regulation of apoptotic process GO:0043065      IEA
9                      'de novo' UMP biosynthetic process GO:0044205      IEA
10                       small molecule metabolic process GO:0044281      TAS
11             pyrimidine nucleoside biosynthetic process GO:0046134      TAS
12 nucleobase-containing small molecule metabolic process 

---
# GO annotation
We can then use the Bioconductor package ***GO.db*** to map GO IDs to GO:0008152, the biological process id for metabolic process.

In [16]:
miller.bp <- lapply(res$go.BP, function(i) unlist(i$id))
bp.ancestor <- lapply(miller.bp, function(i) sapply(i, function(j) "GO:0008152" %in% unlist(GOBPANCESTOR[[j]])))
candidate.genes <- ranked$gene[sapply(bp.ancestor, function(i) TRUE %in% i)]
candidate.genes

The final candidate gene list contains only genes involved in metabolic processes where perterbation is more likely to result in mendelian disease. Through the use of MyVariant.info and MyGene.info annotation services, we have narrowed the candidate gene list from 2589 genes to seven. While we cannot fully remove the process of biological interpretation from an analysis, we have greatly reduced the burden.