## MyVariant.info and MyGene.info Use Case

The following R script demonstrates the utility of MyVariant.info and MyGene.info R clients to annotate variants and prioritize candidate genes in patients with rare Mendelian diseases. This specific study uses data obtained from the database of phenotype and genotype (dbGaP) study...FASTQ files generated by Ng et al for the Miller syndrome study. FASTQ files were processed according to the Broad Institute’s best practices. Individual samples were aligned to the hg19 reference genome using BWA-MEM 0.7.10. Variants were called using GATK 3.3-0 HaplotypeCaller and quality scores were recalibrated using GATK VariantRecalibrator.

In [1]:
library(myvariant)
library(mygene)
library(VariantAnnotation)
source("https://raw.githubusercontent.com/Network-of-BioThings/myvariant.info/master/docs/ipynb/mendelian.R")
setwd("~/sulab/myvariant/vcf/")
vcf.files <- paste(getwd(), list.files(getwd()), sep="/")

Loading required package: GenomicFeatures
Loading required package: AnnotationDbi
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.


Attaching package: ‘AnnotationDbi’

The following object is masked from ‘package:dplyr’:

    select


Attaching package: ‘mygene’

The following object is masked from ‘package:dplyr’:

    query



mendelian.R defines some helper functions that are used in the analysis downstream of retrieving annotations:

`replaceWith0` - replaces all NAs with 0.

`rankByCaddScore` - for prioritizing genes by deleteriousness (scaled CADD score).

---
## Annotating variants with MyVariant.info
The following function reads in each output VCF file using the VariantAnnotation package available from Bioconductor. Install with `biocLite("VariantAnnotation")`. `formatHgvs` (from the ***myvariant*** Bioconductor package) is a function that reads the genomic location and variant information from the VCF to create HGVS IDs which serve as a primary key for each variant. The function `getVariants` makes the queries to MyVariant.info to retrieve annotations.

In [2]:
getVars <- function(vcf.file){
  cat(paste("Processing ", vcf.file, "...\n", sep=" "))
  vcf <- readVcf(vcf.file, genome="hg19")
  vcf <- vcf[isSNV(vcf)]
  vars <- rowRanges(vcf)
  vars <- as(vars, "DataFrame")
  vars$query <- formatHgvs(vcf, "snp")
  annotations <- getVariants(vars$query)
  annotations[c('DP', 'FS', 'QD')] <- info(vcf)[c('DP', 'FS', 'QD')]
  annotations <- replaceWith0(annotations)
  annotations
}

vars <- lapply(vcf.files, getVars)

Processing  /Users/Adam/sulab/myvariant/vcf/subject01_recalibrate_SNP.vcf ...
found header lines for 3 ‘fixed’ fields: ALT, QUAL, FILTER 
found header lines for 24 ‘info’ fields: AC, AF, AN, BaseQRankSum, ClippingRankSum, DB, DP, DS, END, FS, HaplotypeScore, InbreedingCoeff, MLEAC, MLEAF, MQ, MQ0, MQRankSum, NEGATIVE_TRAIN_SITE, POSITIVE_TRAIN_SITE, QD, ReadPosRankSum, SOR, VQSLOD, culprit 
found header lines for 5 ‘geno’ fields: GT, AD, DP, GQ, PL 


Querying chunk 1 of 40
Querying chunk 2 of 40
Querying chunk 3 of 40
Querying chunk 4 of 40
Querying chunk 5 of 40
Querying chunk 6 of 40
Querying chunk 7 of 40
Querying chunk 8 of 40
Querying chunk 9 of 40
Querying chunk 10 of 40
Querying chunk 11 of 40
Querying chunk 12 of 40
Querying chunk 13 of 40
Querying chunk 14 of 40
Querying chunk 15 of 40
Querying chunk 16 of 40
Querying chunk 17 of 40
Querying chunk 18 of 40
Querying chunk 19 of 40
Querying chunk 20 of 40
Querying chunk 21 of 40
Querying chunk 22 of 40
Querying chunk 23 of 40
Querying chunk 24 of 40
Querying chunk 25 of 40
Querying chunk 26 of 40
Querying chunk 27 of 40
Querying chunk 28 of 40
Querying chunk 29 of 40
Querying chunk 30 of 40
Querying chunk 31 of 40
Querying chunk 32 of 40
Querying chunk 33 of 40
Querying chunk 34 of 40
Querying chunk 35 of 40
Querying chunk 36 of 40
Querying chunk 37 of 40
Querying chunk 38 of 40
Querying chunk 39 of 40
Querying chunk 40 of 40
Concatenating data, please be patient.


Processing  /Users/Adam/sulab/myvariant/vcf/subject02_recalibrate_SNP.vcf ...
found header lines for 3 ‘fixed’ fields: ALT, QUAL, FILTER 
found header lines for 24 ‘info’ fields: AC, AF, AN, BaseQRankSum, ClippingRankSum, DB, DP, DS, END, FS, HaplotypeScore, InbreedingCoeff, MLEAC, MLEAF, MQ, MQ0, MQRankSum, NEGATIVE_TRAIN_SITE, POSITIVE_TRAIN_SITE, QD, ReadPosRankSum, SOR, VQSLOD, culprit 
found header lines for 5 ‘geno’ fields: GT, AD, DP, GQ, PL 


Querying chunk 1 of 39
Querying chunk 2 of 39
Querying chunk 3 of 39
Querying chunk 4 of 39
Querying chunk 5 of 39
Querying chunk 6 of 39
Querying chunk 7 of 39
Querying chunk 8 of 39
Querying chunk 9 of 39
Querying chunk 10 of 39
Querying chunk 11 of 39
Querying chunk 12 of 39
Querying chunk 13 of 39
Querying chunk 14 of 39
Querying chunk 15 of 39
Querying chunk 16 of 39
Querying chunk 17 of 39
Querying chunk 18 of 39
Querying chunk 19 of 39
Querying chunk 20 of 39
Querying chunk 21 of 39
Querying chunk 22 of 39
Querying chunk 23 of 39
Querying chunk 24 of 39
Querying chunk 25 of 39
Querying chunk 26 of 39
Querying chunk 27 of 39
Querying chunk 28 of 39
Querying chunk 29 of 39
Querying chunk 30 of 39
Querying chunk 31 of 39
Querying chunk 32 of 39
Querying chunk 33 of 39
Querying chunk 34 of 39
Querying chunk 35 of 39
Querying chunk 36 of 39
Querying chunk 37 of 39
Querying chunk 38 of 39
Querying chunk 39 of 39
Concatenating data, please be patient.


Processing  /Users/Adam/sulab/myvariant/vcf/subject03_recalibrate_SNP.vcf ...
found header lines for 3 ‘fixed’ fields: ALT, QUAL, FILTER 
found header lines for 24 ‘info’ fields: AC, AF, AN, BaseQRankSum, ClippingRankSum, DB, DP, DS, END, FS, HaplotypeScore, InbreedingCoeff, MLEAC, MLEAF, MQ, MQ0, MQRankSum, NEGATIVE_TRAIN_SITE, POSITIVE_TRAIN_SITE, QD, ReadPosRankSum, SOR, VQSLOD, culprit 
found header lines for 5 ‘geno’ fields: GT, AD, DP, GQ, PL 


Querying chunk 1 of 31
Querying chunk 2 of 31
Querying chunk 3 of 31
Querying chunk 4 of 31
Querying chunk 5 of 31
Querying chunk 6 of 31
Querying chunk 7 of 31
Querying chunk 8 of 31
Querying chunk 9 of 31
Querying chunk 10 of 31
Querying chunk 11 of 31
Querying chunk 12 of 31
Querying chunk 13 of 31
Querying chunk 14 of 31
Querying chunk 15 of 31
Querying chunk 16 of 31
Querying chunk 17 of 31
Querying chunk 18 of 31
Querying chunk 19 of 31
Querying chunk 20 of 31
Querying chunk 21 of 31
Querying chunk 22 of 31
Querying chunk 23 of 31
Querying chunk 24 of 31
Querying chunk 25 of 31
Querying chunk 26 of 31
Querying chunk 27 of 31
Querying chunk 28 of 31
Querying chunk 29 of 31
Querying chunk 30 of 31
Querying chunk 31 of 31
Concatenating data, please be patient.


Processing  /Users/Adam/sulab/myvariant/vcf/subject04_recalibrate_SNP.vcf ...
found header lines for 3 ‘fixed’ fields: ALT, QUAL, FILTER 
found header lines for 24 ‘info’ fields: AC, AF, AN, BaseQRankSum, ClippingRankSum, DB, DP, DS, END, FS, HaplotypeScore, InbreedingCoeff, MLEAC, MLEAF, MQ, MQ0, MQRankSum, NEGATIVE_TRAIN_SITE, POSITIVE_TRAIN_SITE, QD, ReadPosRankSum, SOR, VQSLOD, culprit 
found header lines for 5 ‘geno’ fields: GT, AD, DP, GQ, PL 


Querying chunk 1 of 38
Querying chunk 2 of 38
Querying chunk 3 of 38
Querying chunk 4 of 38
Querying chunk 5 of 38
Querying chunk 6 of 38
Querying chunk 7 of 38
Querying chunk 8 of 38
Querying chunk 9 of 38
Querying chunk 10 of 38
Querying chunk 11 of 38
Querying chunk 12 of 38
Querying chunk 13 of 38
Querying chunk 14 of 38
Querying chunk 15 of 38
Querying chunk 16 of 38
Querying chunk 17 of 38
Querying chunk 18 of 38
Querying chunk 19 of 38
Querying chunk 20 of 38
Querying chunk 21 of 38
Querying chunk 22 of 38
Querying chunk 23 of 38
Querying chunk 24 of 38
Querying chunk 25 of 38
Querying chunk 26 of 38
Querying chunk 27 of 38
Querying chunk 28 of 38
Querying chunk 29 of 38
Querying chunk 30 of 38
Querying chunk 31 of 38
Querying chunk 32 of 38
Querying chunk 33 of 38
Querying chunk 34 of 38
Querying chunk 35 of 38
Querying chunk 36 of 38
Querying chunk 37 of 38
Querying chunk 38 of 38
Concatenating data, please be patient.


We then filter each of the VCF files based on coverage and annotations.

In [3]:
cat(paste(nrow(subset(data.frame(table(unlist(lapply(vars, function(i) unique(i$dbnsfp.genename))))), 
    Freq == 4)), "common genes are mutated amongst patients"))

2589 common genes are mutated amongst patients

---
## Filtering Steps
1) The first filter utilizes annotations output by HaplotypeCaller from GATK.

In [4]:
filter1 <- lapply(vars, function(i) subset(i, DP > 8 & FS < 30 & QD > 2))
    
cat(paste(nrow(subset(data.frame(table(unlist(lapply(filter1, function(i) unique(i$dbnsfp.genename))))), 
    Freq == 4)), "genes remain after filtering for coverage and strand bias"))

2420 genes remain after filtering for coverage

2) The next filter keeps nonsynoymous and splite site variants.

In [5]:
filter2 <- lapply(filter1, function(i) subset(i, cadd.consequence %in% c("NON_SYNONYMOUS", "STOP_GAINED", "STOP_LOST", 
                                           "CANONICAL_SPLICE", "SPLICE_SITE")))
    
cat(paste(nrow(subset(data.frame(table(unlist(lapply(filter2, function(i) unique(i$dbnsfp.genename))))), 
    Freq == 4)), "genes remain after filtering for nonsynonymous and splice site variants"))

2003 genes remain after filtering for nonsynonymous and splice site variants

3) The third filter keeps rare variants according to the ExAC data set with allele frequency < 0.01.

In [6]:
filter3 <- lapply(filter2, function(i) subset(i, exac.af < 0.01))
    
cat(paste(nrow(subset(data.frame(table(unlist(lapply(filter3, function(i) unique(i$dbnsfp.genename))))), 
    Freq == 4)), "genes remain after filtering for allele frequency < 0.01 in the ExAC dataset"))

28 genes remain after filtering for allele frequency < 0.01 in the ExAC dataset

4) The last filter keeps rare variants according to the 1000 Genomes Project with allele frequency < 0.01.

In [7]:
filter4 <- lapply(filter3, function(i) subset(i, sapply(dbnsfp.1000gp1.af, function(j) j < 0.01 )))
    
top.genes <- subset(data.frame(table(unlist(lapply(filter4, function(i) unique(i$dbnsfp.genename))))), Freq == 4)
top.genes <- subset(top.genes, !(Var1 %in% c("NULL", 0)))
cat(paste(nrow(top.genes), "genes remain after filtering for allele frequency < 0.01 in 1000 Genomes Project"))

14 genes remain after filtering for allele frequency < 0.01 in 1000 Genomes Project

In [8]:
top.genes$Var1

---
## Prioritizing genes
We can prioritize the genes that contain variants in each of the patients according to CADD (deleteriousness) score. `rankByCaddScore` extracts the average CADD scores of the variants in each gene per patient and ranks in descending order.

In [None]:
ranked <- rankByCaddScore(top.genes$Var1, filter4)
ranked

Unnamed: 0,gene,cadd.phred
1,DHODH,26.81
2,GXYLT1,26.1
3,CTBP2,26.0
4,MTCH2,24.93333
5,SETD8,23.8
6,CTDSP2,22.58571
7,CDC27,20.465
8,HYDIN,18.91
9,XKR3,15.72
10,CDON,10.02


---
##Annotating candidate genes with MyGene.info
Now we can annotate the genes using the `queryMany` function from ***mygene*** which can be install from Bioconductor (`biocLite("mygene")`)

In [None]:
queryMany(ranked$gene, scopes="symbol", species="human", fields=c("name", "pathway.kegg", "MIM", "uniprot"))

Finished


DataFrame with 14 rows and 7 columns
            MIM
    <character>
1        126064
2        613321
3        602619
4        613221
5        607240
...         ...
10       608707
11       607549
12           NA
13           NA
14       610095
                                                                                 name
                                                                          <character>
1                                              dihydroorotate dehydrogenase (quinone)
2                                                      glucoside xylosyltransferase 1
3                                                        C-terminal binding protein 2
4                                                             mitochondrial carrier 2
5                                  SET domain containing (lysine methyltransferase) 8
...                                                                               ...
10                                       cell adhesion associated, 

---
# Conclusion
Ng et al concluded DHODH is responsible for Miller Syndrome.