This script demonstrates the utility of MyVariant.info and MyGene.info to annotate variants and prioritize candidate genes in patients with rare Mendelian diseases. This specific study uses data obtained from the database of phenotype and genotype (dbGaP) study...FASTQ files generated by Ng et al for the Miller syndrome study. FASTQ files were processed according to the Broad Institute’s best practices. Individual samples were aligned to the hg19 reference genome using BWA-MEM 0.7.10. Variants were called using GATK 3.3-0 HaplotypeCaller and quality scores were recalibrated using GATK VariantRecalibrator.

In [6]:
library(myvariant)
library(mygene)
library(VariantAnnotation)

In [7]:
setwd("~/sulab/myvariant/vcf/")
vcf.files <- paste(getwd(), list.files(getwd()), sep="/")
source("https://raw.githubusercontent.com/Network-of-BioThings/myvariant.info/master/docs/ipynb/mendelian.R")

mendelian.R defines some helper functions that are used in the analysis downstream of retrieving annotations:
`replaceWith0` - replaces all NAs with 0
`geneInDf` - subseting data.frames by genename
`geneDf` - extracting data.frames of each gene and respective annotations.
`cadd.df` - extracts median CADD score of genes where multiple variants are present
`rankByCaddScore` - for prioritizing genes by deleteriousness (scaled CADD score)

The following function reads in each output VCF file using the VariantAnnotation package available from Bioconductor. Install with `biocLite("VariantAnnotation")`. `formatHgvs` (from the *myvariant* Bioconductor package) is a function that reads the genomic location and variant information from the VCF to create HGVS IDs which serve as a primary key for each variant. The function `getVariants` makes the queries to MyVariant.info to retrieve annotations.

In [None]:
getVars <- function(vcf.file){
    
  vcf <- readVcf(vcf.file, genome="hg19")
  vcf <- vcf[isSNV(vcf)]
  vars <- rowRanges(vcf)
  vars <- as(vars, "DataFrame")
  vars$query <- formatHgvs(vcf, "snp")
  annotations <- getVariants(vars$query)
  annotations[c('DP', 'FS', 'QD')] <- info(vcf)[c('DP', 'FS', 'QD')]
    ## replace 'NA' with 0
  annotations <- replaceWith0(annotations)
  annotations
}

vars <- lapply(vcf.files, getVars)

found header lines for 3 ‘fixed’ fields: ALT, QUAL, FILTER 
found header lines for 24 ‘info’ fields: AC, AF, AN, BaseQRankSum, ClippingRankSum, DB, DP, DS, END, FS, HaplotypeScore, InbreedingCoeff, MLEAC, MLEAF, MQ, MQ0, MQRankSum, NEGATIVE_TRAIN_SITE, POSITIVE_TRAIN_SITE, QD, ReadPosRankSum, SOR, VQSLOD, culprit 
found header lines for 5 ‘geno’ fields: GT, AD, DP, GQ, PL 


Querying chunk 1 of 40
Querying chunk 2 of 40


In [10]:
## check annotaion fields
head(names(vars[[1]]))

We then apply the function to each of the VCF files and do our filtering based on coverage and annotations.

In [11]:
filterDf <- function(df){
  df <- subset(df, DP > 15 & FS < 30 & QD > 2)
  df <- subset(df, cadd.consequence %in% c("NON_SYNONYMOUS", "STOP_GAINED", "STOP_LOST", 
                                           "CANONICAL_SPLICE", "SPLICE_SITE"))
  df <- subset(df, exac.af < 0.01)
  df <- subset(df, sapply(dbnsfp.1000gp1.af, function(i) i < 0.01 ))
  df
}

In [12]:
filtered.annotations <- lapply(vars, filterDf)

We can then keep only genes that are mutated in each of the four patients.

In [13]:
gene.counts <- data.frame(table(unlist(lapply(filtered.annotations, function(i) unique(i$dbnsfp.genename)))))
top.genes <- subset(gene.counts, Freq == 4)
top.genes

Unnamed: 0,Var1,Freq
1,0,4
281,CDC27,4
293,CDON,4
388,CTBP2,4
391,CTDSP2,4
430,DHODH,4
538,FAM104B,4
599,FRG2C,4
693,GXYLT1,4
748,HYDIN,4


We can prioritize the 14 genes that contain variants in each of the patients according to CADD (deleteriousness) score. rankByCaddScore extracts the average CADD scores of the variants in each gene per patient and ranks in descending order.

In [11]:
ranked <- rankByCaddScore(top.genes$Var1, filtered.annotations)
ranked

ERROR: Error in do.call(rbind, lapply(gene.list, function(i) geneDf(df.list, : error in evaluating the argument 'args' in selecting a method for function 'do.call': Error in lapply(gene.list, function(i) geneDf(df.list, i)) : 
  error in evaluating the argument 'X' in selecting a method for function 'lapply': Error: object 'top.genes' not found



ERROR: Error in eval(expr, envir, enclos): object 'ranked' not found


Ng et al concluded DHODH is responsible for Miller Syndrome as well.

In [10]:
queryMany(ranked$gene, scopes="symbol", species="human", fields=c("name", "pathway", "kegg", "MIM", "uniprot"))

ERROR: Error in unlist(x): error in evaluating the argument 'x' in selecting a method for function 'unlist': Error: object 'ranked' not found

