Genotype and Validate
Please note that this article has not been updated in a very long time and may no longer be applicable. Use at your own risk.
Genotype and Validate is a tool to asses the quality of a technology dataset for calling SNPs and Indels given a secondary (validation) datasource.
The simplest scenario is when you have a VCF of hand annotated SNPs and Indels, and you want to know how well a particular technology performs calling these snps. With a dataset (BAM file) generated by the technology in test, and the hand annotated VCF, you can run GenotypeAndValidate to asses the accuracy of the calls with the new technology's dataset.
Another option is to validate the calls on a VCF file, using a deep coverage BAM file that you trust the calls on. The GenotypeAndValidate walker will make calls using the reads in the BAM file and take them as truth, then compare to the calls in the VCF file and produce a truth table.
Usage of GenotypeAndValidate and its command line arguments are described here.
The VCF Annotations
The annotations can be either true positive (T) or false positive (F). 'T' means it is known to be a true SNP/Indel, while a 'F' means it is known not to be a SNP/Indel but the technology used to create the VCF calls it. To annotate the VCF, simply add an INFO field GV with the value T or F.
GenotypeAndValidate has two outputs. The truth table and the optional VCF file. The truth table is a 2x2 table correlating what was called in the dataset with the truth of the call (whether it's a true positive or a false positive). The table should look like this:
|called alt||True Positive (TP)||False Positive (FP)||Positive PV|
|called ref||False Negative (FN)||True Negative (TN)||Negative PV|
The positive predictive value (PPV) is the proportion of subjects with positive test results who are correctly diagnose.
The negative predictive value (NPV) is the proportion of subjects with a negative test result who are correctly diagnosed.
The optional VCF file will contain only the variants that were called or not called, excluding the ones that were uncovered or didn't pass the filters (-depth). This file is useful if you are trying to compare the PPV and NPV of two different technologies on the exact same sites (so you can compare apples to apples).
You should always use -BTI alleles, so that the GATK only looks at the sites on the VCF file, speeds up the process a lot. (this will soon be added as a default gatk engine mode)
The total number of visited bases may be greater than the number of variants in the original VCF file because of extended indels, as they trigger one call per new insertion or deletion. (i.e. ACTG/- will count as 4 genotyper calls, but it's only one line in the VCF).
Genotypes BAM file from new technology using the VCF as a truth dataset:
java \ -jar /GenomeAnalysisTK.jar \ -T GenotypeAndValidate \ -R human_g1k_v37.fasta \ -I myNewTechReads.bam \ -alleles handAnnotatedVCF.vcf \ -BTI alleles \ -o gav.vcf
An annotated VCF example (info field clipped for clarity)
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 1 20568807 . C T 0 HapMapHet AC=1;AF=0.50;AN=2;DP=0;GV=T GT 0/1 1 22359922 . T C 282 WG-CG-HiSeq AC=2;AF=0.50;GV=T;AN=4;DP=42 GT:AD:DP:GL:GQ 1/0 ./. 0/1:20,22:39:-72.79,-11.75,-67.94:99 ./. 13 102391461 . G A 341 Indel;SnpCluster AC=1;GV=F;AF=0.50;AN=2;DP=45 GT:AD:DP:GL:GQ ./. ./. 0/1:32,13:45:-50.99,-13.56,-112.17:99 ./. 1 175516757 . C G 655 SnpCluster,WG AC=1;AF=0.50;AN=2;GV=F;DP=74 GT:AD:DP:GL:GQ ./. ./. 0/1:52,22:67:-89.02,-20.20,-191.27:99 ./.
Using a BAM file as the truth dataset:
java \ -jar /GenomeAnalysisTK.jar \ -T GenotypeAndValidate \ -R human_g1k_v37.fasta \ -I myTruthDataset.bam \ -alleles callsToValidate.vcf \ -BTI alleles \ -bt \ -o gav.vcf
Example truth table of PacBio reads (BAM) to validate HiSeq annotated dataset (VCF) using the GenotypeAndValidate walker: