Release notes for GATK version 2.0
The GATK 2.0 release includes both the addition of brand-new (and often still experimental) tools and updates to the existing stable tools.
- Base Recalibrator (BQSR v2), an upgrade to CountCovariates/TableRecalibration that generates base substitution, insertion, and deletion error models.
- Reduce Reads, a BAM compression algorithm that reduces file sizes by 20x-100x while preserving all information necessary for accurate SNP and indel calling. ReduceReads enables the GATK to call tens of thousands of deeply sequenced NGS samples simultaneously.
- HaplotypeCaller, a multi-sample local de novo assembly and integrated SNP, indel, and short SV caller.
- Plus powerful extensions to the Unified Genotyper to support variant calling of pooled samples, mitochondrial DNA, and non-diploid organisms. Additionally, the extended Unified Genotyper introduces a novel error modeling approach that uses a reference sample to build a site-specific error model for SNPs and indels that vastly improves calling accuracy.
Base Quality Score Recalibration
- IMPORTANT: the Count Covariates and Table Recalibration tools (which comprise BQSRv1) have been retired! Please see the BaseRecalibrator tool (BQSRv2) for running recalibration with GATK 2.0.
- Handle exception generated when non-standard reference bases are present in the fasta.
- Bug fix for indels: when checking the limits of a read to clip, it wasn't considering reads that may already have been clipped before.
- Now emits the MLE AC and AF in the INFO field.
- Don't allow N's in insertions when discovering indels.
Phase By Transmission
- Multi-allelic sites are now correctly ignored.
- Reporting of mendelian violations is enhanced.
- Corrected TP overflow.
- Fixed bug that arose when no PLs were present.
- Added option to output the father's allele first in phased child haplotypes.
- Fixed a bug that caused the wrong phasing of child/father pairs.
- Improvements to the validation report module: if eval has genotypes and comp has genotypes, then subset the genotypes of comp down to the samples being evaluated when considering TP, FP, FN, TN status.
- If present, the AlleleCount stratification uses the MLE AC by default (and otherwise drops down to use the greedy AC).
- Fixed bugs in the VariantType and IndelSize stratifications.
- FisherStrand annotation no longer hard-codes in filters for bases/reads (previously used MAPQ > 20 && QUAL > 20).
- Miscellaneous bug fixes to experimental annotations.
- Added a Clipping Rank Sum Test to detect when variants are present on reads with differential clipping.
- Fixed the ReadPos Rank Sum Test annotation so that it no longer uses the un-hardclipped start as the alignment start.
- Fixed bug in the NBaseCount annotation module.
- The new TandemRepeatAnnotator is now a standard annotation while HRun has been retired.
- Added PED support for the Inbreeding Coefficient annotation.
- Don't compute QD if there is no QUAL.
Variant Quality Score Recalibration
- The VCF index is now created automatically for the recalFile.
- Now allows you to run with type unsafe JEXL selects, which all default to false when matching.
- Added an option which allows the user to re-genotype through the exact AF calculation model (if PLs are present) in order to recalculate the QUAL and genotypes.
- Added --mergeInfoWithMaxAC argument to keep info fields from the input with the highest AC value.
Somatic Indel Detector
- GT header line is now output.
- Automatically skips Ion reads just like it does with 454 reads.
Variants To Table
- Genotype-level fields can now be specified.
- Added the --moltenize argument to produce molten output of the data.
Depth Of Coverage
- Fixed a NullPointerException that could occur if the user requested an interval summary but never provided a -L argument.
- BCF2 support in tools that output VCFs (use the .bcf extension).
- The GATK Engine no longer automatically strips the suffix "Walker" after the end of tool names; as such, all tools whose name ended with "Walker" have been renamed without that suffix.
- Fixed bug when specifying a JEXL expression for a field that doesn't exist: we now treat the whole expression as false (whereas we were rethrowing the JEXL exception previously).
- There is now a global --interval_padding argument that specifies how many basepairs to add to each of the intervals provided with -L (on both ends).
- Removed all code associated with extended events.
- Algorithmically faster version of DiffEngine.
- Better down-sampling fixes edge case conditions that used to be handled poorly. Read Walkers can now use down-sampling.
- GQ is now emitted as an int, not a float.
- Fixed bug in the Beagle codec that was skipping the first line of the file when decoding.
- Fixed bug in the VCF writer in the case where there are no genotypes for a record but there are genotypes in the header.
- Miscellaneous fixes to the VCF headers being produced.
- Fixed up the BadCigar read filter.
- Removed the old deprecated genotyping framework revolving around the misordering of alleles.
- Extensive refactoring of the GATKReports.
- Picard jar updated to version 1.67.1197.
- Tribble jar updated to version 110.