Download release: gatk-4.2.6.1.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.2.6.1 release:
This release contains a single bug fix for GenotypeGVCFs
to fix an erroneous IllegalStateException
("No likelihood sum exceeded zero -- method was called for variant data with no variant information.") in the edge case where unnormalized PLs are present at monomorphic sites.
Download release: gatk-4.2.6.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.2.6.0 release:
-
Important bug fixes for the joint calling tools (GenotypeGVCFs / GenomicsDB)
- GATK 4.2.5.0 contained two joint genotyping bugs that are now fixed in GATK 4.2.6.0:
GenotypeGVCFs
can throw NullPointerExceptions in some cases with many alternate alleles.- The expectation-maximization component of the QUAL calculation was disabled, leading to false positive, low quality alleles at some multi-allelic sites.
- If you are running these tools in 4.2.5.0 we strongly recommend updating to 4.2.6.0
- GATK 4.2.5.0 contained two joint genotyping bugs that are now fixed in GATK 4.2.6.0:
-
Fixed a "Bucket is a requester pays bucket but no user project provided" error that occurred when accessing requester pays buckets in Google Cloud Storage even when the
--gcs-project-for-requester-pays
argument was specified- If you continue to encounter problems accessing requester pays Google Cloud Storage buckets in 4.2.6.0, please let us know by filing a Github issue!
-
Two new tools for the Structural Variation calling pipeline:
SVAnnotate
andPrintSVEvidence
-
Some fixes to genotype-given-alleles mode in
HaplotypeCaller
andMutect2
Full list of changes:
-
Joint Calling (GenotypeGVCFs / GenomicsDB)
- GATK 4.2.5.0 contained two joint genotyping bugs which are now fixed in 4.2.6.0:
GenotypeGVCFs
can throw NullPointerExceptions in some cases with many alternate alleles.- Fixed in:
- Fix for
NullPointerException
when GenomicsDB has more ALT alleles than specified maximum and many GQ0 hom-ref genotypes allow variants to pass the QUAL filter (#7738)
- Fix for
- Fixed in:
- The expectation-maximization component of the QUAL calculation was disabled, leading to false positive, low quality alleles at some multi-allelic sites.
- Fixed in:
- Fix multi-allelic QUAL calculation and restore some missing ALT annotation data in
ReblockGVCFs
(#7670)
- Fix multi-allelic QUAL calculation and restore some missing ALT annotation data in
- Fixed in:
- Mention acceptable compressed VCF file extensions in
GenomicsDBImport
error message (#7692)
- GATK 4.2.5.0 contained two joint genotyping bugs which are now fixed in 4.2.6.0:
-
SV Calling
- Added a new tool
SVAnnotate
(#7431)SVAnnotate
adds functional annotations for SVs called byGATK-SV
(#7431)
- Added a new tool
PrintSVEvidence
(#7695)PrintSVEvidence
is a tool that can merge any number of files containing one of five types of evidence of structural variation. It's also capable of subsetting regions or samples. It's used to merge evidence from a cohort in theGATK-SV
pipeline.
- Added start/end coordinate validation to
SVCallRecord
(#7714)
- Added a new tool
-
HaplotypeCaller / Mutect2
- Fixed an edge case in
HaplotypeCaller
where filtered alleles in the vicinity of forced-calling alleles could result in empty calls (#7740)- This affects users who run genotype given alleles mode in non-GVCF mode
- Fixed a bug in
HaplotypeCaller
andMutect2
where force-calling alleles were lost upon trimming by placing allele injection after trimming (#7679) - Added a debug ``--pair-hmm-results-file` argument that dumps the the exact inputs/outputs of the PairHMM to a file (#7660)
- Some changes to
Mutect2
to support the futureMutect3
(#7663)- Added training data for the Mutect3 normal artifact filter
- Output tensors for Mutect3 as plain text rather than VCF
- Fixed an edge case in
-
RNA Tools
TransferReadTags
: a new tool that transfers a read tag from an unaligned bam to the matching aligned bam (#7739).- This tool allows us to retrieve read tags that get lost when converting a SAM file to fastqs, then back to SAM (which is necessary if e.g. running fastp to clip adapter bases before alignment).
PostProcessReadsForRSEM
: a new tool that re-orders and filters reads before running RSEM, which has stringent requirements on the input SAM (https://github.com/deweylab/RSEM) (#7752).
-
Funcotator
- Added custom
VariantClassification
severity ordering. (#7673)- Users can now customize the severity ratings of the various
VariantClassifications
using the new--custom-variant-classification-order
argument
- Users can now customize the severity ratings of the various
- Added logging statements to the b37 conversion process explaining why the automatic b37 conversion does or does not take place on their VCFs (#7760)
- Added custom
-
VariantRecalibrator
- Added regularization to covariance in GMM maximization step to fix convergence issues in
VariantRecalibrator
(#7709)- This makes the tool more robust in cases where annotations are highly correlated
- Added regularization to covariance in GMM maximization step to fix convergence issues in
-
Bug Fixes
- Fixed a "Bucket is a requester pays bucket but no user project provided" error that occurred when accessing requester pays buckets in Google Cloud Storage even when
--gcs-project-for-requester-pays
was specified (#7700) (#7730) - Fix for the
PossibleDeNovo
annotation to work without Genotype Likelihoods (#7662)PossibleDeNovo
checks each trio's genotype (including parent hom ref genotypes) for likelihoods even though it doesn't actually use the PLs. The PLs can get dropped if GVCFs are reblocked which means this annotation no longer works as expected. This changes the check to look for GQs instead of PLs as the GQs are used as part of the annotation.
- Fixed a bug with the
--mate-too-distant-length
inMateDistantReadFilter
not being configurable (#7701)
- Fixed a "Bucket is a requester pays bucket but no user project provided" error that occurred when accessing requester pays buckets in Google Cloud Storage even when
-
GATK Engine
-
Miscellaneous Changes
- Added back the
jcenter
repository resolver to our gradle build, fixing a "Could not find biz.k11i:xgboost-predictor:0.3.0" error when building GATK from source (#7665) - We now properly update the
latest
tag in thebroadinstitute/gatk-nightly
Dockerhub repo (#7703) - The docker build now only does a
git lfs pull
onsrc/main/resources/large
(#7727) - Install git lfs with --force in the
Dockerfile
(#7682) - Fix WDL generation for
MultiVariantWalkers
by adding a companion index to theMultiVariantWalker
input variant arg (#7689) - Added google apps script to automatically update GATK release stats. (#7637)
- Updated the GATK stats script to be more universally usable (#7759)
- Added
JointCallExomeCNVs
to.dockstore.yml
and included a note in the WDL (#7719)
- Added back the
-
Documentation
- Corrected the docs for the
--heterozygosity
argument in theGenotypeCalculationArgumentCollection
(#7661)
- Corrected the docs for the
-
Dependencies
Download release: gatk-4.2.5.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.2.5.0 release:
-
Fixed a
GenotypeGVCFs
IllegalStateException
error reported by multiple users in #7639 -
Added a new tool
SVCluster
that clusters structural variants based on coordinates, event type, and supporting algorithms.
Full list of changes:
-
Joint Calling (GenotypeGVCFs / GenomicsDB)
- Fixed an
IllegalStateException
inGenotypeGVCFs
arising from GenomicsDB output with too many alts and no likelihoods, and also added a--genomicsdb-max-alternate-alleles
argument that is separate from the--max-alternate-alleles
argument used byGenotypeGVCFs
(#7655)- This fixes the
GenotypeGVCFs
error reported in #7639 - The new
--genomicsdb-max-alternate-alleles
argument is required to be at least one greater than the--max-alternate-alleles
argument, to account for the NON_REF allele.
- This fixes the
ReblockGVCF
: fixed an edge case where hom-ref "variant" records with no data had wrong-sized PLs and didn't merge with adjacent blocks (#7644)
- Fixed an
-
SV Calling
- Added a new tool
SVCluster
that clusters structural variants based on coordinates, event type, and supporting algorithms. (#7541)- Primary use cases include:
- Clustering SVs produced by multiple callers, based on interval overlap, breakpoint proximity, and sample overlap.
- Merging multiple SV VCFs with disjoint sets of samples and/or variants.
- Defragmentation of copy number variants produced with depth-based callers.
- Primary use cases include:
- Added a new tool
-
Mutect2
-
GATK Engine
- Added a new read filter,
ExcessiveEndClippedReadFilter
(#7638)- This filter will keep reads that have fewer than the specified number of clipped bases on either end.
- Designed with long reads in mind, and as a result has a default value of 1000.
- Added a new read filter,
Download release: gatk-4.2.4.1.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.2.4.1 release:
- Fix more newly discovered log4j2 vulnerabilities. Now that people are paying attention they are finding all sorts of things.
Full list of changes:
-
Build System
- Upgrade our build from Gradle 5.6 to the newest 7.3.2 (#7609)
- This fixes some gradle bugs which were blocking development
-
GenomicsDB
-
Miscellaneous Changes
-
Dependencies
Download release: gatk-4.2.4.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.2.4.0 release:
- Fix a major security bug due to log4j vulnerability. (CVE-2021-44228)
- Improvement to calculation of ExcessHet in joint genotyping. (GenotypeGVCFs, GnarlyGenotyper, ExcessHet).
Full list of changes:
-
Funcotator
- Aligned the Funcotator checkIfAlreadyAnnotated test with the Funcotator engine code. (#7555)
-
GenotypeGVCFs / ExcessHet
- Removed undocumented mid-p correction to p-values in exact test of Hardy-Weinberg equilibrium and updated corresponding tests. We now report the same value as ExcHet in bcftools. Note that previous values of 3.0103 (corresponding to mid-p values of 0.5) will now be 0.0000. (#7394)
- Updated expected ExcessHet values in integration test resources and added an update toggle to GnarlyGenotyperIntegrationTest.
- Updated ExcessHet documentation.
-
Miscellaneous Changes
-
Documentation
-
Dependencies
- Updated log4j to version 2.13.1 -> 2.16.0 to patch CVE-2021-44228 (#7605)
Download release: gatk-4.2.3.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.2.3.0 release:
-
Notable bug fixes for
Mutect2
andFuncotator
-
Support in
CombineGVCFs
andGenotypeGVCFs
for "reblocked" GVCFs as produced by theReblockGVCF
tool. Reblocked GVCFs have a significantly reduced storage footprint. -
More control over the Smith-Waterman parameters in
HaplotypeCaller
andMutect2
-
A new Fragment Allele Depth (
FAD
) variant annotation similar to theAD
annotation except that allele support is considered per read pair, not per individual read -
GenomicsDB bug fixes and enhancements
Full list of changes:
-
HaplotypeCaller/Mutect2
- Fixed a bug where
Mutect2
failed to filter germline variants with alternate representations (#7103)- This caused variants with alternative representations in gnomAD to not be recognized as being the same as called variants in some cases. This resulted in variants that were called and not filtered, but they should have been filtered by "germline".
- Exposed Smith-Waterman parameters as tool arguments in
HaplotypeCaller
,Mutect2
, andFilterAlignmentArtifacts
. (#6885)- Enables use of alternative parameters for different event representation (e.g. three consecutive SNPs instead of two small indels)
- Can now specify the Smith-Waterman implementation in
FilterAlignmentArtifacts
(#7105) - Added a
--debug-assembly-variants-out
diagnostic option to output a side VCF with variants detected by assembly forHaplotypeCaller
andMutect2
(#7384) Mutect2
: the--genotype-germline-sites
argument is no longer marked as experimental (#7533)
- Fixed a bug where
-
GenotypeGVCFs / CombineGVCFs
- Updated
CombineGVCFs
andGenotypeGVCFs
to handle "reblocked" GVCFs with diploid data that are potentially missing hom-ref genotype PLs (#7223) - Homozygous reference genotypes with no PLs and zero depth are now output as no-calls by
GenotypeGVCFs
(#7471) - Bug fixes for
GenotypeGVCFs
/GnarlyGenotyper
when allele-specific annotations have empty values due to lack of informative reads or no depth (#7491) (#7186)
- Updated
-
GenomicsDB
- Added a new
--call-genotypes
GenomicsDB argument, enabling output of called genotypes (i.e. not ./.) when tools likeCombineGVCFs
andSelectVariants
read from a GenomicsDB workspace (#7223) - Added a
--bypass-feature-reader
argument toGenomicsDBImport
to allow the C-based htslib VCF reader implementation to be used instead of the Java implementation (#7393)- Using this option will reduce memory usage and potentially speed up the import process
- Updated to GenomicsDB 1.4.2 (#7520)
- This release fixes a commonly-encountered bookkeeping issue with GenomicsDB array fragments. Should fix errors of the type: "Error: Cannot read from buffer; Error: cannot load book-keeping" as reported in #7012
- Full release notes are here: https://github.com/GenomicsDB/GenomicsDB/releases/tag/v1.4.2
- Added a new
-
Funcotator
-
CNV Calling
- CNV WDLs now handle BAM/CRAM index paths explicitly, as for cases where the index is not in the same path as its file (#7518)
- gCNV in the CASE mode now fills in all hidden DenoisingModelConfig and CopyNumberCallingConfig arguments from the input model configuration (#7464)
- Exposed number of samples used for estimating denoised copy ratios in gCNV via a new
--num-samples-copy-ratio-approx
argument (#7450)
-
SV Calling
JointGermlineCNVSegmentation
: bug fixes and refactoring (#7243)- A number of bugs, particularly with max-clique clustering, have been fixed, as well as a parameter swap bug in
JointGermlineCNVSegmentation
- Reworks classes used by
JointGermlineCNVSegmentation
for SV clustering and defragmentation. The design ofSVClusterEngine
has been overhauled to enable the implementation ofCNVDefragmenter
andBinnedCNVDefragmenter
subclasses. Logic for producing representative records from a collection of clustered SVs has been separated into anSVCollapser
class, which provides enhanced functionality for handling genotypes for SVs more generally.
- A number of bugs, particularly with max-clique clustering, have been fixed, as well as a parameter swap bug in
-
Notable Enhancements
- Added a new Fragment Allele Depth (
FAD
) variant annotation (#7511)- This annotation is identical to the
AD
annotation except that allele support is considered per read pair, not per individual read
- This annotation is identical to the
- Added a new Fragment Allele Depth (
-
Miscellaneous Changes
-
Documentation
-
Dependencies
Download release: gatk-4.2.2.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.2.2.0 release:
-
The
ReblockGVCF
tool is now out of beta with several important improvements. This tool can be used to postprocessHaplotypeCaller
GVCFs to decrease filesize. -
FilterMutectCalls
now has a--microbial-mode
argument that sets filters to defaults appropriate for microbial calling -
Important bug fixes to
CalibrateDragstrModel
andFuncotator
Full list of changes:
-
New Tools
ShiftFasta
: create a fasta with the bases shifted by an offset (#6694)
-
ReblockGVCF
ReblockGVCF
is now out of beta (#7419)- Improved
ReblockGVCF
output to eliminate overlapping reference blocks and reference gaps following trimmed deletions (#7122) - Fixed bugs associated with input no-call genotypes and fixed an off-by-one error at contig starts (#7404)
- Fixed an error on ref blocks with missing DPs (if
--floor-blocks
arg is not provided); fixed rare cases where spanning deletion (*) allele is incorrectly modified (#7400)
-
Mutect2
FilterMutectCalls
: added a--microbial-mode
argument that sets filters to defaults appropriate for microbial calling (#6694)
-
ValidateVariants
- Added an optional argument to check for GVCF reference blocks overlapping variants or other reference blocks (#7405)
-
DRAGEN-GATK
-
Funcotator
- Fixed an issue where the
Match_Norm_Seq_Allele1
andMatch_Norm_Seq_Allele2
fields were not being populated in MAF output (#7422)
- Fixed an issue where the
-
Mitochondrial pipeline
- Removed calls to
FilterNuMTs
andFilterLowHetSites
, which are no longer being used (#7325)
- Removed calls to
-
CNV Calling
- Fixed a bug resulting from prefix strings of less than 3 characters when creating temporary files in
GermlineCNVCaller
and improved documentation of corresponding utility methods. (#7411)
- Fixed a bug resulting from prefix strings of less than 3 characters when creating temporary files in
-
Documentation
Download release: gatk-4.2.1.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.2.1.0 release:
-
Several important fixes to HaplotypeCaller and the new DRAGEN-GATK code introduced in GATK 4.2.0.0
-
Started laying the groundwork in
Mutect2
forMutect3
, which will be more machine learning focused -
LocalAssembler
: a new tool that performs local assembly of small regions to discover structural variants (#6989) -
Support for multi-sample segmentation in
ModelSegments
-
Major speed improvements and several important fixes to
Funcotator
-
A new version of the Intel Genomics Kernel Library (GKL), with many important fixes and improvements
-
A new version of GenomicsDB, with improved cloud support
-
A GATK-wide option to shard VCFs on output, which is often useful for pipelining
-
GATK support for block compressed interval (
.bci
) files, which is useful when working with extremely large interval lists
Full list of changes:
-
New Tools
LocalAssembler
: a new tool that performs local assembly of small regions to discover structural variants (#6989)
-
HaplotypeCaller
- Fixed a rare edge case in DRAGEN mode that could result in negative GQs when
USE_POSTERIOR_PROBABILITIES
is set (#7120) - Fixed a rare edge case (mainly affecting DRAGEN mode) that could cause the PL arrays to be deleted when genotyping in
HaplotypeCaller
(#7148) - Fixed a bug in the
AlleleLikelihoods
that could result in new evidence X being assigned arbitrary likelihoods left over from previous evidence (#7154) - Fixed a "Padded span must contain active span" error caused by invalid feature file intervals that weren't being checked for validity against the sequence dictionary (#7295)
- Do not add the artificial haplotype read group to the bamout file when
--bam-writer-type NO_HAPLOTYPES
is specified (#7141) - Suppressed excessive log output related to
JumboAnnotation
warnings inHaplotypeCaller
(#7358)
- Fixed a rare edge case in DRAGEN mode that could result in negative GQs when
-
DRAGEN-GATK
-
Mutect2
- Added a training data mode (
--training-data-mode
) toMutect2
to prepare forMutect3
(#7109)- Training data mode collects data on variant- and artifact-supporting read sets for fitting a deep learning filtering model
- Better error bars for samples with small contamination in
CalculateContamination
(#7003)
- Added a training data mode (
-
Funcotator
- Greatly improved
Funcotator
performance by optimizing the VCF sanitization code (#7370)- In our tests, this change appears to speed up the tool by roughly 2x
- Updated the Gencode GTF Codec to be more permissive with transcript and gene types (#7166)
- Now the Gencode GTF Codec no longer restricts
transcriptType
andgeneType
to a limited set of values. These fields are now each stored as a String. This allows for arbitrary values in these fields and will help to future-proof (and species-proof) the GTF parser. - Fixes "IndexFeatureFile Error to Run Funcotator with Mouse Ensembl GTF" (#7054)
- Now the Gencode GTF Codec no longer restricts
- Now can decode codons containing IUPAC bases into amino acids. (#7188)
- Updated the tool to allow for protein changes with N / IUPAC bases. (#6778)
- Added the ability to have IUPAC bases in either the ref/alt alleles OR in the reference when calculating the amino acid sequence. In this case, the code will no longer throw a user exception, but will log a warning and will produce ? amino acids in the case that they cannot be decoded from the amino acid table. Currently this will happen any time an N or IUPAC base is in the region to be coded into amino acids.
- Added AminoAcid.UNDECODABLE as a placeholder for any unknown / undecodable amino acid (such as in the case of an ambiguous IUPAC base).
Funcotator
now checks whether the input has already been annotated, and by default throws an error in that case.- We also added a
--reannotate-vcf
override argument to explicitly allow reannotation (#7349)
- We also added a
- Greatly improved
-
CNV Calling
-
SV Calling
- Added
LocalAssembler
, a new tool that performs local assembly of small regions to discover structural variants (#6989)
- Added
-
The Genomics Kernel Library (GKL)
- Updated to GKL version 0.8.8, and remove the FPGA PairHMM as an option (#7203)
- This is a significant update to the GKL that comes with many fixes and improvements:
- Update ISAL and OTC Zlib libraries to latest version (Q1 2021)
- Fixed 3 reproducible issues and retested out of 4 more in GKL
- Updated build for Centos 7 and Current Mac.
- Ran valgrind on limited C unit tests (passed)
- Major improvements to input validation
- Major updates to Error handling and propagation.
- Added Negative space unit testing coverage
- Regular Static Code Scanning
- Good overall quality of life improvement for the software
- This is a significant update to the GKL that comes with many fixes and improvements:
- Updated to GKL version 0.8.8, and remove the FPGA PairHMM as an option (#7203)
-
GenomicsDB
- Moved to GenomicsDB 1.4.1, and add a toggle between the GCS Connector and native GCS support (#7224)
- This release allows for the direct use of the native GCS C++ client instead of the GCS Cloud Connector via HDFS. The GCS Cloud Connector can still be used with GenomicsDB via the ``--genomicsdb-use-gcs-hdfs-connector option`
- Using the native client with GCS allows for GenomicsDB to use the standard paradigms to help with authentication, retries with exponential backoff, configuring credentials, etc., and also helps with performance issues with GCS. See #7070.
- Allow specifying S3 and Azure blob storage uri's to GenomicsDB in addition to GCS and HDFS (#7271)
- Fixes related to the GenomicsDB upgrade (#7257)
- Improved the error message in
GenomicsDBImport
when failing to open aFeatureReader
(#7375)
- Moved to GenomicsDB 1.4.1, and add a toggle between the GCS Connector and native GCS support (#7224)
-
Mitochondrial pipeline
- Added median coverage metric to the mitochondrial pipeline (#7253)
-
Notable Enhancements
- Added a GATK-wide option (
--max-variants-per-shard
) to shard VCFs on output (#6959)- Sharded output is often extremely useful for pipelining
- Added GATK support for block compressed interval (
.bci
) files (#7142) - Added an
AlleleDepthPseudoCounts
(DD) genotype annotation. (#7303)- Similar to AD, the new annotation (DD) captures the depth of each allele's supporting evidence or reads, however it does so by following a variational Bayes approach looking into the likelihoods rather than applying a fixed threshold. This turns out to be more robust in some instances.
- To get the new non-standard annotation in
HaplotypeCaller
you need to add-A AllelePseudoDepth
- We now track the source of variants in
MultiVariantWalkers
, which is important for some tools such asVariantEval
(#7219)
- Added a GATK-wide option (
-
Bug Fixes
- Fixed key ordering bugs in the implementations of
Histogram.median()
andCompressedDataList.iterator()
(#7131)- These bugs could result in incorrect RankSumTest annotations in some cases
- Fixed the
DepthPerSampleHC
andStrandBiasBySample
annotations to not spam the logs with "Annotation will not be calculated" warnings (#7357) VariantEval
: fixed contig stratification to defer to user-defined intervals (#7238)
- Fixed key ordering bugs in the implementations of
-
Miscellaneous Changes
- The
ProgressMeter
can now be completely disabled for all tools / traversals by overridingGATKTool.disableProgressMeter()
(#7354) - We now authenticate with Dockerhub in our Travis builds, to help avoid tests failing due to quota issues (#7204) (#7256)
- Migrated
VariantEval
to be aMultiVariantWalkerGroupedOnStart
(#6973) VariantEval
: added an argument to specify thePedigreeValidationType
(#7240)- Converted
InfoFieldAnnotation
/GenotypeAnnotation
into interfaces. (#7041) - Allow
MultiVariantWalkerGroupedOnStart
subclasses to view/setignoreIntervalsOutsideStart
(#7301) PedigreeAnnotation
: consolidate code, provide getters, and allowPedigreeValidationType
to be set (#7277)ASEReadCounter
: added a warning for variants lacking GT fields (#7326)- Added filters to
dockstore.yml
so that only the master branch and the releases get synced to Dockstore (#7217) - Fixed a compatibility issue between Java 11 and
log4j2
(#7339) - We now update the gcloud package signing key at the start of every docker build (#7180)
- Updated our Artifactory key (#7208)
- Disabled some Spark dataproc tests because of dependency issues. (#7170)
- Removed some embedded licenses from scripts (#7340)
- The
-
Documentation
- Variant annotation documentation: removed broken links to related annotations from the tool docs (#7307)
- Updated the link to an article on Jexl expressions (#7317)
- Fixed several broken links in docs for the CNV tools (#7309)
- Fixed broken links in the docs for
Funcotator
,VariantRecalbrator
, andASEReadCounter
(#7270) - Fixed typos in the tool documentation for
HaplotypeCaller
andLeftAlignAndTrimVariants
(#6440) - Clarify pipeline inputs in documentation for
GnarlyGenotyper
(#7231)
-
Dependencies
Download release: gatk-4.2.0.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.2.0.0 release:
-
We've worked closely with Illumina to port a number of significant innovations for germline short variant calling from their DRAGEN pipeline to GATK. These improvements will form the basis of the upcoming open-source implementation of the DRAGEN pipeline which we're calling DRAGEN-GATK
-
A number of other fixes and improvements to
HaplotypeCaller
to improve the phasing of variant calls and to fix edge cases with indels and spanning deletions -
A new pipeline for gCNV exome joint calling
Full list of changes:
-
- With this release we've worked closely with Illumina to make improvements to the GATK
HaplotypeCaller
to allow it to output germline short variant calls that are functionally equivalent to the calls made by their DRAGEN 3.4.12 pipeline. See our blog post on DRAGEN-GATK for more details on these improvements. A fullDRAGEN-GATK
pipeline that leverages these new features will be released in the near future as a WDL workflow script in the WARP repo on GitHub as well as a featured workspace in Terra. - Below is a summary of the improvements we've ported from DRAGEN in this release. We recommend that most users wait until the complete
DRAGEN-GATK
pipeline is released as a WDL workflow before evaluating these features, though advanced users comfortable with building their own pipelines are welcome to try them out now:- DragSTR: a port of DRAGEN's model for STRs (Short Tandem Repeats) that adjusts HMM indel priors based on empirical reference contexts for better indel calling.
- Using DragSTR involves running two new tools prior to the
HaplotypeCaller
:ComposeSTRTableFile
: scans a reference for STR sites and outputs a table file with a subsample of the available STR sites across the genome.CalibrateDragstrModel
: given the STR table for a reference produced byComposeSTRTableFile
and the reads for a specific sample, generates a model for potential sequencing errors for STR sites of various sizes for that sample.
- After running these tools, you then run
HaplotypeCaller
with the--dragstr-params-path
argument to pass it the DragSTR model generated byCalibrateDragstrModel
.
- Using DragSTR involves running two new tools prior to the
- BQD (Base Quality Dropout) and FRD (Foreign Read Detection): two new genotyper error models ported from DRAGEN
- The
Base Quality Dropout (BQD)
model penalizes variants with low average base quality scores and high average sequencing cycle counts among genotyped reads and reads that were otherwise excluded from the genotyper to model read-context dependent sequencing errors. - The
Foreign Read Detection (FRD)
model uses an adjusted mapping quality score as well as read strandedness information to penalize reads that are likely to have originated from somewhere else on the genome or from contamination. - To activate the BQD and FRD models, run
HaplotypeCaller
with the--dragen-mode
argument.
- The
- Added a new variant QUAL score model that reports the variant QUAL score as the posterior of the reference genotype based on the sample-dependent DRAGEN STR and flat SNP priors.
- DragSTR: a port of DRAGEN's model for STRs (Short Tandem Repeats) that adjusts HMM indel priors based on empirical reference contexts for better indel calling.
- With this release we've worked closely with Illumina to make improvements to the GATK
-
HaplotypeCaller
- We now add physical phasing information (PGT/PID/PS attributes) to genotypes with spanning deletion alleles (#6937)
- Fixed two phasing bugs (#7019)
- Fixed quality score calculation for sites with spanning deletions (#6859)
- This fixes a bug in the AlleleFrequencyCalculator that was causing quality to be overestimated for sites with * alleles representing spanning deletions.
- Added the ability for indels to be recovered from dangling heads in the assembly graph, and a new
--num-matching-bases-in-dangling-end-to-recover
argument for filtering dangling ends (#6113) (#7086) - Improved handling of indels/spanning deletions in the cigar base quality adjustment code. (#6886)
- This aims to better handle the edge cases that come up when mates have mismatching numbers of bases at the start or end of the reads relative to each-other.
- Fixed a bug where overlapping reads in subsequent assembly regions could have invalid base qualities (#6943)
- Convert non-ACGT IUPAC bases to N in HaplotypeCaller prior to assembly to prevent a crash (#6868)
- Renamed the
--mapping-quality-threshold
argument to--mapping-quality-threshold-for-genotyping
, and updated its documentation to be less confusing (#7036) - Added an option for
HaplotypeCaller
andMutect2
to produce a bamout without artificial haplotypes (#6991) - Updated the
--debug-graph-transformations
argument to emit the assembly graph both before and after chain pruning (#7049)
-
Mutect2
- Fixed the
--dont-use-soft-clipped-bases
argument inMutect2
to actually work as intended (#6823)- Due to a bug, this option did nothing because a copy of the original reads was modified. By deleting the unnecessary mapping quality filtering (this is totally redundant with the M2 read filter), we finalize (and thereby discard soft clips if requested) an assembly region made from the original reads, not a copy.
- Fixed a bug in the
Mutect2
engine active region code that could affect the ability to call tumor alts when the normal has a different alt at the same site (#6908) - Removed an obsolete cram to bam conversion step in the
Mutect2
WDL (#6970) - Updated the
Mutect2
whitepaper indocs/mutect/mutect.pdf
to accurately reflect current filter names, and updated the section onFilterAlignmentArtifacts
(#6967)
- Fixed the
-
CNV Calling
- A new pipeline for gCNV exome joint calling (#6554)
- Added a new tool (
JointGermlineCNVSegmentation
) and associated workflow (scripts/cnv_wdl/germline/joint_call_exome_cnvs.wdl
) to combine gCNV segments and calls across samples JointGermlineCNVSegmentation
segments and genotypes CNV calls from the germline CNV pipeline jointly across multiple samples.- The workflow in
scripts/cnv_wdl/germline/joint_call_exome_cnvs.wdl
produces a joint, multi-sample genotyped VCF. - For whole genomes, we recommend CNVs as part of a full SV callset with https://github.com/broadinstitute/gatk-sv (soon to be added to Terra)
- Added a new tool (
GermlineCNVCaller
now restarts inference once with a new random seed when inference diverges. Also added a new entry point to PythonScriptExecutor that returnes ProcessOutput. (#6866)- This is intended to alleviate transient issues with GermlineCNVCaller inference in which the ELBO converges to a NaN value, by calling the python gCNV code with an updated random seed input.
CreateReadCountPanelOfNormals
: fixed a bug in the logic for filtering zero-coverage samples and intervals (#6624)FilterIntervals
: fixed a bug in the tool logic when filtering on annotations and -XL is used to exclude intervals (#7046)
- A new pipeline for gCNV exome joint calling (#6554)
-
SV Calling
PrintSVEvidence
: a new tool that prints any of the Structural Variation evidence file types: read count (RD), discordant pair (PE), split-read (SR), or B-allele frequency (BAF) (#7026)- This tool is used frequently in the GATK-SV pipeline for retrieving subsets of evidence records from a bucket over specific intervals. Evidence file formats comply with the current specifications in the existing GATK-SV pipeline.
-
GenomicsDB
- Introduced a new feature for
GenomicsDBImport
that allows merging multiple contigs into fewer GenomicsDB partitions (#6681)- Controlled via the new
--merge-contigs-into-num-partitions
argument toGenomicsDBImport
- This should produce a huge performance boost in cases where users have a very large number of contigs. Prior to this change, GenomicsDB would create a separate folder/partition for each contig, which slowed down import to a crawl when there were many contigs.
- Controlled via the new
- Introduced a new feature for
-
Funcotator
- Added sorting by strand order for transcript subcomponents (#7065)
- This fixes an issue where the coding sequence, protein prediction, and other annotations could be incorrect for the hg19 version of Gencode, due to the individual elements of each transcript appearing in numerical order, rather than the order in which they appear in the transcript at transcription time.
- Updated the Funcotator tutorial link in the tool documentation. (#6920) (#6925)
- Added sorting by strand order for transcript subcomponents (#7065)
-
Mitochondrial pipeline
-
Notable Enhancements
-
Bug Fixes
- Fixed a
ClosedChannelException
error when doing multiple queries on remote CRAM files, and added a test to verify proper stream management (#7066) SelectVariants
: Fixed an issue where SelectVariants could generate duplicate VCF header lines in some circumstances, resulting in an invalid VCF (#7069)VariantAnnotator
: fixed a NullPointerException by adding a validation check that all samples in the input bam are present in the provided vcf before running (#6944)SplitNCigarReads
: fixed an error where the read mate key was not sufficiently strict about read names, causing cigar errors (#6909)CalculateGenotypePosteriors
: ensure that resources have the same sequence dictionary as the input VCF (#6430)MarkDuplicatesSpark
: fixed a NullPointerException when a null ReadNameRegex was provided (#7002)GnarlyGenotyper
: bugfix for the QUALapprox calculation, tolerate missing VarDP, and support AS_QUALapprox if QUALapprox is missing (#7061)- Fixed the GATK version number in the docker image when doing releases to not end in "-SNAPSHOT" (#6883)
- Fixed a
-
Miscellaneous Changes
- Switched GATK to the Apache 2.0 license (#7079)
- We now print the current Spark version on GATK startup (#7028)
- Added a log warning message when the total size of the PL arrays for a variant will likely exceed 100,000 (#6334)
- Added a script to publish GATK tool WDLs for each release (#6980)
- Migrated the
GATKPath
base class toHtsPath
(#6763) - Migrate additional tools to
GATKPath
(#6718) - Made
BaseUtils.convertIUPACtoN()
andBaseUtils.simpleBaseToBaseIndex()
methods more robust to handle all possible byte values (#7010) - Enabled CARROT integration for triggering test runs from PR comments (#6917) (#6986)
- Added loci information to several annotation warnings (#6891)
VariantRecalibrator
: added locus information to a ref allele mismatch error message (#6964)ReferenceConfidenceVariantContextMerger
: corrected AS annotation warning message to use GATK4 annotation names (#6985)- Made the
CNNScoreVariants
task incnn_variant_wdl/cnn_variant_common_tasks.wdl
robust to the reads and index being in different locations. (#6900) - Updated gcloud docker commands in
build_docker.sh
(#7078) - Added version number to the dockstore yml file (#6905)
- Switched travis gcloud installation to use noninteractive mode (#6974)
- Deleted the obsolete tool
FixCallSetSampleOrdering
(#7022) - Echo the log file after a failed travis run. (#7020)
- Temporarily disable the PairHMMUnitTest on Java 11. (#7044)
- Pin our h5py version to 2.10.0. (#6955)
-
Documentation
- Added a link to the new
gatk-tool-wdls
repository to the README (#6982) - Updated JEXL documentation website link in
SelectVariants
andVariantFiltration
(#7029) - Updated the
ApplyVQSR
docs to consistently use the GATK4 tool name: ApplyRecalibration -> ApplyVQSR - Modified the README to reflect the current download size for Git LFS files (#6933)
- Fixed a typo in the conda environment YML documentation. (#6935)
- Removed reference to -Dtest.single from the README (#6914)
- Fixed a typo in a javadoc comment in
HaplotypeCallerEngine
(#7033)
- Added a link to the new
-
Dependencies
Download release: gatk-4.1.9.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Highlights of the 4.1.9.0 release:
-
A major update to
Funcotator
, bringing in the latest Gencode release, fixing compatibility issues with dbSNP, and more! -
Two new tools,
GeneExpressionEvaluation
andReferenceBlockConcordance
-
Significant performance improvements to
DepthOfCoverage
andSelectVariants
-
Some important bug fixes:
- Fixed a bug in
HaplotypeCaller
andMutect2
where we were losing insertion events that immediately followed a deletion - A fix for the "CreateSomaticPanelOfNormals output PoN has much less variants in 4.1.8.0 than before" issue reported in #6744
- A fix for a frequently-encountered
NullPointerException
in theAS_StrandBiasTest
annotation when runningCombineGVCFs
reported in #6766
- Fixed a bug in
Full list of changes:
-
New Tools
-
GeneExpressionEvaluation
: a tool for evaluating gene expression from RNA-seq reads aligned to whole genome (#6602)- This tool counts fragments to evaluate gene expression from RNA-seq reads aligned to the genome. Features to evaluate expression over are defined in an input annotation file in gff3 fomat. Output is a tsv listing sense and antisense expression for all stranded grouping features, and expression (labeled as sense) for all unstranded grouping features.
-
ReferenceBlockConcordance
: a new tool to evaluate concordance of reference blocks in GVCF files (#6802)- This tool compares the reference blocks of two GVCF files against each other and produces three histograms:
- Truth block histogram: Indicates the number of occurrences of reference blocks with a given confidence score and length in the truth GVCF
- Eval block histogram: Indicates the number of occurrences of reference blocks with a given confidence score and length in the eval GVCF
- Confidence concordance histogram: Reflects the confidence scores of bases in reference blocks in the truth and eval VCF, respectively. An entry of 10 at bin "80,90" means that there are 10 bases which simultaneously have a reference confidence of 80 in the truth GVCF and a reference confidence of 90 in the eval GVCF.
- This tool compares the reference blocks of two GVCF files against each other and produces three histograms:
-
-
HaplotypeCaller/Mutect2
- Fixed a bug in
HaplotypeCaller
andMutect2
where we were losing insertion events that immediately followed a deletion (#6696) - Added a workaround for an issue with multiallelics in the
CreateSomaticPanelOfNormals
pipeline (#6871)- This fixes the "CreateSomaticPanelOfNormals output PoN has much less variants in 4.1.8.0 than before" issue reported in #6744
- Made improvements to the
Mutect2
active region detection code that resulted in recovering some low-AF calls that we were missing (#6821) - Made the
HaplotypeCaller
/Mutect2
adaptive pruner smarter in complex graphs, resulting in modest improvements to indel sensitivity when using the adaptive pruning option (#6520) - Fixed a bug in variation event detection code that could sometimes lead to mistreating indel assembly windows as SNP assembly windows (#6661)
- Fixed a bug in
FragmentUtils
where insertion quals were used instead of deletion quals when adjusting base qualities for two overlapping reads from the same fragment (#6815) - Fixed a concurrent modification exception error for local runs of
HaplotypeCallerSpark
(#6741) - Marked the
--linked-de-bruijn-graph
argument as Advanced rather than Hidden (#6737) - Made a small tweak to
Mutect2
's callable sites count (#6791) - Added a "requester pays" option to
Mutect2
WDL tasks that access bams for use with Google Cloud "requester pays" buckets (#6879)
- Fixed a bug in
-
Funcotator
- A major set of updates to
Funcotator
(#6660)- Updated to the latest Gencode release
- Fixed the contig naming compatibility issue with dbSNP reported in #6564 ("hg38 dbSNP has incorrect contig names")
- Now both hg19 and hg38 have the contig names translated to "chr__"
- Added 'lncRNA' to GeneTranscriptType.
- Added "TAGENE" gene tag.
- Added the MANE_SELECT tag to FeatureTag.
- Added the STOP_CODON_READTHROUGH tag to FeatureTag.
- Updated the GTF versions that are parseable.
- Fixed a parsing error with new versions of gencode and the remap positions (for liftover files).
- Added test for indexing new lifted over gencode GTF.
- Added Gencode_34 entries to MAF output map.
- Pointed data source downloader at new data sources URL.
- Minor updates to workflows to point at new data sources.
- Updated retrieval scripts for dbSNP and Gencode.
- Added required field to gencode config file generation.
- Now gencode retrieval script enforces double hash comments at top of gencode GTF files.
- Fixed an erroneous trailing tab in MAF file output reported in #6693
- Added a maximum version number for data sources in
Funcotator
(#6807) - Added a "requester pays" option to the
Funcotator
WDL for use with Google Cloud "requester pays" buckets (#6874) FuncotateSegments
: fixed an issue with the default value of --alias-to-key-mapping being set to an immutable value (#6700)
- A major set of updates to
-
GenomicsDB
- Updated to GenomicsDB Version 1.3.2, which brings better propagation of errors messages from the GenomicsDB library (#6852)
- Using the GATK option GATK_STACKTRACE_ON_USER_EXCEPTION will now also output a limited C/C++ stacktrace
- Updated to GenomicsDB Version 1.3.2, which brings better propagation of errors messages from the GenomicsDB library (#6852)
-
CNV Tools
- Fixed a bug in the
KernelSegmenter
: the minimal data to calculate the segmentation cost should be2 * windowSize
, rather thanwindowSize
(#6835) - Germline CNV WDL improvements for WGS (#6607)
- Modified gCNV WDLs to improve Cromwell performance when running on a large number of intervals, as in WGS
- Added optional disabled_read_filters input to CollectCounts
- Enabled GCS streaming for CollectCounts and CollectAllelicCounts
- Added a "requester pays" option to the germline and somatic CNV WDLs for use with Google Cloud "requester pays" buckets (#6870)
- Fixed a bug in the
-
Mitochondrial Pipeline
-
Notable Enhancements
- Significantly improved the performance of
DepthOfCoverage
by removing slow string formatting calls (#6740)- In a test run with default arguments locally the runtime for a WGS full chr15 drops from ~8.9 minutes to ~4.7 minutes after this patch
- Significantly improved the performance of
SelectVariants
with large numbers of samples by changing an operation to scale linearly instead of quadratically with the number of samples (#6729)- On one example with several thousand samples there was a speed up from ~5 minutes to 0.1 minutes
- WDL generation: made several improvements to automatic WDL generation, annotated additional tools for WDL generation, and added a section to the README with instructions on generating WDLs for GATK tools (#6800)
- Added a suite of utility methods for working with Google BigQuery:
BigQueryUtils
(#6759) (#6861) - The GATK docker image can now be built with a simple
docker build .
command (no extra arguments needed) (#6764) (#6842) (#6782) - Added a Dockstore yml file with workflow descriptions for the WDLs in the GATK repo, to facilitate automatic publication to Dockstore (#6770)
- Significantly improved the performance of
-
Bug Fixes
- Fixed a
NullPointerException
in theAS_StrandBiasTest
annotation reported in #6766 (#6847) - Fixed a bug with soft clips in
LeftAlignIndels
(#6792) VariantRecalibrator
: uniquify annotations to fix the error reported in #2221 (#6723)- Fixed an issue where
ContextCovariate
inBaseRecalibrator
mistakenly assumed that all non-ACGT bases in the read are N (#6625) - Fixed a crash in
CountBasesSpark
when using the-L
option (#6767)
- Fixed a
-
Miscellaneous Changes
- Significant refactoring of the SV discovery classes (#6652)
FilterVariantTranches
: report more info when the ref alleles don't match (#6723)- We now report the target url in exceptions thrown by
HtsgetReader
(#6799) - Added more information to error messages in
AssemblyRegion
for contigs not in the reference dictionary (#6781) - Improved an error message in
GATKRead.setMatePosition()
(#6779) - Updated the Barclay WDL template for compatibility with the Debian distribution (#6841)
- Temporarily disabled
HtsgetReader
tests to work around issues caused by a server-side upgrade. (#6804) - Re-enabled an
IndexFeatureFile
test for uncompressed BCF. (#6716)
-
Documentation
- Marked
LearnReadOrientationModel
as aDocumentedFeature
(#6726) - Added a gentle warning about loss of True Positives with the default
FilterIntervals
params (#6751) - Updated the README to mention that the conda environment is not officially supported on macOS at this time. (#6788)
- Fixed a typo in the example command for
SplitIntervals
(#6869) - Fixed a typo in the
--tmp-dir
argument in theGenomicsDBImport
docs (#6785) - Fixed a typo in the
--tmp-dir
argument in theGenotypeGVCFs
docs (#6784) - Removed outdated argument references from the
DepthOfCoverage
documentation. (#6810) - Fixed a typo with "-genelist" argument to "-gene-list" in the
DepthOfCoverage
documentation. (#6880) - Fixed a typo in the docs for the
Mutect2
--pcr-indel-qual argument (#6840)
- Marked
-
Dependencies