Skip to content

@droazen droazen released this Jun 26, 2020 · 4 commits to master since this release

Download release: gatk-4.1.8.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Highlights of the 4.1.8.0 release:

  • A major new release of GenomicsDB (1.3.0), with enhanced support for shared filesystems such as NFS and Lustre, support for MNVs, and better compression leading to a roughly 50% reduction in workspace size in our tests. This also includes a fix for an error in GenotypeGVCFs that several users were encountering when reading from GenomicsDB.

  • A major overhaul of the PathSeq microbial detection pipeline containing many improvements

  • Initial/prototype support for reading from HTSGET services in GATK

    • Over the next several releases, we intend for HTSGET support to propagate to more tools in the GATK
  • Fixes for a couple of frequently-reported errors in HaplotypeCaller and Mutect2 (#6586 and #6516)

  • Significant updates to our Python/R library dependencies and Docker image

Full list of changes:

  • New Tools

    • HtsgetReader: an experimental tool to localize files from an HTSGET service (#6611)
      • Over the next several releases, we intend for HTSGET support to propagate to more tools in the GATK
    • ReadAnonymizer: a tool to anonymize reads with information from the reference (#6653)
      • This tool is useful in the case where you want to use data for analysis, but cannot publish the data without anonymizing the sequence information.
  • HaplotypeCaller/Mutect2

    • Fixed an "evidence provided is not in sample" error in HaplotypeCaller when performing contamination downsampling (#6593)
      • This fixes the issue reported in #6586
    • Fixed a "String index out of range" error in the TandemRepeat annotation with HaplotypeCaller and Mutect2 (#6583)
      • This addresses an edge case reported in #6516 where an alt haplotype starts with an indel, and hence the variant start is one base before the assembly region due to padding a leading matching base
    • Better documentation for FilterAlignmentArtifacts (#6638)
    • Updated the CreateSomaticPanelOfNormals documentation (#6584)
    • Improved the tests for NuMTFilterTool (#6569)
  • PathSeq

    • Major overhaul of the PathSeq WDLs (#6536)
      • This new PathSeq WDL redesigns the workflow for improved performance in the cloud.
      • Downsampling can be applied to BAMs with high microbial content (ie >10M reads) that normally cause performance issues.
      • Removed microbial fasta input, as only the sequence dictionary is needed.
      • Broke pipeline down to into smaller tasks. This helps reduce costs by a) provisioning fewer resources at the filter and score phases of the pipeline and b) reducing job wall time to minimize the likelihood of VM preemption.
      • Filter-only option, which can be used to cheaply estimate the number of microbial reads in the sample.
      • Metrics are now parsed so they can be fed as output to the Terra data model.
      • CRAM-to-BAM capability
      • Updated WDL readme
      • Deleted unneeded WDL json configuration, as the configuration can be provided in Terra
    • Added an --ignore-alignment-contigs argument to PathSeq filtering that lets users specify any contigs that should be ignored. (#6537)
      • This is useful for BAMs aligned to hg38, which contains the Epstein-Barr virus (chrEBV)
  • GenomicsDB

    • Upgraded to GenomicsDB version 1.3.0 (#6654)
      • Added a new argument --genomicsdb-shared-posixfs-optimizations to help with shared POSIX filesystems like NFS and Lustre. This turns on disable file locking and for GenomicsDB import it minimizes writes to disks. The performance on some of the gatk datasets for the import of about 10 samples went from 23.72m to 6.34m on NFS which was comparable to importing to a local filesystem. Hopefully this helps with Issue #6487 and #6627. Also, fixes Issue #6519.
      • This version of GenomicsDB also uses pre-compression filters for offset and compression files for new workspaces and genomicsdb arrays. The total sizes for a GenomicsDB workspace using the same dataset as above and the 10 samples went from 313MB to 170MB with no change in import and query times. Smaller GenomicsDB arrays also help with performance on distributed and cloud file systems.
      • This version has added support to handle MNVs similar to deletions as described in Issue #6500.
      • There is added support in GenomicsDBImport to have multiple contigs in the same GenomicsDB partition/array. This will hopefully help import times in cases where users have many thousands of contigs. Changes are still needed from the GATK side to make use of this support.
      • Logging has been improved somewhat with the native C/C++ code using spdlog and fmt and the Java layer using apache log4j and log4j.properties provided by the application. Also, info messages like No valid combination operation found for INFO field AA - the field will NOT be part of INFO fields in the generated VCF records will only be output once for the operation.
    • Made VCFCodec the default for query streams from GenomicsDB (#6675)
      • This fixes the frequently-reported NullPointerException in GenotypeGVCFs when reading from GenomicsDB (see #6667)
      • Added a --genomicsdb-use-bcf-codec argument to opt back in to using the BCFCodec, which is faster but prone to the above error on certain datasets
  • CNV Tools

    • DetermineGermlineContigPloidy can now process interval lists with a single contig (#6613)
    • FilterIntervals now filters out any singleton intervals (#6559)
    • Fixed an inaccurate error message in SVDDenoisingUtils (#6608)
  • Docker/Conda Overhaul (#5026)

    • Our docker image is now built off of Ubuntu 18.04 instead of 16.04
      • This brings in newer versions of several important packages such as samtools
    • Updated many of the Python libraries installed via our conda environment and included in our Docker image to newer versions, resolving several outstanding issues in the process
    • R dependencies are now installed via conda in our Docker build instead of the now-removed install_R_packages.R script
      • Due to this change, we recommend that tools that use R packages (e.g., to create plots) should now be run using the GATK docker image or the conda environment.
    • NOTE: significant updates and changes to the Ubuntu version, native packages, and R/python packages may result in corresponding numerical changes in results.
  • Mitochondrial Pipeline

    • Minor updates to the mitochondrial pipeline WDLs (#6597)
  • Notable Enhancements

    • RevertSamSpark now supports CRAMs (#6641)
    • Fixed a VariantAnnotator performance issue that could cause the tool to run very slowly on certain inputs (#6672)
    • More flexible matching of dbSNP variants during variant annotation (#6626)
      • Add all dbsnp id's which match a particular variant to the variant's id, instead of just the first one found in the dbsnp vcf.
      • Be less brittle to variant normalization issues, and match differing variant representations of the same underlying variant. This is implemented by splitting and trimming multiallelics before checking for a match, which I suspect are the predominant cause of these types of matching failures.
    • Added a --min-num-bases-for-segment-funcotation argument to FuncotateSegments (#6577)
      • This will allow for segments of length less than 150 bases to be annotated if given at run time (defaults to 150 bases to preserve the previous behavior).
    • SplitIntervals can now handle more than 10,000 shards (#6587)
  • Bug Fixes

    • Fixed interval summary files being empty in DepthOfCoverage (#6609)
    • Fixed a crash in the BQSR R script with newer versions of R (#6677)
    • Fix crash when reporting error when trying to build GATK with a JRE (#6676)
    • Fixed an issue where ReadsSourceSpark.getHeader() wasn't propagating the reference at all when a CRAM file input resides on GCS, so it always resulted in a "no reference was provided" error, even when a reference was provided. (#6517)
    • Fixed an issue where ReadsSourceSpark.checkCramReference() always tried to create a Hadoop Path object for the reference no matter what file system it lives on, which fails when using a reference on GCS. (#6517)
    • Fixed an issue where the tab completion integration tests weren't emitting any output (#6647)
  • Miscellaneous Changes

    • Created a new ReadsDataSource interface (#6633)
    • Migrated read arguments and downstream code to GATKPath (#6561)
    • Renamed GATKPathSpecifier to GATKPath. (#6632)
    • Add a read/write roundtrip Spark integration test for a CRAM and reference on HDFS. (#6618)
    • Deleted redundant methods in SVCigarUtils, and rewrote and moved the rest to CigarUtils (#6481)
    • Re-enabled tests for HTSGET now that the reference server is back to a stable version (#6668)
    • Disabled SortSamSparkIntegrationTest.testSortBAMsSharded() (#6635)
    • Fixed a typo in a SortSamSpark log message. (#6636)
    • Removed incorrect logger from DepthOfCoverage. (#6622)
  • Documentation

    • Fixed annotation equation rendering in the tool docs. (#6606)
    • Adding a note as to how to filter on MappingQuality in DepthOfCoverage (#6619)
    • Clarified the docs for the --gcs-project-for-requester-pays argument to mention the need for storage.buckets.get permission on the bucket being accessed (#6594)
    • Fixed a dead forum link in the SelectVariants documentation (#6595)
  • Dependencies

    • Updated HTSJDK to 2.22.0 (#6637)
    • Updated Picard to 2.22.8 (#6637)
    • Updated Barclay to 3.0.0 (#4523)
    • Updated Spark to 2.4.5 (#6637)
    • Updated Disq to 0.3.6 (#6637)
    • Updated the version of Cromwell used on Travis to v51 (#6628)
Assets 3

@droazen droazen released this Apr 23, 2020 · 47 commits to master since this release

Download release: gatk-4.1.7.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Highlights of the 4.1.7.0 release:

  • Added allele-specific filtering to the mitochondrial pipeline.

    • Allele-specific filtering is important for mitochondrial calling because there are many more multi-allelic sites than in the germline autosome.
  • A fix for the frequently-encountered "Smith-Waterman alignment failure" error in HaplotypeCaller and Mutect2

  • Initial support for http(s) paths for BAM inputs, including signed urls

  • A new tool, DownsampleByDuplicateSet, to randomly sample a fraction of duplicate sets from an input bam sorted by UMI

Full list of changes:

  • New Tools

    • DownsampleByDuplicateSet: a new tool to randomly sample a fraction of an input bam sorted by UMI. (#6512)
      • Given a bam grouped by unique molecular identifier (UMI), this tool drops a specified fraction of duplicate sets and returns a new bam.
      • A duplicate set refers to a group of reads whose fragments start and end at the same genomic coordinate and share the same UMI.
      • The input bam must first be sorted by UMI using FGBio GroupReadsByUmi.
      • Use this tool to create, for instance, an insilico mixture of duplex-sequenced samples to simulate tumor subclones.
  • HaplotypeCaller/Mutect2

    • Fixed a regression in HaplotypeCaller and Mutect2 where alt haplotypes with a deletion at the end of the padded region caused exceptions (#6544)
      • This bug produced error messages like the following: "Smith-Waterman alignment failure. Cigar = 275M with reference length 275 but expecting reference length of 303"
    • Fixed an ArrayIndexOutOfBoundsException in GenotypeUtils.computeDiploidGenotypeCounts() caused by mistakenly assuming ploidy two for no-calls (#6563)
    • Added more control over scattering in the Mutect2 PON WDL to allow arbitrarily fine scattering, reducing the memory required for downstream runs of GenomicsDBImport (#6527)
    • Invert --correct-overlapping-quality argument in HaplotypeCaller to --do-not-correct-overlapping-quality (#6528)
  • Mitochondrial Pipeline

    • Added allele-specific filtering to the mitochondrial pipeline (#6399)
      • Allele-specific filtering is important for mitochondria because there are many more multi-allelic sites than in the germline autosome and therefore, downstream tools have access to more of the good allele data.
      • These Mutect2 filters used in the MT pipeline are now allele-specific: weak_evidence, base_qual, map_qual, duplicate, strand_bias, strand_artifact, position, contamination, and low_allele_frac.
      • They are added to the AS_FilterStatus annotation in the INFO field.
      • The numt_chimera and numt_novel filters have been replaced by the possible_numt filter.
      • Two new filtering tools have been added: NuMTFilterTool for the possible_numt filter and MTLowHeteroplasmyFilterTool for the mt_many_low_hets filter, both of which are allele-specific.
      • The --split-multi-allelics option of the LeftAlignAndTrimVariants tool now splits the annotations in the FORMAT and INFO fields that are of type A and R (allele-specific, and allele-specific with reference).
      • The VariantFiltration tool now has an --apply-allele-specific-filters option that will apply masks at the allele level. Before this addition, sites that should not be masked, but had deletions that spanned a masked site would have been masked. Now, if this option is specified, only the alleles spanning the masked site will be masked.
  • GATK Engine

    • Added initial support for http(s) paths for BAM inputs, including signed urls (#6526)
  • Miscellaneous Changes

    • Exposed maximum copy ratio and point size for CNV plotting tools (#6482)
    • Decreased an epsilon value in VariantRecalibrator so that our production exome joint genotyping tests pass (#6534)
    • Migrated reference arguments and downstream code to GATKPathSpecifier (#6524)
    • Removed obsolete isCompatibleWithSparkBroadcast() method. (#6523)
  • Documentation

    • Cleaned up the handling of some missing values in auto-generated GATK tool documentation (#6565)
      • Now docs won't include null, "", or [] in the default value list.
    • Added a README for the CNN variant scoring workflow, and added an input JSON for Mutect2 workflow files located in GCS buckets (#6542)
    • Fixed a typo in a ploidy prior example in the docs for DetermineGermlineContigPloidy (#6531)
Assets 3

@droazen droazen released this Mar 25, 2020 · 61 commits to master since this release

Download release: gatk-4.1.6.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Highlights of the 4.1.6.0 release:

  • Funcotator now supports ENSEMBL GTF files (and non-human species)

  • A beta port of the GATK3 tool DepthOfCoverage, a tool to assess sequence coverage by a wide array of metrics, partitioned by sample, read group, library, or gene (#5913)

  • Several important bug fixes and enhancements to HaplotypeCaller and Mutect2, including:

    • A fix for an often-reported issue where HaplotypeCaller could produce reads starting with deletions during the realignment step and error out.
    • A fix for another often-reported issue where Mutect2 could emit MNPs despite --max-mnp-distance being 0, causing downstream errors in GenomicsDB about MNPs not being supported.

Full list of changes:

  • New Tools

    • A beta port of the GATK3 tool DepthOfCoverage, a tool to assess sequence coverage by a wide array of metrics, partitioned by sample, read group, library, or gene (#5913)
      • This port fixes several bugs and changes some behavior present in the GATK3 version:
        • Fixed a longstanding bug in GATK3 DepthOfCoverage where using multiple partition types results in column header and body lines having mismatching ordering causing incorrect output.
        • The old version used to merge adjacent and overlapping intervals when generating interval summary files. This is no longer the case as in GATK4 adjacent and overlapping intervals are tabulated as separate lines in the output (This also applies to gene lists which would previously have been merged as well).
        • Changed the behavior of gene list coverage to no longer count introns when generating interval summaries for gene lists.
        • Added support for RefSeqGeneList files as optional gene list input.
  • HaplotypeCaller

    • Fixed a bug where single-base intervals led to no calls (#6507)
      • This fixes the issue reported in #6495 "HaplotypeCaller doesn't detect alternate alleles with 1 bp intervals"
    • Clean leading deletions from reads realigned to best haplotypes (#6498)
      • This fixes the issue reported in #6490 "HaplotypeCaller might be producing bogus reads with deletions at their alignment start during realignment to best haplotype step"
    • Fixed an edge case when haplotypes have leading insertion after trimming (#6518)
  • Mutect2

    • Mutect2 can now filter MNVs with orientation bias (#6486)
    • Added an experimental pileup-based read error corrector, which in our evaluations reduces false positives and improves speed at no cost to sensitivity (#6470)
    • Switched CigarBuilder's order for adjacent indels to be deletion first (#6510)
      • Fixes #6473 "Mutect2 (GATK 4.1.5.0) emitting MNPs despite max-mnp-distance 0"
      • This also resolves downstream errors in GenomicsDB about not supporting MNPs
    • Fixed several bugs involving getReadCoordinateForReferenceCoordinate() (#6485)
      • Fixes #6342 "Mutect2 occasionally writes nonsense / invalid values for MPOS info tag"
      • Fixes #6314 "GATK4.1.3.0 Mutect2 enable-all-annotations option error"
      • Fixes #6294 "ReadPosRankSumTest with leading insertions"
      • Fixes #5492 "ReadPosRankSumTest doesn't work for two deletions with one base in between"
  • Funcotator

    • Funcotator now supports ENSEMBL GTF files (and non-human species) (#6477) (#6492)
      • Users can now create datasources for any species for which ENSEMBL has an annotated GTF file and the corresponding coding region FASTA file
      • When creating new data sources, the user must still use gencode as the parent folder for the GTF data source subfolders. For example, for E. coli MG1655:
        • DATASOURCES
          • gencode
            • ASM584v2
              • Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.44.gtf
              • Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.cds.all.fa
              • gencode.config
      • For more information on creating data sources see the Funcotator tutorial on the GATK Forums.
      • An example datasource for E. coli MG1655 can be found in the large test files for Funcotator
      • For ENSEMBL datasources for vertebrates: ftp://ftp.ensembl.org/pub/
      • For ENSEMBL datasources for other species: ftp://ftp.ensemblgenomes.org/pub/
  • CNV Calling

    • Upgrade CNV WDLs to 1.0 spec (#6506)
    • Fixed an off-by-one segmentation argument in ModelSegments. (#6497)
  • Miscellaneous Changes

    • Simplified cigar and clipping code; added tests and fixed a few bugs including #6130 (#6403)
    • Refactored and enhanced ArgumentsBuilder (#6474)
    • Allow all GATKSparkTools to set the SBI index granularity (#6458)
    • Delete NioBam and related classes (#6479)
    • Clean up old interval code (#6465)
    • Remove duplicate copy of the NIO prefetching code (#6464)
    • Fix ignored test in GATKReadAdaptersUnitTest (#6471)
    • Fix alternate spellings of De Bruijn in the codebase (#6472)
  • Documentation

    • Fix a broken set of javadoc references in FeatureDataSource (#6478)
Assets 3

@droazen droazen released this Feb 28, 2020 · 83 commits to master since this release

Download release: gatk-4.1.5.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Highlights of the 4.1.5.0 release:

  • A new, improved version of the --linked-de-bruijn-graph mode for HaplotypeCaller and Mutect2 that has better sensitivity compared to the previous linked DeBruijn graph implementation (#6394)

  • A new version of GenomicsDB that fixes many frequently-reported issues

  • LeftAlignIndels now works for multiple indels

  • VariantAnnotator and Concordance are now out of beta

  • A significant number of bug fixes to major tools like GenotypeGVCFs and SelectVariants

Full list of changes:

  • HaplotypeCaller

    • New, improved version of the --linked-de-bruijn-graph mode for HaplotypeCaller and Mutect2 that has better sensitivity compared to the previous linked DeBruijn graph implementation (#6394)
      • Running HaplotypeCaller in this mode will reduce the number of erroneous haplotypes discovered which can improve genotyping, phasing, and runtime.
      • Changed the haplotype recovery step to check that it covers all paths through the graph even if there are poorly supported paths in the JunctionTrees. Added the argument --disable-artificial-haplotype-recovery to disable this behavior.
      • Added the ability to expand graph kmer size after haplotype recovery in the event that there was a failure due to overcomplicated assembly graphs.
      • Added code to squeeze extra sensitivity out of the junction trees by tolerating SNP errors when threading the junction trees themselves
    • Realigning to best haplotype handles indels better (#6461)
    • Fixed issue #5434 on inconsistent selection of reads for the PL, AD, and DP calculations. (#6055)
    • Fixed bug where SNP and indel pseudocounts were swapped in the AlleleFrequencyCalculator (#6401)
    • The qual used in HaplotypeCaller's isActive() method now matches that of GenotypeGVCFs. That is, they both now use the new qual. (#6343)
    • Skip non-nucleotide alleles in force-calling mode, fixing bug (#6405)
    • Fixed the hidden/experimental --error-correct-reads argument to actually correct the bases and qualities (#6366)
    • Removed the deprecated and obsolete --use-new-qual-calculator argument (#6398)
    • Refactored code related to windows and padding for assembly and genotyping, with slight changes to HMM padding for indels (#6358)
  • Mutect2

    • Improved SomaticClusteringModel (#6337)
    • Sped up Mutect2 reference confidence model with fast likelihoods model (#6457)
    • Modified Fragment creation for Mutect2 to not fail for supplementary reads (#6327)
    • Uniqify PG IDs in FilterAlignmentArtifacts (#6304)
    • Fixed error in RealignmentEngine due to converting from exclusive to inclusive interval ends (#6404)
    • Added an error message for no callable sites in Mutect2 (#6445)
    • Changed filter reporting in Mutect2 (#6288)
    • Fixed force-calling mode in M2 mito WDL (#6359)
    • Pass the reference to the realignment filter in the Mutect2 WDL (#6360)
    • Deleted the old orientation bias filter (#6408)
    • Made callable sites a Long to avoid integer overflow (#6303)
  • GenomicsDB

    • Move to GenomicsDB 1.2.0 (#6305)
      • Fixes an issue with GenomicsDBImport erroring out due to duplicate fields in the Info, Format, and/or Filter fields. (#6158)
      • Fixes an issue with GenomicsDBImport not completing for mixed ploidy samples (#6275)
      • This version uses a 64-bit htslib to workaround overflow issues when computed annotation sizes exceed the 32-bit integer space
  • Joint Calling

    • GenotypeGVCFs: improved checking for upstream deletions in the GenotypingEngine (#6429)
      • Fixes rare cases where GenotypeGVCFs could emit a variant with a spanned allele (*), and a genotype that references the spanned allele, but fail to emit the upstream spanning variant.
    • GenotypeGVCFs: Don't call the NON_REF allele in genotypes or ADs (#6437)
    • Parse combined AS_QUALapprox values from older reblocked GVCFs properly (#6442)
    • Added a force output sites argument to GenotypeGVCFs (#6263)
    • Remove extraneous alleles in GenotypeGVCFs force-output mode (#6406)
  • CNV Calling

    • Copy temporary files early in gcnvkernel to avoid inadvertent temporary directory cleanup. (#6297)
    • Enabled streaming of counts.tsv/counts.tsv.gz files in gCNV CLIs. (#6266)
    • Fixed shard index in PostprocessGermlineCNVCalls log message. (#6313)
    • gCNV vcf cleanup (#6352)
    • Index output VCFs for GCNV postprocessing (#6330)
  • Notable Enhancements

    • VariantAnnotator is now out of beta (#6402)
    • Concordance is out of beta (#6397)
    • LeftAlignIndels now works for multiple indels (#6427)
    • FilterVariantTranches can now handle cases where there are only SNPs or only indels, and not both (#6411)
    • Added new read filters for NotProperlyPaired and for MateDistant (#6295)
    • Made the .git directory optional during build (#6450)
  • Bug Fixes

    • Handle zero-weight Gaussians correctly in VariantRecalibrator (#6425)
    • Fixed the --invalidate-previous-filters argument in VariantFiltration to work as intended (ie., roll back all variants to unfiltered status) (#6412)
    • Fixed a bug where SelectVariants takes forever on many-allelic somatic samples (#6446)
    • Make sure SelectVariants outputs variants in correct order (assuming input vcf is correctly sorted) (#6444)
    • Fixed a NPE crash in VariantEval when run with no intervals/reference (#6283)
    • Fixed a NPE crash in FastaReferenceMaker (#6435)
    • Fixed an out-of-bounds error in CountNs annotation (#6355)
    • Fixed a bug in hardClipCigar function that caused incorrect cigar calculation (#6280)
    • AnalyzeSaturationMutagenesis: fixed bug in codon calling for in-frame inserts (#6332)
  • Miscellaneous Changes

    • Collect split read and paired end evidence files for GATK-SV pipeline (#6356)
    • Add "PASS" filter line for ApplyVQSR and FilterMutectCalls (#6436)
    • Added engine functionality for accessing the user defined intervals without merging them (#5887)
    • Trim intervals loaded from interval files. (#6375)
    • Propagate read group filters in ReadGroupBlackListReadFilter. (#6300)
    • Modified ANDed read filter output message for readability (#6315)
    • Clearly label the number of reads processed in the BaseRecalibrator log output (#6447)
    • Clearly label the CountReads tool output (#6449)
    • Improved the error messages for missing contigs in the reference (#6469)
    • Avoid a copy and reverse operation in CigarUtils.isGood() (#6439)
    • Fixed GenotypeAlleleCount's toString() method (#6376)
    • Minor Funcotator WDL updates. (#6326)
    • Added a getPairOrientation() method to GATKRead (#6420)
    • Merged GATKProtectedVariantContextUtils methods into other classes (#6409)
    • Deleted a lot of unused VCF constants (#6361)
    • Deleted some unused genotyping code (#6354)
    • Fixed incoherent unit test cases in allele subsetting utils (#6448)
    • Add Python script executor error message for SIGKILL exit code 137. (#6414)
    • Pip install pinned numpy. (#6413)
    • Do not install R on travis, and only run the R tests on the Docker. (#6454)
    • Fixes for IndexFeatureFile error reporting. (#6367)
    • Temporarily remove dead Berkeley mirror to unblock builds. (#6422)
    • Disable CNNVariantPipelineTest.testTrainingReadModel until failures are resolved. (#6331)
    • Delete unused JsonSerializer (#6415)
    • Delete empty file SparkToggleCommandLineProgram.java. (#6311)
  • Documentation

    • Clarify the definition of the NON_REF allele (#6431)
    • Clarify behavior of SplitIntervals for lists of adjacent intervals (#6423)
    • Update docs to reflect the fact that TandemRepeat works with HaplotypeCaller (#5943)
    • Update LeftAlignIndels documentation (#6177)
    • Update hyperlink to new GATK forum page in the README (#6381)
    • Add minValue/minRecommended value to ApplyBQSRArgumentCollection (#6438)
    • Small README fixes (#6451)
    • Fix some GATK doc issues (#6318)
    • Update copyright date in LICENSE.TXT (#6383)
  • Dependencies

    • Updated HTSJDK to 2.21.2 (#6462)
    • Updated Picard to 2.21.9 (#6462)
    • Updated Disq to 0.3.5 (#6323)
    • Updated GenomicsDB to 1.2.0 (#6305)
    • Updated TestNG to 7.0.0 (#5787)
Assets 3

@lbergelson lbergelson released this Nov 27, 2019 · 168 commits to master since this release

Download release: gatk-4.1.4.1.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Highlights of the 4.1.4.1 release:

  • New experimental HaplotypeCaller assembly mode which improves phasing, reduces false positives, improves calling at complex sites, and has 15-20% speedup vs the current assembler. It is enabled with option --linked-de-bruijn-graph. This mode is still experimental and not recommended for production use yet.
  • IndexFeatureFile improvements:
    • now cloud enabled
    • changed controversial F argument to I instead.
  • Bug fixes and improvements in GenomicsDB, Mutect2, variant annotation, and more!

Full list of changes:

  • New Tools

    • PrintReadsHeader: a new tool to print a BAM/SAM/CRAM header to a file (#6153)
  • HaplotypeCaller

    • Experimental prototype of JunctionTree based haplotype finding. (#6034) #5925
    • Fix a genotyping bug were reference/alt likelihoods were capped differently. (#6196)
  • Mutect2

    • Mutect2 now warns but does not fail when three or more reads have the same name. (#6240)
    • Fixed the random seed at the beginning of FilterMutectCalls (#6208)
    • GetSampleName and GetPileupSummaries in the M2 pipeline are no longer beta. (#6215)
    • Increase number of iterations in CalculateContamination to 30. (#6282)
    • Handled an edge case with high scatter count in M2 WDL. (#6216)
    • Use ArgumentsBuilder in M2 tests. (#6219)
  • Joint Calling

    • Allele-specific VQSR convergence fix. (#6262)
    • Fix to Allele Fraction annotation bug in multisample vcfs. (#6251)
    • Fix RAW_MQ header inconsistencies after reblocking. (#6276)
    • Mark SNP/indel mode argument in GatherTranches as required so tranches are named properly. (#6273)
  • CNV Calling

    • Fixed model parameter assignment typo in gCNV ploidy model (#6285)
    • Added docker option to the gcnv QC tasks. (#6185)
    • Added epsilons to overdispersion in gCNV models to avoid NaNs. (#6245) #4824 #6226 #6227
    • Assert that ELBO did not become NaN during each step of inference of gCNV. (#6186)
    • Added ability to override THEANO_FLAGS environment variable in gCNV tools. (#6244) #6235
    • Removed erroneous short argument names in R scripts for CNV plotting. (#6197)
  • GenomicsDB

    • Allow GATK to configure annotation processing instead of hardcoding values in GenomicsDB GDB-39
    • High ploidy sites with many genotypes no longer causes an overflow error. GDB-54
    • Add missing libcurl in the native GenomicsDB library. #6122 GDB-66
    • No longer crashes when vcfbufferstream from htslib appears to be invalid. GDB-67
    • Propagated native GenomicsDB exceptions as java IOExceptions. GDB-68
    • Fix issue when using vid protobuf interface and there is more than 1 config. GDB-70
    • Cleanup GenomicsDB vid combine protobuf mapping overrides. #6190
  • Miscellaneous Changes

    • Cloud-enable IndexFeatureFile and change input arg name from -F to -I. (#6246) #6161
    • WDL to run ReadsPipelineSpark on a multicore machine (#6213)
    • Replace TwoPassReadWalker with more general MultiplePassReadWalker. (#6154)
    • Abolish unfilled likelihoods and revamp VariantAnnotator. (#6172)
    • Improve exception message in ValidateVariants. (#6076)
    • Fix Syntax Warning when running GATK with python 3.8 (#6231)
  • Developer / Testing

    • Report errors logs in github comment (#6247) 6234
    • Add .java-version to gitignore to support jenv users. (#6232)
    • Restart test JVM after every 100 test classes do reduce out of memory failures. (#6093)
    • Running the cloud tests on java 11 on travis. (#6210)
  • Documentation

    • Clarify definition of PGT in VCF header (#6221)
    • docs for paired reads in Mutect2 somatic genotyping (#6264)
    • Fix some typos in the allele subsetting code. (#6265)
  • Dependencies

    • Update picard to 2.21.2 (#6253)
    • Update disq to 0.3.4 (#6252)
    • update htsjdk to 2.21.0 (#6250)
    • Update to Genomicsdb 1.1.2.2 (#6206) (#6188)
Assets 3

@droazen droazen released this Oct 8, 2019 · 204 commits to master since this release

Download release: gatk-4.1.4.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Highlights of the 4.1.4.0 release:

  • Major improvements and fixes to Mutect2, including more intelligent handling of paired reads during genotyping and better filtering.

  • Important bug fixes to HaplotypeCaller, the joint calling pipeline, and Funcotator

  • Beta support for building/testing on Java 11 (#6119) (#6145)

    • We encourage you to try this out and give us feedback!

Full list of changes:

  • New Tools

    • AlleleFrequencyQC: a QC tool that uses VariantEval to bin variants in 1000 Genomes by allele frequency. For each bin, we compare the expected allele frequency from 1000 Genomes with the observed allele frequency in the input VCF. This was designed with arrays in mind, as a way to discover potential bugs in our pipeline. #6039)
  • Mutect2

    • Mutect2 genotyping now forces paired reads to support the same haplotype (#5831)
    • New FilterAlignmentArtifacts now realigns a locally-assembled unitig of all variant read pairs (#6143)
    • Fixed a Mutect2 bug that overfiltered by one variant (#6101)
    • Fixed a small gene panel edge case for CalculateContamination (#6137)
    • Fixed a small gene panel edge case in orientation bias filter (#6141)
    • Unified the NIO and non-NIO M2 WDLs (call-caching will now work on Terra) (#6108)
    • Updated Mutect2 pon WDL to WDL 1.0 (#6187)
    • Removed Oncotator from the M2 WDL (Funcotator is still there) (#6144)
    • Fixed an issue in the M2 WDL that could cause the Funcotate task to be ignored by tools such as dxWDL (#6077)
    • Some miscellaneous code refactoring/improvements (#6184) (#6136) (#6107) (#6159)
  • HaplotypeCaller

    • HaplotypeCaller now force-calls like Mutect2: the -genotyping-mode GENOTYPE_GIVEN_ALLELES argument is gone (now you only need to specify --alleles force-calls.vcf) and alleles are now force-called in addition to any other alleles (#6090)
    • Renamed --output-mode EMIT_ALL_SITES to --output-mode EMIT_ALL_ACTIVE_SITES, and clarified the documentation for the argument (#6181)
    • Fixed a rare bug in the genotyping engine where it could emit untrimmed alleles for SNP sites (#6044)
    • Fixed some sources of non-determinism in the HaplotypeCaller that in rare cases could cause the output to vary slightly given the same inputs (#6195) (#6104)
    • Deleted the old exact AF calculation model (#6099)
  • Joint Calling

    • Fixed a regression in GATK 4.1.3.0 that caused us to not emit the AS_QD annotation when running a joint calling pipeline with CombineGVCFs (GenomicsDB was unaffected) (#6168)
    • Fixed allele-specific annotation array length issues when alleles are subset in tools such as GenotypeGVCFs (#6079)
    • Changed AS_RankSum outputs to "." for missing values rather than "nul" (#6079)
  • Funcotator

    • Fixed a bug that caused Funcotator to outputs fields in wrong order in some cases when writing a VCF (#6178)
      • Specifically, Funcotator would output functation fields in the wrong order when there was more than 1 site in a VCF data source with the exact same position and alleles and it matched one of the variants being annotated
  • Mitochondrial pipeline

    • Renamed the output vcf with the name of the sample and supplied a default value for autosomal_median_coverage (meaning you'll now get the NuMT filter even if you don't provide the actual autosomal coverage) (#6160)
  • Miscellaneous Changes

    • Beta support for building/testing on Java 11 (#6119) (#6145)
    • UpdateVCFSequenceDictionary now supports replacing an invalid sequence dictionary in a VCF (#6140)
    • CountFalsePositives now requires an intervals file (#6120)
    • AnalyzeSaturationMutagenesis: use supplementary alignments to identify large deletions (#6092)
    • AnalyzeSaturationMutagenesis: an insert at the start codon is not in the ORF (#6121)
    • Added a check for null sequence dictionaries in the dictionary validation code (#6147)
    • Update SV Spark pipeline example shell scripts saving results to GCS (#6114)
    • Update public key for installing R in docker (#6116)
    • Log exceptions during deletion on JVM exit instead of throwing (#6125)
    • Don't fail the build if we're in a git worktree folder (#6169)
    • Free a bit of memory fir the test suite by disabling mysql and postgress on travis (#6085)
    • Delete bogus index files for queryname sorted CRAMs. (#6149)
    • Cleanup GenomicsDB debugging test output (#6089)
  • Documentation

    • Fixed mitochondria mode documentation in FilterMutectCalls (#6174)
  • Dependencies

    • Updated HTSJDK to 2.20.3 (#6126)
    • Updated Picard to 2.21.1 (#6205)
    • Updated google-cloud-nio to 0.107.0 (#6042)
    • Updated Gradle to 5.6 (#6106)
Assets 3

@droazen droazen released this Aug 9, 2019 · 251 commits to master since this release

Download release: gatk-4.1.3.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Highlights of the 4.1.3.0 release:

  • GnarlyGenotyper, a new beta joint genotyping tool which, along with ReblockGVCF, forms part of a forthcoming more scalable version of our joint genotyping pipeline that we call the "GATK Biggest Practices" pipeline
  • FuncotateSegments, a new beta companion tool to Funcotator that performs functional annotation on a segment file (.seg) rather than a VCF
  • GenomicsDBImport now has the ability to incrementally update an existing GenomicsDB workspace
  • Several important bug fixes to HaplotypeCaller and Mutect2

Compatibility notes:

  • GermlineCNVCaller models built in cohort mode with previous releases are no longer compatible. Users should rebuild these models with this release before running GermlineCNVCaller in case mode. See the CNV Tools section below for more details.

Full list of changes:

  • New Tools

    • GnarlyGenotyper (beta tool) (#4947) (#6075)

      • The GnarlyGenotyper is designed to perform joint genotyping on cohorts of at least tens of thousands of samples called with HaplotypeCaller and post-processed with ReblockGVCF to produce a multi-sample callset in a super highly scalable manner.
      • Caveats:
        • GnarlyGenotyper is intended to be used with GVCFs for which low quality variants have already been removed, derived from post-processing HaplotypeCaller GVCFs with ReblockGVCF. See the "Biggest Practices" usage example in the ReblockGVCF docs for details.
        • GnarlyGenotyper does not subset alternate alleles and can return some highly multi-allelic sites. PLs will not be output for sites with more than 6 alts to save space.
        • GnarlyGenotyper assumes all diploid genotypes
      • Annotations:
        • To generate all the annotations necessary for VQSR, input variants to the GnarlyGenotyper must include the QUALapprox and VarDP annotations along with the latest RAW_MQandDP annotation.
        • If allele-specific annotations are present, they will be used appropriately and a new AS_AltDP annotation giving the total depth across samples for each alternate allele will be added.
      • A GATK "Biggest Practices" pipeline including the GnarlyGenotyper is forthcoming pending some fixes improving on the above caveats.
    • FuncotateSegments (beta tool) (#5941)

      • A companion tool to Funcotator that performs functional annotation on a segment file (.seg) rather than a VCF
      • The Somatic CNV pipeline can optionally run this tool for functional annotation
  • HaplotypeCaller/Mutect2

    • Fixed a regression in HaplotypeCaller/Mutect2 that caused some variants to be lost at sites with high complexity (#5952)
    • Fixed a GGA (GENOTYPE_GIVEN_ALLELES) mode bug in HaplotypeCaller/Mutect2 where added alleles' cigars could have soft clips (#6047)
      • This bug would manifest as a "Cigar cannot be null" error
    • Fixed a bug where cached indel informativeness values could be incorrectly applied to the wrong sites in HaplotypeCaller/Mutect2 (#5911)
    • Fixed an edge case in HaplotypeCaller/Mutect2 where dangling end merging creates cycles (#5960)
    • Added hidden arguments to the assembly engine to track found haplotype counts and kmers used (#6049)
    • Fixed a bug in CalculateContamination when contamination is indistinguishable from zero (#5971)
    • Fixed a bug where normal p value argument in FilterMutectCalls was declared static (#5982)
  • CNV Tools

    • Added FuncotateSegments as an option to the Somatic CNV WDL (#5967)
    • Added QC metrics to the Germline CNV workflow (#6017)
    • Enabled GC-bias correction by default in CNV workflows (#5966)
    • Added denoised coverage file concatenation output to gCNV postprocessor (#5823) Note: The addition of this feature breaks compatibility with gCNV cohort-mode models built with previous releases.
    • Changed cr.igv.seg output of ModelSegments to give log2 Segment_Mean. (#5976)
    • Fixed CNV plotting script to allow spaces in input filenames. (#5983)
  • GenomicsDBImport

    • Added support for making incremental updates to existing workspaces (#5970)
      • This can be done using the new --genomicsdb-update-workspace-path argument
    • Fixed a crash in GenomicsDBImport on queries at positions inside deletions (#5899)
    • Treat AS_QUALapprox and AS_VarDP strings as array of int vectors (#5933)
  • Mitochondrial Calling Pipeline

    • Added NIO support and updated to WDL 1.0 (#6074)
  • Spark Tools

    • Removed the beta label from many simple Spark tools (#5991)
    • Bug fix for reading references from GCS on Spark (#6070)
    • Eliminated an unnecessary sort step in HaplotypeCallerSpark (#5909)
    • Fixed BaseRecalibratorSpark failure on a cluster due to system classloader issue (#5979)
    • Added a WDL for ReadsPipelineSpark (#5904)
    • Added a command-line argument to toggle using NIO on reading for Spark (#6010)
    • Added advanced arguments to MarkDuplicatesSpark to allow non-queryname sorted inputs when specifying multiple input bams and to treat unsorted inputs as queryGroup-sorted (#5974)
    • Clarified the behavior of MarkDuplicatesSpark when given multiple input bams, and improved the sorting behavior if given a mix of queryname-sorted and query-grouped bams (#5901)
    • Changed spark.yarn.executor.memoryOverhead to spark.executor.memoryOverhead as promoted by Spark 2.3 (#6032)
    • Handle newly-added arguments in ApplyBQSRUniqueArgumentCollection (#5949)
  • Miscellaneous Changes

    • Added a new BaseQualityHistogram variant annotation to generate base quality histograms (#5986)
    • Added a new SoftClippedReadFilter that can filter out reads where the ratio of soft-clipped bases to total bases exceeds some given value (#5995)
    • Fixed a serious bug in ValidateVariants where the tool would silently do no validation in the default case when a DBSNP file was not provided (#5984)
    • Fixed a "Record covers a position previously traversed" error in ValidateVariants for GVCFS with multiple contigs (#6028)
    • The RMSMappingQuality annotation now requires the --allow-old-rms-mapping-quality-annotation-data argument to run with GVCFs created by older versions of the GATK (#6060)
    • Added a simple TSV/CSV/XSV writer with cloud write support as an alternative to TableWriter (#5930)
    • Funcotator: added Funcotator stand-alone WDL to supported area (#5999)
    • Extracted the GenotypeGVCFs engine into publicly accessible class/function (#6004)
    • Refactored VariantEval methods to allow subclasses to override (#5998)
    • AnalyzeSaturationMutagenesis: arbitrarily choose 1 read for disjoint pairs, dump rejected reads, and various other improvements (#5926) (#6043)
    • Normalized some AssemblyRegion args in HaplotypeCallerSpark (#5977)
    • Don't redundantly delete temporary directories in RSCriptExecutor (#5894)
    • Treat all source files as UTF-8 for java, javadoc (#5946)
    • Updated an out-of-date argument name in an error message for the CycleCovariate
    • Changed an error about "duplicate feature inputs" to be a UserException (#5951)
    • Got rid of ExpandingArrayList in favor of ArrayList (#6069)
    • Disabled Codecov for now on travis due to spurious errors (#6052)
    • Lowered the Xms value in the test JVM (#6087)
    • Updated the travis installed R version to 3.2.5, matching our base docker image (#6073)
    • Fixed an erroneous warning about GCS test configuration (#5987)
    • Added a code of conduct (#6036)
  • Documentation

    • FilterVariantTranches documentation fix and improvement (#5837)
    • Updated FilterMutectCalls usage examples (#5890)
    • Added --max-mnp-distance 0 to usage example in CreateSomaticPanelOfNormals docs (#5972)
    • Updated the MarkDuplicatesSpark documentation to no longer contain a misleading usage example (#5938)
    • Added a clarification to the README to warn users to set their Gradle JVM properly in Intellij after setup (#6066)
    • Added links to download Java 8 to the README (#6025)
    • Remove non-ascii chars from javadoc (#5936)
  • Dependencies

    • Updated HTSJDK to 2.20.1 (#6083)
    • Updated Picard to 2.20.5 (#6083)
    • Updated Disq to 0.3.3 (#6083)
    • Updated Spark to 2.4.3 (#5990)
    • Updated Gradle to 5.4.1 (#6007)
    • Updated GenomicsDB to 1.1.0.1 (#5970)
Assets 3

@droazen droazen released this Apr 23, 2019 · 316 commits to master since this release

Download release: gatk-4.1.2.0.zip
Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Highlights of the 4.1.2.0 release:

  • Two new tools, MethylationTypeCaller and AnalyzeSaturationMutagenesis (see below for descriptions)
  • Significant improvements to GENOTYPE_GIVEN_ALLELES mode in Mutect2 and HaplotypeCaller
  • Fixed a serious bug in Funcotator that could cause END positions to be wrong for some deletions in MAF output
  • Significant updates to the mitochondrial calling pipeline

Full list of changes:

  • New Tools

    • MethylationTypeCaller (#5762)
      • Identifies methylated bases from bisulfite sequencing data. Given a bisulfite sequenced, methylation-aware aligned BAM and a reference, it outputs methylation-site coverage to a specified output vcf file.
    • AnalyzeSaturationMutagenesis (#5803)(#5883)
      • Processes reads from a saturation mutagenesis experiment, an experiment that systematically perturbs a mini-gene to ascertain which amino-acid variations are tolerable at each codon of the open reading frame. Its main job is to discover variations from wild-type sequence among the reads, and to summarize the variations observed.
  • Mutect2

    • Made significant improvements to GENOTYPE_GIVEN_ALLELES mode in Mutect2 and HaplotypeCaller (#5874). These improvements are described in more detail in #5857
    • CalculateContamination now works much better for very small gene panels (#5873)
    • We now correctly handle inputs with 100% contamination in Mutect2 filtering (#5853)
    • Mutect2 now uses natural logarithms internally (#5858). This does not change any outputs.
    • Minor update to the Mutect2 PON WDL (#5859)
  • Funcotator

    • Fixed a serious bug that could cause END positions to be wrong for some deletions in MAF output (#5876)
    • The tool now throws a user error for an AD field with only 1 value in MAF mode (#5860)
    • Added a new filter to FilterFuncotations. For two autosomal recessive genes, MUTYH and ATP7B, homozygous variants and compound heterozygous variants will be tagged and added to the output vcf. (#5843)
  • Mitochondrial Calling Pipeline

    • Updated the pipeline for the new Mutect2 filtering scheme and pulled filtering after the liftover and recombining of the VCF. (#5847)
    • Made the subsetting of the WGS bam fast by using PrintReads over just chrM instead of traversing the whole bam for NuMT mates. (#5847)
    • Moved polymorphic NuMTs based on autosomal coverage to a filter (it was an annotation before) (#5847)
    • Added an option to hard filter by VAF (#5847)
    • Bug fix for large input files to the mitochondrial pipeline (we now include the size of the input BAM/CRAM when calculating disk size, when necessary) (#5861)
  • Structural Variation Calling Pipeline

    • Bug fix to QNameFinder to handle reads with negative unclipped starts (#5864)
  • Miscellaneous Changes

    • Added a --min-fragment-length argument to the FragmentLengthReadFilter (#5886)
    • Added a --spark-verbosity argument to control verbosity of Spark-generated logs (#5825)
    • Added a new WalkerBase abstract class to be used for all built-in walkers (#4964)
    • Exposed transient attributes in the GATKRead API (#5664)
    • Convert more code to use GATKPathSpecifier (#5870) (#5832). This also fixes an InvalidPathException on Windows machines.
    • Fixes to the test suite related to the recent introduction of a codec for Picard interval lists (#5879)
    • Eliminated an error message during the Docker build in Travis logs by creating a directory before copying to it. (#5878)
  • Documentation

    • Updated the Mutect2 WDL README with Funcotator information (#5892)
    • Updated a usage example for CreateHadoopBamSplittingIndex (#5898)
Assets 3

@droazen droazen released this Mar 28, 2019 · 340 commits to master since this release

Highlights of the 4.1.1.0 release:

  • A substantial (~33%) speedup to the HaplotypeCaller in GVCF mode (-ERC GVCF)
  • Major updates to Mutect2, including completely overhauled filtering and smarter handling of overlapping read pairs.
  • A tensorflow update for CNNScoreVariants that speeds up the tool by roughly ~2X when using the 2D model.
  • Important updates to the mitochondrial calling pipeline, and improved memory usage in the CNV pipeline.
  • Important bug fixes to Funcotator, VariantEval, GenomicsDBImport, and other tools, as well as to the --pedigree argument for annotations.

Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Full list of changes:

  • HaplotypeCaller

    • Greatly improved the performance of the ReferenceConfidenceModel using dynamic programming and caching (#5607)
      • This speeds up whole-genome GVCF mode calling (-ERC GVCF) by ~33% in our tests!
    • Optimized some additional performance hotspots in the ReferenceConfidenceModel (#5616) (#5469) (#5652)
    • Can now write VCF outputs to Google Cloud Storage (GCS) (#5378)
    • Don't output variants with no ALT allele if the * (spanning deletion) allele gets dropped (#5844)
    • Added a --force-active argument that marks all regions as active. Useful for debugging/diagnostics. (#5635)
    • HaplotypeCallerSpark: made performance improvements to allow the tool to run on WGS in strict mode (#5721)
    • Fixed rare infinite recursion bug in KBestHaplotypeFinder (also affects Mutect2)(#5786)
  • Mutect2

    • Overhaul of FilterMutectCalls, which now applies a single threshold to an overall error probability (#5688)
      • FilterMutectCalls automatically determines the optimal threshold.
      • The new somatic clustering model learns tumors' allele fraction spectra and overall SNV and indel mutation rates in order to improve filtering.
      • Includes a rewrite of Mutect2 documentation -- better organization and now includes command line examples in addition to math.
    • Mutect2 now modifies base and indel qualities of overlapping paired reads to account for PCR error rather than discarding reads (#5794)
      • This especially improves indel sensitivity.
    • Optimized Mutect2 read orientation filtering by collecting F1R2 counts from within Mutect2 itself, greatly reducing wall-clock and CPU time (#5840)
    • New Mutect2 panel of normals workflow using GenomicsDB for scalability (#5675)
      • Panel of normals removes germline variants in order to contain only technical artifacts, and contains information about artifact prevalence
    • Rewrote Mutect2 active region likelihood as special case of full somatic likelihoods model, which reduces runtime by 5% (#5814)
    • Funcotator updates in Mutect2 WDL (#5742) (#5735)
    • Prune assemby graph before checking for cycles (#5562)
    • Refactor Mutect2 inheritance so that it doesn't have inactive arguments (#5758)
    • Added CRAM support to the Mutect2 WDL (#5668)
    • Split MNPs in Mutect2 PON WDL, fixing a potential bug (#5706)
    • Handle negative infinity log likelihoods from PairHMM in Mutect2 (#5736)
    • Fixed overfiltering in Mutect2 in GGA alleles mode with no reads (#5743)
    • Correct some Mutect2 VCF header lines (#5792)
    • Handle unmarked duplicates with mate MQ = 0 in Mutect2 (#5734)
    • Output sample names in Mutect2 PON header (#5733)
    • Avoid error due to finite precision error in Mutect2 PON creation (#5797)
    • Update Mutect2 javadoc to reflect v4.1 changes. (#5769)
    • Renamed the OxoGReadCounts annotation to OrientationBiasReadCounts (#5840)
  • CNNScoreVariants

    • We now use the latest Intel-optimized tensorflow (#5725)
      • This speeds up the 2D CNN by roughly 2X in our tests!
    • FilterVariantTranches is out of beta (#5628)
    • Fixed CNNScoreVariants hanging when the conda environment is not set up (#5819)
      • We now make sure that the GATK tool Python package is present before executing streaming Python commands.
    • Extensive updates to the CNN WDLs (#5251)
  • Mitochondrial Calling Pipeline

    • Added an option to recover all dangling branches, on by default for MT calling (#5693)
      • Fixes a large number of missed calls
    • Use adaptive pruning in the mitochondria pipeline (#5669)
    • Changed defaults in mitochondria mode in response to Mutect2 filtering overhaul (#5827)
    • Allowed the MT pipeline to work on bams with a mix of single and paired-end reads (#5818)
    • Added a hard filter to M2 for polymorphic NuMTs and low VAF sites (#5842)
    • Updated the haplochecker version to 0.1.2 to fix a bug with flipping the major and minor hg headers in its output (#5760)
    • Added the rest of the mitochondria joint-calling pipeline (#5673)
      • Merging and genotyping "somatic" GVCFs from Mutect2
    • Added a read filter for unmapped reads and their mates (#5826)
    • Refactored the MT WDL to make validations easier (#5708)
    • Updated a variable name in MT WDL to match gatk-workflows version (#5694)
  • GenotypeGVCFs

    • Added an option to merge intervals for better GenotypeGVCFs performance on GenomicsDB exome input (#5741)
    • Trim per-allele FORMAT annotations and optionally retain raw AS annotations (#5833)
      • GenotypeGVCFs now uses the header info to determine if FORMAT lists need to be subset when alleles are dropped
      • Fixes "F1R2 and F2R2 annotations not updated by GenotypeGvcfs" (#5704)
  • Funcotator

    • Non-locatable data sources can create funcotations again (#5774)
      • Fixes a bug where Funcotator was not adding funcotations from non-locatable data sources
    • Fixed handling of symbollic alleles when determining best transcript for GencodeFuncotation creation. (#5834)
    • FilterFuncotations: support for multi-allelic variants (#5588)
    • FilterFuncotations: support for gnomAD for allele frequency in ClinVarFilter and LofFilter, with a new argument telling it which dataset of gnomAD or ExAC to use (#5691)
    • Added # as a character to be sanitized by VCFOutputRenderer (#5817)
    • Added in Markdown files for Funcotator forum posts (#5630)
    • Updated Funcotator documentation with a FAQ section to respond to user comments (#5755)
  • CNV Tools

    • Improved memory usage in gCNV (#5781)
    • Improved memory requirements of CollectReadCounts (#5715)
    • Added some fixes for minor CNV issues (#5699)
    • Added io_commons.read_csv to address issues with formatting of sample names in gCNV (#5811)
    • Added gCNV PROBPROG 2018 extended abstract, archived notes on CNV methods, and deleted some legacy documentation (#5732)
  • Miscellaneous Changes

    • SelectVariants can now write VCF outputs to Google Cloud Storage (GCS) (#5378)
    • VariantEval bug fix: don't require the output file to already exist (#5681)
    • Fixed the --pedigree argument in the PossibleDeNovo annotation (#5663)
    • GenomicsDBImport: fixed a core dump when querying overlapping deletions (#5799)
    • GatherPileupSummaries: a new tool that combines the output of GetPileupSummaries from disjoint scatter jobs (#5599)
    • VariantsToTable: add splitting for allele-specific annotations and ADs (#5697)
    • CalculateGenotypePosteriors: fix reported bug where no-call genotypes with no reads get genotype posterior probabilities and calls (#5667)
    • Added a new argument to Spark tools enabling the user to control whether to sort the reads on output (#4874)
    • ReadsPipelineSpark: fixed an "Interval not within the bounds of a contig" error (#5645)
    • Concordance: fixed the tool to allow for no variation alleles in the truth data. (#5718)
    • ReblockGVCF: fix sites with zero AD to actually use SITE-level DP value as intended in (#5835)
    • Change UpdateVCFSequenceDictionary to use the specified dictionary uniformly (#5093)
    • Fixed gatk-nightly Docker builds (https://hub.docker.com/r/broadinstitute/gatk-nightly/) (#5759)
    • Print the Picard/HTSJDK versions in addition to the GATK version when running with --version (#5757)
    • IndexFeatureFile: fixed a crash on VCFs with 0 records (#5795)
    • PrintBGZFBlockInformation: removed the file extension check so that we can accept bams (#5801)
    • Added a new read filter: IntervalOverlapReadFilter (#5656)
    • Add NIO Path support to TableReader and TableWriter (#5785)
    • Replaced IntervalsSkipList with OverlapDetector (#4154)
    • Removed some unused arguments in VCF merging code (#5745)
    • Kebab-case some arguments in LocusWalker and LocusWalkerSpark (#5770)
    • Removed an unnecessary IllegalArgumentException in PairHMM (#5705)
    • Removed accidental uses of log4j v1 (#5682)
    • Improvements to Spark evaluation scripts (#5815)
    • Extract tests from PrintReadsIntegrationTest to share with the Spark version. (#5689)
  • Documentation

    • Improved the documentation for the StrandOddsRatio annotation (#5703)
    • Fixed the descriptions of some HaplotypeCaller arguments (#5658)
    • Update VariantRecalibrator example code to reflect new tagged argument syntax (#5710)
    • Corrected javadoc for the InbreedingCoeff annotation (#5768)
    • CalculateGenotypePosteriors: minor updates to javadoc and logger type (#5601)
    • Added and Updated javadoc for SortSamSpark and MarkDuplicatesSpark (#5672)
    • Added a link to a "GitHub basics for researchers" article at top of the GATK README (#5643)
    • Updated the main GATK README to remove outdated references to the Intel conda environment (#5753)
    • Trimmed overly-long tool one-line summaries to shorten --list display width. (#5551)
  • Dependencies

    • Updated HTSJDK to 2.19.0 (#5812)
    • Updated Picard to 2.19.0 (#5812)
    • Updated Disq to 0.3.0 (#5812)
    • Updated google-cloud-nio to 0.81.0 (#5752)
Assets 3

@droazen droazen released this Jan 30, 2019 · 434 commits to master since this release

It's been a year since the GATK 4.0.0.0 release in January 2018, and we decided that it was time to package up the past year's worth of GATK improvements into a new major release, which we're calling version 4.1.0.0!

To commemorate this milestone, we'll be publishing a series of in-depth technical articles and blog posts covering the major new features in version 4.1.0.0 on the official GATK blog.

Below we've compiled the highlights of the new features added between versions 4.0.0.0 and 4.1.0.0. If you're interested in seeing only the changes between the last release (4.0.12.0) and this release (4.1.0.0), click here instead.

Official docker image is here: https://hub.docker.com/r/broadinstitute/gatk/

Major changes between versions 4.0.0.0 and 4.1.0.0 (January 2018 to January 2019):


  • Next-Gen VQSR Replacement For Single-Sample

    • New suite of tools CNNScoreVariants, CNNVariantTrain, CNNVariantWriteTensors, and FilterVariantTranches
    • CNNScoreVariants is now out of beta and ready for production use
    • Performs variant training and scoring using a convolutional neural network.
    • Single-sample only
    • Produces better results than the legacy VariantRecalibrator (VQSR) and comparable or better results to third-party tools like DeepVariant
    • Sophisticated 2D model that uses the reads
  • Major HaplotypeCaller Improvements

    • Now genotypes and outputs spanning deletions
    • Now outputs VCF spec-compliant phased variants
    • Can emit MNPs via a new --max-mnp-distance argument
    • Important fix to the reference confidence calculation upstream of indels
    • New HaplotypeCaller priors for variants sites and homRef blocks
      • Added new --population-callset argument allowing an external panel of variants to be specified to inform the frequency distribution underlying the genotype priors
      • Added new --num-reference-samples-if-no-call argument to control whether to infer (and with what effective strength) that only reference alleles were observed at sites not seen in any panel
  • Major Mutect2 Improvements

    • Mutect2 is now out of beta
    • Support for multi-sample calling
    • Lots of support for high-depth calling such as cfDNA, UMIs, mitochondria, including a new active region likelihood, probabilistic assembly graph pruning that adjusts to the local depth, a new mitochondria mode, and new filters for blood biopsy and mitochondria
    • Now outputs VCF spec-compliant phased variants
    • Can emit MNPs via a new --max-mnp-distance argument
    • Added a genotype given alleles (GGA) mode
    • New STR indel error model that improves sensitivity and precision in STR (short-tandem repeat) contexts
    • Many new/improved filters to reduce false positives (eg., FilterAlignmentArtifacts)
    • Mutect2 now automatically recognizes and removes end repair artifacts in regions with inverted tandem repeats. This is extremely important for some FFPE samples.
    • New probabilistic orientation bias tool
    • Got rid of many questionable indels showing up in bamout of Mutect2 and the HaplotypeCaller
    • Big improvements to CalculateContamination, especially when tumor has lots of CNVs
    • NIO support in Mutect2 WDL
    • Significant speed improvements
    • Improved allele fraction estimation
    • Initial GVCF output support
  • Mitochondrial Calling

    • Added --mitochondria-mode to Mutect2 and FilterMutectCalls. This increases sensitivity and only applies filters that are optimized for mitochondria.
  • New allele frequency / qual score model

    • Is now the default in HaplotypeCaller and GenotypeGVCFs
    • Optimized for greater speed, should resolve many GenotypeGVCFs memory issues
    • Rare numerical finite precision issues in the allele-specific qual have been resolved
  • Major Improvements to the CNV (Copy Number Variation) tools

    • The CNV tools are now out of beta.
      • This includes the tools: AnnotateIntervals, CallCopyRatioSegments, CollectAllelicCounts, CollectReadCounts, CreateReadCountPanelOfNormals, DenoiseReadCounts, DetermineGermlineContigPloidy, FilterIntervals, GermlineCNVCaller, ModelSegments, PostprocessGermlineCNVCalls, PreprocessIntervals, PlotDenoisedCopyRatios, and PlotModeledSegments
    • Completed the GermlineCNVCaller (gCNV) pipeline and made various performance/runtime improvements to both the methods and WDLs.
    • Major changes include the addition of new tools (PostprocessGermlineCNVCalls, FilterIntervals, and CollectReadCounts, which replaces CollectFragmentCounts), as well as improvements to existing tools (notably, AnnotateIntervals).
    • Improved support for various formats, namely VCF output in the gCNV pipeline, IGV-compatible .seg output in the ModelSegments somatic CNV pipeline, and CRAM support for all CNV WDLs.
    • Developed tools and WDLs for tagging and filtering of germline events in the ModelSegments somatic CNV pipeline.
  • Funcotator Official Release

    • Funcotator is now out of beta
    • Huge number of bug fixes and accuracy improvements. Output for several fields is now more correct than other well-known functional annotation tools.
    • Some new features include:
      • MAF output support
      • NIO support for datasources
      • gnomAD support
      • dbsnp support
      • Support for Mitochondrial amino acid sequence/protein change strings
      • 5'/3' flank support
      • Major performance improvements due to added caching
      • Added ALL mode for transcript selection (--transcript-selection-mode ALL) which will output full annotation fields for all transcripts
    • Created a new FuncotatorDataSourceDownloader tool to download data sources
    • Added an experimental FilterFuncotations tool
  • MarkDuplicatesSpark is now a Validated, Scalable Replacement for MarkDuplicates

    • MarkDuplicatesSpark is now out of beta
    • Rewritten version of the tool matches Picard MarkDuplicates output and has greatly improved performance and scalability
    • Supports multiple BAM inputs
    • Indexes BAM outputs on-the-fly in parallel on a cluster
  • Additional Tools Ported from GATK3

    • Ported VariantAnnotator
    • Ported VariantEval
    • Ported FastaAlternateReferenceMaker and FastaReferenceMaker
    • Ported LeftAlignAndTrimVariants
    • Restored GenotypeGVCFs --include-non-variant-sites argument
  • Major Improvements to the SV (Structural Variation) Tools

    • Improvements to collection and calling of events based on discordant read pair evidence.
    • A new scaffolding algorithm greatly improves the contiguity of local assemblies, increasing sensitivity.
    • Regions of excessive sequencing depth are excluded from evidence collection and assembly, improving runtime performance.
    • A major overhaul of our algorithm for calling events based on local assemblies improves accuracy and allows for the accurate reporting of small complex SVs.
    • A machine learning (xgBoost) based classifier for SV evidence improves runtime and increases accuracy by determining which regions should be fed into the local assembly workflow.
  • Spark Improvements

    • New Disq Spark library allows faster and more accurate loading of formats like BAM and VCF
    • HaplotypeCallerSpark now has a "strict mode" that closely matches the regular HaplotypeCaller
    • Created RevertSamSpark, a parallelized Spark version of Picard's RevertSam tool
    • Migrated most Spark tools that take a reference and/or VCF to use Spark's intrinsic file copying mechanism instead of broadcast to distribute the reference and VCFs to worker nodes -- a big performance win!
  • GenomicsDB Improvements

    • Allele-specific annotation support
    • Multi-interval support (with some performance caveats)
    • Support for sites-only queries
    • Support for returning the GT field in queries
    • New protobuf-based API to allow configuration without editing JSON files
    • Added in machinery to allow per-annotation combine operations to be specified
    • Allow for hdfs and gcs URI's to be passed to GenomicsDB
    • Migrated from com.intel.genomicsdb to org.genomicsdb
  • "Goodies" Worth Mentioning

    • Added fasta.gz support to the -R/--reference argument in walker tools
    • SelectVariants can now drop specific annotation fields from the output vcf
    • CalculateGenotypePosteriors now supports indels
    • New tool ReblockGVCF to merge reference blocks in single-sample GVCFs for smaller filesizes
    • Improved MQ calculation accuracy, especially at sites with many uninformative reads; concomitant with new annotation tag and format
    • The -L argument now supports GCS (Google Cloud Storage) for interval list files / bed / vcf files in walker tools
    • Added support for "Requester Pays" GCS (Google Cloud Storage) buckets via new --gcs-project-for-requester-pays argument
    • Added GCS (Google Cloud Storage) output (-O) support to more tools
    • Improved Python integration (eliminated timeouts and reliance on prompt synchronization) means fewer glitches during runs of ML-based tools
    • A significantly (~33%) smaller GATK docker image
    • Changed argument tagging syntax from "--arg tag:value" to "--arg:tag value"
      • Affects command-line interface for VariantRecalibrator, VariantEval, VariantFiltration, and VariantAnnotator

Changes between versions 4.0.12.0 and 4.1.0.0 only:


  • Many tools are now out of beta and ready for production use!

    • CNNScoreVariants is out of beta (#5548)
    • Funcotator and FuncotatorDataSourceDownloader are out of beta (#5621)
    • MarkDuplicatesSpark is out of beta (#5603)
    • CNV tools are out of beta (#5596). This includes: AnnotateIntervals, CallCopyRatioSegments, CollectAllelicCounts, CollectReadCounts, CreateReadCountPanelOfNormals, DenoiseReadCounts, DetermineGermlineContigPloidy, FilterIntervals, GermlineCNVCaller, ModelSegments, PostprocessGermlineCNVCalls, PreprocessIntervals, PlotDenoisedCopyRatios, and PlotModeledSegments
  • New tools:

    • Added ports of FastaAlternateReferenceMaker and FastaReferenceMaker from GATK3 (#5549)
    • RevertSamSpark: a parallelized, Spark-based implementation of RevertSam from Picard (#5395)
    • CompareIntervalLists: simple new tool to compare interval lists (#3702)
    • CountBasesInReference: simple new tool to count bases in a reference file (#5549)
    • PrintBGZFBlockInformation: a tool to dump information about blocks in a BGZF file (#4239)
  • Mutect2

    • Mutect2 now works with multiple tumor and normal samples! (#5560)
    • First iteration of a reference confidence GVCF-like output for Mutect2 to enable mitochondrial joint calling (#5312)
    • Changed default blocking and NON-REF LOD params for Mutect2 GVCF mode (#5615)
    • Changed defaults for mitochondria mode now that we have adaptive pruning (#5544)
    • Fixed an edge case bug when Mutect2 sees a variant with population AF = 1 (#5535)
    • Fixed an edge case of zero-depth in FilterMutectCalls germline filter (#5578)
    • Fixed an edge case for the Mutect2 germline resource (#5563)
    • Tweaked the Mutect2 germline filter (#5595)
    • Put new orientation bias model in Mutect2 NIO wdl (#5580)
    • Improve proposed tumor in normal docs to account for new Mutect2 options (#5555)
  • Added a copy of the mitochondria best practices pipeline (#5566) (#5612)

  • HaplotypeCaller

    • New allele frequency / qual score model is now the default in HaplotypeCaller and GenotypeGVCFs (#5484)
    • Simplified and sped KBestHaplotypeFinder by replacing recursion with Dijkstra's algorithm (#5462) (#5554)
    • Forward input BAM @pg header lines to -bamout output BAM (#3065)
    • Small performance improvement in GVCF mode (#5470)
  • CNV Tools

    • Out of beta, as mentioned above! (#5596)
    • Added per-sample denoised coverage output to gCNV (#5584)
    • ModelSegments: Added separate allele-count thresholds for the normal and tumor (#5556)
    • ModelSegments: Added MinibatchSliceSampler and replaced naive subsampling (#5575)
    • Restored array output in gCNV WDLs for efficient postprocessing. (#5490)
  • Changed tagged argument syntax from --argument tag:value to --argument:tag value (#5526)

    • For example, --resource known,known=true,prior=10.0:myFile becomes --resource:known,known=true,prior=10.0 myFile
    • This change affects VariantRecalibrator, VariantEval, VariantFiltration, and VariantAnnotator
  • Funcotator

    • Out of beta, as mentioned above! (#5621)
    • New datasource release that fixes many issues and adds gnomAD support (#5614)
    • VCF Data Sources now preserve the FILTER field (#5598)
    • Funcotator now gets the NCBI build version from the datasource config file (#5522)
    • Funcotator now ignores transcript version numbers when matching on transcript ID (#5557)
    • Funcotator now uses the GATK-wide version number (#5520)
    • Updated Funcotator tool documentation (#5620)
  • MarkDuplicatesSpark

    • Out of beta, as mentioned above! (#5603)
    • Added the ability for MarkDuplicatesSpark to accept multiple bam inputs (#5430)
    • Fixed MarkDuplicateSpark mutex argument references (#5538)
  • Spark tools

    • Support for distributed BAI index creation, and option for enabling or disabling writing BAI and SBI files on Spark (#5485)
    • Get HaplotypeCallerSpark "strict mode" running on an exome (#5475)
    • Added an option for enabling or disabling writing tabix indexes for bgzipped VCF files from Spark (#5574)
    • Fixed overflow bug in GatkSparkTool.getRecommendedNumReducers() (#5586)
  • GenomicsDB

    • Migrated from com.intel.genomicsdb to org.genomicsdb (#5587) (#5608)
    • GenomicsDB now matches CombineGVCFs with input spanning deletions (#5397)
    • Define GenomicsDB "partitions" over the span of the input intervals in order to dramatically improve exome performance (#5540)
  • Miscellaneous Changes

    • Added liftover wdls and jsons for gnomAD 2.1 (#5604)
    • Added script to create Hg38 to B37 liftover chain (#5579)
    • Allow variant walkers to configure their caching behavior (#3480)
    • Bug fix for using a ReservoirDownsampler with a ReadsDownsamplingIterator (#5594)
    • Started migration to a new URI abstraction (#5526)
    • Fixed inclusion of default read filters in GATK documentation (#5576)
    • Put the actual date/time in the generated GATK documentation (#5567)
    • Pair-HMM alignment algorithm description fix (#5528)
    • Make ReadFilter and Annotation packages configurable (#5573)
    • Fix to make gatk --version print the version instead of throwing an exception (#5537)
    • Added warning message reminding user to add the allele specific annotation group when needed (#3042)
    • Fix for intermittent LeftAlignAndTrimVariants test failures (#5519)
    • Restored link in VariantFiltration docs to point to update online JEXL doc. (#5525)
    • Moved BucketUtils.deleteOnExit() and deleteRecursively() to IOUtils (#5332)
    • Source the tab completion script in the GATK docker image (#5552)
    • Added GATK jar to CLASSPATH in docker image (#3866)
    • Updated travis github badge link (#5617)
    • Removed offline CRAN repository from build (#5593)
  • Dependencies

    • Updated htsjdk to version 2.18.2 (#5585)
    • Updated picard to version 2.18.25 (#5597)
Assets 3
You can’t perform that action at this time.