Skip to content

broadinstitute/GATK-For-Primates

Repository files navigation

GATK-for-Primates: GATK Best Practices for Variant Calling in Non-Human Primate Genomes (alpha)

This repo is NOT ready for use yet and is still under development!!

A reproducible pipeline for germline SNP and Indel variant calling in non-human primate whole-genome re-sequencing data.

This pipeline is in the alpha stage of development. Please do not use this pipeline until we have completed initial benchmarking. Once we see promising results with sensitivity and precision, we'll seek the community's feedback to make additional improvements.

Run on Cromwell: cromwell run gatk-for-primates-germline-snps-indels.wdl -i inputs.json

Run on Terra: See the GATK_For_Primates workspace (coming soon: stay tuned!).

Summary:

The Genome Analysis Toolkit (GATK) remains the premier software package for variant discovery and genotyping from next-generation resequencing reads. Though originally developed for human genetics, GATK has evolved to handle genome data from any organism with any level of ploidy. Nonetheless, the existing ‘best practices’ and published workflows are focused solely on processing human data. These cannot accommodate the unique challenges encountered in data from other taxa. Here, we present a 'best practices' WDL pipeline for germline short variant discovery in non-human primate whole genome re-sequencing data.

  • This pipeline fully automates Base Quality Score Recalibration (BQSR) by applying hard-filtered variants from initial rounds of variant calling. This replaces the need for 'gold standard' sets of known-variant sites that only derive from humans.

  • Hard-filtering is also used in place of Variant Quality Score Recalibration (VQSR), though partial or full VQSR will proceed if SNP and/or Indel truthing sets are provided.

  • Samples from multiple (but closely related) taxonomic units (e.g. species within a genus, or subspecies within a species) can be co-analysed together in this pipeline. This allows for downstream analyses that, while irrelevant to human studies, transpire to be common-use scenarios in wildlife comparative genomics (e.g. assessing hybridization, admixture, speciation, etc.). This approach also facilitates the discovery and consistent genotyping of rare and low-frequency alleles across taxonomic units, which may offer improved insights into complex evolutionary histories and traits.

  • To ensure run-time efficiency, we take a ‘scatter-gather’ approach to parallelization, which can be easily adapted for use on single computers or any high-throughput platform. The pipeline will also be available as a workspace on Terra shortly.

Though presently under testing in primates, the pipeline may be adapted for other non-human animal species, for which new GATK Best Practices are currently in development. Beyond animals, more appropriate GATK Best Practices might be available, e.g. via the GATK for Microbes pipeline.

Modes of operation:

The pipeline operates in one of three modes, to be specified in the input JSON file: initial, repeat or final. These comprise the major stages of a typical workflow, i.e. 1) mapping and BQSR, 2) repeat rounds of BQSR and 3) genotyping and VQSR and/or variant filtration. First-time users may want to run these sequentially, starting with raw FASTQ files and ending with a VCF. Returning and advanced users can simply change the mode to begin at a later stage.

Initial mode

In initial mode, the pipeline takes paired-end FASTQ reads or unmapped BAM files, 1) maps the reads to the reference genome, 2) performs an initial round of variant calling using HaplotypeCaller, 3) hard-filters those calls using standard parameters, and 4) uses the passing loci to perform BQSR. The unfiltered variant calls are then processed into Picard-style interval lists (which are packaged as polymorphicRegions.tar.gz) and outputted for use in repeat and final modes. The pipeline also outputs recalibrated BAM files for each individual, plus the tables and plots required to evaluate convergence.

Repeat mode

Repeat mode is only used when there are no 'gold standard' sets of known-variant sites. The mode should be run repeatedly, essentially 'bootstrapping' the BQSR step, until convergence (see examples here, though in practice, this doesn't usually require more than two rounds). Repeat mode operates largely as in initial mode, except the user provides (a) the recalibrated BAM files (as outputted either from initial mode, or from the most recent run in repeat mode) and (b) the polymorphicRegions.tar.gz file (outputted from initial mode) as inputs. Instead of calling variants across the entirety of each BAM file, repeat mode restricts variant calling to the known (or at least suspected!) polymorphic regions, as indicated by the interval lists in the tarball. This is faster and far less computationally expensive than re-calling haplotypes across the entire genome. Previously mapped BAM files (i.e. from other sources) can also be provided at this stage, in lieu of unmapped data.

image

Final mode

In final mode, the outputs from initial or repeat mode are used as inputs, i.e. fully recalibrated BAM files and the polymorphicRegions.tar.gz file comprising the interval lists of polymorphic regions. Variants are then called to intermediate GVCF files as follows:

  • If samples from multiple taxonomic groups are present, each group's samples are first joint-genotyped separately, then again all together across the entire cohort.
  • If 'gold standard' SNP and Indel sets are provided, these are used as 'truth' sets in VQSR, alongside hard-filtered 'training' sets generated by the pipeline. Recalibration is subsequently performed in series (Indels, then SNPs) across all unfiltered variant calls.
  • If only one 'gold standard' set is provided (e.g. SNPs), recalibration is first performed on all unfiltered variant calls using that truth set; the results are then merged with the hard-filtered calls from the other variant type (i.e. Indels, in this example).

At the conclusion of final mode, all samples are re-genotyped at the passing loci. The final output includes GenomicsDB of all samples (to which further samples can be added later, beyond this pipeline) plus the final genotypes (in gzipped VCF format) for downstream applications.

image

Requirements

Software version requirements :

  • GATK 4.2.3.0
  • Samtools 1.13
  • bwa 0.7.17
  • Python 3.9.7
  • Cromwell 71 (using WDL 1.0)
  • Docker support

Input requirements/expectations:

  • Cleaned-up paired-end reads in FASTQ format and/or unmapped BAM or CRAM files for initial mode (one pair/file per sample).
  • Mapped and de-duplicated BAM or CRAM files, with indexes, for repeat and final modes.
  • Read group information, either provided with the JSON input for FASTQ files, or included in the uBAM or BAM files.
  • A reference genome indexed using either bwa or bwa-mem2 (support for bwa-mem2 is planned for the longer term).
  • Truth SNP and/or Indel sets, if available, in either uncompressed or gzip-compressed VCF format with TBI index(es).
  • A list of 'scatters', either produced with ScatterIntervalsByNs (to be described in forthcoming documentation) or with the third-party Python tool, ScaffoldStitcher. This facilitates scatter-gathering across the reference genome.
  • If in repeat or final mode, the polymorphicRegions.tar.gz file produced in initial mode.

Configuring the JSON input file:

The pipeline is optimized for user inputs to be as simple and limited as possible. Beyond the required files (i.e. reference and index files; FASTQ, BAM or uBAM files; outputs from prior modes), all configuration and parameters are supplied in a single JSON input file.

Mandatory options

The following options must be set in the input JSON file:

Mode Option Type Description
All mode String initial, repeat, or final
All ref File Reference file in .vcf or .vcf.gz format
All ref_dict File Reference dictionary file in .dict format
All ref_fai File Reference index file in .fai format
Initial bwamem2 Boolean Default is false. Set 'true' to use bwa-mem2 instead of bwa mem.
Initial ref_amb File Required if using bwa mem or bwa-mem2
Initial ref_ann File Required if using bwa mem or bwa-mem2
Initial ref_pac File Required if using bwa mem or bwa-mem2
Initial ref_bwt File Required if using bwa mem or bwa-mem2
Initial ref_sa File Required if using bwa mem only
Initial ref_0123 File Required if using bwa-mem2 only
Initial ref_bwt_2bit_64 File Required if using bwa-mem2 only
Repeat/Final packaged_polymorphic_regions File Output from the prior run in initial mode.

The following must always be provided for each user-defined scatter, as a scatterList object:

Name Type Description
name String Name of user-defined scatter
intervals String Comma-separated list in chr:pos format, e.g. chr21:1-27800000

The following must always be provided for each sample, as a sampleList object:

Name Type Description
name String Name of sample or individual.
taxon_group String Taxonomic group name. This is used to group individuals by taxon.

The sampleList object must also contain the following, depending on mode and sample input type:

Input Type Name Mode Type Description
FASTQ: R1 Initial File First of the paired-end FASTQ files.
- R2 Initial File Second of the paired-end FASTQ files.
- RG_ID Initial String Read group ID.
- RG_SM Initial String Read group sample name.
- RG_PU Initial String Read group platform unit. Note this is used in BQSR, which models together all reads with the same PU.
uBAM: unmapped_bam Initial File Unmapped BAM file if not mapping from FASTQ; must contain all read group data.
BAM: bam Repeat/Final File Recalibrated BAM file, e.g. from previous mode, required in each of these modes.
- bam_index Repeat/Final File Recalibrated BAM file index, required in each of these modes.

Common optional inputs

The following options may also be set, depending on mode of operation:

Mode Option Type Description
Initial cram_not_bam Boolean Default is true; indicates that CRAM versus BAM format should be used throughout the workflow.
Initial flowcell_patterned Boolean Default is true; sets --optical-duplicate-pixel-distance in MarkDuplicatesSpark to 2500. Set false for 100.
Any truth_set_SNPs File? Truth set of known SNPs in .vcf or .vcf.gz format.
Any truth_set_SNPs_index File? Index to the above file, in .tbi format.
Any truth_set_INDELs File? Truth set of known Indels in .vcf or .vcf.gz format.
Any truth_set_INDELs_index File? Index to the above file, in .tbi format.
Any validate_truth_sets Boolean Set true to perform ValidateVariants on the truth sets.
Any merge_contigs_into_num_partitions Int? Optional parameter for GenomicsDBImport.
Any container_gatk String? Default is broadinstitute/gatk:4.2.0.0.
Any container_gitc String? Default is us.gcr.io/broad-gotc-prod/genomes-in-the-cloud:2.5.7-2021-06-09_16-47-48Z.
Any container_python String? Default is python:3.9.5.
Any path_to_gitc String? Default is /usr/gitc.
Any path_to_gitc_gatk String? Default is /usr/gitc/gatk4/.

Other options are available for advanced users; specifically, to configure the resources assigned to container instances. If you know how to use these correctly, you'll also know how to find them!

Important notes and caveats to consider

  • The pipeline was originally written in the WDL Development Spec for use with Cromwell 65. However, because Terra supports only WDL 1.0 at present, the current version has been re-written to be 'backwards compatible' with WDL 1.0. For this reason, the code may not flow intuitively in places (e.g. some WDL dev functions are written as 'pseudo-functions' in separate WDLs for 1.0: see /workflows/functions/CollectByKey.wdl for an example). The upside is that these can be simply switched out when WDL is upgraded.
  • Allele-specific annotations are requested throughout, though full support of allele-specific filtering is not yet implemented.
  • If no truth sets are provided, or if only one truth set is provided, multi-allelic variants are first split to be biallelic (using LeftAlignAndTrimVariants) and then split into separate SNP and Indel files (using SplitVcfs). This is imperative, else — at least in allele-specific mode — mixed sites will fail to be processed correctly. If both truth sets are provided, the unfiltered variants are recalibrated with ApplyVQSR run in series, as recommended.
  • The --merge-contigs-into-num-partitions parameter is set to 0 (default), but can (and probably should) be configured to a higher value in the input JSON as explained here. We are still determining best practices for handling this when large numbers of contigs are in use, or when scatters/interval lists split scaffolds into smaller contigs. Note that this parameter requires whole contigs versus intervals.
  • Re-genotyping at the final callset capitalizes on the --include-non-variant-sites option in GenotypeGVCFs as implemented in GATK 4.0.12.0.
  • The pipeline currently uses the Broad's production container for 'Genomes in the Cloud', which uses outdated versions of all tools (i.e. GATK 4.1.8.0 versus 4.2.0.0; bwa 0.7.15 versus 0.7.17; samtools 1.11 versus 1.12). This container is only used when mapping reads or when running AnalyzeCovariates. Eventually, we'll replace this with a new container comprising the latest required software versions (per this issue).
  • When running AnalyzeCovariates, the required 'R' libraries are downloaded and installed each time. This is not optimal!

Acknowledgements

This pipeline is a collaborative effort by the Broad Institute (@bhanugandham) and the Wisconsin National Primate Research Center at the University of Wisconsin—Madison (@grahamlbanes), with support in part by the Office of the Director, National Institutes of Health (P51OD011106). Development was partially facilitated by the compute resources and assistance of the UW-Madison Center For High Throughput Computing (CHTC) in the Department of Computer Sciences. The CHTC is supported by UW-Madison, the Advanced Computing Initiative, the Wisconsin Alumni Research Foundation, the Wisconsin Institutes for Discovery, and the National Science Foundation (NSF), and is an active member of the Open Science Grid, which is supported by NSF and the U.S. Department of Energy's Office of Science.

About

GATK For Primates Best Practices Protocol

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published