Skip to content
This repository has been archived by the owner on Mar 20, 2019. It is now read-only.

Latest commit

 

History

History
251 lines (185 loc) · 32.3 KB

PublicPairedSingleSampleWf_170412.md

File metadata and controls

251 lines (185 loc) · 32.3 KB

This document outlines the data transformation steps and describes some pertinent parameters of the PairedEndSingleSampleWf workflow. As the name implies, the workflow is specific to processing paired end reads for a single sample. The pipeline implements the first segment of the GATK Best Practices (ca. June 2016) for performing germline variant discovery in human whole-genome sequencing (WGS) data. It begins with unaligned paired reads in BAM format, proceeds through alignment against GRCh38/hg38 as the reference genome, pre-processing and variant calling with HaplotypeCaller, and results in a sample-level germline SNP and INDEL variant callset in GVCF format.

  • This document is specific to the public WDL script PublicPairedSingleSampleWf_170412.wdl with an April 12, 2017 date stamp found here.
  • The workflow uses Docker container broadinstitute/genomes-in-the-cloud:2.2.5-1486412288. Docker containers provide reproducible analysis environments, including specific tool versions. Thus, the commands and the given parameters pertain only to these specific tool versions. It is possible for prior or future versions of tools to give different results.
  • Furthermore, the parameters within the pipeline are optimized for WGS and GRCh38. Optimal performance on other types of data and other species may require changes to parameters. For example, Step 3 includes code that calculates how to split the reference contigs for parallel processing that caps the genomic interval per process by the longest contig in the reference. This splits human GRCh38 18-ways and caps each group of contigs by the size of human chromosome 1. This approach may need revision for efficient processing of genomes where the longest contig length deviates greatly from human.
  • The JSON files of inputs and options provided with this WDL constitute a runnable example, as all files and dependencies are publicly available. The read data provided was downsampled to run in a reasonable amount of time on most platforms, so it is not representative of a full scale data, but it does exercise all processing and analysis steps realistically.

The structure of this document distinguishes the WORKFLOW section and the TASKS that are called in the workflow. The WORKFLOW definition section provides an overview of the logic and organization of the data transformations and analyses performed in the pipeline. The TASK definitions section provides more granular details of each step, where each task generally corresponds to the invocation of one tool command. By convention, tasks are generally written before the workflow in WDL scripts, but here we go over the workflow first because it will make more sense to the reader -- unless they're from Mars, in which case who knows. Note that Quality control tasks are not mentioned in the description of the workflow to avoid distracting from the narrative flow of data transformations and analyses, and their task definitions are summarized as a group at the end of this document.


Jump to a section

WORKFLOW definition overview

  1. Map with BWA-MEM and merge to create clean BAM
  2. Flag duplicates with MarkDuplicates
  3. Base quality score recalibration
  4. Call SNP and INDEL variants with HaplotypeCaller

TASK definitions overview


What is NOT covered

  • How to read WDL. This document assumes a basic understanding of WDL components.
  • The JSON files describing inputs and options.
  • Runtime parameters optimized for Broad's Google Cloud Platform implementation.
  • Interpretation of quality control results.

Related resources


Requirements

Software

  • See Article#6671 for setup options in using Cromwell to execute WDL workflows. For an introduction, see Article#7349.
  • Docker container broadinstitute/genomes-in-the-cloud:2.2.5-1486412288 uses the following tool versions for this pipeline. These tools in turn require Java JDK v8 (specifically 8u111 is used) and Python v2.7.
*PICARD_VERSION="1.1150"
GATK35_VERSION="3.5-0-g36282e4"
GATK36_VERSION="3.6-44-ge7d1cd2"
GATK4_VERSION="4.alpha-249-g7df4044"
SAMTOOLS_VERSION="1.3.1"
BWA_VERSION="0.7.15.r1140"
SVTOOLKIT_VERSION="2.00-1650"
TABIX_VERSION=0.2.5_r1005"
BGZIP_VERSION="1.3"

* Note that the Picard version given here does not use the same numbering as the public Picard releases.

Scripts and data

  • Pipeline WDL script and JSON file defining input data.
  • (Optional) JSON file defining additional Cromwell options.
  • Human whole genome paired end sequence reads in unmapped BAM (uBAM) format.
    • Each uBAM is per read group and the header defines read group ID, SM, LB, PL and optionally PU. Because each file is for the same sample, their SM fields will be identical. Each read has an RG tag.
    • Each uBAM has the same suffix, e.g. .unmapped.bam.
    • Each uBAM passes validation by ValidateSamFile.
    • Reads are in query-sorted order.
  • GRCh38 reference genome FASTA including alternate contigs, corresponding .fai index and .dict dictionary and six BWA-specific index files .alt, .sa, .amb, .bwt, .ann and .pac.
  • Known variant sites VCFs and corresponding indexes for masking during base quality score recalibration (BQSR). A separate article will document these data resources.
  • Intervals lists for scattered variant calling with HaplotypeCaller. It is necessary to predefine the calling intervals carefully. In this workflow we use 50 .interval_list files that each contain multiple calling intervals. The calling intervals are an intersection of (i) calling regions of interest and (ii) regions bounded by Ns, otherwise known as gaps in the genome. See the External Resources section of Article#7857 for an example gap file. Use of these defined intervals has the following benefits.
    • Avoids calling twice on a locus. This can happen when reads overlapping an interval boundary expand the interval under consideration such that the same variant is called twice.
    • Makes calls taking into consideration all the available reads that align to the same locus.

WORKFLOW definition overview

L1087 opens the WORKFLOW block; the workflow name is PairedEndSingleSampleWorkflow.

After the workflow name, we start the WORKFLOW body by listing the variables that can stand in for files, parameters or even parts of commands within tasks, e.g. the command for BWA alignment (L1128). The actual file paths and values are provided in an accompanying JSON file.

Then we list the calls to tasks that will be executed. Calls may be listed in any order, as the order in which steps are run only depends on how they're connected to other tasks (and therefore, when their inputs become available), but we like to list them in the order that is most intuitively understandable for humans.

Let's break down the workflow into logical steps and examine their component commands.


1. Map with BWA-MEM and merge to create clean BAM

This step takes the unaligned BAM, aligns with BWA-MEM, merges information between the unaligned and aligned BAM and fixes tags and sorts the BAM.

  • 1.0: L1134 - Call task GetBwaVersion to note the version of BWA for use in step [1.3] and call CheckFinalVcfExtension to make sure the final output is named according to our convention.
  • 1.1: L1143 - Define a "scatter" block to parallelize the next set of operations (including [1.2], [1.3] and [1.4]) over each unmapped_bam in the list of BAMs (given by the variable flowcell_unmapped_bams). These "flowcell BAMs" correspond to subsets of the data that were run on different flowcells (because the sample library was multiplexed) and therefore are identified as separate "readgroups". As a result of the scatter definition, each flowcell BAM (also often called "readgroup BAM") will be processed independently through this segment of the pipeline.

▶︎ Observe the nesting of commands via their relative indentation. Our script writers use these indentations not because they make a difference for Cromwell interpretation but because they allow us human readers to visually comprehend where the scattering applies.

  • 1.2: L1161 - Call task SamToFastqAndBwaMem to map reads to the reference using BWA-MEM. We use bwa_commandline from LX(549) as the actual command.
  • 1.3: L1180 - Call task MergeBamAlignment to merge information between the unaligned and aligned BAM. Both bwa_commandline and bwa_version define elements of the bwamem program group @PG line in the BAM header. The data resulting from this step go on to step [2].
  • 1.4: L1204 - Call task SortAndFixTags and uses task aliasing to rename it SortAndFixReadGroupBam. The consequence of this is that the workflow can then differentiate outputs from those a later call made to the same task on different inputs in [2.1]. This task coordinate sorts and indexes the alignments and fixes the NM and UQ tags whose calculations depend on coordinate sort order. This data transformation allows for validation with ValidateSamFile, which is covered briefly in the section on Quality Control tasks.

2. Flag duplicates with MarkDuplicates

This step aggregates sample BAMs, flags duplicate sets, fixes tags and coordinate sorts. It starts with the output of [1.3]

  • 2.0: L1235 - Call task MarkDuplicates to accomplish two aims. First, since MarkDuplicates is given all the files output by the MergeBamAlignment task, and these by design all belong to the same sample, we effectively aggregate the starting BAMs into a single sample-level BAM. Second, MarkDuplicates flags reads it determines are from duplicate inserts with the 0x400 bitwise SAM flag. Because MarkDuplicates sees query-grouped read alignment records from the output of MergeBamAlignment, it will also mark as duplicate the unmapped mates and supplementary alignments within the duplicate set.
  • 2.1: L1244 - Call task SortAndFixTags and renames it as SortAndFixSampleBam to differentiate outputs from those of the earlier call made to the same task. This task coordinate sorts and indexes the alignments and fixes the NM and UQ tags whose calculations depend on coordinate sort order. Resulting data go on to step [3].

3. Base quality score recalibration

This step creates intervals for scattering, performs BQSR, merges back the scattered results into a single file and finally compresses the BAM to CRAM format.

  • 3.0: L1267 - Call task CreateSequenceGroupingTSV to create intervals from the reference genome's .dict dictionary for subsequent use in [3.1] and [3.3].
  • 3.1: L1286 - Define a scatter across the intervals created in [3.0] and call task BaseRecalibrator to act on each interval of the BAM from [2.1], resulting in a recalibration table per interval.
  • 3.2: L1307 - Call task GatherBqsrReports to consolidate the per-interval recalibration tables into a single recalibration table whose sums reflect the consolidated data.
  • 3.3: L1315 - Define a scatter across the intervals created in [3.0] and call task ApplyBQSR to use the GatherBqsrReports.output_bqsr_report from [3.2] and apply the recalibration to the BAM from [2.1] per interval defined by [3.0]. Each resulting recalibrated BAM will contain alignment records from the specified interval including unmapped reads from singly mapping pairs. These unmapped records retain SAM alignment information, e.g. mapping contig and coordinate information, but have an asterisk * in the CIGAR field.
  • 3.4: L1334 - Call task GatherBamFiles to concatenate the recalibrated BAMs from [3.3], in order, into a single indexed BAM that retains the header from the first BAM. Resulting data go onto step [4].
  • 3.5: L1427 - Call task ConvertToCram to compress the BAM further into reference-dependent indexed CRAM format. Note that there is also a CramToBam task that roundtrips the data back into BAM format; the purpose is to check that nothing was lost in the conversion process. The corresponding quality check will fail if you use a CRAM compression setting that produces lossy compression (which we do not use here).

4. Call SNP and INDEL variants with HaplotypeCaller

This final step uses HaplotypeCaller to call variants over intervals then merges data into a GVCF for the sample, the final output of the workflow.

  • 4.0: L1464 - Define a scatter across intervals (defined within the JSON file under scattered_calling_intervals) to parallelize the HaplotypeCaller GVCF generation. We use only the primary assembly contigs of GRCh38, grouped into 50 intervals lists, to call variants. Within the GRCh38 intervals lists, the primary assembly's contigs are divided by contiguous regions between regions of Ns. The called task then uses this list of regions to parallelize the task via the -L ${interval_list} option.

▶︎ For this pipeline workflow's setup, fifty parallel processes makes sense for a genome of 3 billion basepairs. However, given the same setup, the 50-way split is overkill for a genome of 370 million basepairs as in the case of the pufferfish.

  • 4.1: L1467 - Call task HaplotypeCaller to call SNP and INDEL variants on the BAM from [3.5] per interval, resulting in a GVCF format variant call file for each interval interval.

  • 4.2: L1483 - Call task MergeVCFs to merge the per-interval GVCFs into one single-sample GVCF. The tool concatenates the GVCFs in the same order as the input_vcfs array, which is ordered by contig.

  • 4.3: L1522 - Define files that copy to an output directory if given an OPTIONS JSON file that defines the output directory. If you omit the OPTIONS JSON or omit defining the outputs directory in the OPTIONS JSON, then the workflow skips this step.


TASK DEFINITIONS

CheckFinalVcfExtension

This task checks that the final GVCF filename that is going to be used ends with ".g.vcf.gz". This is a GATK convention that is intended to help differentiate GVCFs from regular VCFs. There is also a practical consequence: when we use this convention, the GATK engine recognizes that it needs to adjust some settings when it writes the output file's index, and so we don't need to specify those settings manually in our command. Note that we still specify those arguments in this pipeline to be really explicit, even though they are not necessary.

GetBwaVersion

This task obtains the version of BWA to later notate within the BAM program group (@PG) line. This is because we like to keep a record of all commands that were run on the data within the file headers.

SamToFastqAndBwaMem

The input to this task is an unaligned queryname-sorted BAM and the output is an aligned query-grouped BAM. This step pipes three processes: (i) conversion of BAM to FASTQ reads, (ii) alternate-contig-aware alignment with BWA-MEM and (iii) conversion of SAM to BAM reads. BWA-MEM requires FASTQ reads as input and produces SAM format reads. This task maps the reads using the BWA command defined as a string variable.

The alt-aware alignment depends on use of GRCh38 as the reference, the versions 0.7.13+ of BWA and the presence of BWA's ALT index from bwa-kit. If the ref_alt ALT index has no content or is not present, then the script exits with an exit 1 error. What this means is that this task is only compatible with a reference with ALT contigs and it only runs in an alt-aware manner.

MergeBamAlignment

This step takes an unmapped BAM and the aligned BAM and merges information from each. Reads, sequence and quality information and meta information from the unmapped BAM merge with the alignment information in the aligned BAM. The BWA version the script obtains from task GetBwaVersion is used here in the program group (@PG) bwamem. What is imperative for this step, that is implied by the script, is that the sort order of the unmapped and aligned BAMs are identical, i.e. query-group sorted. The BWA-MEM alignment step outputs reads in exactly the same order as they are input and so groups mates, secondary and supplementary alignments together for a given read name. The merging step requires both files maintain this ordering and will produce a final merged BAM in the same query-grouped order given the SORT_ORDER="unsorted" parameter. This has implications for how the MarkDuplicates task will flag duplicate sets.

Because the ATTRIBUTES_TO_RETAIN option is set to X0, any aligner-specific tags that are literally X0 will carryover to the merged BAM. BWA-MEM does not output such a tag but does output XS and XA tags for suboptimal alignment score and alternative hits, respectively. However, these do not carryover into the merged BAM. Merging retains certain tags from either input BAM (RG, SA, MD, NM, AS and OQ if present), replaces the PG tag as the command below defines and adds new tags (MC, MQ and FT).

▶︎ Note the NM tag values will be incorrect at this point and the UQ tag is absent. Update and addition of these are dependent on coordinate sort order. Specifically, the script uses a separate SortAndFixTags task to fix NM tags and add UQ tags.

The UNMAP_CONTAMINANT_READS=true option applies to likely cross-species contamination, e.g. bacterial contamination. MergeBamAlignment identifies reads that are (i) softclipped on both ends and (ii) map with less than 32 basepairs as contaminant. For a similar feature in GATK, see OverclippedReadFilter. If MergeBamAlignment determines a read is contaminant, then the mate is also considered contaminant. MergeBamAlignment unmaps the pair of reads by (i) setting the 0x4 flag bit, (ii) replacing column 3's contig name with an asterisk *, (iii) replacing columns 4 and 5 (POS and MAPQ) with zeros, and (iv) adding the FT tag to indicate the reason for unmapping the read, e.g. FT:Z:Cross-species contamination. The records retain their CIGAR strings. Note other processes also use the FT tag, e.g. to indicate reasons for setting the QCFAIL 0x200 flag bit, and will use different tag descriptions.

SortAndFixTags

This task (i) sorts reads by coordinate and then (ii) corrects the NM tag values, adds UQ tags and indexes a BAM. The task pipes the two commands. First, SortSam sorts the records by genomic coordinate using the SORT_ORDER="coordinate" option. Second, SetNmAndUqTags calculates and fixes the UQ and NM tag values in the BAM. Because CREATE_INDEX=true, SetNmAndUqTags creates the .bai index. Again, we create an MD5 file with the CREATE_MD5_FILE=true option.

As mentioned in the MergeBamAlignment task, tag values dependent on coordinate-sorted records require correction in this separate task given this workflow maintains query-group ordering through the pre-processing steps.

MarkDuplicates

This task flags duplicate reads. Because the input is query-group-sorted, MarkDuplicates flags with the 0x400 bitwise SAM flag duplicate primary alignments as well as the duplicate set's secondary and supplementary alignments. Also, for singly mapping mates, duplicate flagging extends to cover unmapped mates. These extensions are features that are only available to query-group-sorted BAMs.

This command uses the ASSUME_SORT_ORDER="queryname" parameter to tell the tool the sort order to expect. Within the context of this workflow, at the point this task is called, we will have avoided any active sorting that would label the BAM header. We know that our original flowcell BAM is queryname-sorted and that BWA-MEM maintains this order to give us query-grouped alignments.

The OPTICAL_DUPLICATE_PIXEL_DISTANCE of 2500 is set for Illumina sequencers that use patterned flowcells to estimate the number of sequencer duplicates. Sequencer duplicates are a subspecies of the duplicates that the tool flags. The Illumina HiSeq X and HiSeq 4000 platforms use patterened flowcells. If estimating library complexity (see section Duplicate metrics in brief) is important to you, then adjust the OPTICAL_DUPLICATE_PIXEL_DISTANCE appropriately for your sequencer platform.

Finally, in this task and others, we produce an MD5 file with the CREATE_MD5_FILE=true option. This creates a 128-bit hash value using the MD5 algorithm that is to files much like a fingerprint is to an individual. Compare MD5 values to verify data integrity, e.g. after moving or copying large files.

CreateSequenceGroupingTSV

This task uses a python script written as a single command using heredoc syntax to create a list of contig groupings. The workflow uses the intervals to scatter the base quality recalibration step [3] that calls on BaseRecalibrator and ApplyBQSR tasks.

This workflow specifically uses Python v2.7.

The input to the task is the reference .dict dictionary that lists contigs. The code takes the information provided by the SN and LN tags of each @SQ line in the dictionary to pair the information in a tuple list. The SN tag names a contig while the LN tag measures the contig length. This list is ordered by descending contig length.

The contig groupings this command creates is in WDL array format where each line represents a group and each group's members are tab-separated. The command adds contigs to each group from the previously length-sorted list in descending order and caps the sum of member lengths by the first contig's sequence length (the longest contig). This has the effect of somewhat evenly distributing sequence per group. For GRCh38, CreateSequenceGroupingTSV-stdout.log shows 18 such groups.

As the code adds contig names to groups, it adds a :1+ to the end of each name. This is to protect the names from downstream tool behavior that removes elements after the last : within a contig name. GRCh38 introduces contig names that include :s and removing the last element make certain contigs indistinguishable from others. With this appendage, we preserve the original contig names through downstream processes. GATK v3.5 and prior versions require this addition.

BaseRecalibrator

The task runs BaseRecalibrator to detect errors made by the sequencer in estimating base quality scores. BaseRecalibrator builds a model of covariation from mismatches in the alignment data while excluding known variant sites and creates a recalibration report for use in the next step. The engine parameter --useOriginalQualities asks BaseRecalibrator to use original sequencer-produced base qualities stored in the OQ tag if present or otherwise use the standard QUAL score. The known sites files should include sites of known common SNPs and INDELs.

This task runs per interval grouping defined by each -L option. The sep in -L ${sep=" -L " sequence_group_interval} ensures each interval in the sequence_group_interval list is given by the command.

GatherBqsrReports

This task consolidates the recalibration reports from each sequence group interval into a single report using GatherBqsrReports, so that the recalibration will be based on the observations made on all the data, not just per-batch of contigs.

ApplyBQSR

The task uses ApplyBQSR and the recalibration report to correct base quality scores in the BAM. Again, using parallelization, this task applies recalibration for the sequence intervals defined with -L. A resulting recalibrated BAM will contain only reads for the intervals in the applied intervals list.

GatherBamFiles

This task concatenates provided BAMs in order, into a single BAM and retains the header of the first file. For this pipeline, this includes the recalibrated sequence grouped BAMs and the recalibrated unmapped reads BAM. For GRCh38, this makes 19 BAM files that the task concatenates together. The resulting BAM is already in coordinate-sorted order. The task creates a new sequence index and MD5 file for the concatenated BAM.

ConvertToCram

This task compresses a BAM to an even smaller CRAM format using the -C option of Samtools. The task then indexes the CRAM and renames it from {basename}.cram.crai to {basename}.crai. CRAM is a fairly new format and tool developers are actively refining features for compatibility. Make sure your tool chain is compatible with CRAM before deleting BAMs. Be aware when using CRAMs that you will have to specify the identical reference genome, not just equivalent reference, with matching MD5 hashes for each contig. These can differ if the capitalization of reference sequences differ.

HaplotypeCaller

This task runs HaplotypeCaller on the recalibrated BAM for given intervals and produces variant calls in GVCF format. HaplotypeCaller reassembles and realign reads around variants and calls genotypes and genotype likelihoods for single nucleotide polymorphism (SNP) and insertion and deletion (INDELs) variants. Proximal variants are phased. The resulting file is a GZ compressed file, a valid VCF format file with extension .g.vcf.gz, containing variants for the given interval.

  • The WORKFLOW's step 4 defines whether any parallelization is applied.

The -ERC GVCF or emit reference confidence mode has two key consequences. First, the tool will emit block records for non-variant sites, which allows us to record information about the quality and quantity of the evidence observed there, and estimate how confident we are that sites in these blocks are actually non-variant. Second, for each record in the GVCF call, the tool will include a symbolic <NON_REF> non-reference allele and will calculate phred-scaled likelihoods (PL annotation) for all possible genotypes. See the GATK documentation for more information on this functionality.

  • The --max_alternate_alleles is set to 3 for performance optimization. This does not limit the alleles that are genotyped, only the number of alleles that HaplotypeCaller emits.
  • We specify -variant_index_parameter 128000 and -variant_index_type LINEAR to set the indexing strategy for the output GVCF (though it's not strictly necessary when we use the ".g.vcf" output naming convention). See Article#3893 for details on this convention.
  • The command invokes an additional read, the OverclippedReadFilter, with --read_filter OverclippedRead that removes reads that are likely from foreign contaminants, e.g. bacterial contamination. The filter define such reads as those that align with less than 30 basepairs and are softclipped on both ends of the read. This option is similar to the MergeBamAlignment task's UNMAP_CONTAMINANT_READS=true option that unmaps contaminant reads less than 32 basepairs.

MergeVCFs

The task uses MergeVcfs to combine multiple VCF files resulting from scatter-gather parallelization into a single VCF file and index. Note that in this case we are applying it to GVCFs; the tool is agnostic to the differences between regular VCFs and GVCFS, and just cares that the files are valid VCFs (which GVCFs are).

Quality Control tasks

A considerable number of quality control tasks are run in the course of this workflow. We do not recapitulate them all here, but we note that there are two general categories of QC tasks: format validation, identity validation, and metrics collection.

  • Format validation aims to verify that the outputs are correctly formatted, to ensure that they will go through the rest of the pipeline smoothly. You can read more about BAM file validation in Article#7571.
  • Identity validation aims to verify that the read data we are working with belongs to the sample identified in the inputs. This is done to detect cases where materials or data files got mislabeled (generally through human error), resulting in either partial swaps (e.g. one readgroup does not belong to the same sample as the others) or total swaps (all the data belongs to a different sample). The most important identity validation is based on "fingerprinting", i.e. comparing genotypes determined on-the-fly from the read data with genotypes determined by running the original biological sample through a genotyping array (which we do for every sample that enters our facility). This allows us to detect sample swaps extremely accurately. As secondary validation measures, we also compare that all read groups support the same genotypes (to detect partial swaps) and we estimate cross-sample contamination. The contamination estimate is used later to inform the variant calling process.
  • Metrics collection aims to evaluate the quality of the data at multiple levels. This allows us to flag problem samples early in the process and continuously monitor the quality of the outputs of our facilities.

In addition, the WDL script includes some basic descriptive comments for all QC calls and tasks.