Added and Updated Javadocs for SortSamSpark and MarkDuplicatesSpark.

broadinstitute · Mar 7, 2019 · 3570af9 · 3570af9
1 parent 342569a
commit 3570af9
Show file tree

Hide file tree

Showing 2 changed files with 96 additions and 53 deletions.
diff --git a/src/main/java/org/broadinstitute/hellbender/tools/spark/pipelines/SortSamSpark.java b/src/main/java/org/broadinstitute/hellbender/tools/spark/pipelines/SortSamSpark.java
@@ -1,6 +1,7 @@
 package org.broadinstitute.hellbender.tools.spark.pipelines;
 
 import htsjdk.samtools.SAMFileHeader;
+import htsjdk.samtools.SAMRecord;
 import org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.api.java.JavaSparkContext;
 import org.broadinstitute.barclay.argparser.Argument;
@@ -18,6 +19,43 @@
 import java.util.Collections;
 import java.util.List;
 
+
+/**
+ * SortSam on Spark (works on SAM/BAM/CRAM)
+ *
+ * <p>A Spark implementation of <a href='https://software.broadinstitute.org/gatk/documentation/tooldocs/current/picard_sam_SortSam.php'>Picard SortSam</a>. The Spark version can run in parallel on multiple cores on a local machine or multiple machines on a Spark cluster while still matching the output of the single-core Picard version. See <a href="https://software.broadinstitute.org/gatk/blog?id=23420">Blog#23420</a> for performance benchmarks.</p>
+ *
+ * <p>The tool sorts reads by coordinate order by default or alternatively by read name, the QNAME field, if asked with the '-SO queryname' option. The contig ordering in the reference dictionary defines coordinate order, and the tool uses the sequence dictionary represented by the @SQ header lines or that of the optionally provided reference to sort reads by the RNAME field. For those reads mapping to a contig, coordinate sorting further orders reads by the POS field of the SAM record, which contains the leftmost mapping position.</p>
+ *
+ *  <p>To queryname-sort, the tool first groups by readname and then deterministically sorts within a readname set by orientation, secondary and supplementary SAM flags. For paired-end reads, reads in the pair share the same queryname. Because aligners can generate secondary and supplementary alignments, queryname groups can consists of, e.g. more than two records for a paired-end pair.</p>
+ *
+ * <h3>Usage examples</h3>
+ * Coordinate-sort aligned reads using all cores available locally
+ * <pre>
+ * gatk SortSamSpark \
+ * -I aligned.bam \
+ * -O coordinatesorted.bam
+ * </pre>
+ *
+ * Queryname-sort reads using four cores on a Spark cluster
+ * <pre>
+ * gatk SortSamSpark \
+ * -I coordinatesorted.bam \
+ * -SO queryname \
+ * -O querygroupsorted.bam \
+ * -- \
+ *  --spark-runner SPARK \
+ *  --spark-master <SPARK-CLUSTER-NAME>\
+ *  --num-executors 5 \
+ *  --executor-cores 4
+ * </pre>
+ *
+ * <h3>Notes</h3>
+ * <ol>
+ *     <li>This Spark tool requires a significant amount of disk operations. Run with both the input data and outputs on high throughput SSDs when possible. When pipelining this tool on Google Compute Engine instances, for best performance requisition machines with LOCAL SSDs.  </li>
+ *     <li>Furthermore, we recommend explicitly setting the Spark temp directory to an available SSD when running this in local mode by adding the argument --conf 'spark.local.dir=/PATH/TO/TEMP/DIR'. See the discussion at https://gatkforums.broadinstitute.org/gatk/discussion/comment/56337 for details.</li>
+ * </ol>
+ */
 @DocumentedFeature
 @CommandLineProgramProperties(summary = "Sorts the input SAM/BAM/CRAM",
         oneLineSummary = "SortSam on Spark (works on SAM/BAM/CRAM)",

diff --git a/.../broadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSpark.java b/.../broadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSpark.java
@@ -35,80 +35,85 @@
 import java.util.*;
 
 /**
- * <p>This is a Spark implementation of the MarkDuplicates tool from Picard that allows the tool to be run in
- *    parallel on multiple cores on a local machine or multiple machines on a Spark cluster while still matching
- *    the output of the single-core Picard version. Since the tool requires holding all of the readnames in memory
- *    while it groups the read information, it is recommended running this tool on a machine/configuration
- *    with at least 8 GB of memory overall for a typical 30x bam.</p>
+ * MarkDuplicates on Spark
  *
- * <p>This tool locates and tags duplicate reads in a BAM or SAM file, where duplicate reads are
- *    defined as originating from a single fragment of DNA.  Duplicates can arise during sample preparation e.g. library
- *    construction using PCR.  See also "<a href='https://broadinstitute.github.io/picard/command-line-overview.html#EstimateLibraryComplexity'>EstimateLibraryComplexity</a>" +
- *    for additional notes on PCR duplication artifacts.  Duplicate reads can also result from a single amplification cluster,
- *    incorrectly detected as multiple clusters by the optical sensor of the sequencing instrument.  These duplication artifacts are
- *    referred to as optical duplicates.</p>
+ * <p>This is a Spark implementation of <a href='https://software.broadinstitute.org/gatk/documentation/tooldocs/current/picard_sam_markduplicates_MarkDuplicates.php'>Picard MarkDuplicates</a> that allows the tool to be run in parallel on multiple cores on a local machine or multiple machines on a Spark cluster while still matching the output of the non-Spark Picard version of the tool. Since the tool requires holding all of the readnames in memory while it groups read information, machine configuration and starting sort-order impact tool performance. </p>
  *
- * <p>The MarkDuplicates tool works by comparing sequences in the 5 prime positions of both reads and read-pairs in a SAM/BAM file.
- *    After duplicate reads are collected, the tool differentiates the primary and duplicate reads using an algorithm that ranks
- *    reads by the sums of their base-quality scores (default method).</p>
+ * Here are some differences of note between MarkDuplicatesSpark and Picard MarkDuplicates.
  *
- * <p>The tool's main output is a new SAM or BAM file, in which duplicates have been identified in the SAM flags field for each
- *    read.  Duplicates are marked with the hexadecimal value of 0x0400, which corresponds to a decimal value of 1024.
- *    If you are not familiar with this type of annotation, please see the following <a href='https://www.broadinstitute.org/gatk/blog?id=7019'>blog post</a> for additional information.</p>" +
+ * <ul>
+ *  <li>MarkDuplicatesSpark processing can replace both the MarkDuplicates and SortSam steps of the Best Practices <a href="https://software.broadinstitute.org/gatk/documentation/article?id=7899#2">single sample pipeline</a>. After flagging duplicate sets, the tool automatically coordinate-sorts the records. It is still necessary to subsequently run SetNmMdAndUqTags before running BQSR. </li>
+ *  <li>The tool is optimized to run on queryname-grouped alignments. If provided coordinate-sorted alignments, the tool will spend additional time first queryname sorting the reads internally. This can result in the tool being up to 2x slower processing under some circumstances.</li>
+ *  <li>Due to MarkDuplicatesSpark queryname-sorting coordinate-sorted inputs internally at the start, the tool produces identical results regardless of the input sort-order. That is, it will flag duplicates sets that include secondary, and supplementary and unmapped mate records no matter the sort-order of the input. This differs from how Picard MarkDuplicates behaves given the differently sorted inputs. </li>
+ *  <li>Collecting duplicate metrics slows down performance and thus the metrics collection is optional and must be specified for the Spark version of the tool with '-M'. It is possible to collect the metrics with the standalone Picard tool <a href='https://software.broadinstitute.org/gatk/documentation/tooldocs/current/picard_sam_markduplicates_EstimateLibraryComplexity.php'>EstimateLibraryComplexity</a>.</li>
+ *  <li>MarkDuplicatesSpark is optimized to run locally on a single machine by leveraging core parallelism that MarkDuplicates and SortSam cannot. It will typically run faster than MarkDuplicates and SortSam by a factor of 15% over the same data at 2 cores and will scale linearly to upwards of 16 cores. This means MarkDuplicatesSpark, even without access to a Spark cluster, is faster than MarkDuplicates.</li>
+ * </ul>
  *
- * <p>Although the bitwise flag annotation indicates whether a read was marked as a duplicate, it does not identify the type of
- *    duplicate.  To do this, a new tag called the duplicate type (DT) tag was recently added as an optional output in
- *    the 'optional field' section of a SAM/BAM file.  Invoking the 'duplicate-tagging-policy' option,
- *    you can instruct the program to mark all the duplicates (All), only the optical duplicates (OpticalOnly), or no
- *    duplicates (DontTag).  The records within the output of a SAM/BAM file will have values for the 'DT' tag (depending on the invoked
- *    'duplicate-tagging-policy'), as either library/PCR-generated duplicates (LB), or sequencing-platform artifact duplicates (SQ).
- *    This tool uses the 'read-name-regex' and the 'optical-duplicate-pixel-distance' options as the primary methods to identify
- *    and differentiate duplicate types.  Set read-name-regex' to null to skip optical duplicate detection, e.g. for RNA-seq
- *    or other data where duplicate sets are extremely large and estimating library complexity is not an aim.
- *    Note that without optical duplicate counts, library size estimation will be inaccurate.</p>
+ * <p>For a typical 30x coverage WGS BAM, we recommend running on a machine with at least 16 GB. Memory usage scales with library complexity and the tool will need more memory for larger or more complex data. If the tool is running slowly it is possible Spark is running out of memory and is spilling data to disk excessively. If this is the case then increasing the memory available to the tool should yield speedup to a threshold; otherwise, increasing memory should have no effect beyond that threshold. </p>
  *
- * <p>MarkDuplicates also produces a metrics file indicating the numbers of duplicates for both single- and paired-end reads.</p>
+ * <p> Note that this tool does not support UMI based duplicate marking. </p>
  *
- * <p>The program can take either coordinate-sorted or query-sorted inputs, however it is recommended that the input be
- *    query-sorted or query-grouped as the tool will have to perform an extra sort operation on the data in order to associate
- *    reads from the input bam with their mates.</p>
+ * <p>See <a href='https://software.broadinstitute.org/gatk/documentation/tooldocs/current/picard_sam_markduplicates_MarkDuplicates.php'>MarkDuplicates documentation</a> for details on tool features and background information. </p>
  *
- * <p>If desired, duplicates can be removed using the 'remove-all-duplicates' and 'remove-sequencing-duplicates' options.</p>
+ * <h3>Usage examples</h3>
+ * Provide queryname-grouped reads to MarkDuplicatesSpark
+ *     <pre>
+ *      gatk MarkDuplicatesSpark \
+ *            -I input.bam \
+ *            -O marked_duplicates.bam
+ *     </pre>
+ *
+ * Additionally produce estimated library complexity metrics
+ *     <pre>
+ *     gatk MarkDuplicatesSpark \
+ *             -I input.bam \
+ *             -O marked_duplicates.bam \
+ *             -M marked_dup_metrics.txt
  *
- * <h4>Usage example:</h4>
+ *     </pre>
+ *
+ *
+ * MarkDuplicatesSpark run locally specifying the removal of sequencing duplicates and tagging OpticalDuplicates
  *     <pre>
- *      gatk MarkDuplicatesSpark \\<br />
- *            -I input.bam \\<br />
- *            -O marked_duplicates.bam \\<br />
- *            -M marked_dup_metrics.txt
+ *       gatk MarkDuplicatesSpark \
+ *            -I input.bam \
+ *            -O marked_duplicates.bam \
+ *            --remove-sequencing-duplicates \
+ *            --duplicate-tagging-policy OpticalOnly
  *     </pre>
  *
- *  <h4>MarkDuplicates run locally specifying the core input (if 'spark.executor.cores' is unset spark will use all available cores on the machine)</h4>
+ *  MarkDuplicates run locally specifying the core input. Note if 'spark.executor.cores' is unset, Spark will use all available cores on the machine.
  *     <pre>
- *       gatk MarkDuplicatesSpark \\<br />
- *            -I input.bam \\<br />
- *            -O marked_duplicates.bam \\<br />
- *            -M marked_dup_metrics.txt \\<br />
+ *       gatk MarkDuplicatesSpark \
+ *            -I input.bam \
+ *            -O marked_duplicates.bam \
+ *            -M marked_dup_metrics.txt \
  *            --conf 'spark.executor.cores=5'
  *     </pre>
  *
- *  <h4>MarkDuplicates run on a spark cluster 5 machines</h4>
+ *  MarkDuplicates run on a Spark cluster of five executors  and with eight executor cores
  *     <pre>
- *       gatk MarkDuplicatesSpark \\<br />
- *            -I input.bam \\<br />
- *            -O marked_duplicates.bam \\<br />
- *            -M marked_dup_metrics.txt \\<br />
- *            -- \\<br />
- *            --spark-runner SPARK \\<br />
- *            --spark-master <master_url> \\<br />
- *            --num-executors 5 \\<br />
- *            --executor-cores 8 <br />
+ *       gatk MarkDuplicatesSpark \
+ *            -I input.bam \
+ *            -O marked_duplicates.bam \
+ *            -M marked_dup_metrics.txt \
+ *            -- \
+ *            --spark-runner SPARK \
+ *            --spark-master MASTER_URL \
+ *            --num-executors 5 \
+ *            --executor-cores 8
  *     </pre>
  *
  *    Please see
- *    <a href='http://broadinstitute.github.io/picard/picard-metric-definitions.html#DuplicationMetrics'>MarkDuplicates</a>
+ *    <a href='http://broadinstitute.github.io/picard/picard-metric-definitions.html#DuplicationMetrics'>Picard DuplicationMetrics</a>
  *    for detailed explanations of the output metrics.
  *    <hr />
+ *
+ * <h3>Notes</h3>
+ * <ol>
+ *     <li>This Spark tool requires a significant amount of disk operations. Run with both the input data and outputs on high throughput SSDs when possible. When pipelining this tool on Google Compute Engine instances, for best performance requisition machines with LOCAL SSDs.  </li>
+ *     <li>Furthermore, we recommend explicitly setting the Spark temp directory to an available SSD when running this in local mode by adding the argument --conf 'spark.local.dir=/PATH/TO/TEMP/DIR'. See <a href="https://gatkforums.broadinstitute.org/gatk/discussion/comment/56337">this forum discussion</a> for details.</li>
+ * </ol>
  */
 @DocumentedFeature
 @CommandLineProgramProperties(