Added some javadoc notes about disk throughput to MarkDuplicatesSpark and SortSamSpark #5672

jamesemery · 2019-02-13T20:24:56Z

Hopefully this will help better inform people as to how these tools should be run in cromwell.

… and SortSamSpark

codecov-io · 2019-02-13T21:14:17Z

Codecov Report

Merging #5672 into master will decrease coverage by 6.763%.
The diff coverage is n/a.

@@              Coverage Diff               @@
##             master     #5672       +/-   ##
==============================================
- Coverage     87.05%   80.287%   -6.763%     
+ Complexity    31708     30237     -1471     
==============================================
  Files          1940      1943        +3     
  Lines        146142    146770      +628     
  Branches      16128     16223       +95     
==============================================
- Hits         127216    117837     -9379     
- Misses        13041     23220    +10179     
+ Partials       5885      5713      -172

Impacted Files	Coverage Δ	Complexity Δ
...transforms/markduplicates/MarkDuplicatesSpark.java	`94.521% <ø> (ø)`	`36 <0> (ø)`	⬇️
...hellbender/tools/spark/pipelines/SortSamSpark.java	`100% <ø> (ø)`	`5 <0> (ø)`	⬇️
...rs/variantutils/SelectVariantsIntegrationTest.java	`0.25% <0%> (-99.75%)`	`1% <0%> (-70%)`
...kers/filters/VariantFiltrationIntegrationTest.java	`0.826% <0%> (-99.174%)`	`1% <0%> (-25%)`
...dorientation/CollectF1R2CountsIntegrationTest.java	`0.917% <0%> (-99.083%)`	`1% <0%> (-12%)`
.../walkers/bqsr/BaseRecalibratorIntegrationTest.java	`1.031% <0%> (-98.969%)`	`1% <0%> (-7%)`
...ers/vqsr/FilterVariantTranchesIntegrationTest.java	`1.053% <0%> (-98.947%)`	`1% <0%> (-5%)`
...s/variantutils/VariantsToTableIntegrationTest.java	`1.205% <0%> (-98.795%)`	`1% <0%> (-20%)`
...ientation/ReadOrientationModelIntegrationTest.java	`1.667% <0%> (-98.333%)`	`1% <0%> (-5%)`
...on/FindBreakpointEvidenceSparkIntegrationTest.java	`1.754% <0%> (-98.246%)`	`1% <0%> (-6%)`
... and 237 more

sooheelee · 2019-02-13T22:23:56Z

@jamesemery, how do you feel about me making changes to your branch via a secondary commit? I think this might be easier than my asking for minor documentation fixes, explaining them and then having you make them.

sooheelee · 2019-02-13T23:31:52Z

I just noticed that SortSamSpark has zero javadoc. So here is where I am developing content: https://docs.google.com/document/d/13_fZ8y8692aKH3jT09cf9VCNxD2NpkISjq8EJqq0ZJ4/edit?usp=sharing @jamesemery.

sooheelee · 2019-02-14T18:51:02Z

Will wait for your return in a week to discuss.

droazen · 2019-02-14T19:17:50Z

Assigning back to @jamesemery to review @sooheelee 's proposed Google doc, and incorporate whatever changes are appropriate

droazen · 2019-02-14T20:03:04Z

@jamesemery I think that the warning about coordinate-sorted input vs. name-sorted input needs to be much more prominent, and perhaps repeated several times at various points in the docs.

droazen · 2019-02-14T20:03:30Z

Also, we should confirm that the tool itself emits a warning message to the logger when given coordinate-sorted input.

jamesemery · 2019-02-21T20:42:22Z

@sooheelee Updated this branch, let me know what your thoughts are (based on the google document discussion)

jamesemery · 2019-03-01T21:45:37Z

@sooheelee Updated the documentation in this branch according to the external discussion. Thoughts? Can this be merged?

sooheelee

@jamesemery, here is my review. Since we have worked together to extensively shape the content and preemptively address user questions, what remains are minor typos etc.

sooheelee · 2019-03-06T18:36:27Z

...org/broadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSpark.java

- *    read.  Duplicates are marked with the hexadecimal value of 0x0400, which corresponds to a decimal value of 1024.
- *    If you are not familiar with this type of annotation, please see the following <a href='https://www.broadinstitute.org/gatk/blog?id=7019'>blog post</a> for additional information.</p>" +
+ * <ul>
+ *  <li>FMarkDuplicatesSpark processing can replace both the MarkDuplicates and SortSam steps of the Best Practices <a href="https://software.broadinstitute.org/gatk/documentation/article?id=7899#2">single sample pipeline </a>. After flagging duplicate sets, the tool automatically coordinate-sorts the records. It is still necessary to subsequently run SetNmMdAndUqTags before running BQSR. </li>


remove F in FMarkDuplicatesSpark

remove extraneous space in >single sample pipeline </a>

sooheelee · 2019-03-06T18:37:52Z

...org/broadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSpark.java

+ * <ul>
+ *  <li>FMarkDuplicatesSpark processing can replace both the MarkDuplicates and SortSam steps of the Best Practices <a href="https://software.broadinstitute.org/gatk/documentation/article?id=7899#2">single sample pipeline </a>. After flagging duplicate sets, the tool automatically coordinate-sorts the records. It is still necessary to subsequently run SetNmMdAndUqTags before running BQSR. </li>
+ *  <li>The tool is optimized to run on queryname-grouped alignments. If provided coordinate-sorted alignments, the tool will spend additional time first queryname sorting the reads internally. This can result in the tool being up to 2x slower processing under some circumstances.</li>
+ *  <li>Due to MarkDuplicatesSpark queryname-sorting coordinate-sorted inputs internally at the start, the tool produces identical results regardless of the input sort-order. That is, it will flag duplicates sets that include secondary, and supplementary and unmapped mate records no matter the sort-order of the input. This differs from how Picard MarkDuplicates behaves given the differently sorted inputs. <li/>


correct closing list item tag:<li/> --> </li>

sooheelee · 2019-03-06T18:38:36Z

...org/broadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSpark.java

+ *  <li>FMarkDuplicatesSpark processing can replace both the MarkDuplicates and SortSam steps of the Best Practices <a href="https://software.broadinstitute.org/gatk/documentation/article?id=7899#2">single sample pipeline </a>. After flagging duplicate sets, the tool automatically coordinate-sorts the records. It is still necessary to subsequently run SetNmMdAndUqTags before running BQSR. </li>
+ *  <li>The tool is optimized to run on queryname-grouped alignments. If provided coordinate-sorted alignments, the tool will spend additional time first queryname sorting the reads internally. This can result in the tool being up to 2x slower processing under some circumstances.</li>
+ *  <li>Due to MarkDuplicatesSpark queryname-sorting coordinate-sorted inputs internally at the start, the tool produces identical results regardless of the input sort-order. That is, it will flag duplicates sets that include secondary, and supplementary and unmapped mate records no matter the sort-order of the input. This differs from how Picard MarkDuplicates behaves given the differently sorted inputs. <li/>
+ *  <li>CCollecting duplicate metrics slows down performance and thus the metrics collection is optional and must be specified for the Spark version of the tool with '-M'. It is possible to collect the metrics with the standalone Picard tool, <a href='https://software.broadinstitute.org/gatk/documentation/tooldocs/current/picard_sam_markduplicates_EstimateLibraryComplexity.php'>EstimateLibraryComplexity</a>.</li>


remove extraneous C in CCollecting

remove , before EstimateLibraryComplexity

sooheelee · 2019-03-06T18:39:53Z

...org/broadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSpark.java

+ *  <li>The tool is optimized to run on queryname-grouped alignments. If provided coordinate-sorted alignments, the tool will spend additional time first queryname sorting the reads internally. This can result in the tool being up to 2x slower processing under some circumstances.</li>
+ *  <li>Due to MarkDuplicatesSpark queryname-sorting coordinate-sorted inputs internally at the start, the tool produces identical results regardless of the input sort-order. That is, it will flag duplicates sets that include secondary, and supplementary and unmapped mate records no matter the sort-order of the input. This differs from how Picard MarkDuplicates behaves given the differently sorted inputs. <li/>
+ *  <li>CCollecting duplicate metrics slows down performance and thus the metrics collection is optional and must be specified for the Spark version of the tool with '-M'. It is possible to collect the metrics with the standalone Picard tool, <a href='https://software.broadinstitute.org/gatk/documentation/tooldocs/current/picard_sam_markduplicates_EstimateLibraryComplexity.php'>EstimateLibraryComplexity</a>.</li>
+ *  <li>MarkDuplicatesSpark is optimized to run locally on a single machine by leveraging core parallelism that MarkDuplicates and SortSam cannot. It will typically run faster than MarkDuplicates and SortSam by a factor of 15% over the same data at 2 cores and will scale linearly to upwards of 16 cores. This means that the tool can be used to speedup MarkDuplicates even without access to a Spark cluster.</li>


This means that the tool can be used to speedup MarkDuplicates even without access to a Spark cluster. --> This means MarkDuplicatesSpark, even without access to a Spark cluster, is faster than MarkDuplicates.

sooheelee · 2019-03-06T18:42:45Z

src/main/java/org/broadinstitute/hellbender/tools/spark/pipelines/SortSamSpark.java

+/**
+ * SortSam on Spark (works on SAM/BAM/CRAM)
+ *
+ * <h4>Overview</h4>


Remove <h4>Overview</h4> line. Currently, documentation looks like this:

sooheelee · 2019-03-06T18:52:18Z

...org/broadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSpark.java

+ * <h3>Additional Notes</h3>
+ * <ul>
+ *     <li>This Spark tool requires a significant amount of disk operations. Run with both the input data and outputs on high throughput SSDs when possible. When pipelining this tool on Google Compute Engine instances, for best performance requisition machines with LOCAL SSDs.  </li>
+ *     <li>Furthermore, we recommend explicitly setting the Spark temp directory to an available SSD when running this in local mode by adding the argument --conf 'spark.local.dir=/PATH/TO/TEMP/DIR'. See the discussion at https://gatkforums.broadinstitute.org/gatk/discussion/comment/56337 for details.</li>


Change "See the discussion at https://gatkforums.broadinstitute.org/gatk/discussion/comment/56337 for details."
-->
See <a href="https://gatkforums.broadinstitute.org/gatk/discussion/comment/56337">this forum discussion</a> for details.

sooheelee · 2019-03-06T18:54:13Z

src/main/java/org/broadinstitute/hellbender/tools/spark/pipelines/SortSamSpark.java

+ * SortSam on Spark (works on SAM/BAM/CRAM)
+ *
+ * <h4>Overview</h4>
+ * <p>A <a href='https://software.broadinstitute.org/gatk/blog?id=23420'>Spark</a> implementation of <a href='https://software.broadinstitute.org/gatk/documentation/tooldocs/current/picard_sam_SortSam.php'>Picard SortSam</a>. The Spark version can run in parallel on multiple cores on a local machine or multiple machines on a Spark cluster while still matching the output of the single-core Picard version. See <a href="https://software.broadinstitute.org/gatk/blog?id=23420">Blog#23420</a> for performance benchmarks.</p>


Remove link surrounding 'Spark'. We link to the blog article in the next sentence.

sooheelee · 2019-03-06T19:02:51Z

src/main/java/org/broadinstitute/hellbender/tools/spark/pipelines/SortSamSpark.java

+ *
+ * <p>The tool sorts reads by coordinate order by default or alternatively by read name, the QNAME field, if asked with the '-SO queryname' option. The contig ordering in the reference dictionary defines coordinate order, and the tool uses the sequence dictionary represented by the @SQ header lines or that of the optionally provided reference to sort reads by the RNAME field. For those reads mapping to a contig, coordinate sorting further orders reads by the POS field of the SAM record, which contains the leftmost mapping position.</p>
+ *
+ *  <p>Queryname-sorted alignments are grouped first by readname and then are deterministically ordered among equal readnames by read flags including orientation, secondary, and supplemntary record flags (See {@link htsjdk.samtools.SAMRecordQueryNameComparator#compare(SAMRecord, SAMRecord)}} for details). For paired-end reads, reads in the pair share the same queryname. Because aligners can generate secondary and supplementary alignments, queryname groups can consists of, e.g. more than two records for a paired-end pair.</p>


We can improve this sentence:

Queryname-sorted alignments are grouped first by readname and then are deterministically ordered among equal readnames by read flags including orientation, secondary, and supplemntary record flags (See htsjdk.samtools.SAMRecordQueryNameComparator#compare(SAMRecord, SAMRecord)} for details).

May I suggest
"To queryname-sort, the tool first groups by readname and then deterministically sorts within a readname set by orientation, secondary and supplementary SAM flags. "

supplemntary record flags --> supplementary

remove link to detailed information. It doesn't show up well in javadoc:

sooheelee · 2019-03-06T19:03:42Z

src/main/java/org/broadinstitute/hellbender/tools/spark/pipelines/SortSamSpark.java

+ * <h4>Usage examples</h4>
+ * Coordinate-sort aligned reads using all cores available locally
+ * <pre>
+ * gatk SortSamSpark \<br />


As I mention below, here and elsewhere:
Please remove the <br> elements as they are unnecessary for gatkdocs and actually add a weird additional line between command lines.

Right now, this is what the doc looks like:

We want to tighten up the line spacing and removing the breaks will do that for us.

sooheelee · 2019-03-06T19:06:26Z

src/main/java/org/broadinstitute/hellbender/tools/spark/pipelines/SortSamSpark.java

+ * </pre>
+ *
+ * <h3>Notes</h3>
+ * <ul>


For Mutect2, I decided to make the Notes section an ordered list. It's up to you if you want to change ul tool since you only have two bullets.

jamesemery · 2019-03-06T21:45:30Z

@sooheelee I have responded to your comments. They were mostly typos and quick changes. I might go ahead and merge this soon.

sooheelee · 2019-03-07T16:24:03Z

Yes, feel free to merge. I gave approval with my last review.

sooheelee · 2019-03-07T16:24:37Z

src/main/java/org/broadinstitute/hellbender/tools/spark/pipelines/SortSamSpark.java

+ * -I coordinatesorted.bam \
+ * -SO queryname \
+ * -O querygroupsorted.bam \
+ * --


Just skimming through and this needs a \.

added some javadoc notes about disk throughput to MarkDuplicatesSpark…

5c5e06e

… and SortSamSpark

jamesemery assigned sooheelee Feb 13, 2019

jamesemery requested a review from sooheelee February 13, 2019 20:24

droazen assigned jamesemery and unassigned sooheelee Feb 14, 2019

sooheelee mentioned this pull request Feb 14, 2019

MarkDuplicateSpark is slower than normal MarkDuplicates #5670

Closed

updated documentation based on google document

c31ec49

jamesemery assigned sooheelee and unassigned jamesemery Feb 21, 2019

updating documentation to final form

a4655ad

sooheelee approved these changes Mar 6, 2019

View reviewed changes

sooheelee assigned jamesemery and unassigned sooheelee Mar 6, 2019

updated the documentation with review comments

e28f5a1

sooheelee reviewed Mar 7, 2019

View reviewed changes

sooheelee approved these changes Mar 7, 2019

View reviewed changes

"

9850f9a

jamesemery merged commit 3570af9 into master Mar 7, 2019

jamesemery deleted the je_updateMDSparkDocumentation branch March 7, 2019 19:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added some javadoc notes about disk throughput to MarkDuplicatesSpark and SortSamSpark #5672

Added some javadoc notes about disk throughput to MarkDuplicatesSpark and SortSamSpark #5672

jamesemery commented Feb 13, 2019

codecov-io commented Feb 13, 2019 •

edited

sooheelee commented Feb 13, 2019

sooheelee commented Feb 13, 2019

sooheelee commented Feb 14, 2019

droazen commented Feb 14, 2019

droazen commented Feb 14, 2019

droazen commented Feb 14, 2019

jamesemery commented Feb 21, 2019

jamesemery commented Mar 1, 2019

sooheelee left a comment

sooheelee Mar 6, 2019

sooheelee Mar 6, 2019

sooheelee Mar 6, 2019

sooheelee Mar 6, 2019

sooheelee Mar 6, 2019

sooheelee Mar 6, 2019

sooheelee Mar 6, 2019

sooheelee Mar 6, 2019

sooheelee Mar 6, 2019

sooheelee Mar 6, 2019

jamesemery commented Mar 6, 2019

sooheelee commented Mar 7, 2019

sooheelee Mar 7, 2019

Added some javadoc notes about disk throughput to MarkDuplicatesSpark and SortSamSpark #5672

Added some javadoc notes about disk throughput to MarkDuplicatesSpark and SortSamSpark #5672

Conversation

jamesemery commented Feb 13, 2019

codecov-io commented Feb 13, 2019 • edited

Codecov Report

sooheelee commented Feb 13, 2019

sooheelee commented Feb 13, 2019

sooheelee commented Feb 14, 2019

droazen commented Feb 14, 2019

droazen commented Feb 14, 2019

droazen commented Feb 14, 2019

jamesemery commented Feb 21, 2019

jamesemery commented Mar 1, 2019

sooheelee left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamesemery commented Mar 6, 2019

sooheelee commented Mar 7, 2019

Choose a reason for hiding this comment

codecov-io commented Feb 13, 2019 •

edited