Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Definine GenomicsDB "partitions" over the span of the input intervals in #5540

Merged
merged 2 commits into from
Dec 21, 2018

Conversation

ldgauthier
Copy link
Contributor

order to dramatically improve exome performance

This makes an 81X speedup in the 1000interval test I put back in. Exomes are just about unrunnable without some sort of interval manipulation. The 65 sample exome joint calling callset output was exactly the same with this version. I also did a comparison for a chr1 and chr20 import (so that we were looking at two intervals that were far apart) and the runtime was the same. This is a huge improvement for some typical use cases.

@codecov-io
Copy link

codecov-io commented Dec 20, 2018

Codecov Report

Merging #5540 into master will increase coverage by 0.007%.
The diff coverage is 91.667%.

@@               Coverage Diff               @@
##              master     #5540       +/-   ##
===============================================
+ Coverage     87.068%   87.075%   +0.007%     
- Complexity     31324     31332        +8     
===============================================
  Files           1921      1921               
  Lines         144579    144598       +19     
  Branches       15949     15952        +3     
===============================================
+ Hits          125882    125908       +26     
+ Misses         12905     12896        -9     
- Partials        5792      5794        +2
Impacted Files Coverage Δ Complexity Δ
...ls/genomicsdb/GenomicsDBImportIntegrationTest.java 88.367% <100%> (+1.685%) 80 <1> (+3) ⬆️
...broadinstitute/hellbender/utils/IntervalUtils.java 91.534% <100%> (+0.06%) 187 <2> (+2) ⬆️
.../hellbender/tools/genomicsdb/GenomicsDBImport.java 79.087% <66.667%> (ø) 53 <0> (+1) ⬆️
...titute/hellbender/utils/IntervalUtilsUnitTest.java 91.906% <90.909%> (-0.01%) 146 <2> (+2)
...oadinstitute/hellbender/utils/gcs/BucketUtils.java 79.878% <0%> (-0.61%) 42% <0%> (ø)
...lotypecaller/readthreading/ReadThreadingGraph.java 88.608% <0%> (-0.253%) 144% <0%> (-1%)
...ithwaterman/SmithWatermanIntelAlignerUnitTest.java 60% <0%> (ø) 2% <0%> (ø) ⬇️
...utils/smithwaterman/SmithWatermanIntelAligner.java 80% <0%> (+30%) 3% <0%> (+2%) ⬆️

order to dramatically improve exome performance
@ldgauthier
Copy link
Contributor Author

@lbergelson you were totally right. I pulled out a flag to turn on interval merging.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ldgauthier Looks good, needs a test for the new method. Thank you for checking into the behavior. We could potentially drive this down into the interval argument collection itself as a new merging mode, but I'm not sure if that's useful in general or a liability when other people incorrectly specify it.

shortName = MERGE_INPUT_INTERVALS_LONG_NAME,
doc = "Boolean flag to import all data in between intervals. Improves performance using large lists of " +
"intervals, as in exome sequencing, especially if GVCF data only exists for specified intervals.")
private boolean mergeInputIntervals = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good... I guess the other alternative would be to add a new interval merging rule to the interval argument collection and push this behavior into the engine.

final ArgumentsBuilder args = new ArgumentsBuilder();
args.addArgument(GenomicsDBImport.WORKSPACE_ARG_LONG_NAME, workspace);
intervals.forEach(interval -> args.addArgument("L", IntervalUtils.locatableToString(interval)));
vcfInputs.forEach(vcf -> args.addArgument("V", vcf));
args.addArgument("batch-size", String.valueOf(batchSize));
args.addArgument(GenomicsDBImport.VCF_INITIALIZER_THREADS_LONG_NAME, String.valueOf(threads));
if (useBufferSize)
if (mergeIntervals) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could use args.addBooleanArgument(GenomicsDBImport.MERGE_INPUT_INTERVALS_LONG_NAME, mergeIntervals) and drop the if

@@ -151,14 +151,16 @@ public void testGenomicsDBImportFileInputsWithMultipleIntervals() throws IOExcep
testGenomicsDBImporter(LOCAL_GVCFS, MULTIPLE_INTERVALS, COMBINED_MULTI_INTERVAL, b38_reference_20_21, true, 1);
}

private void testGenomicsDBImportWith1000Intervals() throws IOException {
@Test
public void testGenomicsDBImportWith1000Intervals() throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rename this to say something about the interval merging.

@ldgauthier
Copy link
Contributor Author

Comments addressed -- back to you @lbergelson

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Looks good 👍

@ldgauthier ldgauthier merged commit 1f7a03d into master Dec 21, 2018
@ldgauthier ldgauthier deleted the ldg_GDBintervalMagic branch December 21, 2018 18:03
droazen pushed a commit that referenced this pull request Mar 28, 2019
… exome input (#5741)

Rather than start off the exome genotyping task with a step to merge the query intervals for better GDB performance, make it an option as for Import in #5540

This is necessary to get the exome joint calling pipeline onto official GATK release dockers.

GenotypeGVCFs also performs better reading from a GenomicsDB as a single
interval
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants