Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run HaplotypeCallerSpark on WGS in strict mode #5721

Merged
merged 4 commits into from Mar 8, 2019
Merged

Conversation

tomwhite
Copy link
Contributor

These are the changes needed to run on a whole genome in strict mode. We get out of memory errors without these changes.

Reads downsampling was missing for the part where AssemblyRegions are filled with reads - this PR adds it in. Downsampling is not deterministic yet, since that depends on #5437, but that's an orthogonal issue so it's OK to merge this change and add #5437 later.

@codecov-io
Copy link

codecov-io commented Feb 26, 2019

Codecov Report

Merging #5721 into master will decrease coverage by 0.001%.
The diff coverage is 91.667%.

@@               Coverage Diff               @@
##              master     #5721       +/-   ##
===============================================
- Coverage     86.985%   86.984%   -0.001%     
- Complexity     31863     31865        +2     
===============================================
  Files           1943      1943               
  Lines         146768    146775        +7     
  Branches       16223     16225        +2     
===============================================
+ Hits          127666    127671        +5     
  Misses         13189     13189               
- Partials        5913      5915        +2
Impacted Files Coverage Δ Complexity Δ
...ils/activityprofile/ActivityProfileStateRange.java 94.286% <100%> (+0.168%) 7 <0> (+1) ⬆️
...lbender/engine/spark/FindAssemblyRegionsSpark.java 82.09% <87.5%> (+0.122%) 20 <3> (+1) ⬆️
...nder/utils/runtime/StreamingProcessController.java 67.299% <0%> (-0.474%) 33% <0%> (ø)

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, my primary comment is that you should probably leave a warning about the inconsistent downsampling between the two places in the code

// 5. Convert shards to assembly regions.
JavaRDD<AssemblyRegion> assemblyRegions = assemblyRegionShardedReads.map((Function<Shard<GATKRead>, AssemblyRegion>) shard -> toAssemblyRegion(shard, header));
// 5. Convert shards to assembly regions. Reads downsampling is done again here, and is assumed to be consistent
// with the downsampling done in step 1, since it is deterministic by locus.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it sounds like you want to get this branch in first, i would instead make this a comment referencing the other branch and noting the issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -159,12 +159,20 @@
// at which points the reads can be filled in. (See next step.)
JavaRDD<ReadlessAssemblyRegion> readlessAssemblyRegions = contigToGroupedStates
.flatMap(getReadlessAssemblyRegionsFunction(header, assemblyRegionArgs));
// repartition to distribute the data evenly across the cluster again
readlessAssemblyRegions = readlessAssemblyRegions.repartition(readlessAssemblyRegions.getNumPartitions());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this repartitioning it looks like the reads may have been partitioned by contig? Flatmap I assume doesn't repartition automatically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right

@tomwhite tomwhite merged commit 02dca71 into master Mar 8, 2019
@tomwhite tomwhite deleted the tw_hcs_strict_genome branch March 8, 2019 13:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants