Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add strict mode to AssemblyRegionWalkerSpark and HaplotypeCallerSpark #5416

Merged
merged 9 commits into from Dec 5, 2018

Conversation

@tomwhite
Copy link
Collaborator

commented Nov 15, 2018

There are tests for both the strict and non-strict modes for HaplotypeCallerSpark and for ExampleAssemblyRegionWalkerSpark. The strict modes produce identical results to the walker versions.

Strict mode is off by default and still needs work to scale to exome-sized data (another PR). This is an improvement over the current HaplotypeCallerSpark implementation since it fixes two bugs (reads overlapping more than two intervals; and editIntervals not being picked up) that caused the output to differ significantly from the walker version. While the output does still differ in non-strict mode, the difference is a lot less (compare expected.testVCFMode.gatk4.vcf and expected.testVCFMode.gatk4.nonstrict.vcf).

@tomwhite tomwhite requested review from jonn-smith and jamesemery Nov 15, 2018
JavaRDD<AssemblyRegion> assemblyRegions = assemblyRegionShardedReads.map((Function<Shard<GATKRead>, AssemblyRegion>) shard -> toAssemblyRegion(shard, header));

// 7. Add reference and feature context.
return assemblyRegions.mapPartitions(getAssemblyRegionWalkerContextFunction(referenceFileName, bFeatureManager));

This comment has been minimized.

Copy link
@tomwhite

tomwhite Nov 15, 2018

Author Collaborator

These 7 steps make up the salient part of the strict algorithm.

@tomwhite tomwhite force-pushed the tw_strict_assembly_region_walker_spark_hc branch from 9deb813 to 4a7e67c Nov 16, 2018
@codecov-io

This comment has been minimized.

Copy link

commented Nov 16, 2018

Codecov Report

Merging #5416 into master will increase coverage by 0.032%.
The diff coverage is 86.755%.

@@               Coverage Diff               @@
##              master     #5416       +/-   ##
===============================================
+ Coverage     86.984%   87.016%   +0.032%     
- Complexity     31208     31272       +64     
===============================================
  Files           1909      1914        +5     
  Lines         144202    144440      +238     
  Branches       15954     15982       +28     
===============================================
+ Hits          125433    125686      +253     
+ Misses         12999     12963       -36     
- Partials        5770      5791       +21
Impacted Files Coverage Δ Complexity Δ
.../hellbender/engine/spark/SparkSharderUnitTest.java 93.846% <100%> (+0.926%) 8 <1> (+1) ⬆️
...bender/engine/spark/AssemblyRegionWalkerSpark.java 100% <100%> (+22.917%) 13 <0> (-2) ⬇️
...nstitute/hellbender/engine/spark/SparkSharder.java 90.789% <100%> (-0.367%) 28 <3> (-3)
...der/tools/HaplotypeCallerSparkIntegrationTest.java 68.293% <100%> (+9.563%) 16 <5> (+4) ⬆️
...nder/tools/spark/pipelines/ReadsPipelineSpark.java 90.741% <100%> (+0.175%) 13 <0> (ø) ⬇️
...nstitute/hellbender/engine/ShardBoundaryShard.java 100% <100%> (ø) 5 <1> (+1) ⬆️
...ampleAssemblyRegionWalkerSparkIntegrationTest.java 100% <100%> (+88.889%) 4 <1> (+2) ⬆️
...tute/hellbender/engine/ReadlessAssemblyRegion.java 46.667% <46.667%> (ø) 4 <4> (?)
...stitute/hellbender/engine/spark/GATKSparkTool.java 82.432% <60%> (-0.444%) 68 <4> (ø)
...stitute/hellbender/tools/HaplotypeCallerSpark.java 82.759% <69.231%> (-5.703%) 20 <1> (-2)
... and 20 more
Copy link
Collaborator

left a comment

Some comments for now. I think we really need to think long and hard about the downsampling, as I suspect its a key source of the differences you have observed between spark and the base tool.


// We wrap our LocusIteratorByState inside an IntervalAlignmentContextIterator so that we get empty loci
// for uncovered locations. This is critical for reproducing GATK 3.x behavior!
LocusIteratorByState libs = new LocusIteratorByState(readShard.iterator(), DownsamplingMethod.NONE, false, ReadUtils.getSamplesFromHeader(readHeader), readHeader, includeReadsWithDeletionsInIsActivePileups);

This comment has been minimized.

Copy link
@jamesemery

jamesemery Nov 16, 2018

Collaborator

DownsamplingMethod.NONE doesn't seem right here. In the AssemblyRegionWalker we use the positionalDownampler over each shard as we do the bandpass filtering, so at the stage where the active regions are developed. Perhaps we should look into whether there is a way to keep the downsampling consistent between the two approaches, especially if we are creating assembly regions with no reads and adding them back in later. Perhaps this accounts for some of the difference.

This comment has been minimized.

Copy link
@tomwhite

tomwhite Nov 19, 2018

Author Collaborator

ActivityProfileStateIterator is the part of AssemblyRegionIterator that just does the iteration over activity profiles. It's the same code and should arguably be refactored to share code (e.g. AssemblyRegionIterator might compose ActivityProfileStateIterator and AssemblyRegionFromActivityProfileStateIterator, although it's not that simple, since AssemblyRegionIterator does read caching, which we don't want in the Spark case.

Also, I don't think we see downsampling kicking in for the tests, so it's probably not the reason for difference in the tests. That seems to be read shard boundary artifacts.

// TODO: interfaces could be improved to avoid casting
ReadlessAssemblyRegion readlessAssemblyRegion = (ReadlessAssemblyRegion) ((ShardBoundaryShard<GATKRead>) shard).getShardBoundary();
int extension = Math.max(shard.getInterval().getStart() - shard.getPaddedInterval().getStart(), shard.getPaddedInterval().getEnd() - shard.getInterval().getEnd());
AssemblyRegion assemblyRegion = new AssemblyRegion(shard.getInterval(), Collections.emptyList(), readlessAssemblyRegion.isActive(), extension, header);

This comment has been minimized.

Copy link
@jamesemery

jamesemery Nov 16, 2018

Collaborator

I see the issue with this interface... It looks like you may be able to make regular assemblyRegion also extend ShardBoundaryShard. I'm not sure that helps anything really... but then at least AssemblyRegion and ReadlessAssemblyRegion are related in the class hirerarchy.

This comment has been minimized.

Copy link
@jamesemery

jamesemery Nov 16, 2018

Collaborator

that is though an abstract interface ofcourse

This comment has been minimized.

Copy link
@tomwhite

tomwhite Nov 19, 2018

Author Collaborator

Thanks for the suggestion. I need to spend a bit more time trying to improve this. (Could be done in another issue or PR.)


return Utils.stream(shardedReadIterator)
.map(shardedRead -> new ShardToMultiIntervalShardAdapter<>(shardedRead))
// TODO: reinstate downsampling (not yet working)

This comment has been minimized.

Copy link
@jamesemery

jamesemery Nov 16, 2018

Collaborator

We should think about how we want to do this. Notably however, as of this implementation you need to add the same downsampling to the reads at step 5, as this is currently only hooked up to downsample for the band pass filtering but not on the reads when they get re-added to the filter. An obvious solution might involve persisiting the underlying bam but obviously thats expensive, the other is that we make the random seed in some way caclulatable, like base it on a hash of the read if necessary, that way we could get an exact mach with the regular haplotype caller.

This comment has been minimized.

Copy link
@tomwhite

tomwhite Nov 19, 2018

Author Collaborator

The reason that I excluded this is because I found an underlying bug in the downsampling code where it would not do the downsampling correctly unless the iterator was created twice. (This needs further investigation.)

But you are right that the reads downsampling needs to have been carried out for the reads too, which isn't currently happening in this code. Let's explore your ideas further.

* A cut-down version of {@link AssemblyRegion} that doesn't store reads, used in the strict implementation of
* {@link org.broadinstitute.hellbender.engine.spark.FindAssemblyRegionsSpark}.
*/
public class ReadlessAssemblyRegion extends ShardBoundary {

This comment has been minimized.

Copy link
@jamesemery

jamesemery Nov 16, 2018

Collaborator

There should probably be a common interface linking this to AssemblyRegion, or a different name distinguishing it.

This comment has been minimized.

Copy link
@tomwhite

tomwhite Nov 19, 2018

Author Collaborator

Agreed.

@tomwhite

This comment has been minimized.

Copy link
Collaborator Author

commented Nov 19, 2018

Thanks for the review @jamesemery. I've addressed your comments. A few outstanding issues:

  • ActivityProfileStateIterator and AssemblyRegionFromActivityProfileStateIterator duplicate parts of AssemblyRegionIterator, so it would be nice to remove the code duplication. Not totally straightforward as the latter does read caching, but the first two don't (Spark shouldn't be caching reads).
  • Downsampling needs more work. I would be OK doing that separately, since I've only ever seen it when running on a full genome, and the new strict code needs more work to work on a full genome (I've only got it running on an exome so far).
  • There are some improvements we could make to ReadlessAssemblyRegion regarding Java interface design and generics, but I'm not sure what they are yet.

I'm not sure if these are blockers, since the strict codepath is a new option (off by default), but would like to know what you and @jonn-smith think.

Copy link
Collaborator

left a comment

I have some very minor comments. Most of the meat has already been addressed from what I can tell.

This will be a very good starting point for HaplotypeCallerSpark correctness.

assemblyRegionArgs.maxProbPropagationDistance, includeReadsWithDeletionsInIsActivePileups);
return Utils.stream(assemblyRegionIter).map(assemblyRegion ->
new AssemblyRegionWalkerContext(assemblyRegion,
new ReferenceContext(reference, assemblyRegion.getExtendedSpan()),

This comment has been minimized.

Copy link
@jonn-smith

jonn-smith Nov 28, 2018

Collaborator

This is actually the same code that was running before, so it really should have been commented already...

It looks like the padding is getting passed in via assemblyRegionArgs.assemblyRegionPadding when creating assemblyRegionIter, so I don't think that it's a technical issue (the span without padding would be retrieved with assemblyRegion.getSpan()).

@tomwhite tomwhite force-pushed the tw_strict_assembly_region_walker_spark_hc branch from e4bd5da to b7eeac9 Dec 3, 2018
@tomwhite

This comment has been minimized.

Copy link
Collaborator Author

commented Dec 3, 2018

Thanks for the review @jonn-smith. I've addressed all your comments, and those of @jamesemery. Please approve so we can merge.

Copy link
Collaborator

left a comment

Considering that this is a beta tool anyway, the impact is relatively low and it is important to benchmark our progress. I do think we should take a look at running this over a full genome should be tried.

@tomwhite

This comment has been minimized.

Copy link
Collaborator Author

commented Dec 3, 2018

Thanks @jamesemery. I've successfully run with an exome in #5475; I'll try with a genome separately.

@tomwhite tomwhite force-pushed the tw_strict_assembly_region_walker_spark_hc branch from b7eeac9 to 2fcc4a7 Dec 4, 2018
@tomwhite tomwhite merged commit 1f6a172 into master Dec 5, 2018
2 checks passed
2 checks passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
@tomwhite tomwhite deleted the tw_strict_assembly_region_walker_spark_hc branch Dec 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.