Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented SlidingWindowWalker #1708

Closed

Conversation

magicDGS
Copy link
Contributor

New implementation of SlidingWindowWalker with some ideas from the discussion in #1528.

The thinks that are requested in #1198 still holds, but now it is more general: padding option is added and construction of windows are done by interval.

The code contain a lot of TODO because it relies on changes implemented in #1567, and because it is suppose to be a walker over ReadWindow instead of SimpleInterval+ReadsContext if reads are available.

I think that with these changes it could be general to be extended by ReadWindowWalker and by users that needs a different way of "slide" over intervals.

@magicDGS
Copy link
Contributor Author

I was thinking after checking your ReadWindowWalker implementation and the HaplotypeCaller branch that with this class we can do a general SlidingWindowWalker, which could be extended by ReadWindowWalker or it could be possible that HaplotypeCaller directly implements SlidingWindowWalker.

@droazen, could you have a look and tell me what do you think about the idea?

@coveralls
Copy link
Collaborator

Coverage Status

Coverage decreased (-0.7%) to 83.476% when pulling c1145d3 on magicDGS:dgs_slidingwindowwalker into ed4f085 on broadinstitute:master.

@coveralls
Copy link
Collaborator

Coverage Status

Coverage decreased (-0.7%) to 83.476% when pulling c1145d3 on magicDGS:dgs_slidingwindowwalker into ed4f085 on broadinstitute:master.

@droazen
Copy link
Collaborator

droazen commented Apr 16, 2016

@magicDGS The HaplotypeCaller traversal has undergone some changes in the past few weeks to improve performance and bring the output of the tool closer to GATK3. There is now an AssemblyRegionWalker that divides the intervals into active and inactive regions, in a greatly simplified version of the GATK3 traversal.

Initially, I did plan on having AssemblyRegionWalker extend the former ReadWindowWalker, or an adapted version of your SlidingWindowWalker, and I did implement it like this at first, but ultimately I collapsed it into a single class for several reasons:

  • Inheriting from a more generic traversal type caused usability issues and confusion with respect to the command-line arguments. The ReadWindow was the unit of processing for the superclass, but for AssemblyRegionWalker it was the unit of I/O and AssemblyRegion was the unit of processing, and I couldn't update the docs for ReadWindowWalker to clear up the confusion without mentioning AssemblyRegion-specific concepts.
  • The ReadShard / ReadWindow was/is only there to prove that we can shard the data without introducing calling artifacts, and to provide a unit of parallelism for the upcoming Spark implementation. It's not something we really want to expose to users as a prominent knob, and we may hide it completely in the future once the shard size is tuned for performance.
  • Inheriting from a more abstract walker type caused a number of other problems as well: methods that should have been final in the supertype could no longer be made final, with the result that tool implementations could inappropriately override engine initialization/shutdown routines. Also, there were issues with the progress meter, since both the supertype traversal and subtype traversal needed their own progress meter for their different units of processing.

Ultimately it was just too awkward and forced, and the read shard is something that we eventually want to make an internal/encapsulated implementation detail anyway. GATK3 made the mistake, I think, of using long, confusing inheritance chains for its walker types, with the result that you got awkward and forced relationships like RodWalker inheriting from LocusWalker. It's better, I think, to make each traversal as standalone as possible, especially given the simplicity of writing a new walker type in GATK4.

For all of these reasons we don't want AssemblyRegionWalker to inherit from a more abstract traversal type -- it's just going to be its own standalone thing, so that it can evolve freely without affecting anyone else. For SlidingWindowWalker, which we still want to merge, I recommend making the traversal do exactly what you want for your use case, as clearly and simply as possible, without worrying about serving as a base class for other traversals. Ping me once you're happy with it, and I'll re-review.

@magicDGS
Copy link
Contributor Author

I agree that it is better to keep it simpler. I was proposing this before looking the latest commits in the HC branch.

I will work on this walker without thinking about other cases, but I would like to keep the idea of padding and slide over intervals instead of the genome from the begining. It will be useful for the things that I have in mind.

Should I close this PR until I implement everything, @droazen?

@droazen
Copy link
Collaborator

droazen commented Apr 16, 2016

@magicDGS You can keep this open -- just push the new version into this same branch and add a comment when you're happy with it and I'll re-review.

@coveralls
Copy link
Collaborator

Coverage Status

Coverage decreased (-0.7%) to 81.814% when pulling 570fa1a on magicDGS:dgs_slidingwindowwalker into 814c9b7 on broadinstitute:master.

@magicDGS
Copy link
Contributor Author

I finished the implementation for the draft SlidingWindowWalker (I should implement an example and an integration test, but I would like to wait till some issues are solved).

made a "TODO" about the way in which the intervals are constructed, because I will need a that ReadShard have a way to construct a shard without ReadSource (either null or empty source), just in case that the implemented SlidingWindowWalker does not require reads. @droazen, could you review and give me some feedback about this, because this class is important for other parts of GATK?

@coveralls
Copy link
Collaborator

Coverage Status

Coverage decreased (-0.7%) to 81.857% when pulling 56b1e97 on magicDGS:dgs_slidingwindowwalker into 814c9b7 on broadinstitute:master.

@magicDGS magicDGS force-pushed the dgs_slidingwindowwalker branch 3 times, most recently from c334088 to d346c34 Compare May 18, 2016 10:40
@magicDGS
Copy link
Contributor Author

@droazen, could you review this PR? Before I implement the integration test I would be important that you check if it is possible the change of the ReadShard contract (null ReadsDataSource are now allowed, and I don't know if this will affect other parts of GATK.

Thank you very much in advance.

@coveralls
Copy link
Collaborator

Coverage Status

Coverage decreased (-0.7%) to 81.661% when pulling d346c34 on magicDGS:dgs_slidingwindowwalker into cfd4bf6 on broadinstitute:master.

@coveralls
Copy link
Collaborator

Coverage Status

Coverage decreased (-0.7%) to 81.584% when pulling d346c34 on magicDGS:dgs_slidingwindowwalker into cfd4bf6 on broadinstitute:master.

@coveralls
Copy link
Collaborator

Coverage Status

Coverage decreased (-0.7%) to 81.658% when pulling 2fe1d68 on magicDGS:dgs_slidingwindowwalker into cfd4bf6 on broadinstitute:master.

@coveralls
Copy link
Collaborator

Coverage Status

Coverage decreased (-0.7%) to 81.661% when pulling d346c34 on magicDGS:dgs_slidingwindowwalker into cfd4bf6 on broadinstitute:master.

@@ -2,6 +2,7 @@

import htsjdk.samtools.SAMSequenceDictionary;
import htsjdk.samtools.util.Locatable;
import org.apache.avro.test.Simple;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove stray import

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@coveralls
Copy link
Collaborator

Coverage Status

Coverage increased (+0.1%) to 82.093% when pulling 0f99a0b on magicDGS:dgs_slidingwindowwalker into 3ff51ed on broadinstitute:master.

@magicDGS
Copy link
Contributor Author

I extracted some classes to a different branch (#2023) to make it easier to review and integrate with the changes for HaplotypeCallerSpark. I will wait for that branch to be accepted to rebase and update this one...

@coveralls
Copy link
Collaborator

Coverage Status

Coverage increased (+0.1%) to 82.329% when pulling 8c0e1d5 on magicDGS:dgs_slidingwindowwalker into a830c13 on broadinstitute:master.

@magicDGS
Copy link
Contributor Author

Hi @droazen, I finished the implementation of the Slider Walkers using the new Shard interface from the changes for HaplotypeCallerSpark (#2016). I know that it is a lot of code, but I will be very grateful if you can have a look to this PR to accept it.

Thank you in advance!

@coveralls
Copy link
Collaborator

Coverage Status

Coverage increased (+0.1%) to 82.329% when pulling cf33b7a on magicDGS:dgs_slidingwindowwalker into a830c13 on broadinstitute:master.

@magicDGS
Copy link
Contributor Author

Hello @droazen, I rebased the PR to the latest master. Could you have a look to this, please? I will be very thankful if so. Thanks in advance!

@coveralls
Copy link
Collaborator

Coverage Status

Coverage increased (+0.09%) to 82.111% when pulling a05ecec on magicDGS:dgs_slidingwindowwalker into 038703a on broadinstitute:master.

@magicDGS
Copy link
Contributor Author

Sorry @droazen, the previous commit had an error in the tests. I'm rebasing/squashing to make a clear PR and when all check pass (except CLOUD), you can review if you have time. Thank you very much.

@coveralls
Copy link
Collaborator

Coverage Status

Coverage increased (+0.09%) to 82.111% when pulling e413ebf on magicDGS:dgs_slidingwindowwalker into 288a1af on broadinstitute:master.

@magicDGS
Copy link
Contributor Author

magicDGS commented Oct 12, 2016

Friendly ping here, @droazen! I will appreciate if this could be accepted soon into the framework!

@magicDGS
Copy link
Contributor Author

magicDGS commented Nov 1, 2016

Hello @droazen. Is there any possibility to review this soon? Thanks in advance!

@droazen
Copy link
Collaborator

droazen commented Nov 2, 2016

Sorry @magicDGS, we'll get to this (and your other PRs) as soon as we're able -- once we're out of alpha and the GATK3 team joins us here we should be able to respond to external pull requests much more promptly, but until then we'll unfortunately have to live with slow turnaround time on code reviews.

@magicDGS
Copy link
Contributor Author

magicDGS commented Nov 2, 2016

Thanks for the reply @droazen! I know that it is a lot of work without code that it is not strictly useful within the milestones, and that's why I started this PR time ago (I will need it soon in my work, but I can just use this branch if I rebase from time to time).

I just want to point to this PR again, although I know that it is not trivial code and will take a while to accept it. I'm looking forward for your comments on this. Thanks again!

@magicDGS
Copy link
Contributor Author

Any progress/update with respect to this review, @droazen? Thanks in advance!

/**
* @author Daniel Gomez-Sanchez (magicDGS)
*/
public class WindowPaddingArgumentCollectionTest extends BaseTest{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-reminder: extend GATKBaseTest (requires rebase)

/**
* @author Daniel Gomez-Sanchez (magicDGS)
*/
public class WindowSizeArgumentCollectionTest extends BaseTest{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-reminder: extend GATKBaseTest (requires rebase)

/**
* @author Daniel Gomez-Sanchez (magicDGS)
*/
public class WindowStepArgumentCollectionTest extends BaseTest{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-reminder: extend GATKBaseTest (requires rebase)

import java.util.Iterator;
import java.util.function.Predicate;

public class FilteringIteratorUnitTest extends BaseTest {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-reminder: extend GATKBaseTest (requires rebase)

import java.util.stream.Collectors;
import java.util.stream.StreamSupport;

/**
* A class to represent a shard of reads data, optionally expanded by a configurable amount of padded data.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droazen - looking into the resurrection of this branch, I realized that the LocalReadShard disappeared. For the sliding-window walker I required that each window is independent of the others, so I guess that the MultiIntervalLocalReadShard is not the option to go.

Is it ok to keep the LocalReadShard? Otherwise, I think that starting a new project for not keeping the GATK blow with unnecessary walkers for your methods is the best way to go...

Copy link
Collaborator

@droazen droazen Mar 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@magicDGS I think it would be ok to resurrect LocalReadShard if you really need it. But you should know that the reason we got rid of it was because we found that there are major performance issues with querying each interval individually. It's much more efficient to query a large number of intervals at once, particularly if the intervals are close together or overlap, to avoid reading and parsing the same parts of the bam file multiple times. Once we moved to MultiIntervalLocalReadShard, the performance of both the HaplotypeCaller and Mutect2 increased by about 30-40%.

I'd recommend closing out these old PRs and submitting a new PR for SlidingWindowWalker with the walker class itself, an ExampleSlidingWindowWalker and ExampleSlidingWindowWalkerIntegrationTest, and whatever utilities you require (but note that the less common code you refactor, the easier it will be for us to merge your PR quickly).

@magicDGS
Copy link
Contributor Author

Closing in favor of a new implementation

@magicDGS magicDGS closed this Apr 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants