-
Notifications
You must be signed in to change notification settings - Fork 581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implemented SlidingWindowWalker #1708
Implemented SlidingWindowWalker #1708
Conversation
2436516
to
c1145d3
Compare
I was thinking after checking your @droazen, could you have a look and tell me what do you think about the idea? |
@magicDGS The HaplotypeCaller traversal has undergone some changes in the past few weeks to improve performance and bring the output of the tool closer to GATK3. There is now an Initially, I did plan on having
Ultimately it was just too awkward and forced, and the read shard is something that we eventually want to make an internal/encapsulated implementation detail anyway. GATK3 made the mistake, I think, of using long, confusing inheritance chains for its walker types, with the result that you got awkward and forced relationships like For all of these reasons we don't want |
I agree that it is better to keep it simpler. I was proposing this before looking the latest commits in the HC branch. I will work on this walker without thinking about other cases, but I would like to keep the idea of padding and slide over intervals instead of the genome from the begining. It will be useful for the things that I have in mind. Should I close this PR until I implement everything, @droazen? |
@magicDGS You can keep this open -- just push the new version into this same branch and add a comment when you're happy with it and I'll re-review. |
c1145d3
to
570fa1a
Compare
I finished the implementation for the draft made a "TODO" about the way in which the intervals are constructed, because I will need a that |
c334088
to
d346c34
Compare
@droazen, could you review this PR? Before I implement the integration test I would be important that you check if it is possible the change of the Thank you very much in advance. |
@@ -2,6 +2,7 @@ | |||
|
|||
import htsjdk.samtools.SAMSequenceDictionary; | |||
import htsjdk.samtools.util.Locatable; | |||
import org.apache.avro.test.Simple; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove stray import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
I extracted some classes to a different branch (#2023) to make it easier to review and integrate with the changes for HaplotypeCallerSpark. I will wait for that branch to be accepted to rebase and update this one... |
0f99a0b
to
6ea9971
Compare
8c0e1d5
to
cf33b7a
Compare
cf33b7a
to
7e54331
Compare
Hello @droazen, I rebased the PR to the latest master. Could you have a look to this, please? I will be very thankful if so. Thanks in advance! |
7e54331
to
23871fa
Compare
Sorry @droazen, the previous commit had an error in the tests. I'm rebasing/squashing to make a clear PR and when all check pass (except CLOUD), you can review if you have time. Thank you very much. |
a05ecec
to
e413ebf
Compare
Friendly ping here, @droazen! I will appreciate if this could be accepted soon into the framework! |
Hello @droazen. Is there any possibility to review this soon? Thanks in advance! |
Sorry @magicDGS, we'll get to this (and your other PRs) as soon as we're able -- once we're out of alpha and the GATK3 team joins us here we should be able to respond to external pull requests much more promptly, but until then we'll unfortunately have to live with slow turnaround time on code reviews. |
Thanks for the reply @droazen! I know that it is a lot of work without code that it is not strictly useful within the milestones, and that's why I started this PR time ago (I will need it soon in my work, but I can just use this branch if I rebase from time to time). I just want to point to this PR again, although I know that it is not trivial code and will take a while to accept it. I'm looking forward for your comments on this. Thanks again! |
e413ebf
to
94e61ed
Compare
Any progress/update with respect to this review, @droazen? Thanks in advance! |
/** | ||
* @author Daniel Gomez-Sanchez (magicDGS) | ||
*/ | ||
public class WindowPaddingArgumentCollectionTest extends BaseTest{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self-reminder: extend GATKBaseTest
(requires rebase)
/** | ||
* @author Daniel Gomez-Sanchez (magicDGS) | ||
*/ | ||
public class WindowSizeArgumentCollectionTest extends BaseTest{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self-reminder: extend GATKBaseTest
(requires rebase)
/** | ||
* @author Daniel Gomez-Sanchez (magicDGS) | ||
*/ | ||
public class WindowStepArgumentCollectionTest extends BaseTest{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self-reminder: extend GATKBaseTest
(requires rebase)
import java.util.Iterator; | ||
import java.util.function.Predicate; | ||
|
||
public class FilteringIteratorUnitTest extends BaseTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self-reminder: extend GATKBaseTest
(requires rebase)
import java.util.stream.Collectors; | ||
import java.util.stream.StreamSupport; | ||
|
||
/** | ||
* A class to represent a shard of reads data, optionally expanded by a configurable amount of padded data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@droazen - looking into the resurrection of this branch, I realized that the LocalReadShard
disappeared. For the sliding-window walker I required that each window is independent of the others, so I guess that the MultiIntervalLocalReadShard
is not the option to go.
Is it ok to keep the LocalReadShard
? Otherwise, I think that starting a new project for not keeping the GATK blow with unnecessary walkers for your methods is the best way to go...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@magicDGS I think it would be ok to resurrect LocalReadShard
if you really need it. But you should know that the reason we got rid of it was because we found that there are major performance issues with querying each interval individually. It's much more efficient to query a large number of intervals at once, particularly if the intervals are close together or overlap, to avoid reading and parsing the same parts of the bam file multiple times. Once we moved to MultiIntervalLocalReadShard
, the performance of both the HaplotypeCaller
and Mutect2
increased by about 30-40%.
I'd recommend closing out these old PRs and submitting a new PR for SlidingWindowWalker
with the walker class itself, an ExampleSlidingWindowWalker
and ExampleSlidingWindowWalkerIntegrationTest
, and whatever utilities you require (but note that the less common code you refactor, the easier it will be for us to merge your PR quickly).
Closing in favor of a new implementation |
New implementation of
SlidingWindowWalker
with some ideas from the discussion in #1528.The thinks that are requested in #1198 still holds, but now it is more general: padding option is added and construction of windows are done by interval.
The code contain a lot of TODO because it relies on changes implemented in #1567, and because it is suppose to be a walker over
ReadWindow
instead ofSimpleInterval
+ReadsContext
if reads are available.I think that with these changes it could be general to be extended by
ReadWindowWalker
and by users that needs a different way of "slide" over intervals.