Speeds up MarkDuplicates on queryname input by using the in memory read-ends map. #1411
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
MarkDuplicates is glacially slow when operating on
queryname
sorted input. The reason for this is that it continues to use theDiskBasedReadEndsForMarkDuplicatesMap
which does a bunch of disk-based IO every time it sees a record mapped to a different chromosome than the last record it saw - which is essentially every other record when operating in queryname order.I was testing with 2m random reads from an exome BAM, and it was taking ~15 minutes to go through the first phase or calculating the read ends. Subbing in the in-memory map reduced this to < 10 seconds. This should be entirely safe because when operating on queryname sorted data the read ends map will generally only ever have 1 read in it at a time!
Checklist (never delete this)
Never delete this, it is our record that procedure was followed. If you find that for whatever reason one of the checklist points doesn't apply to your PR, you can leave it unchecked but please add an explanation below.
Content
Review
For more detailed guidelines, see https://github.com/broadinstitute/picard/wiki/Guidelines-for-pull-requests