Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speeds up MarkDuplicates on queryname input by using the in memory read-ends map. #1411

Merged

Conversation

tfenne
Copy link
Collaborator

@tfenne tfenne commented Oct 15, 2019

Description

MarkDuplicates is glacially slow when operating on queryname sorted input. The reason for this is that it continues to use the DiskBasedReadEndsForMarkDuplicatesMap which does a bunch of disk-based IO every time it sees a record mapped to a different chromosome than the last record it saw - which is essentially every other record when operating in queryname order.

I was testing with 2m random reads from an exome BAM, and it was taking ~15 minutes to go through the first phase or calculating the read ends. Subbing in the in-memory map reduced this to < 10 seconds. This should be entirely safe because when operating on queryname sorted data the read ends map will generally only ever have 1 read in it at a time!


Checklist (never delete this)

Never delete this, it is our record that procedure was followed. If you find that for whatever reason one of the checklist points doesn't apply to your PR, you can leave it unchecked but please add an explanation below.

Content

  • Added or modified tests to cover changes and any new functionality
  • Edited the README / documentation (if applicable)
  • All tests passing on Travis

Review

  • Final thumbs-up from reviewer
  • Rebase, squash and reword as applicable

For more detailed guidelines, see https://github.com/broadinstitute/picard/wiki/Guidelines-for-pull-requests

@tfenne tfenne requested a review from yfarjoun October 15, 2019 20:26
@tfenne tfenne self-assigned this Oct 15, 2019
Copy link
Contributor

@yfarjoun yfarjoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have been done a long time ago!

@jamesemery
Copy link
Contributor

This looks like it would be a good improvement. How significant is that 15 minute speedup compared to the entire runtime of the tool? I suspect this should have an outsized impact on GCS performance because I believe the current pipeline scripts are being run with single core machines and slow persistent disks (which perform worse the fewer cores allocated to the machine).

@tfenne
Copy link
Collaborator Author

tfenne commented Oct 15, 2019

@jamesemery I believe the overall runtime for that small file was on the order of 1 minute. So still something like a 10-15X speedup overall.

@tfenne tfenne merged commit 65f7283 into broadinstitute:master Oct 22, 2019
@tfenne tfenne deleted the tf_speedup_markdupes_on_queryname branch October 22, 2019 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants