Speeds up MarkDuplicates on queryname input by using the in memory read-ends map. #1411

tfenne · 2019-10-15T20:26:02Z

Description

MarkDuplicates is glacially slow when operating on queryname sorted input. The reason for this is that it continues to use the DiskBasedReadEndsForMarkDuplicatesMap which does a bunch of disk-based IO every time it sees a record mapped to a different chromosome than the last record it saw - which is essentially every other record when operating in queryname order.

I was testing with 2m random reads from an exome BAM, and it was taking ~15 minutes to go through the first phase or calculating the read ends. Subbing in the in-memory map reduced this to < 10 seconds. This should be entirely safe because when operating on queryname sorted data the read ends map will generally only ever have 1 read in it at a time!

Checklist (never delete this)

Never delete this, it is our record that procedure was followed. If you find that for whatever reason one of the checklist points doesn't apply to your PR, you can leave it unchecked but please add an explanation below.

Content

Added or modified tests to cover changes and any new functionality
Edited the README / documentation (if applicable)
All tests passing on Travis

Review

Final thumbs-up from reviewer
Rebase, squash and reword as applicable

For more detailed guidelines, see https://github.com/broadinstitute/picard/wiki/Guidelines-for-pull-requests

…ad-ends map.

yfarjoun

This should have been done a long time ago!

jamesemery · 2019-10-15T20:47:16Z

This looks like it would be a good improvement. How significant is that 15 minute speedup compared to the entire runtime of the tool? I suspect this should have an outsized impact on GCS performance because I believe the current pipeline scripts are being run with single core machines and slow persistent disks (which perform worse the fewer cores allocated to the machine).

tfenne · 2019-10-15T20:56:33Z

@jamesemery I believe the overall runtime for that small file was on the order of 1 minute. So still something like a 10-15X speedup overall.

Speeds up MarkDuplicates on queryname input by using the in memory re…

7d88411

…ad-ends map.

tfenne requested a review from yfarjoun October 15, 2019 20:26

tfenne self-assigned this Oct 15, 2019

yfarjoun reviewed Oct 15, 2019

View reviewed changes

yfarjoun approved these changes Oct 16, 2019

View reviewed changes

tfenne merged commit 65f7283 into broadinstitute:master Oct 22, 2019

tfenne deleted the tf_speedup_markdupes_on_queryname branch October 22, 2019 16:37

kbergin mentioned this pull request Mar 31, 2021

MarkDuplicatesSpark usage broadinstitute/warp#266

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speeds up MarkDuplicates on queryname input by using the in memory read-ends map. #1411

Speeds up MarkDuplicates on queryname input by using the in memory read-ends map. #1411

tfenne commented Oct 15, 2019

yfarjoun left a comment

jamesemery commented Oct 15, 2019

tfenne commented Oct 15, 2019

Speeds up MarkDuplicates on queryname input by using the in memory read-ends map. #1411

Speeds up MarkDuplicates on queryname input by using the in memory read-ends map. #1411

Conversation

tfenne commented Oct 15, 2019

Description

Checklist (never delete this)

Content

Review

yfarjoun left a comment

Choose a reason for hiding this comment

jamesemery commented Oct 15, 2019

tfenne commented Oct 15, 2019