external sort in markdup #123

RichardCorbett · 2015-03-02T19:32:46Z

Hi,
I'm (still) playing with the duplicate marking on a large human bam (about 100X coverage).
Using version sambamba_02_02_2015 without any parameters I get the "too many open files" error when marking duplicates.

When I apply the "--hash-table-size 1000000" parameters, the run completes correctly, but the first step, "finding positions of the duplicate reads in the file" used 57gigs of RAM, which is too high for us. Is there anything you can recommend to get the RAM usage lower, but still get a successfully created duplicate marked bam?

lomereiter · 2015-03-02T19:45:36Z

It can be improved, it's just that nobody have run it on huge datasets yet. Picard implements external sort for keeping memory footprint within limits, the same should be done here.

RichardCorbett · 2015-03-02T21:40:29Z

Thanks Artem,
Some of us here are wondering how in the world you came up with name sambamba, and having it be a swahili word for parallel. Is there some kind of magical search engine that can help come up with appropriate names like that?

lomereiter · 2015-03-02T21:53:39Z

That's a funny question :-) I don't know of such an engine. As it were, I typed 'parallel' into Google Translate and tried different languages, and suddenly had this stroke of luck.

RichardCorbett · 2015-03-02T22:18:47Z

aha. Genius. I'm definitely going to try that next time I need a name for something.
Thanks for the tip!

isthisthat · 2015-04-29T11:01:14Z

Hi guys, I second the requirement for a more efficient markdup. I currently switch it off for anything more than 50x whole genome coverage because it runs out of memory. Thanks Artem!

inijman · 2015-12-18T10:22:23Z

HI Artem, I just want to underline this. We have routinely ~ >150x WSG sample and the sysasmins don't like the resource usage anymore :) It does run with big overflow-list sizes and changed ulimits, but it stressing the systems. Would be great if you find a way to optimize it.

lomereiter · 2015-12-20T22:09:15Z

Hi all,
I acknowledge the problem and have put quite some time today to make a working prototype(?) which currently lives in markdup-extsort branch. It's somewhat subpar in terms of disk usage (the same ~20 bytes/read that'd better be compressed where there's an opportunity), and I reiterate that I've got time only for very superficial testing (check the number of duplicates marked etc.), so please give it a try!

lomereiter · 2016-01-01T22:10:25Z

In the latest commit in markdup-extsort branch I introduced LZ4 compression. With default settings peak RAM consumption stayed under 2GB in my test (should be independent of the file size), and for a 18GB input file just 3GB of temporary space were used (what previously would occupy about 5GB in RAM).

lomereiter · 2016-03-06T23:03:50Z

Added to v0.6.0

lomereiter changed the title ~~optimizing dup marking~~ external sort in markdup Mar 2, 2015

lomereiter self-assigned this Mar 2, 2015

lomereiter added the important label Apr 29, 2015

lomereiter closed this as completed Mar 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

external sort in markdup #123

external sort in markdup #123

RichardCorbett commented Mar 2, 2015

lomereiter commented Mar 2, 2015

RichardCorbett commented Mar 2, 2015

lomereiter commented Mar 2, 2015

RichardCorbett commented Mar 2, 2015

isthisthat commented Apr 29, 2015

inijman commented Dec 18, 2015

lomereiter commented Dec 20, 2015

lomereiter commented Jan 1, 2016

lomereiter commented Mar 6, 2016

external sort in markdup #123

external sort in markdup #123

Comments

RichardCorbett commented Mar 2, 2015

lomereiter commented Mar 2, 2015

RichardCorbett commented Mar 2, 2015

lomereiter commented Mar 2, 2015

RichardCorbett commented Mar 2, 2015

isthisthat commented Apr 29, 2015

inijman commented Dec 18, 2015

lomereiter commented Dec 20, 2015

lomereiter commented Jan 1, 2016

lomereiter commented Mar 6, 2016