Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

external sort in markdup #123

Closed
RichardCorbett opened this issue Mar 2, 2015 · 9 comments
Closed

external sort in markdup #123

RichardCorbett opened this issue Mar 2, 2015 · 9 comments
Assignees

Comments

@RichardCorbett
Copy link

Hi,
I'm (still) playing with the duplicate marking on a large human bam (about 100X coverage).
Using version sambamba_02_02_2015 without any parameters I get the "too many open files" error when marking duplicates.

When I apply the "--hash-table-size 1000000" parameters, the run completes correctly, but the first step, "finding positions of the duplicate reads in the file" used 57gigs of RAM, which is too high for us. Is there anything you can recommend to get the RAM usage lower, but still get a successfully created duplicate marked bam?

@lomereiter lomereiter changed the title optimizing dup marking external sort in markdup Mar 2, 2015
@lomereiter lomereiter self-assigned this Mar 2, 2015
@lomereiter
Copy link
Contributor

It can be improved, it's just that nobody have run it on huge datasets yet. Picard implements external sort for keeping memory footprint within limits, the same should be done here.

@RichardCorbett
Copy link
Author

Thanks Artem,
Some of us here are wondering how in the world you came up with name sambamba, and having it be a swahili word for parallel. Is there some kind of magical search engine that can help come up with appropriate names like that?

@lomereiter
Copy link
Contributor

That's a funny question :-) I don't know of such an engine. As it were, I typed 'parallel' into Google Translate and tried different languages, and suddenly had this stroke of luck.

@RichardCorbett
Copy link
Author

aha. Genius. I'm definitely going to try that next time I need a name for something.
Thanks for the tip!

@isthisthat
Copy link

Hi guys, I second the requirement for a more efficient markdup. I currently switch it off for anything more than 50x whole genome coverage because it runs out of memory. Thanks Artem!

@inijman
Copy link

inijman commented Dec 18, 2015

HI Artem, I just want to underline this. We have routinely ~ >150x WSG sample and the sysasmins don't like the resource usage anymore :) It does run with big overflow-list sizes and changed ulimits, but it stressing the systems. Would be great if you find a way to optimize it.

@lomereiter
Copy link
Contributor

Hi all,
I acknowledge the problem and have put quite some time today to make a working prototype(?) which currently lives in markdup-extsort branch. It's somewhat subpar in terms of disk usage (the same ~20 bytes/read that'd better be compressed where there's an opportunity), and I reiterate that I've got time only for very superficial testing (check the number of duplicates marked etc.), so please give it a try!

@lomereiter
Copy link
Contributor

In the latest commit in markdup-extsort branch I introduced LZ4 compression. With default settings peak RAM consumption stayed under 2GB in my test (should be independent of the file size), and for a 18GB input file just 3GB of temporary space were used (what previously would occupy about 5GB in RAM).

@lomereiter
Copy link
Contributor

Added to v0.6.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants