Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BAI merge for hg38 fails with OOM #109

Merged
merged 4 commits into from
Jul 2, 2019
Merged

Conversation

tomwhite
Copy link
Member

@tomwhite tomwhite commented Jul 1, 2019

Merging indexes for a large hg38 BAM file fails with OOM. On the face of it, this is odd since the indexes themselves are all quite small (each well under 1MB), even though there are a lot of them (~1000). However, hg38 has an unusually large number of reference sequences, ~3000 rather than the more usual ~30, and this causes the blowup in memory use.

There are ~3000 ref sequences, and ~1000 parts (and therefore bai files to merge). Calling AbstractBAMFileIndex#query creates an array of Bins of size ~4000. The way the bai merging works is that it calls query for every reference of every part then combines them at the end. So that's ~10^10 elements stored in memory!

This can be fixed by changing the order of combining. Store all the parts in memory, then combine reference-wise at the end. This only needs to store ~10^6 elements which is a lot more manageable.

The test added in this PR fails with OOM if run alone (i.e. without the other commits). With the fix described above to change the order of combining, the test runs in >4 minutes. There is a further optimization to CachingBAMFileIndex that reduces the test time to <1 minute.

Tabix merging suffers from the same problem, so I've fixed that code too. There is no optimization necessary since the index code works in a different way that avoids that particular problem.

@heuermh
Copy link
Contributor

heuermh commented Jul 1, 2019

Note much larger header sizes are common; including the full HLA reference brings it up to ~25,000 reference sequences, and in species with unassembled or only partially assembled contigs it can get to 10^6 or 10^7 reference sequences. We had an issue in ADAM with one of those, will update this comment when I find the link.

Copy link
Contributor

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, one comment about a comment...

* A subclass of CachingBAMFileIndex that is optimized for merging, by returning
* null BAMIndexContent objects if all bins are empty.
*/
public class CachingBAMFileIndexOptimized extends CachingBAMFileIndex {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go away as soon as we merge the htsjdk pr?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a note about that with a link to the pr so we don't forget it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll add it to the htsjdk PR. The whole htsjdk source directory in Disq is due to be moved to htsjdk, so it won't be forgotten.

@lbergelson lbergelson removed their assignment Jul 1, 2019
@tomwhite
Copy link
Member Author

tomwhite commented Jul 2, 2019

@heuermh that is a huge header! Do you know how big the header was? Someone mentioned that they had seen a >2GB header which caused problems...

@tomwhite tomwhite merged commit 03339e2 into disq-bio:master Jul 2, 2019
@tomwhite tomwhite deleted the bai-merge-oom branch July 2, 2019 09:11
tomwhite added a commit to tomwhite/htsjdk that referenced this pull request Jul 2, 2019
@heuermh heuermh added this to the 0.4.0 milestone Jul 2, 2019
tomwhite added a commit to tomwhite/htsjdk that referenced this pull request Jul 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants