New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MarkDuplicatesSpark produces unreadable BAM #1005
Comments
Stack trace, not so useful...
|
I'm not able to access a cluster to test this on right now, however I wonder if it's something to do with the sort order declared in the BAM header. MarkDuplicatesSpark uses the same header for the output as the input, and doesn't change the sort order attribute, unlike SortBamSpark. Does setting it to unsorted help? Have you been able to reproduce this on smaller inputs? I noticed that we don't have an integration test for MD. |
I'm not sure it's sort-order related because I've run the exact same thing on a smaller BAM: CEUTrio.HiSeq.WGS.b37.ch20.4m-16m.NA12878.bam (subset of a larger original BAM) and it runs fine. If I had to guess there is a section of the larger BAM that's causing |
I have managed to reproduce this and have narrowed it down to single file with the problem. Now I need to see what's wrong with the file. |
Woo! Thanks for the update. |
Another update. It looks like the problem lies in Hadoop-BAM, in the part of the code that looks for the start of records in the BGZF block compressed file. If BAM files have an index (produced by SplittingBAMIndexer) then it will use them, but if not, then it uses BAMSplitGuesser for heuristically finding the start of a record. In this case it is trying to find the first record, but it gets the wrong offset (6505, instead of 6506) which then causes the error. We should fix the code in BAMSplitGuesser that causes this. I haven't managed to find where the bug is yet - it may take a while. It might also be worth seeing if it's possible to write indexes from ReadsSparkSink to avoid the BAMSplitGuesser having to be used. I've uploaded the problematic BAM here: https://github.com/broadinstitute/gatk/blob/tw_debug_invalid_bam/part-r-00686.bam |
Looks like it might be due to the change in htsjdk from 1.131 to 1.140. |
Great detective work! Hadoop-BAM appears to still be an active-ish project, it's worth telling them we found a bug and have a BAM that can reproduce the problem. Hopefully, they can pick it up and fix it quickly (if not we can fix it). However, in even in the best case they still need to go another cut for us to get the fix.
If it's not hard, I'd like to get this unblocked by writing out the index and then wait/work on the real fix. |
I'll report this to the Hadoop-BAM and try to get a fix implemented. I haven't got time today but will take another look tomorrow. I agree that writing an index for each shard would be good to do anyway. Note that the index is a special Hadoop-BAM index. I wonder if there's any way to write out the index as a side file while writing out the shards (so that the BAM doesn't need to be re-read). At a minimum, caching the RDD would help. Also, the input side may need changing so that the shards and index files extensions are respected. |
FWIW, it looks like BAMSplitGuesser can be run manually from the command line (at least, it has a static main which takes command line args), which might speed up the debugging. Also, in HB, we're currently excluding htsjdk from the Hadoop-BAM dependency, so I think in HB, Hadoop-BAM is running with the htsjdk we're providing (1.140) which is a bit ahead of what it's expecting. It might be worth trying to see if it repros with either BAMSplitGuesser standalone if you know the offset thats failing, or with HB reverted to the old htsjdk. |
I've submitted a PR to Hadoop-BAM. Note that the problem is not due to a change in htsjdk as I incorrectly stated above. It is because the heuristics that BAMSplitGuesser uses are too weak to reject an incorrect candidate BAM record start position. |
Sweet! |
This was fixed by #1054. Thanks for tracking that down Tom! |
MarkDuplicatesSpark runs successfully, but the (sharded) BAM cannot be loaded using ReadsSparkSource (with your fix to getHeader).
I don't think this is a problem with HadoopBam because I'm able to run the same input through SortBamSpark and the output can be read.
The input is from JP's bucket
gs://jpmartin/hellbender-test-inputs/CEUTrio.HiSeq.WGS.b37.ch1.1m-65m.NA12878.bam
but copied to hdfs
I ran
But I expect that spark-submit should be the same.
The text was updated successfully, but these errors were encountered: