Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MarkDuplicatesSpark produces unreadable BAM #1005

Closed
davidadamsphd opened this issue Oct 16, 2015 · 13 comments
Closed

MarkDuplicatesSpark produces unreadable BAM #1005

davidadamsphd opened this issue Oct 16, 2015 · 13 comments
Assignees

Comments

@davidadamsphd
Copy link
Contributor

MarkDuplicatesSpark runs successfully, but the (sharded) BAM cannot be loaded using ReadsSparkSource (with your fix to getHeader).

I don't think this is a problem with HadoopBam because I'm able to run the same input through SortBamSpark and the output can be read.

The input is from JP's bucket
gs://jpmartin/hellbender-test-inputs/CEUTrio.HiSeq.WGS.b37.ch1.1m-65m.NA12878.bam
but copied to hdfs

I ran

gcloud beta dataproc jobs submit spark --cluster high-mem-32-4 --properties  spark.executor.memory=19g,spark.executor.instances=32,spark.executor.cores=3  --class org.broadinstitute.hellbender.Main --jar build/libs/gatk-all-4.pre-alpha-7-*-SNAPSHOT-spark.jar MarkDuplicatesSpark -I hdfs:///user/davidada/CEUTrio.HiSeq.WGS.b37.ch1.1m-65m.NA12878.bam -O hdfs:///user/davidada/dummy-duped-CEUTrio.HiSeq.WGS.b37.ch1.1m-65m.NA12878.bam --sparkMaster yarn-client

gcloud beta dataproc jobs submit spark --cluster high-mem-32-4 --properties  spark.executor.memory=19g,spark.executor.instances=32,spark.executor.cores=3  --class org.broadinstitute.hellbender.Main --jar build/libs/gatk-all-4.pre-alpha-7-*-SNAPSHOT-spark.jar SortBamSpark -I hdfs:///user/davidada/dummy-duped-CEUTrio.HiSeq.WGS.b37.ch1.1m-65m.NA12878.bam --sparkMaster yarn-client -O  hdfs:///user/davidada/tmp.bam

But I expect that spark-submit should be the same.

@davidadamsphd
Copy link
Contributor Author

Stack trace, not so useful...

15/10/16 14:54:47 ERROR org.apache.spark.scheduler.TaskResultGetter: Could not deserialize TaskEndReason: ClassNotFound with classloader org.apache.spark.util.MutableURLClassLoader@587c290d
15/10/16 14:54:47 ERROR org.apache.spark.scheduler.TaskResultGetter: Could not deserialize TaskEndReason: ClassNotFound with classloader org.apache.spark.util.MutableURLClassLoader@587c290d
15/10/16 14:54:47 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1374.2 in stage 0.0 (TID 2218, high-mem-32-4-w-14.c.genomics-pipelines.internal): UnknownReason
15/10/16 14:54:47 WARN org.apache.spark.ThrowableSerializationWrapper: Task exception could not be deserialized
java.lang.ClassNotFoundException: htsjdk.samtools.util.RuntimeEOFException
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:348)
    at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
    at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
    at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
    at org.apache.spark.ThrowableSerializationWrapper.readObject(TaskEndReason.scala:163)
    at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)

@tomwhite
Copy link
Contributor

I'm not able to access a cluster to test this on right now, however I wonder if it's something to do with the sort order declared in the BAM header. MarkDuplicatesSpark uses the same header for the output as the input, and doesn't change the sort order attribute, unlike SortBamSpark. Does setting it to unsorted help?

Have you been able to reproduce this on smaller inputs? I noticed that we don't have an integration test for MD.

@davidadamsphd
Copy link
Contributor Author

I'm not sure it's sort-order related because I've run the exact same thing on a smaller BAM: CEUTrio.HiSeq.WGS.b37.ch20.4m-16m.NA12878.bam (subset of a larger original BAM) and it runs fine. If I had to guess there is a section of the larger BAM that's causing MarkDuplicatesSpark to make a malformed SAMRecord.

@tomwhite
Copy link
Contributor

I have managed to reproduce this and have narrowed it down to single file with the problem. Now I need to see what's wrong with the file.

@davidadamsphd
Copy link
Contributor Author

Woo! Thanks for the update.

@tomwhite
Copy link
Contributor

Another update. It looks like the problem lies in Hadoop-BAM, in the part of the code that looks for the start of records in the BGZF block compressed file. If BAM files have an index (produced by SplittingBAMIndexer) then it will use them, but if not, then it uses BAMSplitGuesser for heuristically finding the start of a record. In this case it is trying to find the first record, but it gets the wrong offset (6505, instead of 6506) which then causes the error.

We should fix the code in BAMSplitGuesser that causes this. I haven't managed to find where the bug is yet - it may take a while. It might also be worth seeing if it's possible to write indexes from ReadsSparkSink to avoid the BAMSplitGuesser having to be used.

I've uploaded the problematic BAM here: https://github.com/broadinstitute/gatk/blob/tw_debug_invalid_bam/part-r-00686.bam

@tomwhite
Copy link
Contributor

Looks like it might be due to the change in htsjdk from 1.131 to 1.140.

@davidadamsphd
Copy link
Contributor Author

Great detective work!

Hadoop-BAM appears to still be an active-ish project, it's worth telling them we found a bug and have a BAM that can reproduce the problem. Hopefully, they can pick it up and fix it quickly (if not we can fix it).

However, in even in the best case they still need to go another cut for us to get the fix.
Two things:

  • (long shot) is it fixed in org.seqdoop:hadoop-bam:7.2.0?
  • Try to write out the index for each shard. From a quick skim, I didn't see anything about writing out indices in Hadoop-BAM, so I think we'd have to get the name of each BAM, put that into an RDD, and in a forEach read in the bam and create the index. @droazen, does that sound hard?

If it's not hard, I'd like to get this unblocked by writing out the index and then wait/work on the real fix.

@tomwhite
Copy link
Contributor

I'll report this to the Hadoop-BAM and try to get a fix implemented. I haven't got time today but will take another look tomorrow.

I agree that writing an index for each shard would be good to do anyway. Note that the index is a special Hadoop-BAM index. I wonder if there's any way to write out the index as a side file while writing out the shards (so that the BAM doesn't need to be re-read). At a minimum, caching the RDD would help. Also, the input side may need changing so that the shards and index files extensions are respected.

@cmnbroad
Copy link
Collaborator

FWIW, it looks like BAMSplitGuesser can be run manually from the command line (at least, it has a static main which takes command line args), which might speed up the debugging. Also, in HB, we're currently excluding htsjdk from the Hadoop-BAM dependency, so I think in HB, Hadoop-BAM is running with the htsjdk we're providing (1.140) which is a bit ahead of what it's expecting. It might be worth trying to see if it repros with either BAMSplitGuesser standalone if you know the offset thats failing, or with HB reverted to the old htsjdk.

@tomwhite tomwhite changed the title MarkDuplicatesSpark produces invalid BAM MarkDuplicatesSpark produces unreadable BAM Oct 21, 2015
@tomwhite
Copy link
Contributor

I've submitted a PR to Hadoop-BAM. Note that the problem is not due to a change in htsjdk as I incorrectly stated above. It is because the heuristics that BAMSplitGuesser uses are too weak to reject an incorrect candidate BAM record start position.

@davidadamsphd
Copy link
Contributor Author

Sweet!
And now to wait for the review.

@davidadamsphd
Copy link
Contributor Author

This was fixed by #1054. Thanks for tracking that down Tom!

lbergelson pushed a commit that referenced this issue May 31, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants