MarkDuplicatesSpark produces unreadable BAM #1005

davidadamsphd · 2015-10-16T14:54:52Z

MarkDuplicatesSpark runs successfully, but the (sharded) BAM cannot be loaded using ReadsSparkSource (with your fix to getHeader).

I don't think this is a problem with HadoopBam because I'm able to run the same input through SortBamSpark and the output can be read.

The input is from JP's bucket
gs://jpmartin/hellbender-test-inputs/CEUTrio.HiSeq.WGS.b37.ch1.1m-65m.NA12878.bam
but copied to hdfs

I ran

gcloud beta dataproc jobs submit spark --cluster high-mem-32-4 --properties  spark.executor.memory=19g,spark.executor.instances=32,spark.executor.cores=3  --class org.broadinstitute.hellbender.Main --jar build/libs/gatk-all-4.pre-alpha-7-*-SNAPSHOT-spark.jar MarkDuplicatesSpark -I hdfs:///user/davidada/CEUTrio.HiSeq.WGS.b37.ch1.1m-65m.NA12878.bam -O hdfs:///user/davidada/dummy-duped-CEUTrio.HiSeq.WGS.b37.ch1.1m-65m.NA12878.bam --sparkMaster yarn-client

gcloud beta dataproc jobs submit spark --cluster high-mem-32-4 --properties  spark.executor.memory=19g,spark.executor.instances=32,spark.executor.cores=3  --class org.broadinstitute.hellbender.Main --jar build/libs/gatk-all-4.pre-alpha-7-*-SNAPSHOT-spark.jar SortBamSpark -I hdfs:///user/davidada/dummy-duped-CEUTrio.HiSeq.WGS.b37.ch1.1m-65m.NA12878.bam --sparkMaster yarn-client -O  hdfs:///user/davidada/tmp.bam

But I expect that spark-submit should be the same.

The text was updated successfully, but these errors were encountered:

davidadamsphd · 2015-10-16T14:57:08Z

Stack trace, not so useful...

15/10/16 14:54:47 ERROR org.apache.spark.scheduler.TaskResultGetter: Could not deserialize TaskEndReason: ClassNotFound with classloader org.apache.spark.util.MutableURLClassLoader@587c290d
15/10/16 14:54:47 ERROR org.apache.spark.scheduler.TaskResultGetter: Could not deserialize TaskEndReason: ClassNotFound with classloader org.apache.spark.util.MutableURLClassLoader@587c290d
15/10/16 14:54:47 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1374.2 in stage 0.0 (TID 2218, high-mem-32-4-w-14.c.genomics-pipelines.internal): UnknownReason
15/10/16 14:54:47 WARN org.apache.spark.ThrowableSerializationWrapper: Task exception could not be deserialized
java.lang.ClassNotFoundException: htsjdk.samtools.util.RuntimeEOFException
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:348)
    at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
    at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
    at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
    at org.apache.spark.ThrowableSerializationWrapper.readObject(TaskEndReason.scala:163)
    at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)

tomwhite · 2015-10-16T15:54:59Z

I'm not able to access a cluster to test this on right now, however I wonder if it's something to do with the sort order declared in the BAM header. MarkDuplicatesSpark uses the same header for the output as the input, and doesn't change the sort order attribute, unlike SortBamSpark. Does setting it to unsorted help?

Have you been able to reproduce this on smaller inputs? I noticed that we don't have an integration test for MD.

davidadamsphd · 2015-10-16T17:03:25Z

I'm not sure it's sort-order related because I've run the exact same thing on a smaller BAM: CEUTrio.HiSeq.WGS.b37.ch20.4m-16m.NA12878.bam (subset of a larger original BAM) and it runs fine. If I had to guess there is a section of the larger BAM that's causing MarkDuplicatesSpark to make a malformed SAMRecord.

tomwhite · 2015-10-19T16:03:43Z

I have managed to reproduce this and have narrowed it down to single file with the problem. Now I need to see what's wrong with the file.

davidadamsphd · 2015-10-19T17:04:15Z

Woo! Thanks for the update.

tomwhite · 2015-10-20T08:53:42Z

Another update. It looks like the problem lies in Hadoop-BAM, in the part of the code that looks for the start of records in the BGZF block compressed file. If BAM files have an index (produced by SplittingBAMIndexer) then it will use them, but if not, then it uses BAMSplitGuesser for heuristically finding the start of a record. In this case it is trying to find the first record, but it gets the wrong offset (6505, instead of 6506) which then causes the error.

We should fix the code in BAMSplitGuesser that causes this. I haven't managed to find where the bug is yet - it may take a while. It might also be worth seeing if it's possible to write indexes from ReadsSparkSink to avoid the BAMSplitGuesser having to be used.

I've uploaded the problematic BAM here: https://github.com/broadinstitute/gatk/blob/tw_debug_invalid_bam/part-r-00686.bam

tomwhite · 2015-10-20T09:04:18Z

Looks like it might be due to the change in htsjdk from 1.131 to 1.140.

davidadamsphd · 2015-10-20T13:23:14Z

Great detective work!

Hadoop-BAM appears to still be an active-ish project, it's worth telling them we found a bug and have a BAM that can reproduce the problem. Hopefully, they can pick it up and fix it quickly (if not we can fix it).

However, in even in the best case they still need to go another cut for us to get the fix.
Two things:

(long shot) is it fixed in org.seqdoop:hadoop-bam:7.2.0?
Try to write out the index for each shard. From a quick skim, I didn't see anything about writing out indices in Hadoop-BAM, so I think we'd have to get the name of each BAM, put that into an RDD, and in a forEach read in the bam and create the index. @droazen, does that sound hard?

If it's not hard, I'd like to get this unblocked by writing out the index and then wait/work on the real fix.

tomwhite · 2015-10-20T13:38:00Z

I'll report this to the Hadoop-BAM and try to get a fix implemented. I haven't got time today but will take another look tomorrow.

I agree that writing an index for each shard would be good to do anyway. Note that the index is a special Hadoop-BAM index. I wonder if there's any way to write out the index as a side file while writing out the shards (so that the BAM doesn't need to be re-read). At a minimum, caching the RDD would help. Also, the input side may need changing so that the shards and index files extensions are respected.

cmnbroad · 2015-10-20T13:45:24Z

FWIW, it looks like BAMSplitGuesser can be run manually from the command line (at least, it has a static main which takes command line args), which might speed up the debugging. Also, in HB, we're currently excluding htsjdk from the Hadoop-BAM dependency, so I think in HB, Hadoop-BAM is running with the htsjdk we're providing (1.140) which is a bit ahead of what it's expecting. It might be worth trying to see if it repros with either BAMSplitGuesser standalone if you know the offset thats failing, or with HB reverted to the old htsjdk.

tomwhite · 2015-10-21T17:13:10Z

I've submitted a PR to Hadoop-BAM. Note that the problem is not due to a change in htsjdk as I incorrectly stated above. It is because the heuristics that BAMSplitGuesser uses are too weak to reject an incorrect candidate BAM record start position.

davidadamsphd · 2015-10-21T18:07:44Z

Sweet!
And now to wait for the review.

davidadamsphd · 2015-10-27T14:12:32Z

This was fixed by #1054. Thanks for tracking that down Tom!

New strand artifact filter

davidadamsphd assigned tomwhite Oct 16, 2015

tomwhite changed the title ~~MarkDuplicatesSpark produces invalid BAM~~ MarkDuplicatesSpark produces unreadable BAM Oct 21, 2015

tomwhite mentioned this issue Oct 21, 2015

Fix BAMSplitGuesser to check that read name is of length one or more. HadoopGenomics/Hadoop-BAM#25

Merged

davidadamsphd closed this as completed Oct 27, 2015

lbergelson pushed a commit that referenced this issue May 31, 2017

Merge pull request #1005 from broadinstitute/ts_strand

d70cb20

New strand artifact filter

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MarkDuplicatesSpark produces unreadable BAM #1005

MarkDuplicatesSpark produces unreadable BAM #1005

davidadamsphd commented Oct 16, 2015

davidadamsphd commented Oct 16, 2015

tomwhite commented Oct 16, 2015

davidadamsphd commented Oct 16, 2015

tomwhite commented Oct 19, 2015

davidadamsphd commented Oct 19, 2015

tomwhite commented Oct 20, 2015

tomwhite commented Oct 20, 2015

davidadamsphd commented Oct 20, 2015

tomwhite commented Oct 20, 2015

cmnbroad commented Oct 20, 2015

tomwhite commented Oct 21, 2015

davidadamsphd commented Oct 21, 2015

davidadamsphd commented Oct 27, 2015

MarkDuplicatesSpark produces unreadable BAM #1005

MarkDuplicatesSpark produces unreadable BAM #1005

Comments

davidadamsphd commented Oct 16, 2015

davidadamsphd commented Oct 16, 2015

tomwhite commented Oct 16, 2015

davidadamsphd commented Oct 16, 2015

tomwhite commented Oct 19, 2015

davidadamsphd commented Oct 19, 2015

tomwhite commented Oct 20, 2015

tomwhite commented Oct 20, 2015

davidadamsphd commented Oct 20, 2015

tomwhite commented Oct 20, 2015

cmnbroad commented Oct 20, 2015

tomwhite commented Oct 21, 2015

davidadamsphd commented Oct 21, 2015

davidadamsphd commented Oct 27, 2015