BAM header is not getting set on partition 0 with headerless BAM output format #916

fnothaft · 2016-01-12T03:16:56Z

The bug that will not die... Reported by @almussel. See #676, #691, #711, #712, #721...

WARNING:toil.leader:8ecfbeab-88ba-45e6-a3a4-ff1270e13051:       16/01/11 22:40:30 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 4.0 (TID 4, 172.31.16.110): java.lang.AssertionError: assertion failed: Cannot return header if not attached.
WARNING:toil.leader:8ecfbeab-88ba-45e6-a3a4-ff1270e13051:               at scala.Predef$.assert(Predef.scala:179)
WARNING:toil.leader:8ecfbeab-88ba-45e6-a3a4-ff1270e13051:               at org.bdgenomics.adam.rdd.read.ADAMBAMOutputFormat$.getHeader(ADAMBAMOutputFormat.scala:60)
WARNING:toil.leader:8ecfbeab-88ba-45e6-a3a4-ff1270e13051:               at org.bdgenomics.adam.rdd.read.ADAMBAMOutputFormat.<init>(ADAMBAMOutputFormat.scala:68)
WARNING:toil.leader:8ecfbeab-88ba-45e6-a3a4-ff1270e13051:               at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
WARNING:toil.leader:8ecfbeab-88ba-45e6-a3a4-ff1270e13051:               at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
WARNING:toil.leader:8ecfbeab-88ba-45e6-a3a4-ff1270e13051:               at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
WARNING:toil.leader:8ecfbeab-88ba-45e6-a3a4-ff1270e13051:               at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
WARNING:toil.leader:8ecfbeab-88ba-45e6-a3a4-ff1270e13051:               at java.lang.Class.newInstance(Class.java:383)
WARNING:toil.leader:8ecfbeab-88ba-45e6-a3a4-ff1270e13051:               at org.apache.spark.rdd.InstrumentedOutputFormat.<init>(InstrumentedOutputFormat.scala:33)
WARNING:toil.leader:8ecfbeab-88ba-45e6-a3a4-ff1270e13051:               at org.bdgenomics.adam.rdd.read.InstrumentedADAMBAMOutputFormat.<init>(ADAMBAMOutputFormat.scala:71)
WARNING:toil.leader:8ecfbeab-88ba-45e6-a3a4-ff1270e13051:               at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
WARNING:toil.leader:8ecfbeab-88ba-45e6-a3a4-ff1270e13051:               at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
WARNING:toil.leader:8ecfbeab-88ba-45e6-a3a4-ff1270e13051:               at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
WARNING:toil.leader:8ecfbeab-88ba-45e6-a3a4-ff1270e13051:               at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
WARNING:toil.leader:8ecfbeab-88ba-45e6-a3a4-ff1270e13051:               at java.lang.Class.newInstance(Class.java:383)
WARNING:toil.leader:8ecfbeab-88ba-45e6-a3a4-ff1270e13051:               at org.apache.spark.rdd.PairRDDFunctions$anonfun$saveAsNewAPIHadoopDataset$1$anonfun$12.apply(PairRDDFunctions.scala:1020)
WARNING:toil.leader:8ecfbeab-88ba-45e6-a3a4-ff1270e13051:               at org.apache.spark.rdd.PairRDDFunctions$anonfun$saveAsNewAPIHadoopDataset$1$anonfun$12.apply(PairRDDFunctions.scala:1014)
WARNING:toil.leader:8ecfbeab-88ba-45e6-a3a4-ff1270e13051:               at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
WARNING:toil.leader:8ecfbeab-88ba-45e6-a3a4-ff1270e13051:               at org.apache.spark.scheduler.Task.run(Task.scala:88)
WARNING:toil.leader:8ecfbeab-88ba-45e6-a3a4-ff1270e13051:               at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
WARNING:toil.leader:8ecfbeab-88ba-45e6-a3a4-ff1270e13051:               at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
WARNING:toil.leader:8ecfbeab-88ba-45e6-a3a4-ff1270e13051:               at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
WARNING:toil.leader:8ecfbeab-88ba-45e6-a3a4-ff1270e13051:               at java.lang.Thread.run(Thread.java:745)

@almussel will get us Spark logs tomorrow.

The text was updated successfully, but these errors were encountered:

heuermh · 2016-01-12T18:05:26Z

FYI @tomwhite added support for merging BAMs in GATK4, see ReadsSparkSink.java

fnothaft · 2016-01-12T18:09:55Z

I've got a fix for this prepped; just cleaning up a unit test failure, should be good to go in 15min.

Resolves bigdatagenomics#916. Makes several modifications that should eliminate the header attach issue when writing back to SAM/BAM: * Writes the SAM/BAM header as a single file. * Instead of trying to attach the SAM/BAM header to the output format via a singleton object, we pass the path to the SAM/BAM header file via the Hadoop configuration. * The output format reads the header from HDFS when creating the record writer. * At the end, once we've written the full RDD and the header file, we merge all via Hadoop's FsUtil.

fnothaft mentioned this issue Jan 12, 2016

[ADAM-916] New strategy for writing header. #917

Merged

heuermh closed this as completed in #917 Jan 14, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BAM header is not getting set on partition 0 with headerless BAM output format #916

BAM header is not getting set on partition 0 with headerless BAM output format #916

fnothaft commented Jan 12, 2016

heuermh commented Jan 12, 2016

fnothaft commented Jan 12, 2016

BAM header is not getting set on partition 0 with headerless BAM output format #916

BAM header is not getting set on partition 0 with headerless BAM output format #916

Comments

fnothaft commented Jan 12, 2016

heuermh commented Jan 12, 2016

fnothaft commented Jan 12, 2016