Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexOutOfBoundsException in BAMInputFormat.getSplits #1656

Closed
heuermh opened this Issue Aug 3, 2017 · 4 comments

Comments

Projects
2 participants
@heuermh
Copy link
Member

heuermh commented Aug 3, 2017

When batch converting 455 BAM files to ADAM AlignmentRecords in Parquet+Avro format using adam-submit transformAlignments with both input and output on HDFS, 9 failed with the following exception:

converting hdfs://spark-master:8020/data/sample.bam to hdfs://spark-master:8020/data/sample.alignments.adam...
Using ADAM_MAIN=org.bdgenomics.adam.cli.ADAMMain
Using SPARK_SUBMIT=/opt/sparkbox/spark/bin/spark-submit
INFO cli.ADAMMain: ADAM invoked with args: "transformAlignments" "hdfs://spark-master:8020/data/sample.bam" "hdfs://spark-master:8020/data/sample.alignments.adam"
...
INFO rdd.ADAMContext: Loading hdfs://spark-master:8020/data/sample.bam as BAM/CRAM/SAM and converting to AlignmentRecords.
INFO rdd.ADAMContext: Loaded header from hdfs://spark-master:8020/data/sample.bam
...
Command body threw exception:
java.lang.IndexOutOfBoundsException
INFO cli.TransformAlignments: Overall Duration: 7.9 secs
Exception in thread "main" java.lang.IndexOutOfBoundsException
        at java.nio.ByteBuffer.wrap(ByteBuffer.java:375)
        at htsjdk.samtools.BAMRecord.getCigar(BAMRecord.java:246)
        at org.seqdoop.hadoop_bam.BAMSplitGuesser.guessNextBAMRecordStart(BAMSplitGuesser.java:189)
        at org.seqdoop.hadoop_bam.BAMInputFormat.addProbabilisticSplits(BAMInputFormat.java:240)
        at org.seqdoop.hadoop_bam.BAMInputFormat.getSplits(BAMInputFormat.java:155)
        at org.seqdoop.hadoop_bam.AnySAMInputFormat.getSplits(AnySAMInputFormat.java:252)
        at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:120)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1922)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1144)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1074)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1074)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
        at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1074)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply$mcV$sp(PairRDDFunctions.scala:994)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:985)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:985)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
        at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:985)
        at org.apache.spark.rdd.InstrumentedPairRDDFunctions.saveAsNewAPIHadoopFile(InstrumentedPairRDDFunctions.scala:477)
        at org.bdgenomics.adam.rdd.ADAMRDDFunctions$$anonfun$saveRddAsParquet$1.apply$mcV$sp(ADAMRDDFunctions.scala:165)
        at org.bdgenomics.adam.rdd.ADAMRDDFunctions$$anonfun$saveRddAsParquet$1.apply(ADAMRDDFunctions.scala:149)
        at org.bdgenomics.adam.rdd.ADAMRDDFunctions$$anonfun$saveRddAsParquet$1.apply(ADAMRDDFunctions.scala:149)
        at scala.Option.fold(Option.scala:157)
        at org.apache.spark.rdd.Timer.time(Timer.scala:48)
        at org.bdgenomics.adam.rdd.ADAMRDDFunctions.saveRddAsParquet(ADAMRDDFunctions.scala:149)
        at org.bdgenomics.adam.rdd.AvroGenomicRDD.saveAsParquet(GenomicRDD.scala:1750)
        at org.bdgenomics.adam.rdd.AvroGenomicRDD.saveAsParquet(GenomicRDD.scala:1725)
        at org.bdgenomics.adam.rdd.read.AlignmentRecordRDD.save(AlignmentRecordRDD.scala:538)
        at org.bdgenomics.adam.cli.TransformAlignments.run(TransformAlignments.scala:565)
        at org.bdgenomics.utils.cli.BDGSparkCommand$class.run(BDGCommand.scala:55)
        at org.bdgenomics.adam.cli.TransformAlignments.run(TransformAlignments.scala:138)
        at org.bdgenomics.adam.cli.ADAMMain.apply(ADAMMain.scala:126)
        at org.bdgenomics.adam.cli.ADAMMain$.main(ADAMMain.scala:65)
        at org.bdgenomics.adam.cli.ADAMMain.main(ADAMMain.scala)
@fnothaft

This comment has been minimized.

Copy link
Member

fnothaft commented Aug 3, 2017

Do these files have indices? If so, want to give HadoopGenomics/Hadoop-BAM#140 a whirl?

@heuermh

This comment has been minimized.

Copy link
Member Author

heuermh commented Aug 3, 2017

Do these files have indices?

Yes, they do.

If so, want to give HadoopGenomics/Hadoop-BAM#140 a whirl?

May take some time to get to it, but I will.

@fnothaft

This comment has been minimized.

Copy link
Member

fnothaft commented Jan 9, 2018

@heuermh did HadoopGenomics/Hadoop-BAM#164 resolve this issue? If so, ToT should decode these files OK.

@heuermh

This comment has been minimized.

Copy link
Member Author

heuermh commented Jan 26, 2018

Works for me for all nine files with ADAM version 0.23.0.

@heuermh heuermh closed this Jan 26, 2018

@heuermh heuermh added this to the 0.24.0 milestone Jan 26, 2018

@heuermh heuermh added this to Completed in Release 0.24.0 Feb 10, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.