Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NullPointerException at htsjdk CramNormalizer.getByteOrDefault #1993

Closed
heuermh opened this issue May 26, 2018 · 5 comments
Closed

NullPointerException at htsjdk CramNormalizer.getByteOrDefault #1993

heuermh opened this issue May 26, 2018 · 5 comments
Milestone

Comments

@heuermh
Copy link
Member

heuermh commented May 26, 2018

$ adam-shell ...

scala> import org.seqdoop.hadoop_bam.{ CRAMInputFormat, SAMFormat }
import org.seqdoop.hadoop_bam.{CRAMInputFormat, SAMFormat}

scala> sc.hadoopConfiguration.set(CRAMInputFormat.REFERENCE_SOURCE_PATH_PROPERTY, "hdfs:///data/resources/GRCh38/hs38DH.fa")

scala> val reads = sc.loadAlignments("AshkenazimTrio/HG002_NA24385_son/Ultralong_OxfordNanopore/combined_2018-05-18/combined_2018-05-18.hg38.sorted.cram")
reads: org.bdgenomics.adam.rdd.read.AlignmentRecordRDD = RDDBoundAlignmentRecordRDD with 2580 reference sequences, 0 read groups, and 1 processing steps

scala> val lengths = reads.rdd.map(_.sequence.length()).collect()
[Stage 8:>                                                       (0 + 48) / 106]

...
scheduler.TaskSetManager: Lost task 9.0 in stage 8.0: java.lang.NullPointerException
	at htsjdk.samtools.cram.build.CramNormalizer.getByteOrDefault(CramNormalizer.java:315)
	at htsjdk.samtools.cram.build.CramNormalizer.restoreReadBases(CramNormalizer.java:253)
	at htsjdk.samtools.cram.build.CramNormalizer.normalize(CramNormalizer.java:131)
	at htsjdk.samtools.CRAMIterator.nextContainer(CRAMIterator.java:191)
	at htsjdk.samtools.CRAMIterator.hasNext(CRAMIterator.java:261)
	at org.seqdoop.hadoop_bam.CRAMRecordReader.nextKeyValue(CRAMRecordReader.java:60)
	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:207)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
	at scala.collection.AbstractIterator.to(Iterator.scala:1336)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
@heuermh
Copy link
Member Author

heuermh commented May 26, 2018

Does Hadoop-BAM support reading the reference for CRAM from HDFS? If I used one mounted on shared disk, I'm seeing validation errors instead of NPEs

scala> sc.hadoopConfiguration.set(CRAMInputFormat.REFERENCE_SOURCE_PATH_PROPERTY, "/mnt/fb/resources/GRCh38/hs38DH.fa")

scala> val reads = sc.loadAlignments("AshkenazimTrio/HG002_NA24385_son/Ultralong_OxfordNanopore/combined_2018-05-18/combined_2018-05-18.hg38.sorted.cram")
reads: org.bdgenomics.adam.rdd.read.AlignmentRecordRDD = RDDBoundAlignmentRecordRDD with 2580 reference sequences, 0 read groups, and 1 processing steps

scala> val lengths = reads.rdd.map(_.sequence.length()).collect()
[Stage 9:>                                                       (0 + 48) / 106]

...
scheduler.TaskSetManager: htsjdk.samtools.SAMFormatException: SAM validation error:
ERROR: Record 61, Read name f5d18709-1946-4dd0-a519-2a88a4a55d57, CIGAR covers
2720 bases but the sequence is 0 read bases 
	at htsjdk.samtools.SAMUtils.processValidationErrors(SAMUtils.java:454)
	at htsjdk.samtools.CRAMIterator.nextContainer(CRAMIterator.java:209)
	at htsjdk.samtools.CRAMIterator.hasNext(CRAMIterator.java:261)
	at org.seqdoop.hadoop_bam.CRAMRecordReader.nextKeyValue(CRAMRecordReader.java:60)
	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:207)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
	at scala.collection.AbstractIterator.to(Iterator.scala:1336)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

scheduler.TaskSetManager: htsjdk.samtools.SAMFormatException: SAM validation error:
ERROR: Record 323, Read name b3cf0949-c1f0-494b-9ac9-33965291ff90, CIGAR covers
567 bases but the sequence is 0 read bases 
	at htsjdk.samtools.SAMUtils.processValidationErrors(SAMUtils.java:454)
	at htsjdk.samtools.CRAMIterator.nextContainer(CRAMIterator.java:209)
	at htsjdk.samtools.CRAMIterator.hasNext(CRAMIterator.java:261)
	at org.seqdoop.hadoop_bam.CRAMRecordReader.nextKeyValue(CRAMRecordReader.java:60)
	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:207)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
	at scala.collection.AbstractIterator.to(Iterator.scala:1336)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

...

@fnothaft
Copy link
Member

Does the CRAM codepath use NIO? If no, then HDFS won't be supported.

@fnothaft
Copy link
Member

Also, shouldn't this be opened against an upstream repo?

@heuermh
Copy link
Member Author

heuermh commented Jun 23, 2018

According to the Hadoop-BAM docs HDFS should be supported for CRAM references.

CRAMInputFormat | hadoopbam.cram.reference-source-path | (Required.) The path to the reference. May be an hdfs:// path.

Filed upstream issue HadoopGenomics/Hadoop-BAM#201

@heuermh
Copy link
Member Author

heuermh commented Aug 23, 2021

Closing as WontFix

@heuermh heuermh closed this as completed Aug 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants