New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAMFormatException: Unrecognized tag type: ^@ #1657

Closed
heuermh opened this Issue Aug 3, 2017 · 9 comments

Comments

3 participants
@heuermh
Member

heuermh commented Aug 3, 2017

INFO rdd.ADAMContext: Loading hdfs://spark-master:8020/data/sample.bam as BAM/CRAM/SAM and converting to AlignmentRecords.
INFO rdd.ADAMContext: Loaded header from hdfs://spark-master:8020/data/sample.bam
...
INFO read.RDDBoundAlignmentRecordRDD: Saving data in ADAM format
...
WARN scheduler.TaskSetManager: Lost task 135.0 in stage 0.0 (TID 147, ip-10-0-0-9.ec2.internal):
htsjdk.samtools.SAMFormatException: Unrecognized tag type: ^@
        at htsjdk.samtools.BinaryTagCodec.readSingleValue(BinaryTagCodec.java:351)
        at htsjdk.samtools.BinaryTagCodec.readTags(BinaryTagCodec.java:282)
        at htsjdk.samtools.BAMRecord.decodeAttributes(BAMRecord.java:313)
        at htsjdk.samtools.BAMRecord.getAttribute(BAMRecord.java:293)
        at htsjdk.samtools.SAMRecord.isValid(SAMRecord.java:2004)
        at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:795)
        at htsjdk.samtools.BAMFileReader$BAMFileIndexIterator.<init>(BAMFileReader.java:947)
        at htsjdk.samtools.BAMFileReader.getIterator(BAMFileReader.java:482)
        at org.seqdoop.hadoop_bam.BAMRecordReader.initialize(BAMRecordReader.java:172)
        at org.seqdoop.hadoop_bam.BAMInputFormat.createRecordReader(BAMInputFormat.java:121)
        at org.seqdoop.hadoop_bam.AnySAMInputFormat.createRecordReader(AnySAMInputFormat.java:190)
        at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:156)
        at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:129)
        at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:64)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
...
INFO scheduler.TaskSetManager: Lost task 135.1 in stage 0.0 (TID 176) on executor
ip-10-0-0-9.ec2.internal: htsjdk.samtools.SAMFormatException (Unrecognized tag type: ^@) [duplicate 1]
...
INFO scheduler.TaskSetManager: Lost task 135.2 in stage 0.0 (TID 200) on executor
ip-10-0-0-9.ec2.internal: htsjdk.samtools.SAMFormatException (Unrecognized tag type: ^@) [duplicate 2]
@fnothaft

This comment has been minimized.

Member

fnothaft commented Aug 3, 2017

Do you have a line with that?

@heuermh

This comment has been minimized.

Member

heuermh commented Aug 3, 2017

Will investigate. Odd that only one file of 455 from the same source pipeline would have that as a tag type.

@heuermh

This comment has been minimized.

Member

heuermh commented Aug 8, 2017

Can't find an occurrences of ^@ or any parts of that with various escaping in less

$ samtools view -h sample.bam | less

Samtools itself doesn't seem to complain

$ samtools flagstat sample.bam
986020586 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
103643908 + 0 duplicates
928658758 + 0 mapped (94.18% : N/A)
986020586 + 0 paired in sequencing
493010293 + 0 read1
493010293 + 0 read2
896125586 + 0 properly paired (90.88% : N/A)
910853506 + 0 with itself and mate mapped
17805252 + 0 singletons (1.81% : N/A)
3849910 + 0 with mate mapped to a different chr
1764304 + 0 with mate mapped to a different chr (mapQ>=5)

and an excerpt to SAM format transformed ok

$ samtools view -h \
  sample.bam \
  chr1:99000-100000 > sample-chr1-99000-100000.sam

$ hadoop fs -put \
  sample-chr1-99000-100000.sam \
  /data/sample-chr1-99000-100000.sam

$ adam-submit \
  transformAlignments \
  /data/sample-chr1-99000-100000.sam \
  /data/sample-chr1-99000-100000.alignments.adam
...
@fnothaft

This comment has been minimized.

Member

fnothaft commented Aug 8, 2017

This might be a bad Hadoop-BAM split? @ryan-williams has been tracking these down...

@heuermh

This comment has been minimized.

Member

heuermh commented Aug 8, 2017

The error was reported more than once, on the same executor though, so I suppose it could be a bad split. I've asked the data producer to help us confirm the BAM hasn't been corrupted since it was created, and I'll try to do something with htsjdk directly next.

@fnothaft

This comment has been minimized.

Member

fnothaft commented Aug 8, 2017

If the file reads OK on a single node, it looks a lot like a bad split to me...

@ryan-williams

This comment has been minimized.

Member

ryan-williams commented Aug 9, 2017

I have a bunch of utilities for investigating the bad split possibility. lmk if I can be of assistance / would love to get a look at the BAM

@fnothaft

This comment has been minimized.

Member

fnothaft commented Jan 9, 2018

@heuermh ping to retest with latest Hadoop-BAM in ToT.

@heuermh

This comment has been minimized.

Member

heuermh commented Jan 26, 2018

Works for me with ADAM version 0.23.0.

@heuermh heuermh closed this Jan 26, 2018

@heuermh heuermh added this to the 0.24.0 milestone Jan 26, 2018

@heuermh heuermh added this to Completed in Release 0.24.0 Feb 10, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment