New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAMFileHeader "sort order" attribute being un-set during file-save job #800

Closed
ryan-williams opened this Issue Aug 23, 2015 · 5 comments

Comments

Projects
None yet
2 participants
@ryan-williams
Member

ryan-williams commented Aug 23, 2015

I am currently seeing the SAMFileHeader header in ADAMSAMOutputFormat get mutated during the saveAsNewAPIHadoopFile job; specifically, the "sorted order" (attribute "SO") is being dropped, resulting in sorter SAMs that had SO:coordinate set on the header being written with SO:unsorted, and the header having unsorted in this attribute the file-write is done.

This branch repros it with some printlns; run:

mvn clean package -DskipTests
bin/adam-submit transform -single -sort_reads adam-cli/src/test/resources/bqsr1.sam sorted.sam

Here's a gist fo the full output of the latter command.

Section of note (L424-439 of the gist):

org.bdgenomics.adam.rdd.read.ADAMSAMOutputFormat$@274bcf9 return header: Some(SAMFileHeader{VN=1.4, SO=coordinate})
got header: SAMFileHeader{VN=1.4, SO=coordinate} from org.bdgenomics.adam.rdd.read.ADAMSAMOutputFormat$@274bcf9
15/08/23 22:33:56 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
15/08/23 22:33:56 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 4 ms
15/08/23 22:33:57 INFO BlockManagerInfo: Removed broadcast_7_piece0 on localhost:60426 in memory (size: 2.6 KB, free: 265.4 MB)
15/08/23 22:33:57 INFO BlockManagerInfo: Removed broadcast_4_piece0 on localhost:60426 in memory (size: 1763.0 B, free: 265.4 MB)
15/08/23 22:33:57 INFO BlockManagerInfo: Removed broadcast_3_piece0 on localhost:60426 in memory (size: 2.1 KB, free: 265.4 MB)
15/08/23 22:33:57 INFO ContextCleaner: Cleaned shuffle 1
15/08/23 22:33:57 INFO FileOutputCommitter: Saved output of task 'attempt_201508232233_0012_r_000000_0' to file:/Users/ryan/c/adam/sorted.sam/_temporary/0/task_201508232233_0012_r_000000
15/08/23 22:33:57 INFO Executor: Finished task 0.0 in stage 8.0 (TID 5). 844 bytes result sent to driver
15/08/23 22:33:57 INFO TaskSetManager: Finished task 0.0 in stage 8.0 (TID 5) in 605 ms on localhost (1/1)
15/08/23 22:33:57 INFO DAGScheduler: ResultStage 8 (saveAsNewAPIHadoopFile at AlignmentRecordRDDFunctions.scala:188) finished in 0.605 s
15/08/23 22:33:57 INFO TaskSchedulerImpl: Removed TaskSet 8.0, whose tasks have all completed, from pool
15/08/23 22:33:57 INFO DAGScheduler: Job 3 finished: saveAsNewAPIHadoopFile at AlignmentRecordRDDFunctions.scala:188, took 0.668444 s
org.bdgenomics.adam.rdd.read.ADAMSAMOutputFormat$@274bcf9 return header: Some(SAMFileHeader{VN=1.4, SO=unsorted})
middle: SAMFileHeader{VN=1.4, SO=unsorted}

This is some output from the saveAsNewAPIHadoopFile job.

  • At the beginning, we see:

    ADAMSAMOutputFormat$@274bcf9 return header: Some(SAMFileHeader{VN=1.4, SO=coordinate})
    

    from this println that I added. The header has coordinate sort.

  • At the end:

    ADAMSAMOutputFormat$@274bcf9 return header: Some(SAMFileHeader{VN=1.4, SO=unsorted})
    

    So the header now has unsorted sort order; note that both the above printlns show the same pointer address for ADAMSAMOutputFormat, @274bcf9; its sort order value has been mutated by… something.

  • In between, there are no lines clearing or setting the header (which have printlns attached to them in this branch), even though header is private in this branch. How is it being modified?

The result, as the gist shows, is a SAM file with SO:unsorted in its header:

$ head sorted.sam
@HD VN:1.4  SO:unsorted
…

I've been reproducing this deterministically for hours, originally starting from within a test case I was writing while working on #799.

  • Can anyone else repro this from this branch?
  • Any ideas on what is un-setting the header? I'll go poke around in htsjdk/hadoop-bam and see if I can see how it's happening.

@ryan-williams ryan-williams added the bug label Aug 23, 2015

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Aug 24, 2015

Member

After a long slog, I think I've found the culprit: a bug in hadoop-bam's use of HTSJDK.

Here I've patched a hadoop-bam fork with the fix, off of 7.0.0.
Here I have ADAM picking up the patched version of hadoop-bam above.

The bug is that SAMTextWriter.setSortOrder is not called before SAMTextWriter.setHeader, despite this comment stating that it must (link is from htsjdk 1.118, which hadoop-bam 7.0.0 uses a lightly-patched fork of, but this API and comment still exist in htsjdk 1.138). Why htsjdk is set up that way, I have no idea.

To test the bug and my fix for yourself:

git clone git@github.com:ryan-williams/Hadoop-BAM.git
cd hadoop-bam
# writer-fix branch should already be checked out
mvn install -DskipTests

Then, in any ADAM repo:

bin/adam-submit transform -single -sort_reads adam-apis/src/test/resources/small.sam sorted.sam
head -n 1 sorted.sam

You'll (incorrectly) get SO:unsorted, even though the header and reads will have been sorted:

@HD VN:1.4  SO:unsorted

To test with my fix:

git remote add ryan git@github.com:ryan-williams/adam.git
git fetch ryan
git checkout ryan/hadoop-bam-fix
mvn package -DskipTests
mv sorted.sam{,.bak}
bin/adam-submit transform -single -sort_reads adam-apis/src/test/resources/small.sam sorted.sam

This run will output a sorted.sam with the correct SO:coordinate tag in the header line:

diff sorted.sam{,.bak}
1c1
< @HD   VN:1.4  SO:coordinate
---
> @HD   VN:1.4  SO:unsorted

So, I'll file this against hadoop-bam; we could move to a fork if it is bothering us enough, but I think this is just relegated to the SO attribute on the @HD line, so we don't really need to get it fixed, I suppose.

Member

ryan-williams commented Aug 24, 2015

After a long slog, I think I've found the culprit: a bug in hadoop-bam's use of HTSJDK.

Here I've patched a hadoop-bam fork with the fix, off of 7.0.0.
Here I have ADAM picking up the patched version of hadoop-bam above.

The bug is that SAMTextWriter.setSortOrder is not called before SAMTextWriter.setHeader, despite this comment stating that it must (link is from htsjdk 1.118, which hadoop-bam 7.0.0 uses a lightly-patched fork of, but this API and comment still exist in htsjdk 1.138). Why htsjdk is set up that way, I have no idea.

To test the bug and my fix for yourself:

git clone git@github.com:ryan-williams/Hadoop-BAM.git
cd hadoop-bam
# writer-fix branch should already be checked out
mvn install -DskipTests

Then, in any ADAM repo:

bin/adam-submit transform -single -sort_reads adam-apis/src/test/resources/small.sam sorted.sam
head -n 1 sorted.sam

You'll (incorrectly) get SO:unsorted, even though the header and reads will have been sorted:

@HD VN:1.4  SO:unsorted

To test with my fix:

git remote add ryan git@github.com:ryan-williams/adam.git
git fetch ryan
git checkout ryan/hadoop-bam-fix
mvn package -DskipTests
mv sorted.sam{,.bak}
bin/adam-submit transform -single -sort_reads adam-apis/src/test/resources/small.sam sorted.sam

This run will output a sorted.sam with the correct SO:coordinate tag in the header line:

diff sorted.sam{,.bak}
1c1
< @HD   VN:1.4  SO:coordinate
---
> @HD   VN:1.4  SO:unsorted

So, I'll file this against hadoop-bam; we could move to a fork if it is bothering us enough, but I think this is just relegated to the SO attribute on the @HD line, so we don't really need to get it fixed, I suppose.

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 24, 2015

Member

Nice! Fun to read through the thorough debug trace here. I assume this fix will be holding in place until the next release of Hadoop-BAM, then?

Member

fnothaft commented Aug 24, 2015

Nice! Fun to read through the thorough debug trace here. I assume this fix will be holding in place until the next release of Hadoop-BAM, then?

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Aug 24, 2015

Member

Yea, I guess so. We could probably push them to release a 7.1.1; do you have thoughts on upgrading to e.g. 7.1.0 in general? Any idea whether it might be hard or easy?

Member

ryan-williams commented Aug 24, 2015

Yea, I guess so. We could probably push them to release a 7.1.1; do you have thoughts on upgrading to e.g. 7.1.0 in general? Any idea whether it might be hard or easy?

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 25, 2016

Member

I think this was resolved by #917.

Member

fnothaft commented Mar 25, 2016

I think this was resolved by #917.

@fnothaft fnothaft closed this Mar 25, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment