New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-783] Write @SQ header lines in sorted order. #784

Merged
merged 2 commits into from Aug 21, 2015

Conversation

Projects
None yet
4 participants
@fnothaft
Member

fnothaft commented Aug 20, 2015

This change resolves #783 and #760. Specifically, now we write the SAM/BAM @sq header
lines using the same lexicographic ordering that we use for sorting records, and we write the @hd line to note that we are sorted in coordinate order.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Aug 20, 2015

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/848/

Build result: FAILURE

GitHub pull request #784 of commit 2b3ed5e automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/784/merge^{commit} # timeout=10 > git branch -a --contains 53d70081fdfe9797c24be895796e68d8f567ec80 # timeout=10 > git rev-parse remotes/origin/pr/784/merge^{commit} # timeout=10Checking out Revision 53d70081fdfe9797c24be895796e68d8f567ec80 (origin/pr/784/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 53d70081fdfe9797c24be895796e68d8f567ec80First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins commented Aug 20, 2015

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/848/

Build result: FAILURE

GitHub pull request #784 of commit 2b3ed5e automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/784/merge^{commit} # timeout=10 > git branch -a --contains 53d70081fdfe9797c24be895796e68d8f567ec80 # timeout=10 > git rev-parse remotes/origin/pr/784/merge^{commit} # timeout=10Checking out Revision 53d70081fdfe9797c24be895796e68d8f567ec80 (origin/pr/784/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 53d70081fdfe9797c24be895796e68d8f567ec80First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Aug 20, 2015

Member

lgtm from a quick pass

Member

ryan-williams commented Aug 20, 2015

lgtm from a quick pass

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 20, 2015

Member

Whoops! Forgot to add the test collateral...

Member

fnothaft commented Aug 20, 2015

Whoops! Forgot to add the test collateral...

@fnothaft fnothaft added this to the 0.17.1 milestone Aug 20, 2015

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Aug 20, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/849/
Test PASSed.

AmplabJenkins commented Aug 20, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/849/
Test PASSed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 20, 2015

Member

Rebased.

Member

fnothaft commented Aug 20, 2015

Rebased.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Aug 20, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/852/
Test PASSed.

AmplabJenkins commented Aug 20, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/852/
Test PASSed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 20, 2015

Member

Rerebased.

Member

fnothaft commented Aug 20, 2015

Rerebased.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Aug 20, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/854/
Test PASSed.

AmplabJenkins commented Aug 20, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/854/
Test PASSed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 20, 2015

Member

This now resolves #760 as well. Can I get review/merge?

Member

fnothaft commented Aug 20, 2015

This now resolves #760 as well. Can I get review/merge?

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Aug 20, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/855/
Test PASSed.

AmplabJenkins commented Aug 20, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/855/
Test PASSed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 21, 2015

Member

Ping on review/merge... This is the last issue pending for the 0.17.1 release.

Member

fnothaft commented Aug 21, 2015

Ping on review/merge... This is the last issue pending for the 0.17.1 release.

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Aug 21, 2015

Member

looking

Member

ryan-williams commented Aug 21, 2015

looking

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft
Member

fnothaft commented Aug 21, 2015

Thanks @ryan-williams!

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Aug 21, 2015

Member

This lgtm; one question: this optionally sorts the header lines to match the sort of the reads that can be optionally done with adamSortReadsByReferencePosition, right?

That seems reasonable, but feels a little backwards, since the spec says that the @SQ order should define the read order.

If ADAM will only ever output a lex-sort of the @SQs, and the corresponding "coordinate" sort of the reads, that's fine. I have a few more thoughts about making transform round-trips identical byte-for-byte, but I'll put that in a separate issue.

Member

ryan-williams commented Aug 21, 2015

This lgtm; one question: this optionally sorts the header lines to match the sort of the reads that can be optionally done with adamSortReadsByReferencePosition, right?

That seems reasonable, but feels a little backwards, since the spec says that the @SQ order should define the read order.

If ADAM will only ever output a lex-sort of the @SQs, and the corresponding "coordinate" sort of the reads, that's fine. I have a few more thoughts about making transform round-trips identical byte-for-byte, but I'll put that in a separate issue.

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Aug 21, 2015

Member

Finally, I don't know what the preferred way of merging is these days :)

Member

ryan-williams commented Aug 21, 2015

Finally, I don't know what the preferred way of merging is these days :)

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 21, 2015

Member

Thanks for the review!

@ryan-williams for now, we're just emitting coordinate sorted order. The other orders are defined here and are duplicate and queryname. Neither of those seem to be defined anywhere (like, you know, the SAM spec), but a queryname imp'l is at https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/samtools/SAMRecordQueryNameComparator.java and a duplicate imp'l is at https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/samtools/SAMRecordDuplicateComparator.java. I believe queryname is just sorting lexicographically by read name.

I agree it is a bit backwards, but I think an equivalent way to read it is that you need to have the same lexicographic order for both the reads and the header.

We can continue using the merge button for now! After I've got scripts ready, we can cut over.

Member

fnothaft commented Aug 21, 2015

Thanks for the review!

@ryan-williams for now, we're just emitting coordinate sorted order. The other orders are defined here and are duplicate and queryname. Neither of those seem to be defined anywhere (like, you know, the SAM spec), but a queryname imp'l is at https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/samtools/SAMRecordQueryNameComparator.java and a duplicate imp'l is at https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/samtools/SAMRecordDuplicateComparator.java. I believe queryname is just sorting lexicographically by read name.

I agree it is a bit backwards, but I think an equivalent way to read it is that you need to have the same lexicographic order for both the reads and the header.

We can continue using the merge button for now! After I've got scripts ready, we can cut over.

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Aug 21, 2015

Member

The other orders are defined here and are duplicate and queryname. Neither of those seem to be defined anywhere (like, you know, the SAM spec)

FWIW, queryname is in the SAM spec, and is what you described:

duplicate is not, and there's this note about it not being in the spec in htsjdk.

My point, here and in #794, is that the SAM spec doesn't actually say that @SQ should be sorted anywhere, just that "The order of @sq lines defines the alignment sorting order". coordinate sort of the reads is then defined not lexicographically based on the @SQ name, but by whatever order the @SQs are already in, with POS as a secondary sort.

Arguably we should be able to leave @SQs in whatever order they come to us in (and sort reads to conform to that) independently of whether we also want to lex-sort the @SQs (and therefore also lex-sort the reads by RNAME).

Anyway, unless this sounds like such a good idea to you that you want to do it here, I'm find to merge this and then address the possibility of @SQs in non-lex order (and sorting reads to match them) in #794.

Member

ryan-williams commented Aug 21, 2015

The other orders are defined here and are duplicate and queryname. Neither of those seem to be defined anywhere (like, you know, the SAM spec)

FWIW, queryname is in the SAM spec, and is what you described:

duplicate is not, and there's this note about it not being in the spec in htsjdk.

My point, here and in #794, is that the SAM spec doesn't actually say that @SQ should be sorted anywhere, just that "The order of @sq lines defines the alignment sorting order". coordinate sort of the reads is then defined not lexicographically based on the @SQ name, but by whatever order the @SQs are already in, with POS as a secondary sort.

Arguably we should be able to leave @SQs in whatever order they come to us in (and sort reads to conform to that) independently of whether we also want to lex-sort the @SQs (and therefore also lex-sort the reads by RNAME).

Anyway, unless this sounds like such a good idea to you that you want to do it here, I'm find to merge this and then address the possibility of @SQs in non-lex order (and sorting reads to match them) in #794.

fnothaft added some commits Aug 20, 2015

[ADAM-783] Write @sq header lines in sorted order.
This change resolves #783. Specifically, now we write the SAM/BAM @sq header
lines using the same lexicographic ordering that we use for sorting records.
[ADAM-760] Write correct header ordering when saving sorted BAM.
Sets the header line "@hd" sort order to "coordinate" when saving a sorted file
in BAM.
@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 21, 2015

Member

Rebased.

Member

fnothaft commented Aug 21, 2015

Rebased.

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Aug 21, 2015

Member

k, i'll merge when test passes

Member

ryan-williams commented Aug 21, 2015

k, i'll merge when test passes

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 21, 2015

Member

@ryan-williams doesn't that snippet just specify what the coordinate sort order is?

I get what you're saying. I am personally OK with rewriting the @SQ lines to lexicographic order. We could have a different sorting approach that would take in the full sequence dictionary, but it would be a bit more difficult to implement and would probably be a bit slower. Let's look into this more as part of #794.

Member

fnothaft commented Aug 21, 2015

@ryan-williams doesn't that snippet just specify what the coordinate sort order is?

I get what you're saying. I am personally OK with rewriting the @SQ lines to lexicographic order. We could have a different sorting approach that would take in the full sequence dictionary, but it would be a bit more difficult to implement and would probably be a bit slower. Let's look into this more as part of #794.

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Aug 21, 2015

Member

doesn't that snippet just specify what the coordinate sort order is?

ah, you mean that they don't explicitly say what queryname sort is? that makes sense. I guess we're all just to assume that they mean lex-sort by QNAME.

Yea, we can discuss further on #794; always sorting @SQs, and then also sorting reads, seems reasonable, but the reads' side of it should maybe be implemented in terms of the @SQ order. We sort of backed into this "always lex-sort @SQs" approach after having been non-spec-compliant.

Member

ryan-williams commented Aug 21, 2015

doesn't that snippet just specify what the coordinate sort order is?

ah, you mean that they don't explicitly say what queryname sort is? that makes sense. I guess we're all just to assume that they mean lex-sort by QNAME.

Yea, we can discuss further on #794; always sorting @SQs, and then also sorting reads, seems reasonable, but the reads' side of it should maybe be implemented in terms of the @SQ order. We sort of backed into this "always lex-sort @SQs" approach after having been non-spec-compliant.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Aug 21, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/859/
Test PASSed.

AmplabJenkins commented Aug 21, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/859/
Test PASSed.

ryan-williams added a commit that referenced this pull request Aug 21, 2015

Merge pull request #784 from fnothaft/sort-760
[ADAM-783] Write @sq header lines in sorted order.

@ryan-williams ryan-williams merged commit de3fa55 into bigdatagenomics:master Aug 21, 2015

1 check passed

default Merged build finished.
Details
@laserson

This comment has been minimized.

Show comment
Hide comment
@laserson

laserson Sep 16, 2015

Contributor

This change is causing a problem for me in the roundtrip BAM -> ADAM -> BAM. In one of the hellbender tests, I read a BAM file into SAMRecord objects, convert to AlignmentRecord, write out as ADAM/parquet, then read the ADAM and convert to SAMRecord objects. Because of the .sorted, I believe the referenceIndex values are different between the pre/post parquet.

This change is causing a problem for me in the roundtrip BAM -> ADAM -> BAM. In one of the hellbender tests, I read a BAM file into SAMRecord objects, convert to AlignmentRecord, write out as ADAM/parquet, then read the ADAM and convert to SAMRecord objects. Because of the .sorted, I believe the referenceIndex values are different between the pre/post parquet.

This comment has been minimized.

Show comment
Hide comment
@laserson

laserson Sep 16, 2015

Contributor

Note that I'm using the same SAMHeader object in all cases.

cc @beaunorgeot

Contributor

laserson replied Sep 16, 2015

Note that I'm using the same SAMHeader object in all cases.

cc @beaunorgeot

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Sep 16, 2015

Member

is the BAM starting out with @SQs not sorted?

Member

ryan-williams replied Sep 16, 2015

is the BAM starting out with @SQs not sorted?

This comment has been minimized.

Show comment
Hide comment
@laserson

laserson Sep 16, 2015

Contributor
@HD VN:1.5  GO:none SO:coordinate
@SQ SN:chrM LN:16571    AS:HG18 UR:/seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta M5:d2ed829b8a1628d16cbeee88e88e39eb SP:Homo sapiens
@SQ SN:chr1 LN:247249719    AS:HG18 UR:/seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta M5:9ebc6df9496613f373e73396d5b3b6b6 SP:Homo sapiens
@SQ SN:chr2 LN:242951149    AS:HG18 UR:/seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta M5:b12c7373e3882120332983be99aeb18d SP:Homo sapiens
@SQ SN:chr3 LN:199501827    AS:HG18 UR:/seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta M5:0e48ed7f305877f66e6fd4addbae2b9a SP:Homo sapiens
@SQ SN:chr4 LN:191273063    AS:HG18 UR:/seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta M5:cf37020337904229dca8401907b626c2 SP:Homo sapiens
@SQ SN:chr5 LN:180857866    AS:HG18 UR:/seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta M5:031c851664e31b2c17337fd6f9004858 SP:Homo sapiens
@SQ SN:chr6 LN:170899992    AS:HG18 UR:/seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta M5:bfe8005c536131276d448ead33f1b583 SP:Homo sapiens
@SQ SN:chr7 LN:158821424    AS:HG18 UR:/seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta M5:74239c5ceee3b28f0038123d958114cb SP:Homo sapiens
@SQ SN:chr8 LN:146274826    AS:HG18 UR:/seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta M5:1eb00fe1ce26ce6701d2cd75c35b5ccb SP:Homo sapiens
@SQ SN:chr9 LN:140273252    AS:HG18 UR:/seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta M5:ea244473e525dde0393d353ef94f974b SP:Homo sapiens
@SQ SN:chr10    LN:135374737    AS:HG18 UR:/seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta M5:4ca41bf2d7d33578d2cd7ee9411e1533 SP:Homo sapiens
@SQ SN:chr11    LN:134452384    AS:HG18 UR:/seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta M5:425ba5eb6c95b60bafbf2874493a56c3 SP:Homo sapiens
@SQ SN:chr12    LN:132349534    AS:HG18 UR:/seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta M5:d17d70060c56b4578fa570117bf19716 SP:Homo sapiens
@SQ SN:chr13    LN:114142980    AS:HG18 UR:/seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta M5:c4f3084a20380a373bbbdb9ae30da587 SP:Homo sapiens
...
Contributor

laserson replied Sep 16, 2015

@HD VN:1.5  GO:none SO:coordinate
@SQ SN:chrM LN:16571    AS:HG18 UR:/seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta M5:d2ed829b8a1628d16cbeee88e88e39eb SP:Homo sapiens
@SQ SN:chr1 LN:247249719    AS:HG18 UR:/seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta M5:9ebc6df9496613f373e73396d5b3b6b6 SP:Homo sapiens
@SQ SN:chr2 LN:242951149    AS:HG18 UR:/seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta M5:b12c7373e3882120332983be99aeb18d SP:Homo sapiens
@SQ SN:chr3 LN:199501827    AS:HG18 UR:/seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta M5:0e48ed7f305877f66e6fd4addbae2b9a SP:Homo sapiens
@SQ SN:chr4 LN:191273063    AS:HG18 UR:/seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta M5:cf37020337904229dca8401907b626c2 SP:Homo sapiens
@SQ SN:chr5 LN:180857866    AS:HG18 UR:/seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta M5:031c851664e31b2c17337fd6f9004858 SP:Homo sapiens
@SQ SN:chr6 LN:170899992    AS:HG18 UR:/seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta M5:bfe8005c536131276d448ead33f1b583 SP:Homo sapiens
@SQ SN:chr7 LN:158821424    AS:HG18 UR:/seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta M5:74239c5ceee3b28f0038123d958114cb SP:Homo sapiens
@SQ SN:chr8 LN:146274826    AS:HG18 UR:/seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta M5:1eb00fe1ce26ce6701d2cd75c35b5ccb SP:Homo sapiens
@SQ SN:chr9 LN:140273252    AS:HG18 UR:/seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta M5:ea244473e525dde0393d353ef94f974b SP:Homo sapiens
@SQ SN:chr10    LN:135374737    AS:HG18 UR:/seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta M5:4ca41bf2d7d33578d2cd7ee9411e1533 SP:Homo sapiens
@SQ SN:chr11    LN:134452384    AS:HG18 UR:/seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta M5:425ba5eb6c95b60bafbf2874493a56c3 SP:Homo sapiens
@SQ SN:chr12    LN:132349534    AS:HG18 UR:/seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta M5:d17d70060c56b4578fa570117bf19716 SP:Homo sapiens
@SQ SN:chr13    LN:114142980    AS:HG18 UR:/seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta M5:c4f3084a20380a373bbbdb9ae30da587 SP:Homo sapiens
...

This comment has been minimized.

Show comment
Hide comment
@laserson

laserson Sep 16, 2015

Contributor

This sort moves chr1 to the first position. So for reads mapped to chr1, the initial referenceIndex=1, while the reconstructed one has referenceIndex=0.

Contributor

laserson replied Sep 16, 2015

This sort moves chr1 to the first position. So for reads mapped to chr1, the initial referenceIndex=1, while the reconstructed one has referenceIndex=0.

This comment has been minimized.

Show comment
Hide comment
@laserson

laserson Sep 17, 2015

Contributor

The problem is that when using a SAMFileHeaderWritable, the actual internal SAMFileHeader is marked as transient and is reconstructed when needed. However, during the reconstruction, the SequenceDictionary is mutated here:
https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/models/SAMFileHeaderWritable.scala#L52
because of the sort. That means that when a SAMFileHeaderWritable goes through serialization/deserialization, the resulting SAMFileHeader is different.

Specifically, the resulting SAMRecords refer to their reference sequences by an index into the sequence dictionary:
https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/samtools/SAMRecord.java#L343

This breaks when I test equality of SAMRecords because the indices are now different.

We could simply decide that it's my problem to compare the SAMRecords correctly. But it does seem strange that that the SAMFileHeaderWritable gives you a different object pre/post serialization.

cc @fnothaft too in case he missed this.

Contributor

laserson replied Sep 17, 2015

The problem is that when using a SAMFileHeaderWritable, the actual internal SAMFileHeader is marked as transient and is reconstructed when needed. However, during the reconstruction, the SequenceDictionary is mutated here:
https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/models/SAMFileHeaderWritable.scala#L52
because of the sort. That means that when a SAMFileHeaderWritable goes through serialization/deserialization, the resulting SAMFileHeader is different.

Specifically, the resulting SAMRecords refer to their reference sequences by an index into the sequence dictionary:
https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/samtools/SAMRecord.java#L343

This breaks when I test equality of SAMRecords because the indices are now different.

We could simply decide that it's my problem to compare the SAMRecords correctly. But it does seem strange that that the SAMFileHeaderWritable gives you a different object pre/post serialization.

cc @fnothaft too in case he missed this.

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 17, 2015

Member

@laserson I think a simple fix would be to fix the SAMFileHeader that we're using during conversion from AlignmentRecord => SamRecord, no? If I'm not understanding this correctly, let me know; I won't have time to look until next week though.

Member

fnothaft replied Sep 17, 2015

@laserson I think a simple fix would be to fix the SAMFileHeader that we're using during conversion from AlignmentRecord => SamRecord, no? If I'm not understanding this correctly, let me know; I won't have time to look until next week though.

This comment has been minimized.

Show comment
Hide comment
@laserson

laserson Sep 17, 2015

Contributor

Actually, this is not just a ser/de issue. header == SAMFileHeaderWritable(header).header could return false because it reorders the SequenceRecords.

Contributor

laserson replied Sep 17, 2015

Actually, this is not just a ser/de issue. header == SAMFileHeaderWritable(header).header could return false because it reorders the SequenceRecords.

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 17, 2015

Member

That's true, but kind of unrelated to what I think the root cause of the bug is.

So, what's going on is that the SequenceDictionary that is used when converting reads and creating the file header is generated by running an aggregate across all AlignmentRecords. If you don't sort the SequenceRecords, you get a random order in the SequenceDictionary, which was what this changeset was fixing. Now, we get "proper" @SQ line ordering on conversion to BAM.* I think the bug is caused by us using an unsorted SequenceDictionary when converting AlignmentRecords to SamRecords, and then using a sorted SequenceDictionary when actually writing out the SamFileHeader via Hadoop-BAM. I haven't read through the code recently, so this is just from memory, but I would expect this to be the root cause of the error you are seeing.

* Note that this is arguable. As @ryan-williams pointed out earlier, see #794 for a longer discussion. When I say "proper" here, I simply mean "deterministic".

Member

fnothaft replied Sep 17, 2015

That's true, but kind of unrelated to what I think the root cause of the bug is.

So, what's going on is that the SequenceDictionary that is used when converting reads and creating the file header is generated by running an aggregate across all AlignmentRecords. If you don't sort the SequenceRecords, you get a random order in the SequenceDictionary, which was what this changeset was fixing. Now, we get "proper" @SQ line ordering on conversion to BAM.* I think the bug is caused by us using an unsorted SequenceDictionary when converting AlignmentRecords to SamRecords, and then using a sorted SequenceDictionary when actually writing out the SamFileHeader via Hadoop-BAM. I haven't read through the code recently, so this is just from memory, but I would expect this to be the root cause of the error you are seeing.

* Note that this is arguable. As @ryan-williams pointed out earlier, see #794 for a longer discussion. When I say "proper" here, I simply mean "deterministic".

This comment has been minimized.

Show comment
Hide comment
@laserson

laserson Sep 17, 2015

Contributor

Yes, I agree with you 100%. The SAMFileHeaderWritable issue is simply how I stumbled on this. I submitted a patch that enables you to specify a SAMFileHeader rather than compute it by aggregating. Not sure it works yet bc of a Hadoop-BAM issue, but would be good to check it out. Especially bc I haven't touched the BAM part of the ADAM code very much. Let's move any more discussion there.

Contributor

laserson replied Sep 17, 2015

Yes, I agree with you 100%. The SAMFileHeaderWritable issue is simply how I stumbled on this. I submitted a patch that enables you to specify a SAMFileHeader rather than compute it by aggregating. Not sure it works yet bc of a Hadoop-BAM issue, but would be good to check it out. Especially bc I haven't touched the BAM part of the ADAM code very much. Let's move any more discussion there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment