[ADAM-952] Expose sorting by reference index. #1045

Closed
wants to merge 1 commit into
from

Conversation

Projects
None yet
5 participants
@fnothaft
Member

fnothaft commented Jun 6, 2016

Resolves #952. Adds function sortByReferenceIndexAndPosition on RDDs of AlignmentRecord. This sorts reads by their position on a contig, where contigs are ordered by contig index. This conforms to the SAM/BAM sort order.

@fnothaft fnothaft added this to the 0.20.0 milestone Jun 6, 2016

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jun 6, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1258/

Build result: FAILURE

GitHub pull request #1045 of commit c059505 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > /home/jenkins/git2/bin/git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1045/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 4392844e06cd8809682105f9cff03a3ad79b5acd # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1045/merge^{commit} # timeout=10Checking out Revision 4392844e06cd8809682105f9cff03a3ad79b5acd (origin/pr/1045/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 4392844e06cd8809682105f9cff03a3ad79b5acdFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1258/

Build result: FAILURE

GitHub pull request #1045 of commit c059505 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > /home/jenkins/git2/bin/git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1045/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 4392844e06cd8809682105f9cff03a3ad79b5acd # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1045/merge^{commit} # timeout=10Checking out Revision 4392844e06cd8809682105f9cff03a3ad79b5acd (origin/pr/1045/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 4392844e06cd8809682105f9cff03a3ad79b5acdFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jun 7, 2016

Member

Fixed build issue! Should be good to go now.

Member

fnothaft commented Jun 7, 2016

Fixed build issue! Should be good to go now.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jun 7, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1262/

Build result: FAILURE

GitHub pull request #1045 of commit e112a83 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > /home/jenkins/git2/bin/git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1045/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 1af3d16611f2df6a33845d53b0fde6f7978b51d6 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1045/merge^{commit} # timeout=10Checking out Revision 1af3d16611f2df6a33845d53b0fde6f7978b51d6 (origin/pr/1045/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 1af3d16611f2df6a33845d53b0fde6f7978b51d6First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1262/

Build result: FAILURE

GitHub pull request #1045 of commit e112a83 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > /home/jenkins/git2/bin/git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1045/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 1af3d16611f2df6a33845d53b0fde6f7978b51d6 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1045/merge^{commit} # timeout=10Checking out Revision 1af3d16611f2df6a33845d53b0fde6f7978b51d6 (origin/pr/1045/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 1af3d16611f2df6a33845d53b0fde6f7978b51d6First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jun 7, 2016

Member

Sorry, was missing the updated file in the push! Sigh. Retesting now.

Member

fnothaft commented Jun 7, 2016

Sorry, was missing the updated file in the push! Sigh. Retesting now.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jun 7, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1263/

Build result: FAILURE

GitHub pull request #1045 of commit a8c5dba automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > /home/jenkins/git2/bin/git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1045/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains e26696b # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1045/merge^{commit} # timeout=10Checking out Revision e26696b (origin/pr/1045/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f e26696b8d31578c80366a989deb1b9fcf3b5e67fFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1263/

Build result: FAILURE

GitHub pull request #1045 of commit a8c5dba automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > /home/jenkins/git2/bin/git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1045/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains e26696b # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1045/merge^{commit} # timeout=10Checking out Revision e26696b (origin/pr/1045/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f e26696b8d31578c80366a989deb1b9fcf3b5e67fFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jun 7, 2016

Member

Fixed a unit test issue and rebased.

Member

fnothaft commented Jun 7, 2016

Fixed a unit test issue and rebased.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jun 7, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1266/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1266/
Test PASSed.

@@ -207,7 +210,11 @@ class Transform(protected val args: TransformArgs) extends BDGSparkCommand[Trans
}
log.info("Sorting reads")
- adamRecords = oldRdd.sortReadsByReferencePosition()
+ if (args.sortLexicographically) {
+ adamRecords = oldRdd.sortReadsByReferencePosition()

This comment has been minimized.

@ryan-williams

ryan-williams Jun 7, 2016

Member

the pattern here of all these if blocks overwriting adamRecords is confusing. I know it predates this PR so not necessarily a blocker here, but any thoughts as to how to do all of this more clearly?

e.g. there are a bunch of lines manipulating adamRecords above this that are all rendered moot by these lines, seemingly.

@ryan-williams

ryan-williams Jun 7, 2016

Member

the pattern here of all these if blocks overwriting adamRecords is confusing. I know it predates this PR so not necessarily a blocker here, but any thoughts as to how to do all of this more clearly?

e.g. there are a bunch of lines manipulating adamRecords above this that are all rendered moot by these lines, seemingly.

This comment has been minimized.

@fnothaft

fnothaft Jun 17, 2016

Member

Opened #1053 for this. I agree that it needs a good cleaning, but I'd like to hold off until after 0.20.0 due to bandwidth/schedule limitations.

@fnothaft

fnothaft Jun 17, 2016

Member

Opened #1053 for this. I agree that it needs a good cleaning, but I'd like to hold off until after 0.20.0 due to bandwidth/schedule limitations.

import scala.reflect.ClassTag
+private object SequenceIndexOrdering extends Ordering[(Int, Long)] {

This comment has been minimized.

@ryan-williams

ryan-williams Jun 7, 2016

Member

isn't this how (Int, Long)s would get compared by default? is this Ordering[(Int, Long)] needed?

@ryan-williams

ryan-williams Jun 7, 2016

Member

isn't this how (Int, Long)s would get compared by default? is this Ordering[(Int, Long)] needed?

This comment has been minimized.

@fnothaft

fnothaft Jun 17, 2016

Member

You are right! When I wrote the code, I didn't realize that I had to explicitly import the scala.math.Ordering implicit orderings. When I got the compile error, I then just decided to grunt the ordering out. Anywho, patching that now!

@fnothaft

fnothaft Jun 17, 2016

Member

You are right! When I wrote the code, I didn't realize that I had to explicitly import the scala.math.Ordering implicit orderings. When I got the compile error, I then just decided to grunt the ordering out. Anywho, patching that now!

+ // we sort the unmapped reads by read name. To do this, we hash the sequence name
+ // and add the max contig index
+ val maxContigIndex = sd.records.flatMap(_.referenceIndex).max
+ rdd.keyBy(r => {

This comment has been minimized.

@ryan-williams

ryan-williams Jun 7, 2016

Member

this .keyBy(…).sortByKey().values sequence is exactly what RDD.sortBy does, if you want to make it a little more concise/idiomatic.

@ryan-williams

ryan-williams Jun 7, 2016

Member

this .keyBy(…).sortByKey().values sequence is exactly what RDD.sortBy does, if you want to make it a little more concise/idiomatic.

This comment has been minimized.

@fnothaft

fnothaft Jun 17, 2016

Member

SGTM! Will pick that up as well.

@fnothaft

fnothaft Jun 17, 2016

Member

SGTM! Will pick that up as well.

+ val (r, idx) = kv
+ val start: Long = r.getStart
+ ((toIndex(r), start), (r, idx))
+ }).sorted

This comment has been minimized.

@ryan-williams

ryan-williams Jun 7, 2016

Member

again, I think you can kill SequenceIndexWithReadOrdering and just .sortBy(_._1) here no?

@ryan-williams

ryan-williams Jun 7, 2016

Member

again, I think you can kill SequenceIndexWithReadOrdering and just .sortBy(_._1) here no?

[ADAM-952] Expose sorting by reference index.
Resolves #952. Adds function `sortByReferenceIndexAndPosition` on RDDs of
`AlignmentRecord`. This sorts reads by their position on a contig, where
contigs are ordered by contig index. This conforms to the SAM/BAM sort order.
@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jun 17, 2016

Member

Just patched this up with @ryan-williams's review comments and rebased.

Member

fnothaft commented Jun 17, 2016

Just patched this up with @ryan-williams's review comments and rebased.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Jun 17, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1281/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1281/
Test PASSed.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Jun 27, 2016

Member

+1

Member

heuermh commented Jun 27, 2016

+1

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Jul 1, 2016

Member

merged e51bd90

Member

jpdna commented Jul 1, 2016

merged e51bd90

@jpdna jpdna closed this Jul 1, 2016

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Jul 1, 2016

Member

@jpdna, don't forget a "Thank you, @fnothaft!", we're friendly to contributors here :)

Member

heuermh commented Jul 1, 2016

@jpdna, don't forget a "Thank you, @fnothaft!", we're friendly to contributors here :)

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Jul 1, 2016

Member

yup indeed, thanks @fnothaft

Member

jpdna commented Jul 1, 2016

yup indeed, thanks @fnothaft

@fnothaft fnothaft referenced this pull request in BD2KGenomics/toil-scripts Jul 20, 2016

Open

Point at ADAM 0.20.0 RC for adam-pipeline #369

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment