Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INDEL realigner cleanup #1412

Merged

Conversation

@fnothaft
Copy link
Member

fnothaft commented Mar 2, 2017

Fixes #1402, and provides many additional improvements to the IR core. Still a WIP, testing on the cluster as we speak.

@coveralls
Copy link

coveralls commented Mar 2, 2017

Coverage Status

Coverage increased (+0.7%) to 77.008% when pulling 3cf32f6 on fnothaft:issues/1402-indel-realigner-what into eb4aa6c on bigdatagenomics:master.

@AmplabJenkins
Copy link

AmplabJenkins commented Mar 2, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1823/
Test PASSed.

@coveralls
Copy link

coveralls commented Mar 2, 2017

Coverage Status

Coverage increased (+1.0%) to 77.3% when pulling 31c68fa on fnothaft:issues/1402-indel-realigner-what into eb4aa6c on bigdatagenomics:master.

@AmplabJenkins
Copy link

AmplabJenkins commented Mar 2, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1825/
Test PASSed.

@coveralls
Copy link

coveralls commented Mar 3, 2017

Coverage Status

Coverage increased (+0.9%) to 77.174% when pulling fe8a2f2 on fnothaft:issues/1402-indel-realigner-what into eb4aa6c on bigdatagenomics:master.

@AmplabJenkins
Copy link

AmplabJenkins commented Mar 3, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1829/
Test PASSed.

Copy link
Member

heuermh left a comment

Minor doc changes

@@ -77,6 +77,10 @@ class TransformArgs extends Args4jBase with ADAMSaveAnyArgs with ParquetArgs {
var lodThreshold = 5.0
@Args4jOption(required = false, name = "-max_target_size", usage = "The maximum length of a target region to attempt realigning. Default length is 3000.")
var maxTargetSize = 3000
@Args4jOption(required = false, name = "-max_reads_per_target", usage = "The maximum number of reads attached to a target that we will attempt realigning. Default length is 20000.")

This comment has been minimized.

Copy link
@heuermh

heuermh Mar 3, 2017

Member

How about The maximum number of reads attached to a target considered for realignment. Default is 20000.

@@ -421,20 +435,30 @@ private[read] class RealignIndels(
* @param read Read to test.
* @param reference Reference sequence to sweep across.
* @param qualities Integer sequence of phred scaled base quality scores.
* @param originalQuality

This comment has been minimized.

Copy link
@heuermh

heuermh Mar 3, 2017

Member

needs doc comment

@fnothaft fnothaft added this to the 0.21.1 milestone Mar 3, 2017
@@ -208,7 +208,9 @@ object MdTag {

// dirty dancing to recalculate match sets
for (i <- 0 until cigarElement.getLength) {
if (reference(referencePos) == sequence(readPos)) {
println(reference + " @ " + referencePos + ", " + sequence + " @ " + readPos)

This comment has been minimized.

Copy link
@fnothaft

fnothaft Mar 3, 2017

Author Member

Remove

@AmplabJenkins
Copy link

AmplabJenkins commented Mar 4, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1834/

Build result: FAILURE

[...truncated 16 lines...] > /home/jenkins/git2/bin/git rev-parse origin/pr/1412/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains b8c4222ca4c78a234f74dfd72089d9343e62fa58 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1412/merge^{commit} # timeout=10Checking out Revision b8c4222ca4c78a234f74dfd72089d9343e62fa58 (origin/pr/1412/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f b8c4222ca4c78a234f74dfd72089d9343e62fa58First time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft fnothaft force-pushed the fnothaft:issues/1402-indel-realigner-what branch from e533c90 to 745df4f Mar 4, 2017
@coveralls
Copy link

coveralls commented Mar 4, 2017

Coverage Status

Coverage increased (+0.8%) to 77.192% when pulling 745df4f on fnothaft:issues/1402-indel-realigner-what into 07c1982 on bigdatagenomics:master.

@AmplabJenkins
Copy link

AmplabJenkins commented Mar 4, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1835/
Test PASSed.

@AmplabJenkins
Copy link

AmplabJenkins commented Mar 6, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1837/

Build result: FAILURE

[...truncated 16 lines...] > /home/jenkins/git2/bin/git rev-parse origin/pr/1412/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 6d408c4 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1412/merge^{commit} # timeout=10Checking out Revision 6d408c4 (origin/pr/1412/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 6d408c471459434c79630e5724f1a67c964b88cbFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@heuermh
heuermh approved these changes Mar 6, 2017
@coveralls
Copy link

coveralls commented Mar 15, 2017

Coverage Status

Coverage increased (+1.0%) to 77.466% when pulling 8a63091 on fnothaft:issues/1402-indel-realigner-what into cf39e6c on bigdatagenomics:master.

@AmplabJenkins
Copy link

AmplabJenkins commented Mar 15, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1877/

Build result: FAILURE

[...truncated 16 lines...] > /home/jenkins/git2/bin/git rev-parse origin/pr/1412/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 0761d92 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1412/merge^{commit} # timeout=10Checking out Revision 0761d92 (origin/pr/1412/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 0761d9214029fa634f56bc4a72ea7a61e7d29d71First time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft
Copy link
Member Author

fnothaft commented Mar 15, 2017

Jenkins, retest this please.

@coveralls
Copy link

coveralls commented Mar 15, 2017

Coverage Status

Coverage increased (+1.0%) to 77.466% when pulling 8a63091 on fnothaft:issues/1402-indel-realigner-what into cf39e6c on bigdatagenomics:master.

@AmplabJenkins
Copy link

AmplabJenkins commented Mar 15, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1879/
Test PASSed.

@fnothaft fnothaft force-pushed the fnothaft:issues/1402-indel-realigner-what branch from 8a63091 to 80ad790 Mar 23, 2017
@coveralls
Copy link

coveralls commented Mar 23, 2017

Coverage Status

Coverage increased (+0.8%) to 81.35% when pulling 80ad790 on fnothaft:issues/1402-indel-realigner-what into 629b778 on bigdatagenomics:master.

@AmplabJenkins
Copy link

AmplabJenkins commented Mar 23, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1906/
Test PASSed.

@coveralls
Copy link

coveralls commented Mar 26, 2017

Coverage Status

Coverage increased (+0.6%) to 81.226% when pulling 54f328f on fnothaft:issues/1402-indel-realigner-what into 629b778 on bigdatagenomics:master.

@AmplabJenkins
Copy link

AmplabJenkins commented Mar 26, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1912/
Test PASSed.

@fnothaft fnothaft force-pushed the fnothaft:issues/1402-indel-realigner-what branch from 54f328f to e6376e8 Mar 28, 2017
@coveralls
Copy link

coveralls commented Mar 28, 2017

Coverage Status

Coverage increased (+0.6%) to 81.439% when pulling e6376e8 on fnothaft:issues/1402-indel-realigner-what into 65d8c9e on bigdatagenomics:master.

@AmplabJenkins
Copy link

AmplabJenkins commented Mar 28, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1915/
Test PASSed.

@coveralls
Copy link

coveralls commented Mar 28, 2017

Coverage Status

Coverage increased (+0.7%) to 81.541% when pulling 99f4673 on fnothaft:issues/1402-indel-realigner-what into 65d8c9e on bigdatagenomics:master.

@AmplabJenkins
Copy link

AmplabJenkins commented Mar 28, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1916/
Test PASSed.

@coveralls
Copy link

coveralls commented Mar 29, 2017

Coverage Status

Coverage increased (+0.7%) to 81.523% when pulling f82fe65 on fnothaft:issues/1402-indel-realigner-what into 65d8c9e on bigdatagenomics:master.

@AmplabJenkins
Copy link

AmplabJenkins commented Mar 29, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1917/
Test PASSed.

@heuermh
Copy link
Member

heuermh commented Mar 29, 2017

Don't unclip. Yes, or do I mean no?

@fnothaft fnothaft force-pushed the fnothaft:issues/1402-indel-realigner-what branch from f82fe65 to 2c868e1 Mar 30, 2017
@fnothaft
Copy link
Member Author

fnothaft commented Mar 30, 2017

@heuermh think this is good to go now; rerunning a final verification run. Can you give the second commit a looksee? That should be everything new since your last pass.

@coveralls
Copy link

coveralls commented Mar 30, 2017

Coverage Status

Coverage increased (+0.8%) to 81.652% when pulling 2c868e1 on fnothaft:issues/1402-indel-realigner-what into 65d8c9e on bigdatagenomics:master.

@AmplabJenkins
Copy link

AmplabJenkins commented Mar 30, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1918/
Test PASSed.

val startClipped = richCigar.softClippedBasesAtStart
val endClipped = richCigar.softClippedBasesAtEnd

if (unclipReads ||

This comment has been minimized.

Copy link
@heuermh

heuermh Mar 30, 2017

Member

Getting hard to follow the double negative here. Maybe I need more sleep. :)

builder.setOldPosition(r.getStart())
builder.setOldCigar(r.getCigar())
val rec = builder.build()
if (r.getReadName == "H06JUADXX130110:1:1108:12070:36897") {

This comment has been minimized.

Copy link
@heuermh
} else {
r
}
} catch {
case t: Throwable => {
log.warn("Realigning read %s failed with %s.".format(r, t))
log.warn("Realigning read %s failed with %s. At:".format(r, t))
val stack = t.getStackTrace()

This comment has been minimized.

Copy link
@heuermh

heuermh Mar 30, 2017

Member

log.warn(String, Throwable)?

* @param remappingIdx The location in the consensus sequence where this
* read has been realigned.
* @param consensus The consensus variant to realign against.
* @peturn Returns a tuple with the (read start, read end, cigar).

This comment has been minimized.

Copy link
@heuermh

heuermh Mar 30, 2017

Member

@peturn@return

}

// if read is fully before or fully after the consensus, then we
// have a full match

This comment has been minimized.

Copy link
@heuermh

heuermh Mar 30, 2017

Member

thanks for the comments here

@coveralls
Copy link

coveralls commented Mar 30, 2017

Coverage Status

Coverage increased (+0.9%) to 81.716% when pulling 106da85 on fnothaft:issues/1402-indel-realigner-what into 65d8c9e on bigdatagenomics:master.

@AmplabJenkins
Copy link

AmplabJenkins commented Mar 30, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1919/
Test PASSed.

@coveralls
Copy link

coveralls commented Mar 31, 2017

Coverage Status

Coverage increased (+0.7%) to 81.586% when pulling 3efad26 on fnothaft:issues/1402-indel-realigner-what into 65d8c9e on bigdatagenomics:master.

@AmplabJenkins
Copy link

AmplabJenkins commented Mar 31, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1921/
Test PASSed.

Resolves #1402. Includes fixes to consensus generator and reference scorer.

Improve INDEL realigner performance:

* Exit early when realigning will not yield a better score.
* Eliminate substring call in sweep over reference.
* Change datastructures to be immutable wherever possible.
* Add bound checking and other error checking.
* Rewrite target association code to use array instead of set, and improve load balancing.
* Delete high coverage targets with reduceByKey.

Additionally:
* Improve telemetry/logging to sort out load balance issue.
* Support using reference file in INDEL realignment.
* Log reads with negative alignment sizes.
* Improved test coverage for insertion realignment.
* Fix CIGARs on reads that partially overlap INDEL.
* Soft clip reads that partially align to an insertion.
* Eliminate non-determinism.
* Fixed reference file.
* Serialization fixes and debug.
* Fix bad score.
* Clean up clipping code?
* Unclip clipped reads.
@fnothaft fnothaft force-pushed the fnothaft:issues/1402-indel-realigner-what branch from 3efad26 to b6de9a5 Mar 31, 2017
@fnothaft
Copy link
Member Author

fnothaft commented Mar 31, 2017

Squashed down; I say this is good to go from my end.

@coveralls
Copy link

coveralls commented Mar 31, 2017

Coverage Status

Coverage increased (+0.7%) to 81.586% when pulling b6de9a5 on fnothaft:issues/1402-indel-realigner-what into 65d8c9e on bigdatagenomics:master.

@AmplabJenkins
Copy link

AmplabJenkins commented Mar 31, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1922/
Test PASSed.

@heuermh heuermh merged commit d12fdfb into bigdatagenomics:master Mar 31, 2017
2 of 3 checks passed
2 of 3 checks passed
codacy/pr Not so good... This pull request quality could be better.
Details
coverage/coveralls Coverage increased (+0.7%) to 81.586%
Details
default Merged build finished.
Details
@heuermh
Copy link
Member

heuermh commented Mar 31, 2017

Thank you, @fnothaft!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants
You can’t perform that action at this time.