New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BQSR refactor for perf improvements #1423

Merged
merged 1 commit into from Mar 20, 2017

Conversation

Projects
None yet
5 participants
@fnothaft
Member

fnothaft commented Mar 6, 2017

Large refactor of BQSR targeting #1358. Still a WIP, but pushing for review. TODOs:

  • Further documentation pass in BQSR core
  • Split classes in BQSR core out to one per file where logical to improve navigability of the codebase
  • Complete verification
  • See if I can yank out the QualityScore and DecadentRead classes.
@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 6, 2017

Member

Also, the commits show up out of order here relative to what I get locally, so beware...

Member

fnothaft commented Mar 6, 2017

Also, the commits show up out of order here relative to what I get locally, so beware...

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Mar 6, 2017

Coverage Status

Coverage increased (+3.7%) to 80.077% when pulling 8e2a530 on fnothaft:issues/1358-bqsr-perf into 07c1982 on bigdatagenomics:master.

coveralls commented Mar 6, 2017

Coverage Status

Coverage increased (+3.7%) to 80.077% when pulling 8e2a530 on fnothaft:issues/1358-bqsr-perf into 07c1982 on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Mar 6, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1838/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1838/
Test PASSed.

@heuermh

Looks good, nice addition of low level unit tests.

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/models/SnpTable.scala
}
/**
* Creates a SNP Table from an RDD of RichVariants.
* Creates a SNP Table from a VariantsRDD.

This comment has been minimized.

@heuermh

heuermh Mar 6, 2017

Member

VariantsRDDVariantRDD

@heuermh

heuermh Mar 6, 2017

Member

VariantsRDDVariantRDD

Show outdated Hide outdated ...in/scala/org/bdgenomics/adam/rdd/read/recalibration/DinucCovariate.scala
sequence.map {
case 'A' => 'T'
case 'T' => 'A'
case 'C' => 'G'
case 'G' => 'C'
case 'W' | 'S' | 'Y' | 'R' | 'M' | 'K' | 'B' | 'D' | 'V' | 'H' | 'N' => 'N'
case 'N' => 'N'

This comment has been minimized.

@heuermh

heuermh Mar 6, 2017

Member

+1

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Mar 7, 2017

Coverage Status

Coverage increased (+4.1%) to 80.525% when pulling 24eedc9 on fnothaft:issues/1358-bqsr-perf into 07c1982 on bigdatagenomics:master.

coveralls commented Mar 7, 2017

Coverage Status

Coverage increased (+4.1%) to 80.525% when pulling 24eedc9 on fnothaft:issues/1358-bqsr-perf into 07c1982 on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Mar 7, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1841/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1841/
Test PASSed.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Mar 7, 2017

Coverage Status

Coverage increased (+4.02%) to 80.42% when pulling 8a4ecd4 on fnothaft:issues/1358-bqsr-perf into 07c1982 on bigdatagenomics:master.

coveralls commented Mar 7, 2017

Coverage Status

Coverage increased (+4.02%) to 80.42% when pulling 8a4ecd4 on fnothaft:issues/1358-bqsr-perf into 07c1982 on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Mar 7, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1842/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1842/
Test PASSed.

@devin-petersohn

Just a couple of questions on the first review pass and a note about docs update.

}
private val unknownContigs = new mutable.HashSet[String]
@tailrec private def binarySearch(rr: ReferenceRegion,

This comment has been minimized.

@devin-petersohn

devin-petersohn Mar 7, 2017

Member

The next few methods look eerily similar to bdg-utils/IntervalArray methods. Can we not overload the IntervalArray method to use a length and offset? Should SnpTable implement IntervalArray?

@devin-petersohn

devin-petersohn Mar 7, 2017

Member

The next few methods look eerily similar to bdg-utils/IntervalArray methods. Can we not overload the IntervalArray method to use a length and offset? Should SnpTable implement IntervalArray?

This comment has been minimized.

@fnothaft

fnothaft Mar 15, 2017

Member

Opened #1429 for this.

@fnothaft

fnothaft Mar 15, 2017

Member

Opened #1429 for this.

This comment has been minimized.

@fnothaft

fnothaft Mar 15, 2017

Member

The TL,DR is that since we're doing a set operation and not a full join, there's a significant performance reason to prefer this more specialized implementation.

@fnothaft

fnothaft Mar 15, 2017

Member

The TL,DR is that since we're doing a set operation and not a full join, there's a significant performance reason to prefer this more specialized implementation.

This comment has been minimized.

@devin-petersohn

devin-petersohn Mar 20, 2017

Member

Thanks!

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/models/SnpTable.scala
import org.bdgenomics.adam.rich.DecadentRead.Residue
import org.bdgenomics.utils.misc.Logging
import scala.annotation.tailrec
import scala.math.{ max, min }
/**
* A table containing all of the SNPs in a known variation dataset.
*
* @param table A map between a contig name and a set containing all coordinates

This comment has been minimized.

@devin-petersohn

devin-petersohn Mar 7, 2017

Member

I know this is in your TODO, but just to make sure: Docs update.

@devin-petersohn

devin-petersohn Mar 7, 2017

Member

I know this is in your TODO, but just to make sure: Docs update.

This comment has been minimized.

@devin-petersohn

devin-petersohn Mar 14, 2017

Member

Can we get the parameter docs updated?

@devin-petersohn

devin-petersohn Mar 14, 2017

Member

Can we get the parameter docs updated?

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Mar 9, 2017

Coverage Status

Coverage increased (+3.8%) to 80.241% when pulling a602ec4 on fnothaft:issues/1358-bqsr-perf into 07c1982 on bigdatagenomics:master.

coveralls commented Mar 9, 2017

Coverage Status

Coverage increased (+3.8%) to 80.241% when pulling a602ec4 on fnothaft:issues/1358-bqsr-perf into 07c1982 on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Mar 9, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1847/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1847/
Test PASSed.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Mar 9, 2017

Coverage Status

Coverage increased (+4.1%) to 80.538% when pulling 0e4863b on fnothaft:issues/1358-bqsr-perf into 07c1982 on bigdatagenomics:master.

coveralls commented Mar 9, 2017

Coverage Status

Coverage increased (+4.1%) to 80.538% when pulling 0e4863b on fnothaft:issues/1358-bqsr-perf into 07c1982 on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Mar 9, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1848/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1848/
Test PASSed.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Mar 12, 2017

Coverage Status

Coverage increased (+4.0%) to 80.39% when pulling 6add9df on fnothaft:issues/1358-bqsr-perf into 07c1982 on bigdatagenomics:master.

coveralls commented Mar 12, 2017

Coverage Status

Coverage increased (+4.0%) to 80.39% when pulling 6add9df on fnothaft:issues/1358-bqsr-perf into 07c1982 on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Mar 12, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1854/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1854/
Test PASSed.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Mar 13, 2017

Coverage Status

Coverage increased (+3.7%) to 80.097% when pulling e2bd7c5 on fnothaft:issues/1358-bqsr-perf into 07c1982 on bigdatagenomics:master.

coveralls commented Mar 13, 2017

Coverage Status

Coverage increased (+3.7%) to 80.097% when pulling e2bd7c5 on fnothaft:issues/1358-bqsr-perf into 07c1982 on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Mar 13, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1855/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1855/
Test PASSed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 14, 2017

Member

@heuermh give a40cb77 a look-see.

Member

fnothaft commented Mar 14, 2017

@heuermh give a40cb77 a look-see.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Mar 14, 2017

Coverage Status

Coverage increased (+4.09%) to 80.491% when pulling a40cb77 on fnothaft:issues/1358-bqsr-perf into 07c1982 on bigdatagenomics:master.

coveralls commented Mar 14, 2017

Coverage Status

Coverage increased (+4.09%) to 80.491% when pulling a40cb77 on fnothaft:issues/1358-bqsr-perf into 07c1982 on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Mar 14, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1856/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1856/
Test PASSed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 14, 2017

Member

Yahtzee! Removed both QualityScore and DecadentRead.

Member

fnothaft commented Mar 14, 2017

Yahtzee! Removed both QualityScore and DecadentRead.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Mar 14, 2017

Coverage Status

Coverage increased (+4.03%) to 80.425% when pulling 13d77ef on fnothaft:issues/1358-bqsr-perf into 07c1982 on bigdatagenomics:master.

coveralls commented Mar 14, 2017

Coverage Status

Coverage increased (+4.03%) to 80.425% when pulling 13d77ef on fnothaft:issues/1358-bqsr-perf into 07c1982 on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Mar 14, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1857/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1857/
Test PASSed.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Mar 14, 2017

Coverage Status

Coverage increased (+4.05%) to 80.448% when pulling b1c1588 on fnothaft:issues/1358-bqsr-perf into 57264bb on bigdatagenomics:master.

coveralls commented Mar 14, 2017

Coverage Status

Coverage increased (+4.05%) to 80.448% when pulling b1c1588 on fnothaft:issues/1358-bqsr-perf into 57264bb on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Mar 14, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1860/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1860/
Test PASSed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 14, 2017

Member

@heuermh and @devin-petersohn I believe this is good to go, pending a last bit of verification that I will run tomorrow. I think I've addressed all your comments. Can I get a review pass?

Member

fnothaft commented Mar 14, 2017

@heuermh and @devin-petersohn I believe this is good to go, pending a last bit of verification that I will run tomorrow. I think I've addressed all your comments. Can I get a review pass?

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Mar 14, 2017

Coverage Status

Coverage increased (+4.07%) to 80.469% when pulling c4f483e on fnothaft:issues/1358-bqsr-perf into 57264bb on bigdatagenomics:master.

coveralls commented Mar 14, 2017

Coverage Status

Coverage increased (+4.07%) to 80.469% when pulling c4f483e on fnothaft:issues/1358-bqsr-perf into 57264bb on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Mar 14, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1862/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1862/
Test PASSed.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Mar 14, 2017

Coverage Status

Coverage increased (+4.07%) to 80.469% when pulling dcf78d0 on fnothaft:issues/1358-bqsr-perf into 1cae769 on bigdatagenomics:master.

coveralls commented Mar 14, 2017

Coverage Status

Coverage increased (+4.07%) to 80.469% when pulling dcf78d0 on fnothaft:issues/1358-bqsr-perf into 1cae769 on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Mar 14, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1864/

Build result: ABORTED

[...truncated 16 lines...] > /home/jenkins/git2/bin/git rev-parse origin/pr/1423/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 2175619 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1423/merge^{commit} # timeout=10Checking out Revision 2175619 (origin/pr/1423/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 2175619d8748cfc9b007830849ff94c605cf189aFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result ABORTEDADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result ABORTEDADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1864/

Build result: ABORTED

[...truncated 16 lines...] > /home/jenkins/git2/bin/git rev-parse origin/pr/1423/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 2175619 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1423/merge^{commit} # timeout=10Checking out Revision 2175619 (origin/pr/1423/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 2175619d8748cfc9b007830849ff94c605cf189aFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result ABORTEDADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result ABORTEDADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Mar 14, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1866/

Build result: FAILURE

[...truncated 16 lines...] > /home/jenkins/git2/bin/git rev-parse origin/pr/1423/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains cf13794 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1423/merge^{commit} # timeout=10Checking out Revision cf13794 (origin/pr/1423/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f cf13794c98bb49cda08051c0079135d711191311First time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1866/

Build result: FAILURE

[...truncated 16 lines...] > /home/jenkins/git2/bin/git rev-parse origin/pr/1423/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains cf13794 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1423/merge^{commit} # timeout=10Checking out Revision cf13794 (origin/pr/1423/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f cf13794c98bb49cda08051c0079135d711191311First time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@devin-petersohn

Nice refactor! I especially like the 1200+ lines of code blubber cut. A couple of comments about docs/questions, but it looks great!

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/models/SnpTable.scala
import org.bdgenomics.adam.rich.DecadentRead.Residue
import org.bdgenomics.utils.misc.Logging
import scala.annotation.tailrec
import scala.math.{ max, min }
/**
* A table containing all of the SNPs in a known variation dataset.
*
* @param table A map between a contig name and a set containing all coordinates

This comment has been minimized.

@devin-petersohn

devin-petersohn Mar 14, 2017

Member

Can we get the parameter docs updated?

@devin-petersohn

devin-petersohn Mar 14, 2017

Member

Can we get the parameter docs updated?

val (indices, positions) = CollectingSnps.time {
val sortedVariants = variants.sort
.rdd
.cache()

This comment has been minimized.

@devin-petersohn

devin-petersohn Mar 14, 2017

Member

In some other places we allow users to specify where to persist() their RDDs. Do we want to carry that over here?

@devin-petersohn

devin-petersohn Mar 14, 2017

Member

In some other places we allow users to specify where to persist() their RDDs. Do we want to carry that over here?

This comment has been minimized.

@fnothaft

fnothaft Mar 15, 2017

Member

I'm inclined to not specify that here, since the SNP table should be much smaller than the reads, and we are caching to preserve the indices.

@fnothaft

fnothaft Mar 15, 2017

Member

I'm inclined to not specify that here, since the SNP table should be much smaller than the reads, and we are caching to preserve the indices.

This comment has been minimized.

Show outdated Hide outdated ...ala/org/bdgenomics/adam/rdd/read/recalibration/DinucCovariateSuite.scala
val dc = new DinucCovariate
test("computing dinucleotide pairs for a single base sequence should return None") {

This comment has been minimized.

@devin-petersohn

devin-petersohn Mar 14, 2017

Member

Should probably rename test for refactor. None to ('N', 'N').

@devin-petersohn

devin-petersohn Mar 14, 2017

Member

Should probably rename test for refactor. None to ('N', 'N').

current == 'G' ||
current == 'T' ||
current == 'N',
"Saw invalid base %s. Accepted bases are A,C,G,T,N.".format(current))

This comment has been minimized.

@devin-petersohn

devin-petersohn Mar 14, 2017

Member

Why are these the only accepted bases?

@devin-petersohn

devin-petersohn Mar 14, 2017

Member

Why are these the only accepted bases?

This comment has been minimized.

@fnothaft

fnothaft Mar 15, 2017

Member

We've gone back and forward about this in BQSR over time (e.g., #775). IIRC, Illumina dropped the base resolution codes several years back (unambiguous bases are now just N), and we only recalibrate ACTG anyways. From a performance perspective, it's much cheaper to check 5 bases than to check 17.

@fnothaft

fnothaft Mar 15, 2017

Member

We've gone back and forward about this in BQSR over time (e.g., #775). IIRC, Illumina dropped the base resolution codes several years back (unambiguous bases are now just N), and we only recalibrate ACTG anyways. From a performance perspective, it's much cheaper to check 5 bases than to check 17.

sequence.map {
case 'A' => 'T'
case 'T' => 'A'
case 'C' => 'G'
case 'G' => 'C'
case 'W' | 'S' | 'Y' | 'R' | 'M' | 'K' | 'B' | 'D' | 'V' | 'H' | 'N' => 'N'
case _ => 'N'

This comment has been minimized.

@devin-petersohn

devin-petersohn Mar 14, 2017

Member

If we really care these ambiguous base cases, we would actually have cases for each of the ambiguous bases. We may want this.

@devin-petersohn

devin-petersohn Mar 14, 2017

Member

If we really care these ambiguous base cases, we would actually have cases for each of the ambiguous bases. We may want this.

This comment has been minimized.

@fnothaft

fnothaft Mar 15, 2017

Member

My opinionated view is that we don't care about the resolution codes ;) , and that having the exhaustive match here is preferable from a perf/error catching perspective.

@fnothaft

fnothaft Mar 15, 2017

Member

My opinionated view is that we don't care about the resolution codes ;) , and that having the exhaustive match here is preferable from a perf/error catching perspective.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 14, 2017

Member

Jenkins, retest this please.

Member

fnothaft commented Mar 14, 2017

Jenkins, retest this please.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Mar 14, 2017

Coverage Status

Coverage increased (+4.4%) to 80.785% when pulling 5860468 on fnothaft:issues/1358-bqsr-perf into 1cae769 on bigdatagenomics:master.

coveralls commented Mar 14, 2017

Coverage Status

Coverage increased (+4.4%) to 80.785% when pulling 5860468 on fnothaft:issues/1358-bqsr-perf into 1cae769 on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Mar 14, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1869/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1869/
Test PASSed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 15, 2017

Member

Just pushed a commit that addresses the reviewer comments. I am running the validation overnight and will hopefully have updates in the AM.

Member

fnothaft commented Mar 15, 2017

Just pushed a commit that addresses the reviewer comments. I am running the validation overnight and will hopefully have updates in the AM.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Mar 15, 2017

Coverage Status

Coverage increased (+4.2%) to 80.645% when pulling 7e87f70 on fnothaft:issues/1358-bqsr-perf into cf39e6c on bigdatagenomics:master.

Coverage Status

Coverage increased (+4.2%) to 80.645% when pulling 7e87f70 on fnothaft:issues/1358-bqsr-perf into cf39e6c on bigdatagenomics:master.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Mar 15, 2017

Coverage Status

Coverage increased (+4.3%) to 80.785% when pulling 7e87f70 on fnothaft:issues/1358-bqsr-perf into cf39e6c on bigdatagenomics:master.

coveralls commented Mar 15, 2017

Coverage Status

Coverage increased (+4.3%) to 80.785% when pulling 7e87f70 on fnothaft:issues/1358-bqsr-perf into cf39e6c on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Mar 15, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1878/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1878/
Test PASSed.

Show outdated Hide outdated adam-cli/src/main/scala/org/bdgenomics/adam/cli/Transform.scala
@@ -59,10 +59,10 @@ class TransformArgs extends Args4jBase with ADAMSaveAnyArgs with ParquetArgs {
var markDuplicates: Boolean = false
@Args4jOption(required = false, name = "-recalibrate_base_qualities", usage = "Recalibrate the base quality scores (ILLUMINA only)")
var recalibrateBaseQualities: Boolean = false
@Args4jOption(required = false, name = "-min_acceptable_quality", usage = "Minimum acceptable quality for recalibrating a base in a read.")

This comment has been minimized.

@heuermh

heuermh Mar 16, 2017

Member

add , default 5 to usage

@heuermh

heuermh Mar 16, 2017

Member

add , default 5 to usage

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/models/SnpTable.scala
/**
* A table containing all of the SNPs in a known variation dataset.
*
* @param table A map between a contig name and a set containing all coordinates
* where a point variant is known to exist.
* @param indices A map mapping contig names to the (first, last) index in the

This comment has been minimized.

@heuermh

heuermh Mar 16, 2017

Member

A map mapping contig names to...A map of contig names to...

@heuermh

heuermh Mar 16, 2017

Member

A map mapping contig names to...A map of contig names to...

* SNPs to ignore during recalibration table generation.
* @param recordGroups The record groups that the reads in this dataset are from.
* @param minAcceptableAsciiPhred The minimum acceptable Phred score to attempt
* recalibrating, expressed as an ASCII character with Illumina (33) encodings.

This comment has been minimized.

@heuermh

heuermh Mar 16, 2017

Member

add Defaults to a Phred score of 5.

@heuermh

heuermh Mar 16, 2017

Member

add Defaults to a Phred score of 5.

recordGroups: RecordGroupDictionary,
minAcceptableQuality: Int,
optStorageLevel: Option[StorageLevel]): RDD[AlignmentRecord] = {
require(minAcceptableQuality >= 0 && minAcceptableQuality < 93,

This comment has been minimized.

@heuermh

heuermh Mar 16, 2017

Member

do we need a similar check above for minAcceptableAsciiPhred, or not worth it because it is private API?

@heuermh

heuermh Mar 16, 2017

Member

do we need a similar check above for minAcceptableAsciiPhred, or not worth it because it is private API?

This comment has been minimized.

@fnothaft

fnothaft Mar 20, 2017

Member

Yeah, my thought was to skip the check since it was private.

@fnothaft

fnothaft Mar 20, 2017

Member

Yeah, my thought was to skip the check since it was private.

Show outdated Hide outdated .../scala/org/bdgenomics/adam/rdd/read/recalibration/ObservationTable.scala
/**
* Table containing the empirical frequency of mismatches for each set of covariate values.
* @param entries The error covariate -> observed error frequency mapping.

This comment has been minimized.

@heuermh

heuermh Mar 16, 2017

Member

-> → -&gt; or &rarr;

@heuermh

heuermh Mar 16, 2017

Member

-> → -&gt; or &rarr;

Show outdated Hide outdated .../scala/org/bdgenomics/adam/rdd/read/recalibration/ObservationTable.scala
def toCSV: String = {
/**
* @param recordGroups The record groups that generated the reads in this table.
* @return Return this table as CSV compatible with GATK's output.

This comment has been minimized.

@heuermh

heuermh Mar 16, 2017

Member

Here and above probably not worth mentioning that the CSV is compatible with GATK's output, because of course they could change it at any time

@heuermh

heuermh Mar 16, 2017

Member

Here and above probably not worth mentioning that the CSV is compatible with GATK's output, because of course they could change it at any time

Show outdated Hide outdated ...cala/org/bdgenomics/adam/rdd/read/recalibration/RecalibrationTable.scala
* In the prior implementation of BQSR, this class was the recalibration table.
* However, the downside of this implementation was that the covariate->quality
* mapping requires a recursive calculation which winds up running in an inner
* loop many times. Given that we expect the number of bases to be orders and

This comment has been minimized.

@heuermh

heuermh Mar 16, 2017

Member
* However, the downside of that implementation was the covariate &rarr; quality
* mapping required a recursive calculation inside an inner loop many times.
@heuermh

heuermh Mar 16, 2017

Member
* However, the downside of that implementation was the covariate &rarr; quality
* mapping required a recursive calculation inside an inner loop many times.
import org.bdgenomics.adam.rich.RichAlignmentRecord._
import org.bdgenomics.formats.avro.AlignmentRecord
@deprecated("Use RichAlignmentRecord wherever possible in new development.", since = "0.18.0")

This comment has been minimized.

@heuermh

heuermh Mar 16, 2017

Member

woot for delete key!

@heuermh

heuermh Mar 16, 2017

Member

woot for delete key!

Show outdated Hide outdated adam-cli/src/main/scala/org/bdgenomics/adam/cli/Transform.scala
@Args4jOption(required = false, name = "-stringency", usage = "Stringency level for various checks; can be SILENT, LENIENT, or STRICT. Defaults to LENIENT")
var stringency: String = "LENIENT"
@Args4jOption(required = false, name = "-dump_observations", usage = "Local path to dump BQSR observations to. Outputs CSV format.")

This comment has been minimized.

@heuermh

heuermh Mar 16, 2017

Member

If this is gone, do we need any of the toCSV stuff below?

@heuermh

heuermh Mar 16, 2017

Member

If this is gone, do we need any of the toCSV stuff below?

This comment has been minimized.

@fnothaft

fnothaft Mar 20, 2017

Member

The export to CSV functionality is used in the unit tests to validate the recalibrated scores that we assign.

@fnothaft

fnothaft Mar 20, 2017

Member

The export to CSV functionality is used in the unit tests to validate the recalibrated scores that we assign.

validationStringency: ValidationStringency = ValidationStringency.LENIENT): AlignmentRecordRDD = BQSRInDriver.time {
replaceRdd(BaseQualityRecalibration(rdd, knownSnps, observationDumpFile, validationStringency))
minAcceptableQuality: Int = 5,
optStorageLevel: Option[StorageLevel] = Some(StorageLevel.MEMORY_ONLY)): AlignmentRecordRDD = BQSRInDriver.time {

This comment has been minimized.

@heuermh

heuermh Mar 16, 2017

Member

in #1427 should I have used optStorageLevel instead of storageLevel?

@heuermh

heuermh Mar 16, 2017

Member

in #1427 should I have used optStorageLevel instead of storageLevel?

This comment has been minimized.

@fnothaft

fnothaft Mar 20, 2017

Member

Oh, probably yeah, but not a huge deal. We should clean that up though.

@fnothaft

fnothaft Mar 20, 2017

Member

Oh, probably yeah, but not a huge deal. We should clean that up though.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 20, 2017

Member

@heuermh squashed down and addressed review comments.

Member

fnothaft commented Mar 20, 2017

@heuermh squashed down and addressed review comments.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Mar 20, 2017

Coverage Status

Coverage increased (+4.4%) to 80.895% when pulling 51a8ce2 on fnothaft:issues/1358-bqsr-perf into cf39e6c on bigdatagenomics:master.

coveralls commented Mar 20, 2017

Coverage Status

Coverage increased (+4.4%) to 80.895% when pulling 51a8ce2 on fnothaft:issues/1358-bqsr-perf into cf39e6c on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Mar 20, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1886/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1886/
Test PASSed.

[ADAM-1358] Refactor BQSR to improve performance and legibility.
Resolves #1358.

* Adds instrumentation to BQSR.
* Changed SnpTable to remove RichVariant conversion, use VariantRDD API.
* Refactoring SnpTable to eliminate per-residue costly masked site lookup.
* Restructuring core of SnpTable around an array to improve GC performance.
  Additionally, wrote custom serializer to improve serialization performance.
* Added test suite for SnpTable, to test table creation.
* Refactored SnpTable to use an IntervalArray-like approach. This approach
  improves masked site lookup performance by 50%.
* Added tests to SnpTableSuite to cover lookup case, and reenabled tests in
  BaseQualityRecalibrationSuite.
* Adding unit test coverage to covariates
* Revert "[ADAM-775] Allow all IUPAC codes in BQSR"
  This reverts commit 207eeba.
* Pulled Seq allocation for base check out into an immutable set.
* Rewrote dinuc covariate. 50% improvement in performance.
* Rewrite main BQSR aggregate as reduce by key
* Added tests to recalibrator, recalibration table.
* Majorly refactors of BQSR tables.
* Starting to factor out the QualityScore class
* Refactoring CovariateKey to reduce size in memory
* Eliminated `org.bdgenomics.adam.rich.DecadentRead` (partially resolves #577)
* Refactor CovariateKey to store record group ID instead of record group name.
* Removed `org.bdgenomics.adam.models.QualityScore`.
* Split multi-class files into one class per file (excepting private classes) to improve navigability.
* Scaladoc all the recalibrators! You get a scaladoc! And you get a scaladoc!
@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Mar 20, 2017

Coverage Status

Coverage increased (+4.4%) to 80.895% when pulling 3726e77 on fnothaft:issues/1358-bqsr-perf into 98b263f on bigdatagenomics:master.

Coverage Status

Coverage increased (+4.4%) to 80.895% when pulling 3726e77 on fnothaft:issues/1358-bqsr-perf into 98b263f on bigdatagenomics:master.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Mar 20, 2017

Coverage Status

Coverage increased (+4.4%) to 80.895% when pulling 3726e77 on fnothaft:issues/1358-bqsr-perf into 98b263f on bigdatagenomics:master.

coveralls commented Mar 20, 2017

Coverage Status

Coverage increased (+4.4%) to 80.895% when pulling 3726e77 on fnothaft:issues/1358-bqsr-perf into 98b263f on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Mar 20, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1888/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1888/
Test PASSed.

}
private val unknownContigs = new mutable.HashSet[String]
@tailrec private def binarySearch(rr: ReferenceRegion,

This comment has been minimized.

@devin-petersohn

devin-petersohn Mar 20, 2017

Member

Thanks!

val (indices, positions) = CollectingSnps.time {
val sortedVariants = variants.sort
.rdd
.cache()

This comment has been minimized.

@heuermh heuermh merged commit b1fce67 into bigdatagenomics:master Mar 20, 2017

2 of 3 checks passed

codacy/pr Not so good... This pull request quality could be better.
Details
coverage/coveralls Coverage increased (+4.4%) to 80.895%
Details
default Merged build finished.
Details
@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Mar 20, 2017

Member

Thank you, @fnothaft!

Member

heuermh commented Mar 20, 2017

Thank you, @fnothaft!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment