New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed hashCode implementations to improve performance of BQSR #576

Merged
merged 1 commit into from Feb 9, 2015

Conversation

Projects
None yet
4 participants
@nfergu
Contributor

nfergu commented Feb 8, 2015

Some notes:

  • When looking at the performance of BQSR the Util.hashCombine method appeared to perform poorly, so I have replaced its usage in CovariateKey and QualityScore with hash codes based on the recipe from Programming in Scala (http://www.artima.com/pins1ed/object-equality.html).
  • On my small test data set of about 90,000 reads this change makes ObservationAccumulator.+= about 3.5 times faster, making BQSR about 50% faster in total (not including times for loading, saving files etc). I'm aware that this is not a particularly realistic test, but I think it's probably sufficient for such a low-risk change.
  • I have also removed Util.hashCombine method and replaced its other usages, since others may use it without being aware of the performance penalty.
  • This change also included some new timers in the Timers object
@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Feb 8, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/589/
Test PASSed.

AmplabJenkins commented Feb 8, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/589/
Test PASSed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Feb 8, 2015

Member

Nice catch @nfergu ! The aggregate in BQSR is notably slow on large datasets. I'm going to be doing a big cluster run later this week, so I'll give this a run on the high coverage NA12878 to benchmark perf and concordance.

Member

fnothaft commented Feb 8, 2015

Nice catch @nfergu ! The aggregate in BQSR is notably slow on large datasets. I'm going to be doing a big cluster run later this week, so I'll give this a run on the high coverage NA12878 to benchmark perf and concordance.

@@ -48,7 +48,11 @@ object Timers extends Metrics {
// Recalibrate Base Qualities
val BQSRInDriver = timer("Base Quality Recalibration")
val CreateKnownSnpsTable = timer("Create Known SNPs Table")
val ComputeCovariates = timer("Compute Covariates")

This comment has been minimized.

@massie

massie Feb 9, 2015

Member

You know. Whoever wrote that timer library deserves kudos; it sure is useful. :)

@massie

massie Feb 9, 2015

Member

You know. Whoever wrote that timer library deserves kudos; it sure is useful. :)

massie added a commit that referenced this pull request Feb 9, 2015

Merge pull request #576 from nfergu/bqsrperf2
Changed hashCode implementations to improve performance of BQSR

@massie massie merged commit 83e4c9a into bigdatagenomics:master Feb 9, 2015

1 check passed

default Merged build finished.
Details
@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie Feb 9, 2015

Member

Thanks @nfergu !

Member

massie commented Feb 9, 2015

Thanks @nfergu !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment