improve SW performance by replacing functional reductions with imperative ones #965

Closed
wants to merge 1 commit into
from

Conversation

Projects
None yet
5 participants
@noamBarkai

while running Smith-Waterman on large data sets over Spark we noticed "hot spots" in Scala's for-loops, hence the change below replaces the elegant functional calculations with the far less-elegant but slightly more efficient while-loops.
We've seen a ~30% improvement in performance with this change.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Mar 2, 2016

Can one of the admins verify this patch?

Can one of the admins verify this patch?

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Mar 3, 2016

Member

Jenkins, test this please

Member

heuermh commented Mar 3, 2016

Jenkins, test this please

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Mar 3, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1102/

Build result: FAILURE

GitHub pull request #965 of commit baba084 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/965/merge^{commit} # timeout=10 > git branch -a --contains 8cc5c6d # timeout=10 > git rev-parse remotes/origin/pr/965/merge^{commit} # timeout=10Checking out Revision 8cc5c6d (origin/pr/965/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 8cc5c6d1508118846a1ddc4f4dd3decbc17a7994First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1102/

Build result: FAILURE

GitHub pull request #965 of commit baba084 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/965/merge^{commit} # timeout=10 > git branch -a --contains 8cc5c6d # timeout=10 > git rev-parse remotes/origin/pr/965/merge^{commit} # timeout=10Checking out Revision 8cc5c6d (origin/pr/965/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 8cc5c6d1508118846a1ddc4f4dd3decbc17a7994First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 3, 2016

Member

Thanks @noamBarkai! LGTM! I'll look at the failing tests a bit later. Always amusing to run into while being much more efficient than for in Scala.

Member

fnothaft commented Mar 3, 2016

Thanks @noamBarkai! LGTM! I'll look at the failing tests a bit later. Always amusing to run into while being much more efficient than for in Scala.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Mar 13, 2016

Member

Jenkins, retest this please

Member

heuermh commented Mar 13, 2016

Jenkins, retest this please

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Mar 13, 2016

Member

I saw a 40% speed up using this simple example on 36000 comparisons over 1024 executors.

Curious, are you using sw for biological sequences? I ask because our current implementation does not have support for alphabets or substitution matrices.

Member

heuermh commented Mar 13, 2016

I saw a 40% speed up using this simple example on 36000 comparisons over 1024 executors.

Curious, are you using sw for biological sequences? I ask because our current implementation does not have support for alphabets or substitution matrices.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Mar 13, 2016

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1103/

Build result: FAILURE

GitHub pull request #965 of commit baba084 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/965/merge^{commit} # timeout=10 > git branch -a --contains 8cc5c6d # timeout=10 > git rev-parse remotes/origin/pr/965/merge^{commit} # timeout=10Checking out Revision 8cc5c6d (origin/pr/965/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 8cc5c6d1508118846a1ddc4f4dd3decbc17a7994First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1103/

Build result: FAILURE

GitHub pull request #965 of commit baba084 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/965/merge^{commit} # timeout=10 > git branch -a --contains 8cc5c6d # timeout=10 > git rev-parse remotes/origin/pr/965/merge^{commit} # timeout=10Checking out Revision 8cc5c6d (origin/pr/965/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 8cc5c6d1508118846a1ddc4f4dd3decbc17a7994First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@noamBarkai

This comment has been minimized.

Show comment
Hide comment
@noamBarkai

noamBarkai Mar 13, 2016

@heuermh - glad to hear you've reproduced the performance improvement. 👍
re your question - yes, we are using it for biological sequences. for the purposes we're using it for, alphabets and substitution matrices are of less concern.

@heuermh - glad to hear you've reproduced the performance improvement. 👍
re your question - yes, we are using it for biological sequences. for the purposes we're using it for, alphabets and substitution matrices are of less concern.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Mar 22, 2016

Member

@fnothaft @massie is this just failing in Jenkins because of the code formatter? I'm not sure what I'm seeing wrong in the console.

Member

heuermh commented Mar 22, 2016

@fnothaft @massie is this just failing in Jenkins because of the code formatter? I'm not sure what I'm seeing wrong in the console.

@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie Mar 22, 2016

Member

The error is

++ git status --porcelain
+ test -n ' M adam-core/src/main/scala/org/bdgenomics/adam/algorithms/smithwaterman/SmithWaterman.scala'
+ echo 'Please run '\''./scripts/format-source'\'''
Please run './scripts/format-source'
+ exit 1
Build step 'Execute shell' marked build as failure
Recording test results
Finished: FAILURE

which, as you say, looks like an issue with the source formatting of SmithWaterman.scala. It's odd that it would only show up in a single build of the matrix.

Check that your git repo is clean and then run ./scripts/format-source and see if any files are modified.

Member

massie commented Mar 22, 2016

The error is

++ git status --porcelain
+ test -n ' M adam-core/src/main/scala/org/bdgenomics/adam/algorithms/smithwaterman/SmithWaterman.scala'
+ echo 'Please run '\''./scripts/format-source'\'''
Please run './scripts/format-source'
+ exit 1
Build step 'Execute shell' marked build as failure
Recording test results
Finished: FAILURE

which, as you say, looks like an issue with the source formatting of SmithWaterman.scala. It's odd that it would only show up in a single build of the matrix.

Check that your git repo is clean and then run ./scripts/format-source and see if any files are modified.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Mar 22, 2016

Member

The 2.10/1.5.2/2.6.0 configuration is the first build to run, if that fails the whole build stops.

$ ./scripts/format-source
$ git diff .
diff --git a/adam-core/src/main/scala/org/bdgenomics/adam/algorithms/smithwaterman/SmithWaterman.scala b/adam-core/src/main/scala/org/bdgenomics/adam/algorithms/smithwaterman/SmithWaterman.scala
index d4e3930..83179a9 100644
--- a/adam-core/src/main/scala/org/bdgenomics/adam/algorithms/smithwaterman/SmithWaterman.scala
+++ b/adam-core/src/main/scala/org/bdgenomics/adam/algorithms/smithwaterman/SmithWaterman.scala
@@ -47,25 +47,25 @@ abstract class SmithWaterman(xSequence: String, ySequence: String) extends Seria
    * @param matrix Matrix to score.
    * @return Tuple of (i, j) coordinates.
    */
-    private[smithwaterman] final def maxCoordinates(matrix: Array[Array[Double]]): (Int, Int) = {
-      var xMax = 0
-      var yMax = 0
-      var max = Double.MinValue
-      var x = 0
-      while (x < matrix.length) {
-        var y = 0
-        while (y < matrix(x).length) {
-          if (matrix(x)(y) >= max) {
-            max = matrix(x)(y)
-            xMax = x
-            yMax = y
-          }
-          y += 1
+  private[smithwaterman] final def maxCoordinates(matrix: Array[Array[Double]]): (Int, Int) = {
+    var xMax = 0
+    var yMax = 0
+    var max = Double.MinValue
+    var x = 0
+    while (x < matrix.length) {
+      var y = 0
+      while (y < matrix(x).length) {
+        if (matrix(x)(y) >= max) {
+          max = matrix(x)(y)
+          xMax = x
+          yMax = y
         }
-        x += 1
+        y += 1
       }
-      (yMax, xMax)
+      x += 1
     }
+    (yMax, xMax)
+  }

@noamBarkai could you run scripts/format-source and then update this pull request? Thank you.

Member

heuermh commented Mar 22, 2016

The 2.10/1.5.2/2.6.0 configuration is the first build to run, if that fails the whole build stops.

$ ./scripts/format-source
$ git diff .
diff --git a/adam-core/src/main/scala/org/bdgenomics/adam/algorithms/smithwaterman/SmithWaterman.scala b/adam-core/src/main/scala/org/bdgenomics/adam/algorithms/smithwaterman/SmithWaterman.scala
index d4e3930..83179a9 100644
--- a/adam-core/src/main/scala/org/bdgenomics/adam/algorithms/smithwaterman/SmithWaterman.scala
+++ b/adam-core/src/main/scala/org/bdgenomics/adam/algorithms/smithwaterman/SmithWaterman.scala
@@ -47,25 +47,25 @@ abstract class SmithWaterman(xSequence: String, ySequence: String) extends Seria
    * @param matrix Matrix to score.
    * @return Tuple of (i, j) coordinates.
    */
-    private[smithwaterman] final def maxCoordinates(matrix: Array[Array[Double]]): (Int, Int) = {
-      var xMax = 0
-      var yMax = 0
-      var max = Double.MinValue
-      var x = 0
-      while (x < matrix.length) {
-        var y = 0
-        while (y < matrix(x).length) {
-          if (matrix(x)(y) >= max) {
-            max = matrix(x)(y)
-            xMax = x
-            yMax = y
-          }
-          y += 1
+  private[smithwaterman] final def maxCoordinates(matrix: Array[Array[Double]]): (Int, Int) = {
+    var xMax = 0
+    var yMax = 0
+    var max = Double.MinValue
+    var x = 0
+    while (x < matrix.length) {
+      var y = 0
+      while (y < matrix(x).length) {
+        if (matrix(x)(y) >= max) {
+          max = matrix(x)(y)
+          xMax = x
+          yMax = y
         }
-        x += 1
+        y += 1
       }
-      (yMax, xMax)
+      x += 1
     }
+    (yMax, xMax)
+  }

@noamBarkai could you run scripts/format-source and then update this pull request? Thank you.

@noamBarkai

This comment has been minimized.

Show comment
Hide comment
@noamBarkai

noamBarkai Mar 22, 2016

sorry about that, done.
I amended this additional change to the previous commit and force-pushed so as not to "litter" the commit log, but the formatting change is there.

sorry about that, done.
I amended this additional change to the previous commit and force-pushed so as not to "litter" the commit log, but the formatting change is there.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Mar 22, 2016

Member

Perfect, thank you. Will merge once Mr Jenkins gives his approval.

Member

heuermh commented Mar 22, 2016

Perfect, thank you. Will merge once Mr Jenkins gives his approval.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Mar 22, 2016

Member

Jenkins, retest this please

Member

heuermh commented Mar 22, 2016

Jenkins, retest this please

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Mar 22, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1111/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1111/
Test PASSed.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Mar 22, 2016

Member

Thanks for the contribution! Merged in commit a505bb9

Member

heuermh commented Mar 22, 2016

Thanks for the contribution! Merged in commit a505bb9

@heuermh heuermh closed this Mar 22, 2016

@noamBarkai

This comment has been minimized.

Show comment
Hide comment
@noamBarkai

noamBarkai Mar 22, 2016

cool! thanks.

On Tue, Mar 22, 2016 at 9:00 PM, Michael L Heuer notifications@github.com
wrote:

Thanks for the contribution! Merged in commit a505bb9
a505bb9


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#965 (comment)

cool! thanks.

On Tue, Mar 22, 2016 at 9:00 PM, Michael L Heuer notifications@github.com
wrote:

Thanks for the contribution! Merged in commit a505bb9
a505bb9


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#965 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment