[SPARK-6685][MLLIB]Use DSYRK to compute AtA in ALS #13891

hqzizania · 2016-06-24T10:33:03Z

What changes were proposed in this pull request?

jira: https://issues.apache.org/jira/browse/SPARK-6685
This is to swtich DSPR to DSYRK to use native BLAS to accelerate the computation of AtA in ALS. A buffer is allocated to stack vectors to do Level 3 BLAS routine

How was this patch tested?

java and scala ut

srowen · 2016-06-24T10:36:01Z

Is this actually faster though?

hqzizania · 2016-06-24T10:50:02Z

This is a prototype. Actually, it is critical if it will be faster = =!
I have done a simple test, the effect is up to "number of user for each product". The "number of user for each product" is equal to the range of i in each loop. If a considerable number of vectors can be stack, its nativeBLAS will speeup > 3x than original nativeBLAS, but it is still slower than original F2JBLAS (maybe the data of my test is not enough big). In my test for the original ALS, nativeBLAS is much slower than F2JBLAS.
Anyway I don't know if "number of user for each product" can be very large in a real case.

SparkQA · 2016-06-24T11:11:56Z

Test build #61171 has finished for PR 13891 at commit 8fb4a82.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-06-24T16:25:11Z

@hqzizania This could be tested with benchmarks without ALS. I guess even with a correct implementation, we need a large rank to see improvement.

hqzizania · 2016-06-24T16:37:02Z

@mengxr Do you mean only test add() and addStack() without ALS?

SparkQA · 2016-06-24T17:19:52Z

Test build #61184 has finished for PR 13891 at commit 7e3d238.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hqzizania · 2016-06-28T04:40:52Z

code for testing

  def run(rank: Int, a:Int) = {
    println(s"blas.getclass() = ${blas.getClass.toString} on process $rank")

    val m = 1 << a
    val n = 1 << a - 1
    val stack = 1 << a - 2
    val matrix = new Array[Array[Float]](m).map { x =>
      val y = new Array[Float](n)
      y.map(a => Random.nextFloat())
    }
    val bVector = new Array[Double](m).map(x => Random.nextDouble())
    val ls = new NormalEquation(n)

    for (u <- 0 to 3) {
      ls.reset()
      val t0 = System.nanoTime()
      for (i <- 0 until m)
        ls.add(matrix(i), bVector(i))
      val t1 = System.nanoTime()
      println("nostack Elapsed time: " + (t1 - t0) / 1000000 + s"ms on process $rank")

      ls.reset()
      val t2 = System.nanoTime()
      var i = 0
      while (i < m) {
        val matrixBuffer = mutable.ArrayBuilder.make[Double]
        val bBuffer = mutable.ArrayBuilder.make[Double]
        for (s <- 0 until stack) {
          for (j <- 0 until n) {
            matrixBuffer += matrix(i + s)(j)
          }
          bBuffer += bVector(i + s)
        }
        i += stack
        ls.addStack(matrixBuffer.result(), bBuffer.result(), stack)
      }
      val t3 = System.nanoTime()
      println("stack Elapsed time: " + (t3 - t2) / 1000000 + s"ms on process $rank")
    }
  }

  class NormalEquation(val k: Int) extends Serializable {

    /** Number of entries in the upper triangular part of a k-by-k matrix. */
    val triK = k * (k + 1) / 2
    /** A^T^ * A */
    val ata = new Array[Double](triK)
    /** A^T^ * b */
    val atb = new Array[Double](k)

    private val da = new Array[Double](k)
    private val ata2 = new Array[Double](k * k)
    private val upper = "U"

    private def copyToDouble(a: Array[Float]): Unit = {
      var i = 0
      while (i < k) {
        da(i) = a(i)
        i += 1
      }
    }

    private def copyToTri(): Unit = {
      var ii = 0
      for(i <- 0 until k)
        for(j <- 0 to i) {
          ata(ii) += ata2(i * k + j)
          ata2(i * k + j) = 0
          ii += 1
        }
    }

    /** Adds an observation. */
    def add(a: Array[Float], b: Double, c: Double = 1.0): this.type = {
      require(c >= 0.0)
      require(a.length == k)
      copyToDouble(a)
      blas.dspr(upper, k, c, da, 1, ata)
      if (b != 0.0) {
        blas.daxpy(k, c * b, da, 1, atb, 1)
      }
      this
    }

    /** Adds a stack of observations. */
    def addStack(a: Array[Double], b: Array[Double], n: Int): this.type = {
      require(a.length == n * k)
      blas.dsyrk(upper, "N", k, n, 1.0, a, k, 1.0, ata2, k)
      copyToTri()
      blas.dgemv("N", k, n, 1.0, a, k, b, 1, 1.0, atb, 1)
      this
    }

    /** Merges another normal equation object. */
    def merge(other: NormalEquation): this.type = {
      require(other.k == k)
      blas.daxpy(ata.length, 1.0, other.ata, 1, ata, 1)
      blas.daxpy(atb.length, 1.0, other.atb, 1, atb, 1)
      this
    }

    /** Resets everything to zero, which should be called after each solve. */
    def reset(): Unit = {
      ju.Arrays.fill(ata, 0.0)
      ju.Arrays.fill(ata2, 0.0)
      ju.Arrays.fill(atb, 0.0)
    }
  }

results:

hqzizania · 2016-06-28T04:55:34Z

@mengxr this is a simple imitation of the loop in computeFactors[ID]() ALS using. It runs on a bare-metal node with 4 cores. All tests use all cores by RDD multi-partitions. Can it make sense for this patch?

srowen · 2016-06-28T07:48:25Z

mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala

@@ -1296,6 +1316,9 @@ object ALS extends DefaultParamsReadable[ALS] with Logging {
          }
          var i = srcPtrs(j)
          var numExplicits = 0
+          val doStack = if (srcPtrs(j + 1) - srcPtrs(j) > 10) true else false


if (...) true else false is redundant

srowen · 2016-06-28T07:51:56Z

It may not matter much but your test code is a little different than in the patch, like for copyToTri().
It's optional, but a few comments explaining what addStack does might help readers.

set stack size > 128 and comments added

hqzizania · 2016-06-28T15:09:32Z

@srowen Ops~ The copytoTri() is indeed a little different in the test code. I change it into:

    private def copyToTri(): Unit = {
      var i = 0
      var j = 0
      var ii = 0
      while (i < k) {
        val temp = i * k
        j = 0
        while (j <= i) {
          ata(ii) += ata2(temp + j)
          j += 1
          ii += 1
        }
        i += 1
      }
    }

And ls.reset() added into the loop of doStack as following (also for the other loop):

val t2 = System.nanoTime()
      var i = 0
      while (i < m) {
        ls.reset()
        val matrixBuffer = mutable.ArrayBuilder.make[Double]
        val bBuffer = mutable.ArrayBuilder.make[Double]
        for (s <- 0 until stack) {
          for (j <- 0 until n) {
            matrixBuffer += matrix(i + s)(j)
          }
          bBuffer += bVector(i + s)
        }
        i += stack
        ls.addStack(matrixBuffer.result(), bBuffer.result(), stack)
      }
      val t3 = System.nanoTime()

The results are basically the same. Actually ls.reset() is also used in each inner loop in computeFactors ALS using.

hqzizania · 2016-06-28T15:26:07Z

I set the threshold size to stack as 128 according to some more tests results, where 128 maybe a conservative size. However, this change will bypassing existing unit tests, as doStack is always false. This patch runs through all unit tests successfully at my local machine when setting it 10 (doStack is true sometimes). Thus, I add a unit test for it.

SparkQA · 2016-06-28T15:46:26Z

Test build #61381 has finished for PR 13891 at commit 3607bdc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-29T11:57:46Z

Test build #61465 has finished for PR 13891 at commit 56194eb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hqzizania · 2016-08-04T02:23:17Z

cc @mengxr @yanboliang Was this patch Okay?

yanboliang · 2016-09-21T13:43:52Z

@hqzizania Could you share the regression performance test result? I have time to get this in if it's ready. Thanks.

hqzizania · 2016-09-21T15:19:24Z

@yanboliang sorry, i'm on a business trip and will upload the test result ASAP.

hqzizania · 2016-10-21T05:36:10Z

@yanboliang So sorry for my late response.

Some regression performance test results:
Datasets: using genExplicitTestData to generate with numUsers = 20000, numItems = 2000
Single-node cluster: 16 physical cores, 100GB memory
ALS: numUserBlocks = 30, numItemBlocks = 30
It will run computeFactors with 30 partitions in parallel.

ALS: rank = 1024
Computing time and used memory for computeFactors:

ALS: rank = 129
Computing time for computeFactors:

ALS: rank = 512
Computing time for computeFactors:

The results shows this patch makes it faster very much when rank is large, but we should reset the two threshold values of "doStack" as 1024 rather 128.
However, a following problem is that the unit test for this patch will take much time as rank must be larger than 1024. Should I just remove the unit test?

mengxr · 2016-10-21T06:35:49Z

@hqzizania Thanks for the performance tests! This matches my guess. I'm not sure how often people use a rank greater than 1000 or even 250. But I think it is good to use BLAS level-3 routines. We can make the threshold a param and set a small threshold and test both code paths.

hqzizania · 2016-10-21T06:42:40Z

@mengxr I see. I will add a param for it. :)

SparkQA · 2016-10-21T06:43:30Z

Test build #67323 has finished for PR 13891 at commit dc4f4ba.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-24T02:44:03Z

Test build #67422 has finished for PR 13891 at commit d29fd67.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-24T02:54:04Z

Test build #67423 has finished for PR 13891 at commit 294164d.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-24T03:12:06Z

Test build #67424 has finished for PR 13891 at commit 513e791.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-24T05:01:16Z

Test build #67432 has finished for PR 13891 at commit 1f3ff96.

This patch fails MiMa tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-10-24T05:04:08Z

Test build #67434 has finished for PR 13891 at commit 1081e64.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-24T07:16:52Z

Test build #67435 has finished for PR 13891 at commit a6b5a16.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hqzizania · 2016-10-25T00:23:41Z

@mengxr @srowen @yanboliang A threshold param is added for unit tests. Does it look okay now?

srowen · 2016-10-25T09:29:34Z

project/MimaExcludes.scala

@@ -864,6 +864,9 @@ object MimaExcludes {
      // [SPARK-12221] Add CPU time to metrics
      ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.status.api.v1.TaskMetrics.this"),
      ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.status.api.v1.TaskMetricDistributions.this")
+    ) ++ Seq(
+      // SPARK-6685
+      ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.recommendation.ALS.train")


Does this actually remove a method? that shouldn't be necessary, I imagine.

Actually, I don't know how to write the "mima exclude" exactly. It could be not a proper solution to the failure of mima, which may be caused by the modification to def train

I see. It's a developer API, so more reasonable to change, though it's still ideal to not change these APIs unless necessary. Try putting the new param at the end? I don't think that changes the situation but makes it at least source-compatible with any current invocations.

SparkQA · 2016-10-25T17:50:48Z

Test build #67523 has finished for PR 13891 at commit 2077457.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

hqzizania · 2016-10-25T17:57:39Z

@srowen It seems mima test still fails when putting the new Param at the end of train method. :(

srowen · 2016-10-26T12:34:24Z

@hqzizania OK thanks for checking that. That may be an issue for this change.

hqzizania · 2016-10-26T14:22:02Z

@srowen I am not familiar with MiMa really, so what should I do now? Or just go back to the previous commit, and create a JIRA for the issue?

HyukjinKwon · 2017-02-09T13:47:05Z

@hqzizania If you check the log, there are some guides for how to. Should we maybe rebase this and check the logs?

HyukjinKwon · 2017-05-11T12:51:04Z

I will propose to close this assuming this is inactive.

add dsyrk to ALS

8fb4a82

implicitprefs ut fails fix

7e3d238

srowen reviewed Jun 28, 2016
View reviewed changes

use "while" loop instead of "for"

3607bdc

set stack size > 128 and comments added

add unit test for dostack ALS

56194eb

reset threshold values for doStack and remove UT

dc4f4ba

add threshold param to ALS

d29fd67

nit fix

294164d

nit

513e791

hqzizania added 2 commits October 23, 2016 21:41

Merge remote-tracking branch 'origin/master' into ALSdsyrk

f56b586

mima fix

1081e64

hqzizania force-pushed the ALSdsyrk branch from 1f3ff96 to 1081e64 Compare October 24, 2016 04:45

oops

a6b5a16

srowen reviewed Oct 25, 2016

View reviewed changes

solve mima failure

2077457

HyukjinKwon mentioned this pull request May 17, 2017

[INFRA] Close stale PRs #18017

Closed

asfgit closed this in 5d2750a May 18, 2017

mpjlu mentioned this pull request Oct 19, 2017

[SPARK-6685][ML]Use DSYRK to compute AtA in ALS #19536

Closed

xwu99 mentioned this pull request May 7, 2020

[SPARK-31454][ML] An optimized K-Means based on DenseMatrix and GEMM #28229

Closed

[SPARK-6685][MLLIB]Use DSYRK to compute AtA in ALS #13891

[SPARK-6685][MLLIB]Use DSYRK to compute AtA in ALS #13891

Conversation

hqzizania commented Jun 24, 2016

What changes were proposed in this pull request?

How was this patch tested?

srowen commented Jun 24, 2016

hqzizania commented Jun 24, 2016 • edited Loading

SparkQA commented Jun 24, 2016

mengxr commented Jun 24, 2016

hqzizania commented Jun 24, 2016

SparkQA commented Jun 24, 2016

hqzizania commented Jun 28, 2016

hqzizania commented Jun 28, 2016 • edited Loading

srowen Jun 28, 2016

Choose a reason for hiding this comment

srowen commented Jun 28, 2016

hqzizania commented Jun 28, 2016 • edited Loading

hqzizania commented Jun 28, 2016 • edited Loading

SparkQA commented Jun 28, 2016

SparkQA commented Jun 29, 2016

hqzizania commented Aug 4, 2016

yanboliang commented Sep 21, 2016 • edited Loading

hqzizania commented Sep 21, 2016

hqzizania commented Oct 21, 2016 • edited Loading

mengxr commented Oct 21, 2016

hqzizania commented Oct 21, 2016

SparkQA commented Oct 21, 2016

SparkQA commented Oct 24, 2016

SparkQA commented Oct 24, 2016

SparkQA commented Oct 24, 2016

SparkQA commented Oct 24, 2016

SparkQA commented Oct 24, 2016

SparkQA commented Oct 24, 2016

hqzizania commented Oct 25, 2016

srowen Oct 25, 2016

Choose a reason for hiding this comment

hqzizania Oct 25, 2016

Choose a reason for hiding this comment

srowen Oct 25, 2016

Choose a reason for hiding this comment

SparkQA commented Oct 25, 2016

hqzizania commented Oct 25, 2016

srowen commented Oct 26, 2016

hqzizania commented Oct 26, 2016

HyukjinKwon commented Feb 9, 2017

HyukjinKwon commented May 11, 2017

hqzizania commented Jun 24, 2016 •

edited

Loading

hqzizania commented Jun 28, 2016 •

edited

Loading

hqzizania commented Jun 28, 2016 •

edited

Loading

hqzizania commented Jun 28, 2016 •

edited

Loading

yanboliang commented Sep 21, 2016 •

edited

Loading

hqzizania commented Oct 21, 2016 •

edited

Loading