[SPARK-33609][ML] word2vec reduce broadcast size #30548

zhengruifeng · 2020-11-30T13:12:30Z

What changes were proposed in this pull request?

1, directly use float vectors instead of converting to double vectors, this is about 2x faster than using vec.axpy;
2, mark wordList and wordVecNorms lazy
3, avoid slicing in computation of wordVecNorms

Why are the changes needed?

halve broadcast size

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing testsuites

init init init init init ix ix ix ix ix ix init init

zhengruifeng · 2020-11-30T13:13:19Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

    var i = 0
-    while (i < numWords) {
-      val vec = wordVectors.slice(i * vectorSize, i * vectorSize + vectorSize)


avoid this slicing

zhengruifeng · 2020-11-30T13:13:34Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

@@ -538,9 +538,13 @@ class Word2VecModel private[spark] (
  @Since("1.1.0")
  def transform(word: String): Vector = {
    wordIndex.get(word) match {
-      case Some(ind) =>
-        val vec = wordVectors.slice(ind * vectorSize, ind * vectorSize + vectorSize)


avoid this slicing

SparkQA · 2020-11-30T14:36:49Z

Test build #131984 has finished for PR 30548 at commit 978b225.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2020-11-30T15:31:15Z

mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala

@@ -278,34 +279,45 @@ class Word2VecModel private[ml] (
  @Since("1.4.0")
  def setOutputCol(value: String): this.type = set(outputCol, value)

+  private var bcModel: Broadcast[Word2VecModel] = _


I don't suppose we have a way to clean this up after use - will just have to get GCed?

yes. I followed the impl of CountVectorizer here.
Since other .ml impls do not use a mutable var for a broadcast variable like this, I will remove this var.

As to CountVectorizer, should we also remove the var broadcastDict in it? It looks like that other mllib impls do not use mutable broadcasted variable like that.

srowen · 2020-11-30T15:32:04Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

+        val offset = index * size
+        val array = Array.ofDim[Double](size)
+        var i = 0
+        while (i < size) { array(i) = wordVectors(offset + i); i += 1 }


Is this actually more efficient than slice? Likewise above.

I guess so, I will do a simple test.

srowen · 2020-11-30T15:34:21Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

  }

  // wordVecNorms: Array of length numWords, each value being the Euclidean norm
  //               of the wordVector.
-  private val wordVecNorms: Array[Float] = {
-    val wordVecNorms = new Array[Float](numWords)
+  private lazy val wordVecNorms: Array[Float] = {


How much does this save, if it only happens once and has to happen to use the model?

this var wordVecNorms is only used in method findSynonyms in the .mllib.w2v; however, this findSynonyms is never used in the .ml side. So I think we can make it lazy.

OK fair enough. There are use cases here that would never need this calculated?

however, this findSynonyms is never used in the .ml side. So I think we can make it lazy.

I am wrong. this var wordVecNorms is used in methods findSynonyms and findSynonymsArray in the .ml side. Since it is not used in transform, so we can still mark it lazy

srowen · 2020-11-30T15:35:05Z

mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala

-    val word2Vec = udf { sentence: Seq[String] =>
+
+    if (bcModel == null) {
+      bcModel = dataset.sparkSession.sparkContext.broadcast(this)


Looks like you only use this.wordVectors below? maybe just broadcast that

both wordVectors and wordIndex are used.

yes, there are two wordVectors...

Oops, right, I think I meant to say that you only use those two. is there any savings from just broadcasting those rather than the whole model? if not that's fine.

zhengruifeng · 2020-12-03T02:28:30Z

test performance:

  test("add float to array") {
    val floatArrays = Array.tabulate(10000, 100)((i, j) => i.toFloat / (j + 1))
    val vectors = floatArrays.map(array => Vectors.dense(array.map(_.toDouble)))

    val tic0 = System.nanoTime()
    Seq.range(0, 1000).foreach { i =>
      val sum = Array.ofDim[Double](100)
      floatArrays.foreach { array =>
        var j = 0
        while (j < 100) { sum(j) += array(j); j += 1 }
      }
    }
    val toc0 = System.nanoTime()


    val tic1 = System.nanoTime()
    Seq.range(0, 1000).foreach { i =>
      val sum = Vectors.zeros(100)
      vectors.foreach { vec =>
        org.apache.spark.ml.linalg.BLAS.axpy(1.0, vec, sum)
      }
    }
    val toc1 = System.nanoTime()

    println(s"array sum: ${toc0 - tic0}, vector axpy: ${toc1 - tic1}")
  }

result:

@srowen it seems that directly adding float values is nearly 2x faster than using axpy, while halving the broadcast size.

srowen · 2020-12-03T02:51:02Z

I buy that. If this is in response to the slice comment above, I am looking at a different part of the change where you unrolled the slice. Not a big deal but I guess I'd be surprised if it makes a difference, and if not, then slice is simpler.

srowen · 2020-12-03T02:56:23Z

mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala

-    val emptyVec = Vectors.sparse(d, Array.emptyIntArray, Array.emptyDoubleArray)
-    val word2Vec = udf { sentence: Seq[String] =>
+
+    val bcModel = dataset.sparkSession.sparkContext.broadcast(this.wordVectors)


At first glance this makes more sense. But, we can't call bcModel.destroy() at the end here anyway. So we have this broadcast we can't explicitly close no matter what. And now I guess, this would re-broadcast every time? that could be bad. What do you think? I know this is not consistent in the code either way.

And now I guess, this would re-broadcast every time? that could be bad. What do you think?

I agree. I perfer not using broadcasting in transform, but this may need more discussion. we can keep current behavior for now.

GBT models are also broadcasted in this way for performance since SPARK-7127.

Looks good but i'd back out this part of the change

zhengruifeng · 2020-12-03T03:00:48Z

slicing:

test("slicing vs non-slicing") {
    val n = 10000
    val size = 100
    val floatArray = Array.tabulate(n * size)(i => i.toFloat)

    val tic0 = System.nanoTime()
    Seq.range(0, 1000).foreach { i =>
      Seq.range(0, n).foreach { j =>
        floatArray.slice(j * size, j * size + size).map(_.toDouble)
      }
    }
    val toc0 = System.nanoTime()


    val tic1 = System.nanoTime()
    Seq.range(0, 1000).foreach { i =>
      Seq.range(0, n).foreach { j =>
        val doubles = Array.ofDim[Double](size)
        val offset = j * size
        var k = 0
        while (k < size) { doubles(k) = floatArray(offset + k); k += 1 }
      }
    }
    val toc1 = System.nanoTime()

    println(s"slicing: ${toc0 - tic0}, non-slicing: ${toc1 - tic1}")
  }

@srowen slicing and then mapping to double, is about 10X slower than the new impl. It is somewhat surprising to me.

SparkQA · 2020-12-03T03:22:03Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36679/

SparkQA · 2020-12-03T03:49:24Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36679/

SparkQA · 2020-12-03T04:23:17Z

Test build #132082 has finished for PR 30548 at commit 3ba4fda.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2020-12-08T03:05:31Z

Merged to master, thanks @srowen for reviewing!

init

978b225

init init init init init ix ix ix ix ix ix init init

zhengruifeng commented Nov 30, 2020

View reviewed changes

github-actions bot added ML MLLIB labels Nov 30, 2020

srowen reviewed Nov 30, 2020

View reviewed changes

address comments

3ba4fda

srowen reviewed Dec 3, 2020

View reviewed changes

srowen approved these changes Dec 5, 2020

View reviewed changes

zhengruifeng closed this in ebd8b93 Dec 8, 2020

zhengruifeng deleted the w2v_float32_transform branch December 8, 2020 03:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-33609][ML] word2vec reduce broadcast size #30548

[SPARK-33609][ML] word2vec reduce broadcast size #30548

zhengruifeng commented Nov 30, 2020 •

edited

zhengruifeng Nov 30, 2020

zhengruifeng Nov 30, 2020

SparkQA commented Nov 30, 2020

srowen Nov 30, 2020

zhengruifeng Dec 3, 2020

zhengruifeng Dec 3, 2020

srowen Nov 30, 2020

zhengruifeng Dec 3, 2020

srowen Nov 30, 2020

zhengruifeng Dec 3, 2020

srowen Dec 3, 2020

zhengruifeng Dec 3, 2020

srowen Nov 30, 2020

zhengruifeng Dec 3, 2020

zhengruifeng Dec 3, 2020

srowen Dec 3, 2020

zhengruifeng commented Dec 3, 2020 •

edited

srowen commented Dec 3, 2020

srowen Dec 3, 2020

zhengruifeng Dec 3, 2020

srowen Dec 5, 2020

zhengruifeng commented Dec 3, 2020

SparkQA commented Dec 3, 2020

SparkQA commented Dec 3, 2020

SparkQA commented Dec 3, 2020

zhengruifeng commented Dec 8, 2020

[SPARK-33609][ML] word2vec reduce broadcast size #30548

[SPARK-33609][ML] word2vec reduce broadcast size #30548

Conversation

zhengruifeng commented Nov 30, 2020 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 30, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhengruifeng commented Dec 3, 2020 • edited

srowen commented Dec 3, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhengruifeng commented Dec 3, 2020

SparkQA commented Dec 3, 2020

SparkQA commented Dec 3, 2020

SparkQA commented Dec 3, 2020

zhengruifeng commented Dec 8, 2020

zhengruifeng commented Nov 30, 2020 •

edited

zhengruifeng commented Dec 3, 2020 •

edited