[SPARK-19922][ML] small speedups to findSynonyms #17263

Krimit · 2017-03-12T02:48:15Z

Currently generating synonyms using a large model (I've tested with 3m words) is very slow. These efficiencies have sped things up for us by ~17%

I wasn't sure if such small changes were worthy of a jira, but the guidelines seemed to suggest that that is the preferred approach

What changes were proposed in this pull request?

Address a few small issues in the findSynonyms logic:

remove usage of Array.fill to zero out the cosineVec array. The default float value in Scala and Java is 0.0f, so explicitly setting the values to zero is not needed
use Floats throughout. The conversion to Doubles before doing the priorityQueue is totally superfluous, since all the similarity computations are done using Floats anyway. Creating a second large array just serves to put extra strain on the GC
convert the slow for(i <- cosVec.indices) to an ugly, but faster, while loop

These efficiencies are really only apparent when working with a large model

How was this patch tested?

Existing unit tests + some in-house tests to time the difference

cc @jkbradley @MLnick @srowen

Currently generating synonyms using a model with 3m words is painfully slow. These efficiencies have sped things up by more than 17%. Address a few issues in the findSynonyms logic: 1) no need to zero out the cosineVec array each time, since default value for float arrays is 0.0f. This should offer some nice speedups 2) use floats throughout. The conversion to Doubles before doing the priorityQueue is totally superflous, since all the computations are done using floats anyway 3) convert the slow for(i <- cosVec.indices), which combines a scala closure with a Range, to an ugly but faster while loop

SparkQA · 2017-03-12T02:59:33Z

Test build #74393 has finished for PR 17263 at commit f22f47f.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-12T04:49:54Z

Test build #74394 has finished for PR 17263 at commit 63b7ae8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

I think that's right, given that only BLAS.s* routines are used (= single-precision = float). A few minor suggestions here.

srowen · 2017-03-12T08:51:36Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

@@ -570,7 +570,7 @@ class Word2VecModel private[spark] (
    require(num > 0, "Number of similar words should > 0")

    val fVector = vector.toArray.map(_.toFloat)
-    val cosineVec = Array.fill[Float](numWords)(0)
+    val cosineVec = new Array[Float](numWords) // default value is 0.0f


It's fine, it doesn't need a comment

srowen · 2017-03-12T08:51:58Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

      if (norm == 0.0) {
-        cosVec(ind) = 0.0
+        cosineVec(i) = 0


I'd still write 0.0f for clarity but it's no big deal. I guess we can write norm == 0.0f too

also, remove comment about default value, it's not needed

SparkQA · 2017-03-12T17:34:26Z

Test build #74407 has finished for PR 17263 at commit 0fb6d23.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2017-03-14T13:08:31Z

Merged to master

silly mistake

63b7ae8

srowen reviewed Mar 12, 2017

View reviewed changes

0.0f for extra clarity

0fb6d23

also, remove comment about default value, it's not needed

srowen approved these changes Mar 13, 2017

View reviewed changes

asfgit closed this in 5e96a57 Mar 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19922][ML] small speedups to findSynonyms #17263

[SPARK-19922][ML] small speedups to findSynonyms #17263

Krimit commented Mar 12, 2017

SparkQA commented Mar 12, 2017

SparkQA commented Mar 12, 2017

srowen left a comment

srowen Mar 12, 2017

srowen Mar 12, 2017

SparkQA commented Mar 12, 2017

srowen commented Mar 14, 2017

[SPARK-19922][ML] small speedups to findSynonyms #17263

[SPARK-19922][ML] small speedups to findSynonyms #17263

Conversation

Krimit commented Mar 12, 2017

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Mar 12, 2017

SparkQA commented Mar 12, 2017

srowen left a comment

Choose a reason for hiding this comment

srowen Mar 12, 2017

Choose a reason for hiding this comment

srowen Mar 12, 2017

Choose a reason for hiding this comment

SparkQA commented Mar 12, 2017

srowen commented Mar 14, 2017