Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19922][ML] small speedups to findSynonyms #17263

Closed
wants to merge 3 commits into from

Conversation

Krimit
Copy link
Contributor

@Krimit Krimit commented Mar 12, 2017

Currently generating synonyms using a large model (I've tested with 3m words) is very slow. These efficiencies have sped things up for us by ~17%

I wasn't sure if such small changes were worthy of a jira, but the guidelines seemed to suggest that that is the preferred approach

What changes were proposed in this pull request?

Address a few small issues in the findSynonyms logic:

  1. remove usage of Array.fill to zero out the cosineVec array. The default float value in Scala and Java is 0.0f, so explicitly setting the values to zero is not needed
  2. use Floats throughout. The conversion to Doubles before doing the priorityQueue is totally superfluous, since all the similarity computations are done using Floats anyway. Creating a second large array just serves to put extra strain on the GC
  3. convert the slow for(i <- cosVec.indices) to an ugly, but faster, while loop

These efficiencies are really only apparent when working with a large model

How was this patch tested?

Existing unit tests + some in-house tests to time the difference

cc @jkbradley @MLnick @srowen

Currently generating synonyms using a model with 3m words is painfully slow. These efficiencies have sped things up by more than 17%.

Address a few issues in the findSynonyms logic:
1) no need to zero out the cosineVec array each time, since default value for float arrays is 0.0f. This should offer some nice speedups
2) use floats throughout. The conversion to Doubles before doing the priorityQueue is totally superflous, since all the computations are done using floats anyway
3) convert the slow for(i <- cosVec.indices), which combines a scala closure with a Range, to an ugly but faster while loop
@SparkQA
Copy link

SparkQA commented Mar 12, 2017

Test build #74393 has finished for PR 17263 at commit f22f47f.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 12, 2017

Test build #74394 has finished for PR 17263 at commit 63b7ae8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's right, given that only BLAS.s* routines are used (= single-precision = float). A few minor suggestions here.

@@ -570,7 +570,7 @@ class Word2VecModel private[spark] (
require(num > 0, "Number of similar words should > 0")

val fVector = vector.toArray.map(_.toFloat)
val cosineVec = Array.fill[Float](numWords)(0)
val cosineVec = new Array[Float](numWords) // default value is 0.0f
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine, it doesn't need a comment

if (norm == 0.0) {
cosVec(ind) = 0.0
cosineVec(i) = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still write 0.0f for clarity but it's no big deal. I guess we can write norm == 0.0f too

also, remove comment about default value, it's not needed
@SparkQA
Copy link

SparkQA commented Mar 12, 2017

Test build #74407 has finished for PR 17263 at commit 0fb6d23.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Mar 14, 2017

Merged to master

@asfgit asfgit closed this in 5e96a57 Mar 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants