Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34189][ML] w2v findSynonyms optimization #31276

Closed
wants to merge 3 commits into from

Conversation

zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

1, use Guavaording instead of BoundedPriorityQueue;
2, use local variables;
3, avoid conversion: ml.vector -> mllib.vector

Why are the changes needed?

this pr is about 30% faster than existing impl

Does this PR introduce any user-facing change?

NO

How was this patch tested?

existing testsuites

use local var

vec -> array
@zhengruifeng
Copy link
Contributor Author

train a model with https://en.wikipedia.org/wiki/Word2vec as the training data;

import org.apache.spark.ml.feature._

val df = spark.read.text("/d0/Dev/PRs/Word2vec")
val df2 = df.as[String].map(_.split(" ")).toDF("words")

val w2v = new Word2Vec().setInputCol("words").setMaxIter(1)
val w2vm = w2v.fit(df2)
w2vm.save("/tmp/w2vm")

performance test

import org.apache.spark.ml.feature._

val w2vm = Word2VecModel.load("/tmp/w2vm")
val words = w2vm.getVectors.select("word").as[String].collect

val start = System.currentTimeMillis; Seq.range(0, 10000).foreach { i => words.foreach(word => w2vm.findSynonymsArray(word, 10)) }; val end = System.currentTimeMillis; val duration = end - start; 

results:
master: 8978
this PR: 6419, about 30% faster than existing impl

@SparkQA
Copy link

SparkQA commented Jan 21, 2021

Test build #134320 has finished for PR 31276 at commit 298c6fa.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 21, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38907/

@SparkQA
Copy link

SparkQA commented Jan 21, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38907/

@SparkQA
Copy link

SparkQA commented Jan 22, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38934/

@SparkQA
Copy link

SparkQA commented Jan 22, 2021

Test build #134347 has finished for PR 31276 at commit e3ce5de.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 22, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38934/

@zhengruifeng
Copy link
Contributor Author

friendly ping @srowen

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK, merge when ready.

@zhengruifeng
Copy link
Contributor Author

Merged to master, thanks @srowen !

@zhengruifeng zhengruifeng deleted the w2v_findSynonyms_opt branch January 27, 2021 02:10
skestle pushed a commit to skestle/spark that referenced this pull request Feb 3, 2021
### What changes were proposed in this pull request?
1, use Guavaording instead of BoundedPriorityQueue;
2, use local variables;
3, avoid conversion: ml.vector -> mllib.vector

### Why are the changes needed?
this pr is about 30% faster than existing impl

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
existing testsuites

Closes apache#31276 from zhengruifeng/w2v_findSynonyms_opt.

Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants