[SPARK-34189][ML] w2v findSynonyms optimization #31276

zhengruifeng · 2021-01-21T08:24:09Z

What changes were proposed in this pull request?

1, use Guavaording instead of BoundedPriorityQueue;
2, use local variables;
3, avoid conversion: ml.vector -> mllib.vector

Why are the changes needed?

this pr is about 30% faster than existing impl

Does this PR introduce any user-facing change?

NO

How was this patch tested?

existing testsuites

use local var vec -> array

zhengruifeng · 2021-01-21T08:26:38Z

train a model with https://en.wikipedia.org/wiki/Word2vec as the training data;

import org.apache.spark.ml.feature._

val df = spark.read.text("/d0/Dev/PRs/Word2vec")
val df2 = df.as[String].map(_.split(" ")).toDF("words")

val w2v = new Word2Vec().setInputCol("words").setMaxIter(1)
val w2vm = w2v.fit(df2)
w2vm.save("/tmp/w2vm")

performance test

import org.apache.spark.ml.feature._

val w2vm = Word2VecModel.load("/tmp/w2vm")
val words = w2vm.getVectors.select("word").as[String].collect

val start = System.currentTimeMillis; Seq.range(0, 10000).foreach { i => words.foreach(word => w2vm.findSynonymsArray(word, 10)) }; val end = System.currentTimeMillis; val duration = end - start;

results:
master: 8978
this PR: 6419, about 30% faster than existing impl

SparkQA · 2021-01-21T09:58:10Z

Test build #134320 has finished for PR 31276 at commit 298c6fa.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-01-21T10:02:07Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38907/

SparkQA · 2021-01-21T10:31:50Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38907/

SparkQA · 2021-01-22T02:48:57Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38934/

SparkQA · 2021-01-22T03:15:30Z

Test build #134347 has finished for PR 31276 at commit e3ce5de.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-01-22T03:17:30Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38934/

zhengruifeng · 2021-01-26T08:25:18Z

friendly ping @srowen

srowen

Looks OK, merge when ready.

zhengruifeng · 2021-01-27T02:10:05Z

Merged to master, thanks @srowen !

### What changes were proposed in this pull request? 1, use Guavaording instead of BoundedPriorityQueue; 2, use local variables; 3, avoid conversion: ml.vector -> mllib.vector ### Why are the changes needed? this pr is about 30% faster than existing impl ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? existing testsuites Closes apache#31276 from zhengruifeng/w2v_findSynonyms_opt. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>

init

8ab5c5c

use local var vec -> array

github-actions bot added ML MLLIB labels Jan 21, 2021

nit

298c6fa

update pytest

e3ce5de

github-actions bot added CORE PYTHON labels Jan 22, 2021

srowen approved these changes Jan 26, 2021

View reviewed changes

zhengruifeng closed this in 2c4e4f8 Jan 27, 2021

zhengruifeng deleted the w2v_findSynonyms_opt branch January 27, 2021 02:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-34189][ML] w2v findSynonyms optimization #31276

[SPARK-34189][ML] w2v findSynonyms optimization #31276

zhengruifeng commented Jan 21, 2021

zhengruifeng commented Jan 21, 2021

SparkQA commented Jan 21, 2021

SparkQA commented Jan 21, 2021

SparkQA commented Jan 21, 2021

SparkQA commented Jan 22, 2021

SparkQA commented Jan 22, 2021

SparkQA commented Jan 22, 2021

zhengruifeng commented Jan 26, 2021

srowen left a comment

zhengruifeng commented Jan 27, 2021

[SPARK-34189][ML] w2v findSynonyms optimization #31276

[SPARK-34189][ML] w2v findSynonyms optimization #31276

Conversation

zhengruifeng commented Jan 21, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

zhengruifeng commented Jan 21, 2021

SparkQA commented Jan 21, 2021

SparkQA commented Jan 21, 2021

SparkQA commented Jan 21, 2021

SparkQA commented Jan 22, 2021

SparkQA commented Jan 22, 2021

SparkQA commented Jan 22, 2021

zhengruifeng commented Jan 26, 2021

srowen left a comment

Choose a reason for hiding this comment

zhengruifeng commented Jan 27, 2021