[SPARK-3097][MLlib] Word2Vec performance improvement #1932

Ishiihara · 2014-08-13T23:25:39Z

@mengxr Please review the code. Adding weights in reduceByKey soon.

Only output model entry for words appeared in the partition before merging and use reduceByKey to combine model. In general, this implementation is 30s or so faster than implementation using big array.

mengxr · 2014-08-14T05:22:35Z

Jenkins, test this please.

mengxr · 2014-08-14T05:31:50Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

@@ -34,7 +34,7 @@ import org.apache.spark.mllib.rdd.RDDFunctions._
 import org.apache.spark.rdd._
 import org.apache.spark.util.Utils
 import org.apache.spark.util.random.XORShiftRandom
-
+import org.apache.spark.util.collection.PrimitiveKeyOpenHashMap
 /**


add an empty line after imports

SparkQA · 2014-08-14T17:29:56Z

QA tests have started for PR 1932. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18547/consoleFull

SparkQA · 2014-08-14T18:13:03Z

QA results for PR 1932:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18547/consoleFull

mengxr · 2014-08-16T23:11:56Z

Jenkins, test this please.

SparkQA · 2014-08-16T23:15:09Z

QA tests have started for PR 1932 at commit cad2011.

This patch merges cleanly.

SparkQA · 2014-08-17T00:07:44Z

QA tests have finished for PR 1932 at commit cad2011.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2014-08-18T04:01:08Z

Jenkins, test this please.

SparkQA · 2014-08-18T04:05:11Z

QA tests have started for PR 1932 at commit d5377a9.

This patch merges cleanly.

SparkQA · 2014-08-18T04:57:55Z

QA tests have finished for PR 1932 at commit d5377a9.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2014-08-18T06:30:05Z

LGTM. Merged into master and branch-1.1. Thanks!

mengxr Please review the code. Adding weights in reduceByKey soon. Only output model entry for words appeared in the partition before merging and use reduceByKey to combine model. In general, this implementation is 30s or so faster than implementation using big array. Author: Liquan Pei <liquanpei@gmail.com> Closes #1932 from Ishiihara/Word2Vec-improve2 and squashes the following commits: d5377a9 [Liquan Pei] use syn0Global and syn1Global to represent model cad2011 [Liquan Pei] bug fix for synModify array out of bound 083aa66 [Liquan Pei] update synGlobal in place and reduce synOut size 9075e1c [Liquan Pei] combine syn0Global and syn1Global to synGlobal aa2ab36 [Liquan Pei] use reduceByKey to combine models (cherry picked from commit 3c8fa50) Signed-off-by: Xiangrui Meng <meng@databricks.com>

loveconan1988 · 2014-08-18T06:32:19Z

------------------ 原始邮件 ------------------
发件人: "asfgit";notifications@github.com;
发送时间: 2014年8月18日(星期一) 下午2:31
收件人: "apache/spark"spark@noreply.github.com;

主题: Re: [spark] [SPARK-3097][MLlib] Word2Vec performance improvement(#1932)

Closed #1932 via 3c8fa50.

—
Reply to this email directly or view it on GitHub.

mengxr Please review the code. Adding weights in reduceByKey soon. Only output model entry for words appeared in the partition before merging and use reduceByKey to combine model. In general, this implementation is 30s or so faster than implementation using big array. Author: Liquan Pei <liquanpei@gmail.com> Closes apache#1932 from Ishiihara/Word2Vec-improve2 and squashes the following commits: d5377a9 [Liquan Pei] use syn0Global and syn1Global to represent model cad2011 [Liquan Pei] bug fix for synModify array out of bound 083aa66 [Liquan Pei] update synGlobal in place and reduce synOut size 9075e1c [Liquan Pei] combine syn0Global and syn1Global to synGlobal aa2ab36 [Liquan Pei] use reduceByKey to combine models

Ishiihara added 2 commits August 13, 2014 05:45

use reduceByKey to combine models

aa2ab36

combine syn0Global and syn1Global to synGlobal

9075e1c

Ishiihara changed the title ~~[SPARK-2907][MLlib] Word2Vec performance improve~~ [SPARK-2907][MLlib] Word2Vec performance improvement Aug 13, 2014

Ishiihara changed the title ~~[SPARK-2907][MLlib] Word2Vec performance improvement~~ [MLlib] Word2Vec performance improvement Aug 14, 2014

mengxr reviewed Aug 14, 2014
View reviewed changes

update synGlobal in place and reduce synOut size

083aa66

bug fix for synModify array out of bound

cad2011

use syn0Global and syn1Global to represent model

d5377a9

Ishiihara changed the title ~~[MLlib] Word2Vec performance improvement~~ [SPARK-3097][MLlib] Word2Vec performance improvement Aug 18, 2014

asfgit closed this in 3c8fa50 Aug 18, 2014

mengxr mentioned this pull request Aug 19, 2014

[SPARK-2907] [MLlib] Use mutable.HashMap to represent model in Word2Vec #1871

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-3097][MLlib] Word2Vec performance improvement #1932

[SPARK-3097][MLlib] Word2Vec performance improvement #1932

Ishiihara commented Aug 13, 2014

mengxr commented Aug 14, 2014

mengxr Aug 14, 2014

SparkQA commented Aug 14, 2014

SparkQA commented Aug 14, 2014

mengxr commented Aug 16, 2014

SparkQA commented Aug 16, 2014

SparkQA commented Aug 17, 2014

mengxr commented Aug 18, 2014

SparkQA commented Aug 18, 2014

SparkQA commented Aug 18, 2014

mengxr commented Aug 18, 2014

loveconan1988 commented Aug 18, 2014

[SPARK-3097][MLlib] Word2Vec performance improvement #1932

[SPARK-3097][MLlib] Word2Vec performance improvement #1932

Conversation

Ishiihara commented Aug 13, 2014

mengxr commented Aug 14, 2014

mengxr Aug 14, 2014

Choose a reason for hiding this comment

SparkQA commented Aug 14, 2014

SparkQA commented Aug 14, 2014

mengxr commented Aug 16, 2014

SparkQA commented Aug 16, 2014

SparkQA commented Aug 17, 2014

mengxr commented Aug 18, 2014

SparkQA commented Aug 18, 2014

SparkQA commented Aug 18, 2014

mengxr commented Aug 18, 2014

loveconan1988 commented Aug 18, 2014