[MLlib] [SPARK-2510]Word2Vec: Distributed Representation of Words #1719

Ishiihara · 2014-08-01T15:50:56Z

This is a pull request regarding SPARK-2510 at https://issues.apache.org/jira/browse/SPARK-2510. Word2Vec creates vector representation of words in a text corpus. The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.

To make our implementation more scalable, we train each partition separately and merge the model of each partition after each iteration. To make the model more accurate, multiple iterations may be needed.

To investigate the vector representations is to find the closest words for a query word. For example, the top 20 closest words to "china" are for 1 partition and 1 iteration :

taiwan 0.8077646146334014
korea 0.740913304563621
japan 0.7240667798885471
republic 0.7107151279078352
thailand 0.6953217332072862
tibet 0.6916782118129544
mongolia 0.6800858715972612
macau 0.6794925677480378
singapore 0.6594048695593799
manchuria 0.658989931844148
laos 0.6512978726001666
nepal 0.6380792327845325
mainland 0.6365469459587788
myanmar 0.6358614338840394
macedonia 0.6322366180313249
xinjiang 0.6285291551708028
russia 0.6279951236068411
india 0.6272874944023487
shanghai 0.6234544135576999
macao 0.6220588462925876

The result with 10 partitions and 5 iterations is:
taiwan 0.8310495079388313
india 0.7737171315919039
japan 0.756777901233668
korea 0.7429767187102452
indonesia 0.7407557427278356
pakistan 0.712883426985585
mainland 0.7053379963140822
thailand 0.696298191073948
mongolia 0.693690656871415
laos 0.6913069680735292
macau 0.6903427690029617
republic 0.6766381604813666
malaysia 0.676460699141784
singapore 0.6728790997360923
malaya 0.672345232966194
manchuria 0.6703732292753156
macedonia 0.6637955686322028
myanmar 0.6589462882439646
kazakhstan 0.657017801081494
cambodia 0.6542383836451932

mengxr · 2014-08-01T15:55:16Z

Jenkins, add to whitelist.

mengxr · 2014-08-01T15:55:23Z

Jenkins, test this please.

mengxr · 2014-08-01T15:57:40Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

+
+package org.apache.spark.mllib.feature
+
+import scala.util._


How many methods are used from scala.util.*? If this is less than 4, please list them explicitly.

SparkQA · 2014-08-01T15:59:04Z

QA tests have started for PR 1719. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17678/consoleFull

SparkQA · 2014-08-01T15:59:46Z

QA results for PR 1719:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class Word2Vec(
class Word2VecModel (val _model:RDD[(String, Array[Double])]) extends Serializable {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17678/consoleFull

mengxr · 2014-08-01T15:59:48Z

mllib/src/test/scala/org/apache/spark/mllib/feature/Word2VecSuite.scala

+import org.apache.spark.mllib.util.LocalSparkContext
+
+class Word2VecSuite extends FunSuite with LocalSparkContext {
+  test("word2vec") {


This only tests the model. Could you add a test for the algorithm?

mengxr · 2014-08-01T16:01:45Z

@Ishiihara This is great! Could you add the JIRA number to the title [SPARK-####]? I will ping you after I finish the first pass.

Ishiihara · 2014-08-01T16:21:24Z

@mengxr code format done. Working on test case of algorithm.

SparkQA · 2014-08-01T16:24:04Z

QA tests have started for PR 1719. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17680/consoleFull

SparkQA · 2014-08-01T16:24:49Z

QA results for PR 1719:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class Word2Vec(
class Word2VecModel (val _model:RDD[(String, Array[Double])]) extends Serializable {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17680/consoleFull

mengxr · 2014-08-01T21:45:35Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

+/**
+ *  Vector representation of word
+ */
+class Word2Vec(


We need more docs here, for example, link to the C implementation and the original papers for word2vec.

and briefly explain what it does.

Btw, this is definitely an experimental feature. Please add @Experimental tag. Example:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala#L45

SparkQA · 2014-08-03T23:29:18Z

QA tests have started for PR 1719. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17830/consoleFull

SparkQA · 2014-08-04T00:29:10Z

QA results for PR 1719:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class Word2Vec(

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17830/consoleFull

mengxr · 2014-08-04T03:42:38Z

mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala

+      val (aggSyn0, aggSyn1, _, _) =
+        // TODO: broadcast temp instead of serializing it directly 
+        // or initialize the model in each executor
+        newSentences.aggregate((syn0Global.clone(), syn1Global.clone(), 0, 0))(


Do you mind changing it to treeAggregate? (import org.apache.spark.mllib.rdd.RDDFunctions._) I tried 64 partitions but it is slower than 8 partitions because aggregation is linear on the number of partitions.

SparkQA · 2014-08-04T03:59:21Z

QA tests have started for PR 1719. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17840/consoleFull

change model to use float instead of double

SparkQA · 2014-08-04T04:56:49Z

QA results for PR 1719:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class Word2Vec(

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17840/consoleFull

some updates

SparkQA · 2014-08-04T05:19:11Z

QA tests have started for PR 1719. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17842/consoleFull

mengxr · 2014-08-04T05:20:04Z

mllib/src/test/scala/org/apache/spark/mllib/feature/Word2VecSuite.scala

+  test("Word2VecModel") {
+    val num = 2
+    val localModel = Seq(
+      ("china" ,  Array(0.50f, 0.50f, 0.50f, 0.50f)),


, -> , (remove extra spaces)

SparkQA · 2014-08-04T05:54:14Z

QA tests have started for PR 1719. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17843/consoleFull

SparkQA · 2014-08-04T06:03:44Z

QA results for PR 1719:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class Word2Vec(

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17842/consoleFull

mengxr · 2014-08-04T06:19:14Z

Jenkins, test this please.

SparkQA · 2014-08-04T06:24:16Z

QA tests have started for PR 1719. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17846/consoleFull

SparkQA · 2014-08-04T06:50:10Z

QA results for PR 1719:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class Word2Vec(

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17843/consoleFull

mengxr · 2014-08-04T06:58:00Z

LGTM. Merged into both master and branch-1.1. @Ishiihara Thanks a lot for implementing word2vec! Please help improve its performance during the QA period. One task left is Java support. If you want to spend some time on it, there are some examples in HashingTF.scala and JavaTfIdfSuite.java.

This is a pull request regarding SPARK-2510 at https://issues.apache.org/jira/browse/SPARK-2510. Word2Vec creates vector representation of words in a text corpus. The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms. To make our implementation more scalable, we train each partition separately and merge the model of each partition after each iteration. To make the model more accurate, multiple iterations may be needed. To investigate the vector representations is to find the closest words for a query word. For example, the top 20 closest words to "china" are for 1 partition and 1 iteration : taiwan 0.8077646146334014 korea 0.740913304563621 japan 0.7240667798885471 republic 0.7107151279078352 thailand 0.6953217332072862 tibet 0.6916782118129544 mongolia 0.6800858715972612 macau 0.6794925677480378 singapore 0.6594048695593799 manchuria 0.658989931844148 laos 0.6512978726001666 nepal 0.6380792327845325 mainland 0.6365469459587788 myanmar 0.6358614338840394 macedonia 0.6322366180313249 xinjiang 0.6285291551708028 russia 0.6279951236068411 india 0.6272874944023487 shanghai 0.6234544135576999 macao 0.6220588462925876 The result with 10 partitions and 5 iterations is: taiwan 0.8310495079388313 india 0.7737171315919039 japan 0.756777901233668 korea 0.7429767187102452 indonesia 0.7407557427278356 pakistan 0.712883426985585 mainland 0.7053379963140822 thailand 0.696298191073948 mongolia 0.693690656871415 laos 0.6913069680735292 macau 0.6903427690029617 republic 0.6766381604813666 malaysia 0.676460699141784 singapore 0.6728790997360923 malaya 0.672345232966194 manchuria 0.6703732292753156 macedonia 0.6637955686322028 myanmar 0.6589462882439646 kazakhstan 0.657017801081494 cambodia 0.6542383836451932 Author: Liquan Pei <lpei@gopivotal.com> Author: Xiangrui Meng <meng@databricks.com> Author: Liquan Pei <liquanpei@gmail.com> Closes #1719 from Ishiihara/master and squashes the following commits: 2ba9483 [Liquan Pei] minor fix for Word2Vec test e248441 [Liquan Pei] minor style change 26a948d [Liquan Pei] Merge pull request #1 from mengxr/Ishiihara-master c14da41 [Xiangrui Meng] fix styles 384c771 [Xiangrui Meng] remove minCount and window from constructor change model to use float instead of double e93e726 [Liquan Pei] use treeAggregate instead of aggregate 1a8fb41 [Liquan Pei] use weighted sum in combOp 7efbb6f [Liquan Pei] use broadcast version of vocab in aggregate 6bcc8be [Liquan Pei] add multiple iteration support 720b5a3 [Liquan Pei] Add test for Word2Vec algorithm, minor fixes 2e92b59 [Liquan Pei] modify according to feedback 57dc50d [Liquan Pei] code formatting e4a04d3 [Liquan Pei] minor fix 0aafb1b [Liquan Pei] Add comments, minor fixes 8d6befe [Liquan Pei] initial commit (cherry picked from commit e053c55) Signed-off-by: Xiangrui Meng <meng@databricks.com>

SparkQA · 2014-08-04T07:19:03Z

QA results for PR 1719:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class Word2Vec(

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17846/consoleFull

This is a pull request regarding SPARK-2510 at https://issues.apache.org/jira/browse/SPARK-2510. Word2Vec creates vector representation of words in a text corpus. The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms. To make our implementation more scalable, we train each partition separately and merge the model of each partition after each iteration. To make the model more accurate, multiple iterations may be needed. To investigate the vector representations is to find the closest words for a query word. For example, the top 20 closest words to "china" are for 1 partition and 1 iteration : taiwan 0.8077646146334014 korea 0.740913304563621 japan 0.7240667798885471 republic 0.7107151279078352 thailand 0.6953217332072862 tibet 0.6916782118129544 mongolia 0.6800858715972612 macau 0.6794925677480378 singapore 0.6594048695593799 manchuria 0.658989931844148 laos 0.6512978726001666 nepal 0.6380792327845325 mainland 0.6365469459587788 myanmar 0.6358614338840394 macedonia 0.6322366180313249 xinjiang 0.6285291551708028 russia 0.6279951236068411 india 0.6272874944023487 shanghai 0.6234544135576999 macao 0.6220588462925876 The result with 10 partitions and 5 iterations is: taiwan 0.8310495079388313 india 0.7737171315919039 japan 0.756777901233668 korea 0.7429767187102452 indonesia 0.7407557427278356 pakistan 0.712883426985585 mainland 0.7053379963140822 thailand 0.696298191073948 mongolia 0.693690656871415 laos 0.6913069680735292 macau 0.6903427690029617 republic 0.6766381604813666 malaysia 0.676460699141784 singapore 0.6728790997360923 malaya 0.672345232966194 manchuria 0.6703732292753156 macedonia 0.6637955686322028 myanmar 0.6589462882439646 kazakhstan 0.657017801081494 cambodia 0.6542383836451932 Author: Liquan Pei <lpei@gopivotal.com> Author: Xiangrui Meng <meng@databricks.com> Author: Liquan Pei <liquanpei@gmail.com> Closes apache#1719 from Ishiihara/master and squashes the following commits: 2ba9483 [Liquan Pei] minor fix for Word2Vec test e248441 [Liquan Pei] minor style change 26a948d [Liquan Pei] Merge pull request apache#1 from mengxr/Ishiihara-master c14da41 [Xiangrui Meng] fix styles 384c771 [Xiangrui Meng] remove minCount and window from constructor change model to use float instead of double e93e726 [Liquan Pei] use treeAggregate instead of aggregate 1a8fb41 [Liquan Pei] use weighted sum in combOp 7efbb6f [Liquan Pei] use broadcast version of vocab in aggregate 6bcc8be [Liquan Pei] add multiple iteration support 720b5a3 [Liquan Pei] Add test for Word2Vec algorithm, minor fixes 2e92b59 [Liquan Pei] modify according to feedback 57dc50d [Liquan Pei] code formatting e4a04d3 [Liquan Pei] minor fix 0aafb1b [Liquan Pei] Add comments, minor fixes 8d6befe [Liquan Pei] initial commit

vivounicorn · 2014-09-12T02:35:59Z

we have an initial implementation using downpour SGD,our goal is support ten million words/users，it 's really hard.

mrchor · 2015-07-26T10:29:37Z

"The RDD‘storage in the algorithm default persist in memory?@mengxr

IPR/apache-incubator-iceberg@IPR:fa22e45...IPR:17032c4 Upgrades ADT to 1.1.18 and Iceberg to 0.13.0.14

Liquan Pei added 3 commits August 1, 2014 00:45

initial commit

8d6befe

Add comments, minor fixes

0aafb1b

minor fix

e4a04d3

mengxr reviewed Aug 1, 2014
View reviewed changes

Ishiihara changed the title ~~[MLlib] word2vec: Distributed Representation of Words~~ [MLlib] [SPARK-2510]word2vec: Distributed Representation of Words Aug 1, 2014

code formatting

57dc50d

mengxr reviewed Aug 1, 2014
View reviewed changes

use weighted sum in combOp

1a8fb41

mengxr reviewed Aug 4, 2014
View reviewed changes

use treeAggregate instead of aggregate

e93e726

remove minCount and window from constructor

384c771

change model to use float instead of double

mengxr and others added 2 commits August 3, 2014 22:09

fix styles

c14da41

Merge pull request #1 from mengxr/Ishiihara-master

26a948d

some updates

mengxr reviewed Aug 4, 2014
View reviewed changes

minor style change

e248441

minor fix for Word2Vec test

2ba9483

asfgit closed this in e053c55 Aug 4, 2014

sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023

rdar://107223070 : Upgrades ADT to 1.1.18 (apache#1719)

6b5ad45

IPR/apache-incubator-iceberg@IPR:fa22e45...IPR:17032c4 Upgrades ADT to 1.1.18 and Iceberg to 0.13.0.14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MLlib] [SPARK-2510]Word2Vec: Distributed Representation of Words #1719

[MLlib] [SPARK-2510]Word2Vec: Distributed Representation of Words #1719

Ishiihara commented Aug 1, 2014

mengxr commented Aug 1, 2014

mengxr commented Aug 1, 2014

mengxr Aug 1, 2014

SparkQA commented Aug 1, 2014

SparkQA commented Aug 1, 2014

mengxr Aug 1, 2014

mengxr commented Aug 1, 2014

Ishiihara commented Aug 1, 2014

SparkQA commented Aug 1, 2014

SparkQA commented Aug 1, 2014

mengxr Aug 1, 2014

mengxr Aug 1, 2014

mengxr Aug 1, 2014

SparkQA commented Aug 3, 2014

SparkQA commented Aug 4, 2014

mengxr Aug 4, 2014

SparkQA commented Aug 4, 2014

SparkQA commented Aug 4, 2014

SparkQA commented Aug 4, 2014

mengxr Aug 4, 2014

Ishiihara Aug 4, 2014

SparkQA commented Aug 4, 2014

SparkQA commented Aug 4, 2014

mengxr commented Aug 4, 2014

SparkQA commented Aug 4, 2014

SparkQA commented Aug 4, 2014

mengxr commented Aug 4, 2014

SparkQA commented Aug 4, 2014

vivounicorn commented Sep 12, 2014

mrchor commented Jul 26, 2015


		package org.apache.spark.mllib.feature

		import scala.util._

[MLlib] [SPARK-2510]Word2Vec: Distributed Representation of Words #1719

[MLlib] [SPARK-2510]Word2Vec: Distributed Representation of Words #1719

Conversation

Ishiihara commented Aug 1, 2014

mengxr commented Aug 1, 2014

mengxr commented Aug 1, 2014

Choose a reason for hiding this comment

SparkQA commented Aug 1, 2014

SparkQA commented Aug 1, 2014

Choose a reason for hiding this comment

mengxr commented Aug 1, 2014

Ishiihara commented Aug 1, 2014

SparkQA commented Aug 1, 2014

SparkQA commented Aug 1, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 3, 2014

SparkQA commented Aug 4, 2014

Choose a reason for hiding this comment

SparkQA commented Aug 4, 2014

SparkQA commented Aug 4, 2014

SparkQA commented Aug 4, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 4, 2014

SparkQA commented Aug 4, 2014

mengxr commented Aug 4, 2014

SparkQA commented Aug 4, 2014

SparkQA commented Aug 4, 2014

mengxr commented Aug 4, 2014

SparkQA commented Aug 4, 2014

vivounicorn commented Sep 12, 2014

mrchor commented Jul 26, 2015