[SPARK-15328][MLLIB][ML] Word2Vec import for original binary format #13735

wangyum · 2016-06-17T11:57:06Z

What changes were proposed in this pull request?

Add loadGoogleModel() function to import original wor2vec binary format.

How was this patch tested?

mllib.feature.Word2VecSuite and ml.feature.Word2VecSuite

I also tested with real model:

AmplabJenkins · 2016-06-17T12:02:14Z

Can one of the admins verify this patch?

insidedctm · 2016-09-15T11:58:29Z

This seems to work fine with small model such as that produced by demo_word.sh in the word2vec code repository however I get problems when trying a large model such as GoogleNews-vectors-negative300.bin.

I can successfully load the model using this code (albeit I needed to give the driver 12GB of memory):
import org.apache.spark.ml.feature.Word2VecModel
val path = "file:///Downloads/GoogleNews-vectors-negative300.bin"
val model = Word2VecModel.loadGoogleModel(path)

However synonyms are not found for a typical lookup e.g.
model.findSynonyms("spark",20).show
responds with
java.lang.IllegalStateException: spark not in vocabulary

However the distance tool from the word2vec toolkit, loading the same model gives:

Load Google word2vec model

d4c7725

wangyum closed this Mar 8, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-15328][MLLIB][ML] Word2Vec import for original binary format #13735

[SPARK-15328][MLLIB][ML] Word2Vec import for original binary format #13735

Uh oh!

wangyum commented Jun 17, 2016

Uh oh!

AmplabJenkins commented Jun 17, 2016

Uh oh!

insidedctm commented Sep 15, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-15328][MLLIB][ML] Word2Vec import for original binary format #13735

[SPARK-15328][MLLIB][ML] Word2Vec import for original binary format #13735

Uh oh!

Conversation

wangyum commented Jun 17, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

AmplabJenkins commented Jun 17, 2016

Uh oh!

insidedctm commented Sep 15, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants