Skip to content

Conversation

@wangyum
Copy link
Member

@wangyum wangyum commented Jun 17, 2016

What changes were proposed in this pull request?

Add loadGoogleModel() function to import original wor2vec binary format.

How was this patch tested?

mllib.feature.Word2VecSuite and ml.feature.Word2VecSuite

I also tested with real model:
spark_load_google_word2vec

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@insidedctm
Copy link
Contributor

This seems to work fine with small model such as that produced by demo_word.sh in the word2vec code repository however I get problems when trying a large model such as GoogleNews-vectors-negative300.bin.

I can successfully load the model using this code (albeit I needed to give the driver 12GB of memory):
import org.apache.spark.ml.feature.Word2VecModel
val path = "file:///Downloads/GoogleNews-vectors-negative300.bin"
val model = Word2VecModel.loadGoogleModel(path)

However synonyms are not found for a typical lookup e.g.
model.findSynonyms("spark",20).show
responds with
java.lang.IllegalStateException: spark not in vocabulary

However the distance tool from the word2vec toolkit, loading the same model gives:

screen shot 2016-09-15 at 12 57 03

@wangyum wangyum closed this Mar 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants