New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WordVectorSerializer
cannot read gensim word2vec model
#4150
Comments
Can you please post sample model file (as archive), i’ll take a look what’s up there.
… 5 окт. 2017 г., в 18:14, Sheng Chen ***@***.***> написал(а):
in gensim v2.3.0-py27
model = Word2Vec(sentences,
size=self._wv_config.vector_size,
window=self._wv_config.window_size,
min_count=self._wv_config.min_count,
workers=self._wv_config.workers,
sg=int(self._wv_config.use_skip_kgram),
iter=self._wv_config.num_epoch)
model.save(self._model_path)
in dl4j v0.9.1
import org.deeplearning4j.models.embeddings.loader.WordVectorSerializer
WordVectorSerializer.readWord2VecModel(new java.io.File("../data/wordvec.model"))
error
Exception in thread "main" java.lang.RuntimeException: Unable to guess input file format. Please use corresponding loader directly
at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readWord2VecModel(WordVectorSerializer.java:2480)
at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readWord2VecModel(WordVectorSerializer.java:2266)
at LoadWordVecModel$.delayedEndpoint$LoadWordVecModel$1(RunTensorflowModel.scala:47)
at LoadWordVecModel$delayedInit$body.apply(RunTensorflowModel.scala:45)
at scala.Function0.apply$mcV$sp(Function0.scala:34)
at scala.Function0.apply$mcV$sp$(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App.$anonfun$main$1$adapted(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:389)
at scala.App.main(App.scala:76)
at scala.App.main$(App.scala:74)
at LoadWordVecModel$.main(RunTensorflowModel.scala:45)
at LoadWordVecModel.main(RunTensorflowModel.scala)
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#4150>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALru_9eXtOGvvnKIn_kshCCRU9KwAuIVks5spPJJgaJpZM4PvPvq>.
|
Corporate policy forbids me from posting attachment to the external site... If there anything in the model you would look at first ? |
I don’t need YOUR mode or your data.
Just create random model from random text from the web (i.e. from right this issue topic), and give me the model file.
The only thing i need - gensim model which fails to load. It’s content is irrelevant for me.
… 5 окт. 2017 г., в 18:18, Sheng Chen ***@***.***> написал(а):
Corporate policy forbids me from posting attachment to the external site... If there anything in the model you would look at first ?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#4150 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALru_2KYICyp2bLOJpM8F4QyBW_7z9-eks5spPNXgaJpZM4PvPvq>.
|
No, this way it won't be loaded. |
ok, maybe you should be more explicit in your documentation about which kind of wv format spit out by gensim that dl4j can load. Anyway, I think the file with the format that dl4j can load should be produced like the following, model.wv.save_word2vec_format(vector_file_path, vocab_file_path, binary=False) |
Run into the same problem with a word2vec model downloaded from https://github.com/clips/dutchembeddings It would help if we know what kind of word2vec format dl4j can handle. |
that one looks to be csv, it can’t fail
чт, 12 окт. 2017 г. в 22:31, Johan Vogelzang <notifications@github.com>:
… Run into the same problem with a word2vec model downloaded from
https://github.com/clips/dutchembeddings
It would help if we know what kind of word2vec format dl4j can handle.
Clearly there are more types of word2vec formats?
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#4150 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ALru_0SgxRe_bZSP5xAqRvYRpIYmVEw1ks5srmkOgaJpZM4PvPvq>
.
|
You can easily reproduce this problem with the dl4j example CnnSentenceClassificationExample.
_Loading word vectors and creating DataSetIterators Process finished with exit code 1_ |
Thanks, i’ll try that.
… 13 окт. 2017 г., в 12:14, Johan Vogelzang ***@***.***> написал(а):
You can easily reproduce this problem with the dl4j example CnnSentenceClassificationExample.
Just download the wordembeddings from http://www.clips.uantwerpen.be/dutchembeddings/wikipedia-160.tar.gz <http://www.clips.uantwerpen.be/dutchembeddings/wikipedia-160.tar.gz>
In CnnSentenceClassificationExample change the WORD_VECTORS_PATH to the download location.
And run the example.
_Loading word vectors and creating DataSetIterators
o.d.m.e.l.WordVectorSerializer - Trying DL4j format...
o.d.m.e.l.WordVectorSerializer - Trying CSVReader...
o.d.m.e.l.WordVectorSerializer - Trying BinaryReader...
Exception in thread "main" java.lang.RuntimeException: Unable to guess input file format
at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.loadStaticModel(WordVectorSerializer.java:2646)
at org.deeplearning4j.examples.convolution.sentenceclassification.CnnSentenceClassificationExample.main(CnnSentenceClassificationExample.java:124)
Process finished with exit code 1_
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub <#4150 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALru_4vFBDWiIKmMwLTDixuOcLiBz49xks5srynmgaJpZM4PvPvq>.
|
It seems that the WordVectorSerializer could not handle the downloaded tar.gz file Workaround |
No, i’m afraid that’s .tar being a problem, not a .gz
… 13 окт. 2017 г., в 16:04, Johan Vogelzang ***@***.***> написал(а):
It seems that the WordVectorSerializer could not handle the downloaded tar.gz file
The line: ZipFile zipFile = new ZipFile(file);
throws an error "error in opening zip file"
Workaround
Manually unzip the downloaded tar.gz file and point WORD_VECTORS_PATH to the extracted txt file.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub <#4150 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALru_z1cBppUp9rnJkBXdflbcaoDCjaHks5sr1_RgaJpZM4PvPvq>.
|
I've got the same problem:
I solved it like this: gensim (3.0.1, python 2.7.5): in dl4j 0.9.1:
|
@tendy, |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
in gensim v2.3.0-py27
in dl4j v0.9.1
error
The text was updated successfully, but these errors were encountered: