`WordVectorSerializer` cannot read gensim word2vec model #4150

shengc · 2017-10-05T15:13:41Z

in gensim v2.3.0-py27

   model = Word2Vec(sentences,
                    size=self._wv_config.vector_size,
                    window=self._wv_config.window_size,
                    min_count=self._wv_config.min_count,
                    workers=self._wv_config.workers,
                    sg=int(self._wv_config.use_skip_kgram),
                    iter=self._wv_config.num_epoch)
   model.save(self._model_path)

in dl4j v0.9.1

  import org.deeplearning4j.models.embeddings.loader.WordVectorSerializer
  WordVectorSerializer.readWord2VecModel(new java.io.File("../data/wordvec.model"))

error

Exception in thread "main" java.lang.RuntimeException: Unable to guess input file format. Please use corresponding loader directly
	at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readWord2VecModel(WordVectorSerializer.java:2480)
	at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readWord2VecModel(WordVectorSerializer.java:2266)
	at LoadWordVecModel$.delayedEndpoint$LoadWordVecModel$1(RunTensorflowModel.scala:47)
	at LoadWordVecModel$delayedInit$body.apply(RunTensorflowModel.scala:45)
	at scala.Function0.apply$mcV$sp(Function0.scala:34)
	at scala.Function0.apply$mcV$sp$(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App.$anonfun$main$1$adapted(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:389)
	at scala.App.main(App.scala:76)
	at scala.App.main$(App.scala:74)
	at LoadWordVecModel$.main(RunTensorflowModel.scala:45)
	at LoadWordVecModel.main(RunTensorflowModel.scala)

raver119 · 2017-10-05T15:15:38Z

Can you please post sample model file (as archive), i’ll take a look what’s up there.

…

5 окт. 2017 г., в 18:14, Sheng Chen ***@***.***> написал(а): in gensim v2.3.0-py27 model = Word2Vec(sentences, size=self._wv_config.vector_size, window=self._wv_config.window_size, min_count=self._wv_config.min_count, workers=self._wv_config.workers, sg=int(self._wv_config.use_skip_kgram), iter=self._wv_config.num_epoch) model.save(self._model_path) in dl4j v0.9.1 import org.deeplearning4j.models.embeddings.loader.WordVectorSerializer WordVectorSerializer.readWord2VecModel(new java.io.File("../data/wordvec.model")) error Exception in thread "main" java.lang.RuntimeException: Unable to guess input file format. Please use corresponding loader directly at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readWord2VecModel(WordVectorSerializer.java:2480) at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readWord2VecModel(WordVectorSerializer.java:2266) at LoadWordVecModel$.delayedEndpoint$LoadWordVecModel$1(RunTensorflowModel.scala:47) at LoadWordVecModel$delayedInit$body.apply(RunTensorflowModel.scala:45) at scala.Function0.apply$mcV$sp(Function0.scala:34) at scala.Function0.apply$mcV$sp$(Function0.scala:34) at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12) at scala.App.$anonfun$main$1$adapted(App.scala:76) at scala.collection.immutable.List.foreach(List.scala:389) at scala.App.main(App.scala:76) at scala.App.main$(App.scala:74) at LoadWordVecModel$.main(RunTensorflowModel.scala:45) at LoadWordVecModel.main(RunTensorflowModel.scala) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4150>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALru_9eXtOGvvnKIn_kshCCRU9KwAuIVks5spPJJgaJpZM4PvPvq>.

shengc · 2017-10-05T15:18:29Z

Corporate policy forbids me from posting attachment to the external site... If there anything in the model you would look at first ?

raver119 · 2017-10-05T15:20:28Z

I don’t need YOUR mode or your data. Just create random model from random text from the web (i.e. from right this issue topic), and give me the model file. The only thing i need - gensim model which fails to load. It’s content is irrelevant for me.

…

5 окт. 2017 г., в 18:18, Sheng Chen ***@***.***> написал(а): Corporate policy forbids me from posting attachment to the external site... If there anything in the model you would look at first ? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#4150 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALru_2KYICyp2bLOJpM8F4QyBW_7z9-eks5spPNXgaJpZM4PvPvq>.

shengc · 2017-10-05T15:31:47Z

wv.scratch.zip

raver119 · 2017-10-05T15:35:36Z

No, this way it won't be loaded.
I'll take a look into export this kind of format, but for now i'd recco to just save gensim model as csv.

shengc · 2017-10-05T15:54:48Z

ok, maybe you should be more explicit in your documentation about which kind of wv format spit out by gensim that dl4j can load.
https://deeplearning4j.org/word2vec.html#setup

Anyway, I think the file with the format that dl4j can load should be produced like the following,

model.wv.save_word2vec_format(vector_file_path, vocab_file_path, binary=False)

johanvogelzang · 2017-10-12T19:30:47Z

Run into the same problem with a word2vec model downloaded from https://github.com/clips/dutchembeddings

It would help if we know what kind of word2vec format dl4j can handle.
Clearly there are more types of word2vec formats?

raver119 · 2017-10-12T19:32:45Z

that one looks to be csv, it can’t fail чт, 12 окт. 2017 г. в 22:31, Johan Vogelzang <notifications@github.com>:

…

Run into the same problem with a word2vec model downloaded from https://github.com/clips/dutchembeddings It would help if we know what kind of word2vec format dl4j can handle. Clearly there are more types of word2vec formats? — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#4150 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALru_0SgxRe_bZSP5xAqRvYRpIYmVEw1ks5srmkOgaJpZM4PvPvq> .

johanvogelzang · 2017-10-13T09:13:35Z

You can easily reproduce this problem with the dl4j example CnnSentenceClassificationExample.

Just download the wordembeddings from http://www.clips.uantwerpen.be/dutchembeddings/wikipedia-160.tar.gz
In CnnSentenceClassificationExample change the WORD_VECTORS_PATH to the download location.
And run the example.

_Loading word vectors and creating DataSetIterators
o.d.m.e.l.WordVectorSerializer - Trying DL4j format...
o.d.m.e.l.WordVectorSerializer - Trying CSVReader...
o.d.m.e.l.WordVectorSerializer - Trying BinaryReader...
Exception in thread "main" java.lang.RuntimeException: Unable to guess input file format
at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.loadStaticModel(WordVectorSerializer.java:2646)
at org.deeplearning4j.examples.convolution.sentenceclassification.CnnSentenceClassificationExample.main(CnnSentenceClassificationExample.java:124)

Process finished with exit code 1_

raver119 · 2017-10-13T09:17:49Z

Thanks, i’ll try that.

…

13 окт. 2017 г., в 12:14, Johan Vogelzang ***@***.***> написал(а): You can easily reproduce this problem with the dl4j example CnnSentenceClassificationExample. Just download the wordembeddings from http://www.clips.uantwerpen.be/dutchembeddings/wikipedia-160.tar.gz <http://www.clips.uantwerpen.be/dutchembeddings/wikipedia-160.tar.gz> In CnnSentenceClassificationExample change the WORD_VECTORS_PATH to the download location. And run the example. _Loading word vectors and creating DataSetIterators o.d.m.e.l.WordVectorSerializer - Trying DL4j format... o.d.m.e.l.WordVectorSerializer - Trying CSVReader... o.d.m.e.l.WordVectorSerializer - Trying BinaryReader... Exception in thread "main" java.lang.RuntimeException: Unable to guess input file format at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.loadStaticModel(WordVectorSerializer.java:2646) at org.deeplearning4j.examples.convolution.sentenceclassification.CnnSentenceClassificationExample.main(CnnSentenceClassificationExample.java:124) Process finished with exit code 1_ — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#4150 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALru_4vFBDWiIKmMwLTDixuOcLiBz49xks5srynmgaJpZM4PvPvq>.

johanvogelzang · 2017-10-13T13:02:43Z

It seems that the WordVectorSerializer could not handle the downloaded tar.gz file
The line: ZipFile zipFile = new ZipFile(file);
throws an error "error in opening zip file"

Workaround
Manually unzip the downloaded tar.gz file and point WORD_VECTORS_PATH to the extracted txt file.

raver119 · 2017-10-13T13:06:48Z

No, i’m afraid that’s .tar being a problem, not a .gz

…

13 окт. 2017 г., в 16:04, Johan Vogelzang ***@***.***> написал(а): It seems that the WordVectorSerializer could not handle the downloaded tar.gz file The line: ZipFile zipFile = new ZipFile(file); throws an error "error in opening zip file" Workaround Manually unzip the downloaded tar.gz file and point WORD_VECTORS_PATH to the extracted txt file. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#4150 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALru_z1cBppUp9rnJkBXdflbcaoDCjaHks5sr1_RgaJpZM4PvPvq>.

tendy · 2017-11-02T01:13:20Z

I've got the same problem:

Exception in thread "main" java.lang.RuntimeException: Unable to guess input file format. Please use corresponding loader directly

I solved it like this:

gensim (3.0.1, python 2.7.5):
model.wv.save_word2vec_format("abc.bin.gz", binary=True)

in dl4j 0.9.1:

public static void main(String[] args) throws Exception {
    String filePath = "abc.bin.gz";
    File gModel = new File(filePath);
    Word2Vec vec = loadGoogleBinaryModel(gModel, false);
}

public static Word2Vec loadGoogleBinaryModel(File modelFile, boolean lineBreaks) throws IOException {
    return readBinaryModel(modelFile, lineBreaks, true);
}

private static Word2Vec readBinaryModel(File modelFile, boolean linebreaks, boolean normalize)
    throws NumberFormatException, IOException {
    InMemoryLookupTable<VocabWord> lookupTable;
    VocabCache<VocabWord> cache;
    INDArray syn0;
    int words, size;

    int originalFreq = Nd4j.getMemoryManager().getOccasionalGcFrequency();
    boolean originalPeriodic = Nd4j.getMemoryManager().isPeriodicGcActive();

    if (originalPeriodic)
        Nd4j.getMemoryManager().togglePeriodicGc(false);

    Nd4j.getMemoryManager().setOccasionalGcFrequency(50000);

    try (BufferedInputStream bis = new BufferedInputStream(GzipUtils.isCompressedFilename(modelFile.getName())
        ? new GZIPInputStream(new FileInputStream(modelFile)) : new FileInputStream(modelFile));
         DataInputStream dis = new DataInputStream(bis)) {
        words = Integer.parseInt(WordVectorSerializer.readString(dis));
        size = Integer.parseInt(WordVectorSerializer.readString(dis));
        syn0 = Nd4j.create(words, size);
        cache = new AbstractCache<>();

        WordVectorSerializer.printOutProjectedMemoryUse(words, size, 1);

        lookupTable = (InMemoryLookupTable<VocabWord>) new InMemoryLookupTable.Builder<VocabWord>().cache(cache)
            .useHierarchicSoftmax(false).vectorLength(size).build();

        String word;
        float[] vector = new float[size];
        for (int i = 0; i < words; i++) {

            word = WordVectorSerializer.readString(dis);
            log.trace("Loading " + word + " with word " + i);

            for (int j = 0; j < size; j++) {
                vector[j] = WordVectorSerializer.readFloat(dis);
            }

            syn0.putRow(i, normalize ? Transforms.unitVec(Nd4j.create(vector)) : Nd4j.create(vector));

            // FIXME There was an empty string in my test model ......
            if (StringUtils.isNotEmpty(word)) {
                VocabWord vw = new VocabWord(1.0, word);
                vw.setIndex(cache.numWords());

                cache.addToken(vw);
                cache.addWordToIndex(vw.getIndex(), vw.getLabel());

                cache.putVocabWord(word);
            }

            if (linebreaks) {
                dis.readByte(); // line break
            }

            Nd4j.getMemoryManager().invokeGcOccasionally();
        }
    } finally {
        if (originalPeriodic)
            Nd4j.getMemoryManager().togglePeriodicGc(true);

        Nd4j.getMemoryManager().setOccasionalGcFrequency(originalFreq);
    }

    lookupTable.setSyn0(syn0);

    Word2Vec ret = new Word2Vec.Builder().useHierarchicSoftmax(false).resetModel(false).layerSize(syn0.columns())
        .allowParallelTokenization(true).elementsLearningAlgorithm(new SkipGram<VocabWord>())
        .learningRate(0.025).windowSize(5).workers(1).build();

    ret.setVocab(cache);
    ret.setLookupTable(lookupTable);

    return ret;
}

ali3assi · 2017-11-03T17:27:20Z

@tendy,
Hi can you please sir insert the complete file containing your code above please?
Thank you

lock · 2018-09-21T11:58:52Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

raver119 self-assigned this Oct 5, 2017

raver119 added the Bug Bugs and problems label Oct 5, 2017

raver119 added Enhancement New features and other enhancements and removed Bug Bugs and problems labels Oct 5, 2017

raver119 removed their assignment Apr 26, 2018

raver119 closed this as completed Aug 1, 2018

lock bot locked and limited conversation to collaborators Sep 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`WordVectorSerializer` cannot read gensim word2vec model #4150

`WordVectorSerializer` cannot read gensim word2vec model #4150

shengc commented Oct 5, 2017

raver119 commented Oct 5, 2017 via email

shengc commented Oct 5, 2017

raver119 commented Oct 5, 2017 via email

shengc commented Oct 5, 2017

raver119 commented Oct 5, 2017

shengc commented Oct 5, 2017 •

edited

johanvogelzang commented Oct 12, 2017

raver119 commented Oct 12, 2017 via email

johanvogelzang commented Oct 13, 2017

raver119 commented Oct 13, 2017 via email

johanvogelzang commented Oct 13, 2017

raver119 commented Oct 13, 2017 via email

tendy commented Nov 2, 2017 •

edited

ali3assi commented Nov 3, 2017 •

edited

lock bot commented Sep 21, 2018

WordVectorSerializer cannot read gensim word2vec model #4150

WordVectorSerializer cannot read gensim word2vec model #4150

Comments

shengc commented Oct 5, 2017

raver119 commented Oct 5, 2017 via email

shengc commented Oct 5, 2017

raver119 commented Oct 5, 2017 via email

shengc commented Oct 5, 2017

raver119 commented Oct 5, 2017

shengc commented Oct 5, 2017 • edited

johanvogelzang commented Oct 12, 2017

raver119 commented Oct 12, 2017 via email

johanvogelzang commented Oct 13, 2017

raver119 commented Oct 13, 2017 via email

johanvogelzang commented Oct 13, 2017

raver119 commented Oct 13, 2017 via email

tendy commented Nov 2, 2017 • edited

ali3assi commented Nov 3, 2017 • edited

lock bot commented Sep 21, 2018

`WordVectorSerializer` cannot read gensim word2vec model #4150

`WordVectorSerializer` cannot read gensim word2vec model #4150

shengc commented Oct 5, 2017 •

edited

tendy commented Nov 2, 2017 •

edited

ali3assi commented Nov 3, 2017 •

edited