Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WordVectorSerializer cannot read gensim word2vec model #4150

Closed
shengc opened this issue Oct 5, 2017 · 15 comments
Closed

WordVectorSerializer cannot read gensim word2vec model #4150

shengc opened this issue Oct 5, 2017 · 15 comments
Labels
Enhancement New features and other enhancements

Comments

@shengc
Copy link

shengc commented Oct 5, 2017

in gensim v2.3.0-py27

   model = Word2Vec(sentences,
                    size=self._wv_config.vector_size,
                    window=self._wv_config.window_size,
                    min_count=self._wv_config.min_count,
                    workers=self._wv_config.workers,
                    sg=int(self._wv_config.use_skip_kgram),
                    iter=self._wv_config.num_epoch)
   model.save(self._model_path)

in dl4j v0.9.1

  import org.deeplearning4j.models.embeddings.loader.WordVectorSerializer
  WordVectorSerializer.readWord2VecModel(new java.io.File("../data/wordvec.model"))

error

Exception in thread "main" java.lang.RuntimeException: Unable to guess input file format. Please use corresponding loader directly
	at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readWord2VecModel(WordVectorSerializer.java:2480)
	at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readWord2VecModel(WordVectorSerializer.java:2266)
	at LoadWordVecModel$.delayedEndpoint$LoadWordVecModel$1(RunTensorflowModel.scala:47)
	at LoadWordVecModel$delayedInit$body.apply(RunTensorflowModel.scala:45)
	at scala.Function0.apply$mcV$sp(Function0.scala:34)
	at scala.Function0.apply$mcV$sp$(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App.$anonfun$main$1$adapted(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:389)
	at scala.App.main(App.scala:76)
	at scala.App.main$(App.scala:74)
	at LoadWordVecModel$.main(RunTensorflowModel.scala:45)
	at LoadWordVecModel.main(RunTensorflowModel.scala)
@raver119
Copy link
Contributor

raver119 commented Oct 5, 2017 via email

@shengc
Copy link
Author

shengc commented Oct 5, 2017

Corporate policy forbids me from posting attachment to the external site... If there anything in the model you would look at first ?

@raver119
Copy link
Contributor

raver119 commented Oct 5, 2017 via email

@shengc
Copy link
Author

shengc commented Oct 5, 2017

wv.scratch.zip

@raver119 raver119 self-assigned this Oct 5, 2017
@raver119 raver119 added the Bug Bugs and problems label Oct 5, 2017
@raver119
Copy link
Contributor

raver119 commented Oct 5, 2017

No, this way it won't be loaded.
I'll take a look into export this kind of format, but for now i'd recco to just save gensim model as csv.

@raver119 raver119 added Enhancement New features and other enhancements and removed Bug Bugs and problems labels Oct 5, 2017
@shengc
Copy link
Author

shengc commented Oct 5, 2017

ok, maybe you should be more explicit in your documentation about which kind of wv format spit out by gensim that dl4j can load.
https://deeplearning4j.org/word2vec.html#setup

Anyway, I think the file with the format that dl4j can load should be produced like the following,

model.wv.save_word2vec_format(vector_file_path, vocab_file_path, binary=False)

@johanvogelzang
Copy link

Run into the same problem with a word2vec model downloaded from https://github.com/clips/dutchembeddings

It would help if we know what kind of word2vec format dl4j can handle.
Clearly there are more types of word2vec formats?

@raver119
Copy link
Contributor

raver119 commented Oct 12, 2017 via email

@johanvogelzang
Copy link

You can easily reproduce this problem with the dl4j example CnnSentenceClassificationExample.

_Loading word vectors and creating DataSetIterators
o.d.m.e.l.WordVectorSerializer - Trying DL4j format...
o.d.m.e.l.WordVectorSerializer - Trying CSVReader...
o.d.m.e.l.WordVectorSerializer - Trying BinaryReader...
Exception in thread "main" java.lang.RuntimeException: Unable to guess input file format
at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.loadStaticModel(WordVectorSerializer.java:2646)
at org.deeplearning4j.examples.convolution.sentenceclassification.CnnSentenceClassificationExample.main(CnnSentenceClassificationExample.java:124)

Process finished with exit code 1_

@raver119
Copy link
Contributor

raver119 commented Oct 13, 2017 via email

@johanvogelzang
Copy link

It seems that the WordVectorSerializer could not handle the downloaded tar.gz file
The line: ZipFile zipFile = new ZipFile(file);
throws an error "error in opening zip file"

Workaround
Manually unzip the downloaded tar.gz file and point WORD_VECTORS_PATH to the extracted txt file.

@raver119
Copy link
Contributor

raver119 commented Oct 13, 2017 via email

@tendy
Copy link

tendy commented Nov 2, 2017

I've got the same problem:

Exception in thread "main" java.lang.RuntimeException: Unable to guess input file format. Please use corresponding loader directly

I solved it like this:

gensim (3.0.1, python 2.7.5):
model.wv.save_word2vec_format("abc.bin.gz", binary=True)

in dl4j 0.9.1:

public static void main(String[] args) throws Exception {
    String filePath = "abc.bin.gz";
    File gModel = new File(filePath);
    Word2Vec vec = loadGoogleBinaryModel(gModel, false);
}

public static Word2Vec loadGoogleBinaryModel(File modelFile, boolean lineBreaks) throws IOException {
    return readBinaryModel(modelFile, lineBreaks, true);
}

private static Word2Vec readBinaryModel(File modelFile, boolean linebreaks, boolean normalize)
    throws NumberFormatException, IOException {
    InMemoryLookupTable<VocabWord> lookupTable;
    VocabCache<VocabWord> cache;
    INDArray syn0;
    int words, size;

    int originalFreq = Nd4j.getMemoryManager().getOccasionalGcFrequency();
    boolean originalPeriodic = Nd4j.getMemoryManager().isPeriodicGcActive();

    if (originalPeriodic)
        Nd4j.getMemoryManager().togglePeriodicGc(false);

    Nd4j.getMemoryManager().setOccasionalGcFrequency(50000);

    try (BufferedInputStream bis = new BufferedInputStream(GzipUtils.isCompressedFilename(modelFile.getName())
        ? new GZIPInputStream(new FileInputStream(modelFile)) : new FileInputStream(modelFile));
         DataInputStream dis = new DataInputStream(bis)) {
        words = Integer.parseInt(WordVectorSerializer.readString(dis));
        size = Integer.parseInt(WordVectorSerializer.readString(dis));
        syn0 = Nd4j.create(words, size);
        cache = new AbstractCache<>();

        WordVectorSerializer.printOutProjectedMemoryUse(words, size, 1);

        lookupTable = (InMemoryLookupTable<VocabWord>) new InMemoryLookupTable.Builder<VocabWord>().cache(cache)
            .useHierarchicSoftmax(false).vectorLength(size).build();

        String word;
        float[] vector = new float[size];
        for (int i = 0; i < words; i++) {

            word = WordVectorSerializer.readString(dis);
            log.trace("Loading " + word + " with word " + i);

            for (int j = 0; j < size; j++) {
                vector[j] = WordVectorSerializer.readFloat(dis);
            }

            syn0.putRow(i, normalize ? Transforms.unitVec(Nd4j.create(vector)) : Nd4j.create(vector));

            // FIXME There was an empty string in my test model ......
            if (StringUtils.isNotEmpty(word)) {
                VocabWord vw = new VocabWord(1.0, word);
                vw.setIndex(cache.numWords());

                cache.addToken(vw);
                cache.addWordToIndex(vw.getIndex(), vw.getLabel());

                cache.putVocabWord(word);
            }

            if (linebreaks) {
                dis.readByte(); // line break
            }

            Nd4j.getMemoryManager().invokeGcOccasionally();
        }
    } finally {
        if (originalPeriodic)
            Nd4j.getMemoryManager().togglePeriodicGc(true);

        Nd4j.getMemoryManager().setOccasionalGcFrequency(originalFreq);
    }

    lookupTable.setSyn0(syn0);

    Word2Vec ret = new Word2Vec.Builder().useHierarchicSoftmax(false).resetModel(false).layerSize(syn0.columns())
        .allowParallelTokenization(true).elementsLearningAlgorithm(new SkipGram<VocabWord>())
        .learningRate(0.025).windowSize(5).workers(1).build();

    ret.setVocab(cache);
    ret.setLookupTable(lookupTable);

    return ret;
}

@ali3assi
Copy link

ali3assi commented Nov 3, 2017

@tendy,
Hi can you please sir insert the complete file containing your code above please?
Thank you

@raver119 raver119 removed their assignment Apr 26, 2018
@raver119 raver119 closed this as completed Aug 1, 2018
@lock
Copy link

lock bot commented Sep 21, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Sep 21, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Enhancement New features and other enhancements
Projects
None yet
Development

No branches or pull requests

5 participants