Error while opening own trained vectors file #9

newterminator · 2016-03-11T06:55:06Z

I was able to train data using train_word2vec.py after preprocessing the data using merge_text.py.
Below is the outcome of train_word2vec.py:

Then I input the vectors.bin to the new version 0.2.0 of sense2vec and I got an IOerror. The following is what I put to load the vectors:

from sense2vec.vectors import VectorMap
vector_map = VectorMap(128)
vector_map.load("/home/noname/Documents/data/vectors")

The error:

---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-9-315510f2d9d1> in <module>()
      1 vector_map = VectorMap(128)
----> 2 vector_map.load("/home/noname/Documents/data/vectors")

/home/noname/spacy/src/sense2vec/sense2vec/vectors.pyx in sense2vec.vectors.VectorMap.load (sense2vec/vectors.cpp:4870)()
    100 
    101     def load(self, data_dir):
--> 102         self.data.load(path.join(data_dir, 'vectors.bin'))
    103         with open(path.join(data_dir, 'strings.json')) as file_:
    104             self.strings.load(file_)

/home/noname/spacy/src/sense2vec/sense2vec/vectors.pyx in sense2vec.vectors.VectorStore.load (sense2vec/vectors.cpp:7049)()
    200         cdef float[:] cv
    201         for i in range(nr_vector):
--> 202             cfile.read_into(&tmp[0], self.nr_dim, sizeof(tmp[0]))
    203             ptr = &tmp[0]
    204             cv = <float[:128]>ptr

/home/noname/.linuxbrew/Cellar/python/2.7.11/lib/python2.7/site-packages/spacy/cfile.pyx in spacy.cfile.CFile.read_into (spacy/cfile.cpp:1147)()
     25         st = fread(dest, elem_size, number, self.fp)
     26         if st != number:
---> 27             raise IOError
     28 
     29     cdef int write_from(self, void* src, size_t number, size_t elem_size) except -1:

IOError:

Also I wanted to ask that how do I get the relevant freqs.json and strings.json for the trained vectors. For the strings.json, I have the batch outputs from merge_text.py. So they need to be mapped to the relevant information in freqs.json. If there is already a function that does it and I missed calling it, please let me know.

Python version: 2.7.11
Spacy version: 0.100.5

The text was updated successfully, but these errors were encountered:

The-Kunze · 2016-03-23T16:31:31Z

Is there any update on this issue? I am also getting the same error.

henningpeters · 2016-03-23T16:42:32Z

Sorry for the delay, we have focused on other things recently. Drop me a mail (hp@spacy.io) if you need to train custom sense2vec models urgently.

The-Kunze · 2016-03-24T14:38:12Z

Thanks for getting back to me. It's not very urgent, I've just been
experimenting with sense2vec and spacy for a project I'm working on. I'm
relatively new to spaCy, but I think I understand what's going wrong, and
if you could tell if I'm on the right track that'd be very helpful.

I'm using the pre-processing script (merge_text.py) and training the gensim
word2vec model, and then trying to load the resulting vector binary into
sense2vec, which isn't working because sense2vec expects the vector binary
to be in the spacy format. Is that correct?

My understanding though, is that I can still get the same result, i.e.
searching for similarity between tokens tagged with POS, using the gensim
model, I just have to load that into Gensim. Is that true? Going off of
that, what are the benefits of trying to convert the Gensim model into the
accepted Spacy format and loading it into sense2vec?

Sorry if these are dumb or obvious questions, I'm still learning at this
stage and am grateful for any help I can get. Thanks!

Mike

On Wed, Mar 23, 2016 at 4:42 PM, Henning Peters notifications@github.com
wrote:

Sorry for the delay, we have focused on other things recently. Drop me a
mail (hp@spacy.io) if you need to train custom sense2vec models urgently.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#9 (comment)

syllog1sm · 2016-03-24T22:12:47Z

Your assumptions are all correct --- Gensim saves the model into its own format, and you can just load that up and make model.most_similar() queries. We wrote our own sense2vec.vectors class for a few reasons:

Gensim's data format is a bit inefficient. Specifically, it maintains the Python strings in memory as unicode objects. This means that the strings actually take more memory than the vectors. We also wanted to minimise loading time.
We wanted to be able to cache similarity query results. This was especially important for our web demo.
We wanted to be able to make a query against a vector directly, not just via a word.
We wanted to support a borrow() method, that allows multiple VectorMap objects to share the same underlying data. This lets you find the most similar words within some subset more easily. For instance, we want to be able to ask "what's the most similar noun to this vector", but also "what's the most similar word starting with S?". To do this, we allow you to add the same vector to multiple maps, without making multiple copies.

If none of these features are relevant to you, then using Gensim's Word2Vec class might be better for you.

elyase · 2016-03-24T22:25:50Z

What would be the recommended way to create a model that can be loaded by vector_map.load? Add them one by one with vector_map.add/borrow and then vectormap.save?

syllog1sm · 2016-03-24T22:45:53Z

@elyase : Correct, that's what you should do. I'm sure I had a script that did exactly that, but it seems to have gone missing when the repository was reorganised. Damn.

At the moment it's very hard for us to get this library into a great state for users while we also push spaCy forward. It's still quite hard for other people to pitch in on spaCy, but this library is smaller and a bit more accessible. If you write a little conversion script, we'd appreciate the pull request.

elyase · 2016-03-25T12:59:35Z

Added a PR (#11) with the conversion script.

syllog1sm · 2016-09-11T12:14:24Z

Assuming it's safe to close this? Reopen if necessary.

enandini · 2017-07-17T18:11:48Z

@newterminator I am currently trying to do train_word2vec, and I am running errors. I have not found many threads on this; I am directing my in_dir and out_loc to a folder which has the text file of what was outputted by merge_text. However, I keep getting the error that I have too few arguments. I was wondering if you ever ran into this issue...

syllog1sm closed this as completed Sep 11, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error while opening own trained vectors file #9

Error while opening own trained vectors file #9

newterminator commented Mar 11, 2016

The-Kunze commented Mar 23, 2016

henningpeters commented Mar 23, 2016

The-Kunze commented Mar 24, 2016

syllog1sm commented Mar 24, 2016

elyase commented Mar 24, 2016

syllog1sm commented Mar 24, 2016

elyase commented Mar 25, 2016

syllog1sm commented Sep 11, 2016

enandini commented Jul 17, 2017

Error while opening own trained vectors file #9

Error while opening own trained vectors file #9

Comments

newterminator commented Mar 11, 2016

The-Kunze commented Mar 23, 2016

henningpeters commented Mar 23, 2016

The-Kunze commented Mar 24, 2016

syllog1sm commented Mar 24, 2016

elyase commented Mar 24, 2016

syllog1sm commented Mar 24, 2016

elyase commented Mar 25, 2016

syllog1sm commented Sep 11, 2016

enandini commented Jul 17, 2017