Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while opening own trained vectors file #9

Closed
newterminator opened this issue Mar 11, 2016 · 9 comments
Closed

Error while opening own trained vectors file #9

newterminator opened this issue Mar 11, 2016 · 9 comments

Comments

@newterminator
Copy link

I was able to train data using train_word2vec.py after preprocessing the data using merge_text.py.
Below is the outcome of train_word2vec.py:

vectors

Then I input the vectors.bin to the new version 0.2.0 of sense2vec and I got an IOerror. The following is what I put to load the vectors:

from sense2vec.vectors import VectorMap
vector_map = VectorMap(128)
vector_map.load("/home/noname/Documents/data/vectors")

The error:

---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-9-315510f2d9d1> in <module>()
      1 vector_map = VectorMap(128)
----> 2 vector_map.load("/home/noname/Documents/data/vectors")

/home/noname/spacy/src/sense2vec/sense2vec/vectors.pyx in sense2vec.vectors.VectorMap.load (sense2vec/vectors.cpp:4870)()
    100 
    101     def load(self, data_dir):
--> 102         self.data.load(path.join(data_dir, 'vectors.bin'))
    103         with open(path.join(data_dir, 'strings.json')) as file_:
    104             self.strings.load(file_)

/home/noname/spacy/src/sense2vec/sense2vec/vectors.pyx in sense2vec.vectors.VectorStore.load (sense2vec/vectors.cpp:7049)()
    200         cdef float[:] cv
    201         for i in range(nr_vector):
--> 202             cfile.read_into(&tmp[0], self.nr_dim, sizeof(tmp[0]))
    203             ptr = &tmp[0]
    204             cv = <float[:128]>ptr

/home/noname/.linuxbrew/Cellar/python/2.7.11/lib/python2.7/site-packages/spacy/cfile.pyx in spacy.cfile.CFile.read_into (spacy/cfile.cpp:1147)()
     25         st = fread(dest, elem_size, number, self.fp)
     26         if st != number:
---> 27             raise IOError
     28 
     29     cdef int write_from(self, void* src, size_t number, size_t elem_size) except -1:

IOError:

Also I wanted to ask that how do I get the relevant freqs.json and strings.json for the trained vectors. For the strings.json, I have the batch outputs from merge_text.py. So they need to be mapped to the relevant information in freqs.json. If there is already a function that does it and I missed calling it, please let me know.

Python version: 2.7.11
Spacy version: 0.100.5

@The-Kunze
Copy link

Is there any update on this issue? I am also getting the same error.

@henningpeters
Copy link
Contributor

Sorry for the delay, we have focused on other things recently. Drop me a mail (hp@spacy.io) if you need to train custom sense2vec models urgently.

@The-Kunze
Copy link

Thanks for getting back to me. It's not very urgent, I've just been
experimenting with sense2vec and spacy for a project I'm working on. I'm
relatively new to spaCy, but I think I understand what's going wrong, and
if you could tell if I'm on the right track that'd be very helpful.

I'm using the pre-processing script (merge_text.py) and training the gensim
word2vec model, and then trying to load the resulting vector binary into
sense2vec, which isn't working because sense2vec expects the vector binary
to be in the spacy format. Is that correct?

My understanding though, is that I can still get the same result, i.e.
searching for similarity between tokens tagged with POS, using the gensim
model, I just have to load that into Gensim. Is that true? Going off of
that, what are the benefits of trying to convert the Gensim model into the
accepted Spacy format and loading it into sense2vec?

Sorry if these are dumb or obvious questions, I'm still learning at this
stage and am grateful for any help I can get. Thanks!

Mike

On Wed, Mar 23, 2016 at 4:42 PM, Henning Peters notifications@github.com
wrote:

Sorry for the delay, we have focused on other things recently. Drop me a
mail (hp@spacy.io) if you need to train custom sense2vec models urgently.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#9 (comment)

@syllog1sm
Copy link
Contributor

Your assumptions are all correct --- Gensim saves the model into its own format, and you can just load that up and make model.most_similar() queries. We wrote our own sense2vec.vectors class for a few reasons:

  • Gensim's data format is a bit inefficient. Specifically, it maintains the Python strings in memory as unicode objects. This means that the strings actually take more memory than the vectors. We also wanted to minimise loading time.
  • We wanted to be able to cache similarity query results. This was especially important for our web demo.
  • We wanted to be able to make a query against a vector directly, not just via a word.
  • We wanted to support a borrow() method, that allows multiple VectorMap objects to share the same underlying data. This lets you find the most similar words within some subset more easily. For instance, we want to be able to ask "what's the most similar noun to this vector", but also "what's the most similar word starting with S?". To do this, we allow you to add the same vector to multiple maps, without making multiple copies.

If none of these features are relevant to you, then using Gensim's Word2Vec class might be better for you.

@elyase
Copy link

elyase commented Mar 24, 2016

What would be the recommended way to create a model that can be loaded by vector_map.load? Add them one by one with vector_map.add/borrow and then vectormap.save?

@syllog1sm
Copy link
Contributor

@elyase : Correct, that's what you should do. I'm sure I had a script that did exactly that, but it seems to have gone missing when the repository was reorganised. Damn.

At the moment it's very hard for us to get this library into a great state for users while we also push spaCy forward. It's still quite hard for other people to pitch in on spaCy, but this library is smaller and a bit more accessible. If you write a little conversion script, we'd appreciate the pull request.

@elyase
Copy link

elyase commented Mar 25, 2016

Added a PR (#11) with the conversion script.

@syllog1sm
Copy link
Contributor

Assuming it's safe to close this? Reopen if necessary.

@enandini
Copy link

@newterminator I am currently trying to do train_word2vec, and I am running errors. I have not found many threads on this; I am directing my in_dir and out_loc to a folder which has the text file of what was outputted by merge_text. However, I keep getting the error that I have too few arguments. I was wondering if you ever ran into this issue...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants