Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error using the most similar method #5

Closed
newterminator opened this issue Feb 23, 2016 · 12 comments
Closed

Error using the most similar method #5

newterminator opened this issue Feb 23, 2016 · 12 comments

Comments

@newterminator
Copy link

Following the successful installation of sense2vec, I got the model loaded as described in the response to the issue #3, but I am getting an error when I try to use the most_similar method.

Following is what I entered after loading the model:
print vector_map.most_similar("education", topn=10)

Below is the error I receive.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-7f468f5b06ca> in <module>()
----> 1 print vector_map.most_similar("education", topn=10)

/home/noname/spacy/src/sense2vec/sense2vec/vectors.pyx in sense2vec.vectors.VectorMap.most_similar (sense2vec/vectors.cpp:3363)()
     66             yield (string, freq, self.data[i])
     67 
---> 68     def most_similar(self, float[:] vector, int n):
     69         indices, scores = self.data.most_similar(vector, n)
     70         return [self.strings[idx] for idx in indices], scores

TypeError: most_similar() takes exactly 2 positional arguments (1 given)

So I understand that the most_similar method wants a float parameter followed by an int parameter. I thought the function will expect similar arguments as to gensim's word2vec implementation of most_similar method.

I request if please I could be shown how to use the most_similar method in the sense2vec implementation.

@henningpeters
Copy link
Contributor

This looks easy to fix. The function signature in your error report above tell you that the second parameter is n and not topn. As you can also see from the signature, float[:] vector cannot be a string, its a list of floats - the vector from which you want to compute the n most similar entries.

Here's how you retrieve a vector for an entry and issue a most_similar() query:

freq, query_vector = vector_map[query]  <--- query is a unicode string
vector_map.most_similar(query_vector, 10)

The code is still a bit rough to use, this will change before we officially release it on PyPI. Also, we would love to hear about your use case. If you want don't want to discuss this publicly please get in contact with me at hp@spacy.io.

@newterminator
Copy link
Author

Hi @henningpeters , Thanks for responding to the issue.
I did as you suggested:
freq, query_vector = vector_map[unicode("education")]
and the error I received was:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-17-fdf21a28d7ad> in <module>()
----> 1 freq, query_vector = vector_map[unicode("education")]

/home/noname/spacy/src/sense2vec/sense2vec/vectors.pyx in sense2vec.vectors.VectorMap.__getitem__ (sense2vec/vectors.cpp:3002)()
     54         freq = self.freqs[hashed]
     55         if not freq:
---> 56             raise KeyError(string)
     57         else:
     58             i = self.strings[string]

KeyError: u'education'

So my two questions are:

  1. that what type of value does the 'freq' variable hold and
  2. how do I resolve the above error.

I understand that the code is rough, and hence I highly appreciate your time in explaining the concepts and running of the code.

As per my use case, I don't have one at the moment, as I am just learning the concepts of NLP and saw your sense2vec implementation and demo (which got me excited a lot) and wanted to play around with it.

@henningpeters
Copy link
Contributor

It says that education isn't contained in vector_map. Please have a look at the load() function at https://github.com/spacy-io/sense2vec/blob/master/sense2vec/vectors.pyx#L097 and the files vectors.bin, strings.json and freqs.json to understand what's going on here.

@newterminator
Copy link
Author

Hi @henningpeters Thanks for the real quick reply. I checked out the load function of the Vectormap class. I made sure that the load function had the right path for the vectors.bin, strings.json and freqs.json files
This is what I have:

import sputnik
from sense2vec import about
from sense2vec.vectors import VectorMap

package = sputnik.package(about.__title__, about.__version__, about.__default_model__)
vector_map = VectorMap(128)
vector_map.load(package.path)

freq, query_vector = vector_map[unicode("beekeepers|NOUN")]
vector_map.most_similar(query_vector, 10)

I even tried but still received the same error, using the entry "beekeepers|NOUN" that I found in the strings.json file.

I have opened the files as mentioned by you and read the implementation of Vectormap class from the file as linked in your above comment, but I am still not getting the piece that I am missing.

@henningpeters
Copy link
Contributor

That seems strange, I cannot reproduce this behavior on my system (Linux, Atlas, Python 3.4). Output here is:

(['beekeepers|NOUN',
  'honey_bees|NOUN',
  'Beekeepers|NOUN', ...], <MemoryView of 'ndarray' at 0x7f2a1f05e398>)

On which platform are you and against which blas library did you compile?

@newterminator
Copy link
Author

Hi @henningpeters , Thank you for getting back on the issue. I use the following: Ubuntu 14.04, Atlas and Python 2.7.11.
I have been trying to resolve this error, searching on SO and others but to no avail...
I verified that the freqs.json and strings.json file are getting loaded in the vectors.pyx.

@newterminator
Copy link
Author

Hey @henningpeters , I tried to train the vectors fresh on a Project Gutenberg eBook text file, so that I can check if they also produce the same error with the most_similar method. I ran into this error though:

(spacy) noname@noname-desktop:~/spacy/src/sense2vec/bin$ python merge_text.py "/home/noname/Documents/data/learn.txt" "/home/noname/Documents/data"
Traceback (most recent call last):
  File "merge_text.py", line 137, in <module>
    plac.call(main)
  File "/home/noname/.linuxbrew/Cellar/python/2.7.11/lib/python2.7/site-packages/plac_core.py", line 309, in call
    cmd, result = parser_from(obj).consume(arglist)
  File "/home/noname/.linuxbrew/Cellar/python/2.7.11/lib/python2.7/site-packages/plac_core.py", line 195, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "merge_text.py", line 133, in main
    parallelize(do_work, enumerate(jobs), n_workers, [out_dir])
  File "merge_text.py", line 44, in parallelize
    return Parallel(n_jobs=n_jobs)(delayed(func)(*(item + extra)) for item in iterator)
  File "/home/noname/.linuxbrew/Cellar/python/2.7.11/lib/python2.7/site-packages/joblib/parallel.py", line 800, in __call__
    while self.dispatch_one_batch(iterator):
  File "/home/noname/.linuxbrew/Cellar/python/2.7.11/lib/python2.7/site-packages/joblib/parallel.py", line 653, in dispatch_one_batch
    tasks = BatchedCalls(itertools.islice(iterator, batch_size))
  File "/home/noname/.linuxbrew/Cellar/python/2.7.11/lib/python2.7/site-packages/joblib/parallel.py", line 68, in __init__
    self.items = list(iterator_slice)
  File "merge_text.py", line 44, in <genexpr>
    return Parallel(n_jobs=n_jobs)(delayed(func)(*(item + extra)) for item in iterator)
  File "merge_text.py", line 49, in iter_comments
    for i, line in enumerate(file_):
IOError: invalid data stream

Also wanted to ask that does the merge_text.py take a simple text file as its input or does the text file need to pre-processed in a certain way.

@newterminator
Copy link
Author

Hey @henningpeters Nevermind the above comment I figured out from looking at merge_text.py that I was entering a text file as input which was causing the error. And also I will use Python 3 and see if the initial issue exists and re-open the issue if needed.

henningpeters added a commit to explosion/spaCy that referenced this issue Mar 6, 2016
@henningpeters
Copy link
Contributor

There was indeed an error within spaCy that broke compatibility with Python<3.3

@honnibal
Copy link
Member

honnibal commented Mar 6, 2016

It's nice to accidentally support lots of versions, and I'd rather make changes to keep our code more general.

But we don't promise support for Python < 3.3 do we? Python 3.0 was basically unusable, and Python 3.1 was pretty terrible. Most libraries only support 3.3+ I think?

@henningpeters
Copy link
Contributor

Maybe I was a bit unclear. Of course I meant primarily 2.7, but the change also fixes "accidentally" everything in between.

@newterminator
Copy link
Author

Hi @henningpeters , Thanks for clearing out the error in Python 2.7 and I confirm that the load() and most_similar functions work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants