Error using the most similar method #5

newterminator · 2016-02-23T06:52:45Z

Following the successful installation of sense2vec, I got the model loaded as described in the response to the issue #3, but I am getting an error when I try to use the most_similar method.

Following is what I entered after loading the model:
print vector_map.most_similar("education", topn=10)

Below is the error I receive.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-7f468f5b06ca> in <module>()
----> 1 print vector_map.most_similar("education", topn=10)

/home/noname/spacy/src/sense2vec/sense2vec/vectors.pyx in sense2vec.vectors.VectorMap.most_similar (sense2vec/vectors.cpp:3363)()
     66             yield (string, freq, self.data[i])
     67 
---> 68     def most_similar(self, float[:] vector, int n):
     69         indices, scores = self.data.most_similar(vector, n)
     70         return [self.strings[idx] for idx in indices], scores

TypeError: most_similar() takes exactly 2 positional arguments (1 given)

So I understand that the most_similar method wants a float parameter followed by an int parameter. I thought the function will expect similar arguments as to gensim's word2vec implementation of most_similar method.

I request if please I could be shown how to use the most_similar method in the sense2vec implementation.

The text was updated successfully, but these errors were encountered:

henningpeters · 2016-02-23T09:29:59Z

This looks easy to fix. The function signature in your error report above tell you that the second parameter is n and not topn. As you can also see from the signature, float[:] vector cannot be a string, its a list of floats - the vector from which you want to compute the n most similar entries.

Here's how you retrieve a vector for an entry and issue a most_similar() query:

freq, query_vector = vector_map[query]  <--- query is a unicode string
vector_map.most_similar(query_vector, 10)

The code is still a bit rough to use, this will change before we officially release it on PyPI. Also, we would love to hear about your use case. If you want don't want to discuss this publicly please get in contact with me at hp@spacy.io.

newterminator · 2016-02-23T19:00:54Z

Hi @henningpeters , Thanks for responding to the issue.
I did as you suggested:
freq, query_vector = vector_map[unicode("education")]
and the error I received was:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-17-fdf21a28d7ad> in <module>()
----> 1 freq, query_vector = vector_map[unicode("education")]

/home/noname/spacy/src/sense2vec/sense2vec/vectors.pyx in sense2vec.vectors.VectorMap.__getitem__ (sense2vec/vectors.cpp:3002)()
     54         freq = self.freqs[hashed]
     55         if not freq:
---> 56             raise KeyError(string)
     57         else:
     58             i = self.strings[string]

KeyError: u'education'

So my two questions are:

that what type of value does the 'freq' variable hold and
how do I resolve the above error.

I understand that the code is rough, and hence I highly appreciate your time in explaining the concepts and running of the code.

As per my use case, I don't have one at the moment, as I am just learning the concepts of NLP and saw your sense2vec implementation and demo (which got me excited a lot) and wanted to play around with it.

henningpeters · 2016-02-23T19:12:23Z

It says that education isn't contained in vector_map. Please have a look at the load() function at https://github.com/spacy-io/sense2vec/blob/master/sense2vec/vectors.pyx#L097 and the files vectors.bin, strings.json and freqs.json to understand what's going on here.

newterminator · 2016-02-23T21:50:57Z

Hi @henningpeters Thanks for the real quick reply. I checked out the load function of the Vectormap class. I made sure that the load function had the right path for the vectors.bin, strings.json and freqs.json files
This is what I have:

import sputnik
from sense2vec import about
from sense2vec.vectors import VectorMap

package = sputnik.package(about.__title__, about.__version__, about.__default_model__)
vector_map = VectorMap(128)
vector_map.load(package.path)

freq, query_vector = vector_map[unicode("beekeepers|NOUN")]
vector_map.most_similar(query_vector, 10)

I even tried but still received the same error, using the entry "beekeepers|NOUN" that I found in the strings.json file.

I have opened the files as mentioned by you and read the implementation of Vectormap class from the file as linked in your above comment, but I am still not getting the piece that I am missing.

henningpeters · 2016-02-26T19:31:42Z

That seems strange, I cannot reproduce this behavior on my system (Linux, Atlas, Python 3.4). Output here is:

(['beekeepers|NOUN',
  'honey_bees|NOUN',
  'Beekeepers|NOUN', ...], <MemoryView of 'ndarray' at 0x7f2a1f05e398>)

On which platform are you and against which blas library did you compile?

newterminator · 2016-02-26T23:00:47Z

Hi @henningpeters , Thank you for getting back on the issue. I use the following: Ubuntu 14.04, Atlas and Python 2.7.11.
I have been trying to resolve this error, searching on SO and others but to no avail...
I verified that the freqs.json and strings.json file are getting loaded in the vectors.pyx.

newterminator · 2016-03-03T21:18:12Z

Hey @henningpeters , I tried to train the vectors fresh on a Project Gutenberg eBook text file, so that I can check if they also produce the same error with the most_similar method. I ran into this error though:

(spacy) noname@noname-desktop:~/spacy/src/sense2vec/bin$ python merge_text.py "/home/noname/Documents/data/learn.txt" "/home/noname/Documents/data"
Traceback (most recent call last):
  File "merge_text.py", line 137, in <module>
    plac.call(main)
  File "/home/noname/.linuxbrew/Cellar/python/2.7.11/lib/python2.7/site-packages/plac_core.py", line 309, in call
    cmd, result = parser_from(obj).consume(arglist)
  File "/home/noname/.linuxbrew/Cellar/python/2.7.11/lib/python2.7/site-packages/plac_core.py", line 195, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "merge_text.py", line 133, in main
    parallelize(do_work, enumerate(jobs), n_workers, [out_dir])
  File "merge_text.py", line 44, in parallelize
    return Parallel(n_jobs=n_jobs)(delayed(func)(*(item + extra)) for item in iterator)
  File "/home/noname/.linuxbrew/Cellar/python/2.7.11/lib/python2.7/site-packages/joblib/parallel.py", line 800, in __call__
    while self.dispatch_one_batch(iterator):
  File "/home/noname/.linuxbrew/Cellar/python/2.7.11/lib/python2.7/site-packages/joblib/parallel.py", line 653, in dispatch_one_batch
    tasks = BatchedCalls(itertools.islice(iterator, batch_size))
  File "/home/noname/.linuxbrew/Cellar/python/2.7.11/lib/python2.7/site-packages/joblib/parallel.py", line 68, in __init__
    self.items = list(iterator_slice)
  File "merge_text.py", line 44, in <genexpr>
    return Parallel(n_jobs=n_jobs)(delayed(func)(*(item + extra)) for item in iterator)
  File "merge_text.py", line 49, in iter_comments
    for i, line in enumerate(file_):
IOError: invalid data stream

Also wanted to ask that does the merge_text.py take a simple text file as its input or does the text file need to pre-processed in a certain way.

newterminator · 2016-03-05T05:22:27Z

Hey @henningpeters Nevermind the above comment I figured out from looking at merge_text.py that I was entering a text file as input which was causing the error. And also I will use Python 3 and see if the initial issue exists and re-open the issue if needed.

…tation, also fixes explosion/sense2vec#5 for py2

henningpeters · 2016-03-06T08:23:46Z

There was indeed an error within spaCy that broke compatibility with Python<3.3

honnibal · 2016-03-06T14:59:14Z

It's nice to accidentally support lots of versions, and I'd rather make changes to keep our code more general.

But we don't promise support for Python < 3.3 do we? Python 3.0 was basically unusable, and Python 3.1 was pretty terrible. Most libraries only support 3.3+ I think?

henningpeters · 2016-03-06T16:38:42Z

Maybe I was a bit unclear. Of course I meant primarily 2.7, but the change also fixes "accidentally" everything in between.

newterminator · 2016-03-07T23:12:45Z

Hi @henningpeters , Thanks for clearing out the error in Python 2.7 and I confirm that the load() and most_similar functions work.

newterminator closed this as completed Mar 5, 2016

henningpeters added a commit to explosion/spaCy that referenced this issue Mar 6, 2016

hash_string() should not depend on python's internal unicode represen…

b740f20

…tation, also fixes explosion/sense2vec#5 for py2

henningpeters added a commit that referenced this issue Mar 6, 2016

change freqs.json data format, fixes #5

804373c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error using the most similar method #5

Error using the most similar method #5

newterminator commented Feb 23, 2016

henningpeters commented Feb 23, 2016

newterminator commented Feb 23, 2016

henningpeters commented Feb 23, 2016

newterminator commented Feb 23, 2016

henningpeters commented Feb 26, 2016

newterminator commented Feb 26, 2016

newterminator commented Mar 3, 2016

newterminator commented Mar 5, 2016

henningpeters commented Mar 6, 2016

honnibal commented Mar 6, 2016

henningpeters commented Mar 6, 2016

newterminator commented Mar 7, 2016

Error using the most similar method #5

Error using the most similar method #5

Comments

newterminator commented Feb 23, 2016

henningpeters commented Feb 23, 2016

newterminator commented Feb 23, 2016

henningpeters commented Feb 23, 2016

newterminator commented Feb 23, 2016

henningpeters commented Feb 26, 2016

newterminator commented Feb 26, 2016

newterminator commented Mar 3, 2016

newterminator commented Mar 5, 2016

henningpeters commented Mar 6, 2016

honnibal commented Mar 6, 2016

henningpeters commented Mar 6, 2016

newterminator commented Mar 7, 2016