Some words not in model GoogleNews-vectors-negative300.bin #18

finin · 2015-06-01T20:56:52Z

When I use word2vec to access the pre-trained model GoogleNews-vectors-negative300.bin', some of the words are reported as being not in the model. I've had the same problem on a 16GB Mac running OS 10.10.2 and on a large linux machine. Here a session on linux:

$ python
Python 2.7.8 |Anaconda 2.1.0 (64-bit)| (default, Aug 21 2014, 18:22:21) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://binstar.org
>>> import word2vec
>>> w = word2vec
>>> m = w.load('GoogleNews-vectors-negative300.bin')
>>> m.vectors.shape
(3000000, 300)
>>> m['dog']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "word2vec/wordvectors.py", line 45, in __getitem__
    return self.get_vector(word)
  File "word2vec/wordvectors.py", line 54, in get_vector
    idx = self.ix(word)
  File "word2vec/wordvectors.py", line 40, in ix
    raise KeyError('Word not in vocabulary')
KeyError: u'Word not in vocabulary'
>>> m['cat']
array([ 0.03357537,  0.05204856,  0.04530652,  0.05636346, -0.02170937,
       -0.00815787, -0.0528576 ,  0.02413651,  0.06094805,  0.02588944,
        ...
        0.03559798,  0.06715073,  0.00525879, -0.04476715,  0.03249664])
>>>

The text was updated successfully, but these errors were encountered:

dkirkby · 2016-10-11T04:30:25Z

I'm seeing the same thing. How was this resolved?

sicotronic · 2016-10-11T04:44:37Z

This is normally expected as it is practically impossible to cover all the words for a given language during training. You have to decide how to handle the unknown words, some common approaches are:

Replace all the unknown words for the same token, for example "< UNK >", and train the model over that data to learn a average distribution for words out of vocabulary.
Another option is to use the model as it is, and just assign a randomly initialized vector (with the same amount of dimensions) to the unknown words (this can be further enhanced if you assign random values within the range and distribution of the other words in the model).

dkirkby · 2016-10-11T15:38:29Z

Thanks, I had assumed that the ~1000 most common english words ("dog" is ranked 754 here) would inevitably be included in a 3,000,000 word vocabulary, but I don't know enough about how the vocabulary is selected from the input corpus. (Sorry that this is off topic for this repo)

sicotronic · 2016-10-11T15:47:43Z

You're welcome.
Yes, it depends on the corpus used for the training, for example if the model was trained only over hundred of thousands of business emails, even if you have more than 3 million words in your training data I doubt you will find the word "dog" with enough frequency to be included in the vocabulary of the model (usually, the vocabulary is restricted to the top n (~thousand) frequent words to limit the calculation time and memory usage).

dkirkby · 2016-10-11T15:55:42Z

These google weights were trained on 100 billion (!) words and have a 3 million word vocab, so its still surprising to me that a word like "dog" did not make the cut.

sicotronic · 2016-10-11T16:07:48Z

Well, now that you mentioned it again, it is indeed surprising that "dog" is not included in a 3 million word vocabulary, specially when the word "cat" it is included...

sicotronic · 2016-10-11T16:15:15Z

Ok, if you check in the source code you can see that the maximum "vocabulary hash" size is 3 million (https://github.com/danielfrg/word2vec/blob/master/word2vec/c/word2vec.c#L27) but it seems that the vocabulary size is not covering all the hash table space, (there is a function called ReduceVocab to reduce the vocabulary to only the top most frequent words here: https://github.com/danielfrg/word2vec/blob/master/word2vec/c/word2vec.c#L175). You should check out the documentation of the already trained model because I think that the vocabulary size is one of the needed parameters at training time.

DucVuMinh · 2017-10-26T03:58:06Z

Hello @sicotronic
As you said : Another option is to use the model as it is, and just assign a randomly initialized vector (with the same amount of dimensions) to the unknown words (this can be further enhanced if you assign random values within the range and distribution of the other words in the model).
But can you explain why we can use it and give me some papers or examples. I confuse that assigning randomly a vector can it make any scene in word2vec model or sometimes make harm?

jeLee6gi · 2017-11-03T20:12:07Z

I don't know if we have the same problem, but I also noticed that common words were missing. Looking at m.vocab, it seems that the first character is missing from every word:

..., 'onductive_yarns', 'nrique_Tolentino', 'oronary_Interventions', 'nterface_NVMHCI', ...

Edit:
m = gensim.models.KeyedVectors.load_word2vec_format(path, binary=True) loads the model fine, I guess I'll use that instead.

sicotronic · 2017-11-10T10:31:17Z

Hi @DucVuMinh

I'm sorry for the lack of rigurosity. The idea behind using a random initialized vector, with values under the same distribution of the known words, for unknown words is that you will get a point in the vector space that looks like a real observed word and therefore you will be able to operate or apply all the distance calculations consistently with the known words, as well as retrain the embeddings to fit your data including a vector for your unknown words.

I co-authored a paper at IJCAI2017 (https://www.ijcai.org/proceedings/2017/573) where we used a similar idea when assigning vectors to words we want to replace. (Basically we wanted to turn question sentences into something that looks like statements where the vectors representing the wh-question words (who/when/where) were replaced by the vectors of words that are most likely to make the sentence "similar" (under a given metric) to most of the answer sentences for each question type).

Anyway, I think it is a somewhat-common trick to build vectors with random values (under the same distribution of the already known words) in order to initialize the vectors for unknown words and then make them fit your training dataset to represent the averaged distribution of the unknown words in your data. I'm not sure right now about exactly which other papers present this idea, but it is a technical tip shared mostly everywhere I can remember, I think if you google it you will find several results (answers at stackexchange.com, blogs, other repositories). I just did that and found this comment by dennybritz here:
dennybritz/cnn-text-classification-tf#10 (comment)

DucVuMinh · 2017-11-13T09:31:06Z

@sicotronic
Oh, thank you for your supporting.
material you provided and your comments is very useful.

liu-zg15 · 2017-12-17T13:25:10Z

I'm seeing the same thing. How about that now?
I found that some common words are missing, like 'of', 'and', 'to', 'a'.
I think these words must be in the googlenews, but i don't know why i can't find them.

paige-pruitt · 2017-12-18T20:50:23Z

I am seeing the same issue as @liu-zg15. While it reports words like 'a', 'to', and 'and' are not in the vocabulary, it has vectors for 'b', 'c', etc. This seems like it must be some sort of bug instead of a lack of vocab coverage... (however it found vectors for both 'dog' and 'cat' unlike the earlier commenter).

rhlbns · 2018-08-29T10:38:23Z

I am also facing the same issue could not find words like 'a', 'to' and 'of' but it appears corresponding words starting with uppercase 'A', 'To' and 'Of' are available.

ValeryRybakov · 2018-10-10T18:32:58Z

so-called 'stop-words' like articles, particles, prepositions are eliminated in most w2v models, as they take a lot of memory (usually half of all words), having no independant meaning, thus useless in this sence.

Akashtyagi · 2020-02-19T10:15:06Z

I am able to get vectors for m['DOG'] and m['CAT'], when used in uppercase. It's weird that its only accepting Uppercase words in my model.

I am using pretrained GoogleNews-negative300

danielfrg closed this as completed Dec 10, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some words not in model GoogleNews-vectors-negative300.bin #18

Some words not in model GoogleNews-vectors-negative300.bin #18

finin commented Jun 1, 2015

dkirkby commented Oct 11, 2016

sicotronic commented Oct 11, 2016 •

edited

dkirkby commented Oct 11, 2016

sicotronic commented Oct 11, 2016

dkirkby commented Oct 11, 2016

sicotronic commented Oct 11, 2016

sicotronic commented Oct 11, 2016 •

edited

DucVuMinh commented Oct 26, 2017

jeLee6gi commented Nov 3, 2017 •

edited

sicotronic commented Nov 10, 2017 •

edited

DucVuMinh commented Nov 13, 2017

liu-zg15 commented Dec 17, 2017

paige-pruitt commented Dec 18, 2017

rhlbns commented Aug 29, 2018

ValeryRybakov commented Oct 10, 2018

Akashtyagi commented Feb 19, 2020

Some words not in model GoogleNews-vectors-negative300.bin #18

Some words not in model GoogleNews-vectors-negative300.bin #18

Comments

finin commented Jun 1, 2015

dkirkby commented Oct 11, 2016

sicotronic commented Oct 11, 2016 • edited

dkirkby commented Oct 11, 2016

sicotronic commented Oct 11, 2016

dkirkby commented Oct 11, 2016

sicotronic commented Oct 11, 2016

sicotronic commented Oct 11, 2016 • edited

DucVuMinh commented Oct 26, 2017

jeLee6gi commented Nov 3, 2017 • edited

sicotronic commented Nov 10, 2017 • edited

DucVuMinh commented Nov 13, 2017

liu-zg15 commented Dec 17, 2017

paige-pruitt commented Dec 18, 2017

rhlbns commented Aug 29, 2018

ValeryRybakov commented Oct 10, 2018

Akashtyagi commented Feb 19, 2020

sicotronic commented Oct 11, 2016 •

edited

sicotronic commented Oct 11, 2016 •

edited

jeLee6gi commented Nov 3, 2017 •

edited

sicotronic commented Nov 10, 2017 •

edited