Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some embeddings are invalid (majority of vectors is inf or nan) #6

Closed
leezu opened this issue Dec 21, 2017 · 5 comments
Closed

Some embeddings are invalid (majority of vectors is inf or nan) #6

leezu opened this issue Dec 21, 2017 · 5 comments

Comments

@leezu
Copy link

leezu commented Dec 21, 2017

Firstly thanks for your efforts in providing the pretrained embeddings.

Unfortunately some of the embeddings are not trained correctly. For instance, the d300 embeddings for the 10k model for English contain 9640 vectors with inf entries, out of 10817 total. It would be great if you can provide your training script, or double check it and upload fixed vectors.
The d100 embeddings that you use in the Readme are indeed fine and do not contain any inf values.

Furthermore, while debugging this issue I noticed that the embeddings contain Chinese characters at the following indices[10345, 10451, 10458, 10475, 10514, 10531, 10539, 10541, 10601, 10606, 10609, 10622, 10627, 10632, 10633, 10638, 10657, 10702, 10740, 10750, 10755, 10756, 10762, 10781, 10790, 10791, 10802, 10809, 10810, 10815]. Perhaps it would be sensible to filter out sentences containing Chinese characters from the training corpus?

@bheinzerling
Copy link
Owner

bheinzerling commented Jan 8, 2018

Thanks for reporting the inf entries. I didn't come across those in my evaluation, since there didn't seem to be any performance difference between the d200 and d300 embeddings I loooked at, so I mostly limited the evaluation to dim <= 200.

That said, I have no idea where the -inf values come from. They (or to be more exact, very small negative values that are turned into -inf by gensim) are already present in the glove output (before converting to word2vec format), so my best guess is some numerical instability in glove. Did you encounter this issue only for d300 embeddings?

I'm planning to retrain all embeddings in order to address some of the other issues that have been raised, hopefully this will allow me to find out what's happening.

Regarding Chinese characters, that's indeed a problem of the training corpus (Wikipedia). Most, if not all, Wikipedia editions contain "foreign" characters that do not occur in the native character inventory of their language.

I don't see a good way of filtering out those "foreign" characters, since it's not only Chinese characters in English Wikipedia, but German umlauts, French accents, cuneiform in articles about ancient Syria, etc.

An approach based on frequency (something like "delete sentences containing rare characters") might work for languages that use a small character inventory (e.g. European languages using the Latin alphabet), but will probably delete many native characters from languages with a large character inventory. For example the Chinese Wikipedia contains many Latin characters that are more frequent than some of the rarer Chinese characters.

@jbingel
Copy link

jbingel commented Feb 20, 2018

Hi @bheinzerling , I think the inf values may stem from bad formatting in the string-formatted files (there are lots of rows for which the float numbers seem to be formatted wrongly and thus are very big or small, such that they're casted to inf (or -inf) when reading them with gensim. See for instance line 2 in en.wiki.bpe.op25000.d300.w2v.txt:

▁the 77886731983343865551525467241822275705322262629385835727655310147672217916213924049715292248127516478049295593733435890305060133781727877705920968924703204172031998750831607167150978431853908720377688377359429221796730229030811730868041989137637292326094939917553014290510380007424.000000 -8815719088942410927595216684135846806679085608351504742380150361050767799483375182712544617842629875794353491481573647659294516250794376484211844788084607114616380493081365034768534741466933964707557414081551607706475695494920408336667413770905...

@AndersonHappens
Copy link

I ran into this same problem today in the size 300 vectors. In the meantime while a fix isn't out you should either remove the download links or make a big bold note in the readme saying that the size 300 vectors are invalid.

@bheinzerling
Copy link
Owner

This is (finally) fixed in the latest version.

@emanuelevivoli
Copy link

I encountered a similar problem on a 300 sized vectors.
However, for me the nan vector was caused by the presence of token '', which ended up being all 300-dim nan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants