Some embeddings are invalid (majority of vectors is inf or nan) #6

leezu · 2017-12-21T10:55:13Z

Firstly thanks for your efforts in providing the pretrained embeddings.

Unfortunately some of the embeddings are not trained correctly. For instance, the d300 embeddings for the 10k model for English contain 9640 vectors with inf entries, out of 10817 total. It would be great if you can provide your training script, or double check it and upload fixed vectors.
The d100 embeddings that you use in the Readme are indeed fine and do not contain any inf values.

Furthermore, while debugging this issue I noticed that the embeddings contain Chinese characters at the following indices[10345, 10451, 10458, 10475, 10514, 10531, 10539, 10541, 10601, 10606, 10609, 10622, 10627, 10632, 10633, 10638, 10657, 10702, 10740, 10750, 10755, 10756, 10762, 10781, 10790, 10791, 10802, 10809, 10810, 10815]. Perhaps it would be sensible to filter out sentences containing Chinese characters from the training corpus?

The text was updated successfully, but these errors were encountered:

bheinzerling · 2018-01-08T16:55:28Z

Thanks for reporting the inf entries. I didn't come across those in my evaluation, since there didn't seem to be any performance difference between the d200 and d300 embeddings I loooked at, so I mostly limited the evaluation to dim <= 200.

That said, I have no idea where the -inf values come from. They (or to be more exact, very small negative values that are turned into -inf by gensim) are already present in the glove output (before converting to word2vec format), so my best guess is some numerical instability in glove. Did you encounter this issue only for d300 embeddings?

I'm planning to retrain all embeddings in order to address some of the other issues that have been raised, hopefully this will allow me to find out what's happening.

Regarding Chinese characters, that's indeed a problem of the training corpus (Wikipedia). Most, if not all, Wikipedia editions contain "foreign" characters that do not occur in the native character inventory of their language.

I don't see a good way of filtering out those "foreign" characters, since it's not only Chinese characters in English Wikipedia, but German umlauts, French accents, cuneiform in articles about ancient Syria, etc.

An approach based on frequency (something like "delete sentences containing rare characters") might work for languages that use a small character inventory (e.g. European languages using the Latin alphabet), but will probably delete many native characters from languages with a large character inventory. For example the Chinese Wikipedia contains many Latin characters that are more frequent than some of the rarer Chinese characters.

jbingel · 2018-02-20T13:31:06Z

Hi @bheinzerling , I think the inf values may stem from bad formatting in the string-formatted files (there are lots of rows for which the float numbers seem to be formatted wrongly and thus are very big or small, such that they're casted to inf (or -inf) when reading them with gensim. See for instance line 2 in en.wiki.bpe.op25000.d300.w2v.txt:

▁the 77886731983343865551525467241822275705322262629385835727655310147672217916213924049715292248127516478049295593733435890305060133781727877705920968924703204172031998750831607167150978431853908720377688377359429221796730229030811730868041989137637292326094939917553014290510380007424.000000 -8815719088942410927595216684135846806679085608351504742380150361050767799483375182712544617842629875794353491481573647659294516250794376484211844788084607114616380493081365034768534741466933964707557414081551607706475695494920408336667413770905...

AndersonHappens · 2018-11-10T04:36:23Z

I ran into this same problem today in the size 300 vectors. In the meantime while a fix isn't out you should either remove the download links or make a big bold note in the readme saying that the size 300 vectors are invalid.

bheinzerling · 2018-11-19T17:03:29Z

This is (finally) fixed in the latest version.

emanuelevivoli · 2022-04-15T14:47:56Z

I encountered a similar problem on a 300 sized vectors.
However, for me the nan vector was caused by the presence of token '', which ended up being all 300-dim nan.

bheinzerling closed this as completed Nov 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some embeddings are invalid (majority of vectors is inf or nan) #6

Some embeddings are invalid (majority of vectors is inf or nan) #6

leezu commented Dec 21, 2017

bheinzerling commented Jan 8, 2018 •

edited

jbingel commented Feb 20, 2018

AndersonHappens commented Nov 10, 2018

bheinzerling commented Nov 19, 2018

emanuelevivoli commented Apr 15, 2022

Some embeddings are invalid (majority of vectors is inf or nan) #6

Some embeddings are invalid (majority of vectors is inf or nan) #6

Comments

leezu commented Dec 21, 2017

bheinzerling commented Jan 8, 2018 • edited

jbingel commented Feb 20, 2018

AndersonHappens commented Nov 10, 2018

bheinzerling commented Nov 19, 2018

emanuelevivoli commented Apr 15, 2022

bheinzerling commented Jan 8, 2018 •

edited