New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some embeddings are invalid (majority of vectors is inf or nan) #6
Comments
Thanks for reporting the inf entries. I didn't come across those in my evaluation, since there didn't seem to be any performance difference between the d200 and d300 embeddings I loooked at, so I mostly limited the evaluation to dim <= 200. That said, I have no idea where the I'm planning to retrain all embeddings in order to address some of the other issues that have been raised, hopefully this will allow me to find out what's happening. Regarding Chinese characters, that's indeed a problem of the training corpus (Wikipedia). Most, if not all, Wikipedia editions contain "foreign" characters that do not occur in the native character inventory of their language. I don't see a good way of filtering out those "foreign" characters, since it's not only Chinese characters in English Wikipedia, but German umlauts, French accents, cuneiform in articles about ancient Syria, etc. An approach based on frequency (something like "delete sentences containing rare characters") might work for languages that use a small character inventory (e.g. European languages using the Latin alphabet), but will probably delete many native characters from languages with a large character inventory. For example the Chinese Wikipedia contains many Latin characters that are more frequent than some of the rarer Chinese characters. |
Hi @bheinzerling , I think the
|
I ran into this same problem today in the size 300 vectors. In the meantime while a fix isn't out you should either remove the download links or make a big bold note in the readme saying that the size 300 vectors are invalid. |
This is (finally) fixed in the latest version. |
I encountered a similar problem on a 300 sized vectors. |
Firstly thanks for your efforts in providing the pretrained embeddings.
Unfortunately some of the embeddings are not trained correctly. For instance, the d300 embeddings for the 10k model for English contain 9640 vectors with inf entries, out of 10817 total. It would be great if you can provide your training script, or double check it and upload fixed vectors.
The d100 embeddings that you use in the Readme are indeed fine and do not contain any inf values.
Furthermore, while debugging this issue I noticed that the embeddings contain Chinese characters at the following indices
[10345, 10451, 10458, 10475, 10514, 10531, 10539, 10541, 10601, 10606, 10609, 10622, 10627, 10632, 10633, 10638, 10657, 10702, 10740, 10750, 10755, 10756, 10762, 10781, 10790, 10791, 10802, 10809, 10810, 10815]
. Perhaps it would be sensible to filter out sentences containing Chinese characters from the training corpus?The text was updated successfully, but these errors were encountered: