Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vocab length != word vector count #7

Closed
tocab opened this issue Jan 23, 2018 · 5 comments
Closed

Vocab length != word vector count #7

tocab opened this issue Jan 23, 2018 · 5 comments

Comments

@tocab
Copy link

tocab commented Jan 23, 2018

Hey, there is an inconsistency between the count of words in the vocabulary file of the English word vectors for 25000 BPE merges and the count of word vectors with a dimensionality of 200. In the vocab file, there is a count of 25000 words, but the .bin file contains 25777 word vectors after loading it into gensim.

Additionally, the order of the word vectors differs from the vocab file. For example, the word "▁explanation" is on position 19387 in the vocab file, but on position 9138 in the .bin file.

It would be very helpful if the index of the vocabulary in all files would be in the same order and have the same count.

@tocab
Copy link
Author

tocab commented Jan 23, 2018

As a workaround, I iterated the words over the vocab file and looked up the words by the words. During this process, the following words were not found in the word vectors:

<s>, </s>, ▁distric, ptember, bruary, ▁performan, orporated, ▁headqu, ▁attem, ▁mathem, ▁passeng, uguese, ▁azerbai, ▁compris, urday, ▁emplo, ▁portra, ▁thous, ▁lithu, ▁leban, ▁councill, ▁specim, ▁molec, ▁entrepren, ▁predecess, ▁glouc, ▁earthqu, ▁istan, imination, ▁infloresc, ▁ingred, chiidae, ▁sofl, ürttemberg, ▁practition, echua, eteries, bridgeshire, ▁nudi, rzys, tokrzys, uchestan, ▁taekw, kopol, giluyeh, ▁fute, ivisie, marthen, ▁gillesp, aziland, scray, alandhar, azulu, alisco

@solomatov
Copy link

I have absolutely the same problem. Is there any chance that it will be fixed?

@bheinzerling
Copy link
Owner

bheinzerling commented Jul 26, 2018

I've almost finished the next version of BPEmb, in which this issue will be fixed. I'm currently figuring out where to host it, since it is a bit larger and doesn't fit on the current web space.

@solomatov
Copy link

Thank you for your work! You save everybody a lot of time and resources by providing us with a pretrained models.

@bheinzerling
Copy link
Owner

This is (finally) fixed in the latest version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants