New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Symbols don't match between model and embedding #2
Comments
Do you need |
Any NMT is likely to need |
Also, if tokenization is performed with sentencepiece and then processed with the embedding in the bin file (as suggested in the Readme), given that the symbols don't match, there will be unknowns that should not appear, isn't it? |
Yes, there will be unknowns, because I set the GloVe minimum token frequency to 100 for high-resource languages, and to 10 for low-resource languages, since you probably won't get good embeddings for tokens that occur too rarely. I know this is not ideal, but so far haven't found a good solution for this. In the original BPE implementation (subword-nmt), you can set a minimum symbol frequency, meaning that no BPE symbols with lower frequency will be generated. In this case there wouldn't be any unknowns. I haven't found a similary option in sentencepiece, but I have to use sentencepiece, because it is so much faster. Regarding |
Thanks @bheinzerling . Also, I would suggest adding a warning in the Readme about the difference in tokens. This would save some time from someone that is thinking about using the sentencepiece python wrapper and then use the embedding right away with the token IDs generated by |
This is (finally) fixed in the latest version. |
It seems that symbols from the sentencepiece model and the associated embeddings are not the same, specifically control symbols are not present in the embedding. For instance, in
de.wiki.bpe.op50000.model
there are the symbols <s>, </s> and <unk>, but in the associated embeddingde.wiki.bpe.vs50000.d300.vectors.bin
only <unk> is defined.Update: Further exploration reveals that the number of common symbols between the aforementioned 2 files are 49631.
The text was updated successfully, but these errors were encountered: