Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Symbols don't match between model and embedding #2

Closed
noe opened this issue Oct 18, 2017 · 6 comments
Closed

Symbols don't match between model and embedding #2

noe opened this issue Oct 18, 2017 · 6 comments

Comments

@noe
Copy link

noe commented Oct 18, 2017

It seems that symbols from the sentencepiece model and the associated embeddings are not the same, specifically control symbols are not present in the embedding. For instance, in de.wiki.bpe.op50000.model there are the symbols <s>, </s> and <unk>, but in the associated embedding de.wiki.bpe.vs50000.d300.vectors.bin only <unk> is defined.

Update: Further exploration reveals that the number of common symbols between the aforementioned 2 files are 49631.

@bheinzerling
Copy link
Owner

<s>, </s> and <unk> are generated automatically by sentencepiece. <s>, </s> do not occur in the texts I used for training the embeddings, since I didn't apply any preprocessing other than the preproc_text.sh script. That's why there are no embeddings for <s>, </s>

Do you need <s>, </s> for your specific application?

@noe
Copy link
Author

noe commented Oct 18, 2017

Any NMT is likely to need </s> to signal end of sentence for the translation.

@noe
Copy link
Author

noe commented Oct 18, 2017

Also, if tokenization is performed with sentencepiece and then processed with the embedding in the bin file (as suggested in the Readme), given that the symbols don't match, there will be unknowns that should not appear, isn't it?

@bheinzerling
Copy link
Owner

bheinzerling commented Oct 18, 2017

Yes, there will be unknowns, because I set the GloVe minimum token frequency to 100 for high-resource languages, and to 10 for low-resource languages, since you probably won't get good embeddings for tokens that occur too rarely.

I know this is not ideal, but so far haven't found a good solution for this. In the original BPE implementation (subword-nmt), you can set a minimum symbol frequency, meaning that no BPE symbols with lower frequency will be generated. In this case there wouldn't be any unknowns.

I haven't found a similary option in sentencepiece, but I have to use sentencepiece, because it is so much faster.

Regarding <s>: Reading the docs again, it seems sentencepiece can add sentence markers automatically, but so far I haven't used that option since I didn't need it for my use case. Let me look into how well this works across languages.

@noe
Copy link
Author

noe commented Oct 18, 2017

Thanks @bheinzerling . Also, I would suggest adding a warning in the Readme about the difference in tokens. This would save some time from someone that is thinking about using the sentencepiece python wrapper and then use the embedding right away with the token IDs generated by sentencepiece.SentencePieceModel.EncodeAsIds.

@bheinzerling
Copy link
Owner

This is (finally) fixed in the latest version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants