New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding special tokens to a BPEmb model #53
Comments
Hi, Unfortunately, adding special tokens to a pretrained sentencepiece model isn't supported by sentencepiece. It's possible to specify user-defined symbols (https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md), but that requires training a sentencepiece model and corresponding embeddings from scratch. This repository shows how to train your own models and embeddings: https://github.com/stephantul/piecelearn |
Ok, thanks for the clarification! |
Hey, Sorry to bother you with this once again, but I figured out how to extend a pretrained sentencepiece model with user-defined symbols following the hints at google/sentencepiece#426. e.g.
The newly extended sentencepiece model encodes a string containing a special tokens as expected:
I just came across #36, I suppose that answers it! Thanks! |
Thanks for posting this, good to know! |
@tannonk What is the mp module that you use here to call mp.ParseFromString and mp.SentencePiece()? |
@leminhyen2, good point, that part is missing from the code snippet above. The |
@tannonk Thank you, that was very helpful |
Hi,
Thanks for this excellent resource! I've been using BPEmbs in my models since learning about them recently and have found them to work quite well. I'm currently trying to figure out how to use them most effectively with my data which is pre-processed with certain masking tokens for privacy, e.g.
<name>
,<digit>
, etc.This might be an obvious question, but can you think of a way to extend the vocabulary by adding special tokens to a pre-trained sentencepiece model or is this out of the question? If so, perhaps it would be possible to allow for a certain number of arbitrary special tokens in future iterations of BPEmbs.
Thanks in advance!
The text was updated successfully, but these errors were encountered: