Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding special tokens to a BPEmb model #53

Closed
tannonk opened this issue Feb 12, 2021 · 7 comments
Closed

adding special tokens to a BPEmb model #53

tannonk opened this issue Feb 12, 2021 · 7 comments

Comments

@tannonk
Copy link

tannonk commented Feb 12, 2021

Hi,

Thanks for this excellent resource! I've been using BPEmbs in my models since learning about them recently and have found them to work quite well. I'm currently trying to figure out how to use them most effectively with my data which is pre-processed with certain masking tokens for privacy, e.g. <name>, <digit>, etc.

This might be an obvious question, but can you think of a way to extend the vocabulary by adding special tokens to a pre-trained sentencepiece model or is this out of the question? If so, perhaps it would be possible to allow for a certain number of arbitrary special tokens in future iterations of BPEmbs.

Thanks in advance!

@bheinzerling
Copy link
Owner

Hi,

Unfortunately, adding special tokens to a pretrained sentencepiece model isn't supported by sentencepiece. It's possible to specify user-defined symbols (https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md), but that requires training a sentencepiece model and corresponding embeddings from scratch. This repository shows how to train your own models and embeddings: https://github.com/stephantul/piecelearn

@tannonk
Copy link
Author

tannonk commented Feb 16, 2021

Ok, thanks for the clarification!

@tannonk tannonk closed this as completed Feb 16, 2021
@tannonk
Copy link
Author

tannonk commented Feb 17, 2021

Hey,

Sorry to bother you with this once again, but I figured out how to extend a pretrained sentencepiece model with user-defined symbols following the hints at google/sentencepiece#426.

e.g.

model_file = 'en.wiki.bpe.vs10000.model'
symbols = ['<endtitle>', '<name>', '<url>', '<digit>', '<email>', '<loc>', '<greeting>', '<salutation>', '[CLS]', '[SEP]']

mp.ParseFromString(open(model_file, 'rb').read())

print(f'Original model pieces: {len(mp.pieces)}')

for i, sym in enumerate(symbols, 1):
    new_sym = mp.SentencePiece()
    new_sym.piece = sym 
    new_sym.score = 0.0 # default score for USER_DEFINED
    new_sym.type = 4 # type value for USER_DEFINED
    mp.pieces.insert(2+i, new_sym) # position after default control symbols ("<unk>", "<s>", "</s>")
    print(f'added {new_sym}...')

print(f'New model pieces: {len(mp.pieces)}')

outfile = 'en.ext.wiki.bpe.vs10000.model'

with open(outfile, 'wb') as f:
    f.write(mp.SerializeToString())

The newly extended sentencepiece model encodes a string containing a special tokens as expected:

sp = spm.SentencePieceProcessor(model_file=str('en.ext.wiki.bpe.vs10000.model'))
sp.encode_as_pieces('[CLS] this is a test <name> <digit> .')
>>> ['▁', '[CLS]', '▁this', '▁is', '▁a', '▁test',  '▁', '<name>',  '▁', '<digit>', '▁.']

Depending on how the embedding models are saved, it would then be possible to load a BPEmb model with Gensim and continue training on a small in-domain corpus. But from what I can tell, the models available at https://nlp.h-its.org/bpemb/#download do not support continued training. Are the full models available anywhere?

I just came across #36, I suppose that answers it!

Thanks!

@bheinzerling
Copy link
Owner

Thanks for posting this, good to know!

@leminhyen2
Copy link

@tannonk What is the mp module that you use here to call mp.ParseFromString and mp.SentencePiece()?

@tannonk
Copy link
Author

tannonk commented Oct 7, 2021

@leminhyen2, good point, that part is missing from the code snippet above. The mp here is SentencePiece's ModelProto() object. Check out this post from the original author, which includes it simply as m: google/sentencepiece#121 (comment)

@leminhyen2
Copy link

@tannonk Thank you, that was very helpful

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants