adding special tokens to a BPEmb model #53

tannonk · 2021-02-12T20:50:22Z

Hi,

Thanks for this excellent resource! I've been using BPEmbs in my models since learning about them recently and have found them to work quite well. I'm currently trying to figure out how to use them most effectively with my data which is pre-processed with certain masking tokens for privacy, e.g. <name>, <digit>, etc.

This might be an obvious question, but can you think of a way to extend the vocabulary by adding special tokens to a pre-trained sentencepiece model or is this out of the question? If so, perhaps it would be possible to allow for a certain number of arbitrary special tokens in future iterations of BPEmbs.

Thanks in advance!

The text was updated successfully, but these errors were encountered:

bheinzerling · 2021-02-16T11:19:32Z

Hi,

Unfortunately, adding special tokens to a pretrained sentencepiece model isn't supported by sentencepiece. It's possible to specify user-defined symbols (https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md), but that requires training a sentencepiece model and corresponding embeddings from scratch. This repository shows how to train your own models and embeddings: https://github.com/stephantul/piecelearn

tannonk · 2021-02-16T12:09:19Z

Ok, thanks for the clarification!

tannonk · 2021-02-17T09:51:47Z

Hey,

Sorry to bother you with this once again, but I figured out how to extend a pretrained sentencepiece model with user-defined symbols following the hints at google/sentencepiece#426.

e.g.

model_file = 'en.wiki.bpe.vs10000.model'
symbols = ['<endtitle>', '<name>', '<url>', '<digit>', '<email>', '<loc>', '<greeting>', '<salutation>', '[CLS]', '[SEP]']

mp.ParseFromString(open(model_file, 'rb').read())

print(f'Original model pieces: {len(mp.pieces)}')

for i, sym in enumerate(symbols, 1):
    new_sym = mp.SentencePiece()
    new_sym.piece = sym 
    new_sym.score = 0.0 # default score for USER_DEFINED
    new_sym.type = 4 # type value for USER_DEFINED
    mp.pieces.insert(2+i, new_sym) # position after default control symbols ("<unk>", "<s>", "</s>")
    print(f'added {new_sym}...')

print(f'New model pieces: {len(mp.pieces)}')

outfile = 'en.ext.wiki.bpe.vs10000.model'

with open(outfile, 'wb') as f:
    f.write(mp.SerializeToString())

The newly extended sentencepiece model encodes a string containing a special tokens as expected:

sp = spm.SentencePieceProcessor(model_file=str('en.ext.wiki.bpe.vs10000.model'))
sp.encode_as_pieces('[CLS] this is a test <name> <digit> .')
>>> ['▁', '[CLS]', '▁this', '▁is', '▁a', '▁test',  '▁', '<name>',  '▁', '<digit>', '▁.']

Depending on how the embedding models are saved, it would then be possible to load a BPEmb model with Gensim and continue training on a small in-domain corpus. But from what I can tell, the models available at https://nlp.h-its.org/bpemb/#download do not support continued training. Are the full models available anywhere?

I just came across #36, I suppose that answers it!

Thanks!

bheinzerling · 2021-02-17T11:55:15Z

Thanks for posting this, good to know!

leminhyen2 · 2021-10-07T08:16:28Z

@tannonk What is the mp module that you use here to call mp.ParseFromString and mp.SentencePiece()?

tannonk · 2021-10-07T13:01:00Z

@leminhyen2, good point, that part is missing from the code snippet above. The mp here is SentencePiece's ModelProto() object. Check out this post from the original author, which includes it simply as m: google/sentencepiece#121 (comment)

leminhyen2 · 2021-10-07T15:01:23Z

@tannonk Thank you, that was very helpful

tannonk closed this as completed Feb 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding special tokens to a BPEmb model #53

adding special tokens to a BPEmb model #53

tannonk commented Feb 12, 2021 •

edited

bheinzerling commented Feb 16, 2021

tannonk commented Feb 16, 2021

tannonk commented Feb 17, 2021 •

edited

bheinzerling commented Feb 17, 2021

leminhyen2 commented Oct 7, 2021

tannonk commented Oct 7, 2021

leminhyen2 commented Oct 7, 2021

adding special tokens to a BPEmb model #53

adding special tokens to a BPEmb model #53

Comments

tannonk commented Feb 12, 2021 • edited

bheinzerling commented Feb 16, 2021

tannonk commented Feb 16, 2021

tannonk commented Feb 17, 2021 • edited

bheinzerling commented Feb 17, 2021

leminhyen2 commented Oct 7, 2021

tannonk commented Oct 7, 2021

leminhyen2 commented Oct 7, 2021

tannonk commented Feb 12, 2021 •

edited

tannonk commented Feb 17, 2021 •

edited