Skip to content

Tokenization and vocabulary

zhezhaoa edited this page Jul 3, 2022 · 7 revisions

UER-py supports multiple tokenization strategies. The most commonly used strategy is BertTokenizer (which is also the default strategy). There are two ways of using BertTokenizer: the first is to specify the vocabulary path through --vocab_path and then use BERT's original tokenization strategy to segment sentences according to the vocabulary; the second is to specify the sentencepiece model path by --spm_model_path . We import sentencepiece, load the sentencepiece model, and segment the sentence. If user specifies --spm_model_path, sentencepiece is used for tokenization. Otherwise, user must specify --vocab_path and BERT's original tokenization strategy is used for tokenization.

In addition, the project provides CharTokenizer and SpaceTokenizer. CharTokenizer tokenizes the text by character. If the text is all Chinese character, CharTokenizer and BertTokenizer are equivalent. CharTokenizer is simple, and is faster than BertTokenizer. SpaceTokenizer separates the text by space. One can preprocess the text in advance (such as word segmentation), separate the text by space, and then use SpaceTokenizer. For CharTokenizer and SpaceTokenizer, if user specifies --spm_model_path, then the vocabulary in sentencepiece model is used. Otherwise, user must specify the vocabulary through --vocab_path.

To support English RoBERTa and GPT-2 pre-training models, the project includes BPETokenizer. --vocab_path specifies the vocabulary path and --merges_path specifies the merge vocabulary file.

The project also supports XLMRobertaTokenizer (identical with the original implementation). XLMRobertaTokenizer uses the sentencepiece model and its path is denoted by --spm_model_path . Special tokens are added to the vocabulary in XLMRoBERTaTokenizer. Since XLMRoBERTaTokenizer uses different special tokens with the default case, we should change the special tokens with the method mentioned in the next paragraph.

The pre-processing, pre-training, fine-tuning, and inference stages all need to specify the vocabulary (provided through --vocab_path or --smp_model_path) and the tokenizer (provided through --tokenizer). If users use their own vocabularies, in default case, the padding character, starting character, separator character, and mask character are "[PAD]", "[CLS]", "[SEP]", "[MASK]" (the project read special tokens information from models/special_tokens_map.json by default). If the user's vocabulary has different special tokens, the user should provide special tokens mapping file, e.g. models/xlmroberta_special_tokens_map.json and then change the path of the special tokens mapping file in uer/utils/constants.py .

Clone this wiki locally