## Hugging Face Tokebizers Tutorial and Usage

Ref:- https://www.kaggle.com/funtowiczmo/hugging-face-tutorials-training-tokenizer

In [1]:
# !pip install tokenizers==0.5.2

In [12]:
from tokenizers import Tokenizer
from tokenizers.decoders import ByteLevel as ByteLevelDecoder
from tokenizers.models import BPE
from tokenizers.normalizers import NFKC, Lowercase, Sequence
from tokenizers.pre_tokenizers import ByteLevel

# First we create an empty Byte Pair Encoding model
tokenizer = Tokenizer(BPE.empty())

# Lower casing and unicode-normalization
tokenizer.normalizer = Sequence([NFKC(), Lowercase()])

# use a pre-tokenizer convert a input to a Byte Level representation
tokenizer.pre_tokenizer = ByteLevel()

# use a decoder so we can recover from a tokenized input to the original one
tokenizer.decoder = ByteLevelDecoder()


## Train the pipeline

### Various options for the various stages
We designed the library so that it provides all the required blocks to create end-to-end tokenizers in an interchangeable way. In that sense, we provide these various components:

1. **Normalizer**: Executes all the initial transformations over the initial input string. For example when you need to lowercase some text, maybe strip it, or even apply one of the common unicode normalization process, you will add a Normalizer.
2. **PreTokenizer**: In charge of splitting the initial input string. That's the component that decides where and how to pre-segment the origin string. The simplest example would be like we saw before, to simply split on spaces.
3. **Model**: Handles all the sub-token discovery and generation, this part is trainable and really dependant of your input data.
4. **Post-Processor**: Provides advanced construction features to be compatible with some of the Transformers-based SoTA models. For instance, for BERT it would wrap the tokenized sentence around [CLS] and [SEP] tokens.
5. **Decoder**: In charge of mapping back a tokenized input to the original string. The decoder is usually chosen according to the PreTokenizer we used previously.
6. **Trainer**: Provides training capabilities to each model.

For each of the components above we provide multiple implementations:

1. **Normalizer**: Lowercase, Unicode (NFD, NFKD, NFC, NFKC), Bert, Strip, ...
2. **PreTokenizer**: ByteLevel, WhitespaceSplit, CharDelimiterSplit, Metaspace, ...
3. **Model**: WordLevel, BPE, WordPiece
4. **Post-Processor**: BertProcessor, ...
5. **Decoder**: WordLevel, BPE, WordPiece, ...

In [13]:
from tokenizers.trainers import BpeTrainer

trainer = BpeTrainer(vocab_size=25000, show_progress=True, initial_alphabet=ByteLevel.alphabet())

tokenizer.train(trainer, ["./big.txt"])

print(f"Trained vocab size {tokenizer.get_vocab_size()}")

Trained vocab size 25000


In [15]:
tokenizer.model.save(".")

['./vocab.json', './merges.txt']

In [28]:
# Load the model
tokenizer.model = BPE.from_files("./vocab.json", "./merges.txt")

# encoding = tokenizer.encode("This is a simple input to be tokenized")
# encoding = tokenizer.encode("What is your name? My Name is Abhik")
encoding = tokenizer.encode("U.S.A is handling the kovid19 crisis not very well")
# encoding = tokenizer.encode("Banks are planning to cut the rates. The Federal govt. will reduce rates to 0.")


print(f"Encoded String {encoding.tokens}")

decoded = tokenizer.decode(encoding.ids)

print(f"Decoded String {decoded}")

Encoded String ['Ġu', '.', 's', '.', 'a', 'Ġis', 'Ġhandling', 'Ġthe', 'Ġk', 'ov', 'id', '19', 'Ġcrisis', 'Ġnot', 'Ġvery', 'Ġwell']
Decoded String  u.s.a is handling the kovid19 crisis not very well


### Other options
The Encoding structure exposes multiple properties which are useful when working with transformers models

1. **normalized_str**: The input string after normalization (lower-casing, unicode, stripping, etc.)
2. **original_str**: The input string as it was provided
3. **tokens**: The generated tokens with their string representation
4. **input_ids**: The generated tokens with their integer representation
5. **attention_mask**: If your input has been padded by the tokenizer, then this would be a vector of 1 for any non padded token and 0 for padded ones.
6. **special_token_mask**: If your input contains special tokens such as [CLS], [SEP], [MASK], [PAD], then this would be a vector with 1 in places where a special token has been added.
7. **type_ids**: If your was made of multiple "parts" such as (question, context), then this would be a vector with for each token the segment it belongs to.
8. **overflowing**: If your has been truncated into multiple subparts because of a length limit (for BERT for example the sequence length is limited to 512), this will contain all the remaining overflowing parts.

In [29]:
# use some more functions that could also be used
print(f"Original String - {encoding.original_str}")
print(f"Normalized String - {encoding.normalized_str}")
print(f"String Tokens - {encoding.tokens}")
# print(f"String Input Ids - {encoding.input_ids}")
print(f" Attention mask - {encoding.attention_mask}")
print(f"Special Token Mask - {encoding.original_str}")
print(f"Type Ids - {encoding.type_ids}")
print(f"Overflowing - {encoding.overflowing}")


Original String - U.S.A is handling the kovid19 crisis not very well
Normalized String - u.s.a is handling the kovid19 crisis not very well
String Tokens - ['Ġu', '.', 's', '.', 'a', 'Ġis', 'Ġhandling', 'Ġthe', 'Ġk', 'ov', 'id', '19', 'Ġcrisis', 'Ġnot', 'Ġvery', 'Ġwell']
 Attention mask - [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Special Token Mask - U.S.A is handling the kovid19 crisis not very well
Type Ids - [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Overflowing - []
