# [Building A Tokenizer, Block By Block](https://huggingface.co/learn/nlp-course/chapter6/8?fw=pt)

```text
As we’ve seen in the previous sections, tokenization comprises several steps:
  1. Normalization (any cleanup of the text that is deemed necessary, such as removing spaces or accents, Unicode normalization, etc.)
  2. Pre-tokenization (splitting the input into words)
  3. Running the input through the model (using the pre-tokenized words to produce a sequence of tokens)
  4. Post-processing (adding the special tokens of the tokenizer, generating the attention mask and token type IDs)
  
As a reminder, here’s another look at the overall process:
```

<br>

[![image.png](https://i.postimg.cc/Qxkc6N43/image.png)](https://postimg.cc/dL373FcH)

In [1]:
# Built-in library
import re
import json
from typing import Any, Dict, List, Optional, Union
import logging
import warnings

# Standard imports
import numpy as np
import pandas as pd
from rich import print

# Visualization
import matplotlib.pyplot as plt


# Pandas settings
pd.options.display.max_rows = 1_000
pd.options.display.max_columns = 1_000
pd.options.display.max_colwidth = 600

warnings.filterwarnings("ignore")

# Black code formatter (Optional)
%load_ext lab_black

# auto reload imports
%load_ext autoreload
%autoreload 2

#### Note

- The 🤗 Tokenizers library has been built to provide several options for each of those steps, which you can mix and match together. 
- In this section we’ll see how we can build a tokenizer from scratch, as opposed to training a new tokenizer from an old one as we did in [section 2](https://huggingface.co/course/chapter6/2).
-  You’ll then be able to build any kind of tokenizer you can think of!

More precisely, the library is built around a central Tokenizer class with the building blocks regrouped in submodules:

- **normalizers** contains all the possible types of Normalizer you can use (complete list [here](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.normalizers)).
- **pre_tokenizers** contains all the possible types of PreTokenizer you can use (complete list [here](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.pre_tokenizers)).
- **models** contains the various types of Model you can use, like BPE, WordPiece, and Unigram (complete list [here](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.models)).
trainers contains all the different types of Trainer you can use to train your model on a corpus (one per type of model; complete list [here](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.trainers)).
- **post_processors** contains the various types of PostProcessor you can use (complete list [here](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.processors)).
- **decoders** contains the various types of Decoder you can use to decode the outputs of tokenization (complete list [here](https://huggingface.co/docs/tokenizers/python/latest/components.html#decoders)).
You can find the whole list of building blocks [here](https://huggingface.co/docs/tokenizers/python/latest/components.html).

### Load Dataset

In [2]:
from datasets import load_dataset, Dataset

name: str = "wikitext-2-raw-v1"  # wikitext-103-v1 (~191 MB)
split: str = "train"
dataset: Dataset = load_dataset(path="wikitext", name=name, split=split)


def get_training_corpus() -> str:
    """This is used to load the dataset in batches."""
    for i in range(0, len(dataset), 1000):
        yield dataset[i : i + 1000]["text"]

In [3]:
dataset

Dataset({
    features: ['text'],
    num_rows: 36718
})

In [4]:
dataset[1]

{'text': ' = Valkyria Chronicles III = \n'}

```text
The function get_training_corpus() is a generator that will yield batches of 1,000 texts, which we will use to train the tokenizer.

🤗 Tokenizers can also be trained on text files directly. Here’s how we can generate/save a text file containing all the texts/inputs from WikiText-2 that we can use locally:
```

```python
with open("wikitext-2.txt", "w", encoding="utf-8") as f:
    for i in range(len(dataset)):
        f.write(dataset[i]["text"] + "\n")
```

In [5]:
# Save data
fp: str = "./my_data/wikitext-2.txt"

with open(file=fp, mode="w") as f:
    for idx in range(len(dataset)):
        f.write(dataset[idx].get("text") + "\n")

#### Build A Custom BERT, GPT-2, and XLNet Tokenizer

```text
- Build a custom tokenizer block by block. 
- Using the three main tokenization algorithms: 
  - WordPiece
  - BPE 
  - Unigram
```

<br><hr>

### Building A WordPiece Tokenizer From Sratch

```text
- To build a tokenizer with the 🤗 Tokenizers library, we start by instantiating a Tokenizer object with a model, then set its normalizer, pre_tokenizer, post_processor, and decoder attributes to the values we want.

- For this example, we’ll create a Tokenizer with a WordPiece model:
```


In [6]:
from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)

# To ensure proper tokenization, we need to define the unk_token, which represents unknown characters.
tokenizer: Tokenizer = Tokenizer(model=models.WordPiece(unk_token="[UNK]"))
tokenizer

<tokenizers.Tokenizer at 0x7fcb22943600>

In [7]:
# Additional arguments include vocab (unnecessary for training) and max_input_chars_per_word to set a maximum
# word length for splitting. For normalization, the BertNormalizer can be used with options like lowercase,
# strip_accents, clean_text, and handle_chinese_chars for replicating the bert-base-uncased tokenizer.
tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)


# OR
# The library provides a Lowercase normalizer and a StripAccents normalizer, and you can compose several
# normalizers using a Sequence:
tokenizer.normalizer = normalizers.Sequence(
    [normalizers.NFD(), normalizers.Lowercase(), normalizers.StripAccents()]
)

In [8]:
# We’re also using an NFD Unicode normalizer, as otherwise the StripAccents normalizer won’t properly
# recognize the accented characters and thus won’t strip them out. As we’ve seen before, we can use
# the normalize_str() method of the normalizer to check out the effects it has on a given text:
print(tokenizer.normalizer.normalize_str("Héllò hôw are ü?"))

In [9]:
# Next is the pre-tokenization step. Again, there is a prebuilt BertPreTokenizer that we can use:
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()

In [10]:
# Or we can build it from scratch:
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

In [11]:
# Note that the Whitespace pre-tokenizer splits on whitespace and all characters that are not letters,
# digits, or the underscore character, so it technically splits on whitespace and punctuation:
tokenizer.pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer.")

[('Let', (0, 3)),
 ("'", (3, 4)),
 ('s', (4, 5)),
 ('test', (6, 10)),
 ('my', (11, 13)),
 ('pre', (14, 17)),
 ('-', (17, 18)),
 ('tokenizer', (18, 27)),
 ('.', (27, 28))]

In [12]:
# If you only want to split on whitespace, you should use the WhitespaceSplit pre-tokenizer instead:
pre_tokenizer = pre_tokenizers.WhitespaceSplit()
pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer.")

[("Let's", (0, 5)),
 ('test', (6, 10)),
 ('my', (11, 13)),
 ('pre-tokenizer.', (14, 28))]

In [13]:
# Like with normalizers, you can use a Sequence to compose several pre-tokenizers:
pre_tokenizer = pre_tokenizers.Sequence(
    [pre_tokenizers.WhitespaceSplit(), pre_tokenizers.Punctuation()]
)
pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer.")

[('Let', (0, 3)),
 ("'", (3, 4)),
 ('s', (4, 5)),
 ('test', (6, 10)),
 ('my', (11, 13)),
 ('pre', (14, 17)),
 ('-', (17, 18)),
 ('tokenizer', (18, 27)),
 ('.', (27, 28))]

In [14]:
# The next step in the tokenization pipeline is running the inputs through the model. We already specified our model
# in the initialization, but we still need to train it, which will require a WordPieceTrainer. The main thing to
# remember when instantiating a trainer in 🤗 Tokenizers is that you need to pass it all the special tokens you
# intend to use — otherwise it won’t add them to the vocabulary, since they are not in the training corpus:
special_tokens: list[str] = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer: trainers = trainers.WordPieceTrainer(
    vocab_size=25000, special_tokens=special_tokens
)

In [15]:
# As well as specifying the vocab_size and special_tokens, we can set the min_frequency
# (the number of times a token must appear to be included in the vocabulary) or change the
# continuing_subword_prefix (if we want to use something different from ##).
# To train our model using the iterator we defined earlier, we just have to execute this command:
tokenizer.train_from_iterator(iterator=get_training_corpus(), trainer=trainer)






In [16]:
# We can also use text files to train our tokenizer, which would look like this (we reinitialize the model with
# an empty WordPiece beforehand):
fp: str = "./my_data/wikitext-2.txt"
tokenizer.model = models.WordPiece(unk_token="[UNK]")
tokenizer.train([fp], trainer=trainer)






In [17]:
# In both cases, we can then test the tokenizer on a text by calling the encode() method:
encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)

```text
- The encoding generated by the tokenizer includes attributes such as ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, and overflowing. 

- The final step is post-processing, which involves adding the [CLS] token at the beginning and the [SEP] token at the end using a TemplateProcessor, with the IDs of these tokens obtained from the vocabulary.
```

In [18]:
cls_token_id: int = tokenizer.token_to_id("[CLS]")
sep_token_id: int = tokenizer.token_to_id("[SEP]")
print(cls_token_id, sep_token_id)

```text
- To write the template for the TemplateProcessor, we have to specify how to treat a single sentence and a pair of sentences. 
- For both, we write the special tokens we want to use; the first (or single) sentence is represented by $A, while the second sentence (if encoding a pair) is represented by $B. 
- For each of these (special tokens and sentences), we also specify the corresponding token type ID after a colon.

- The classic BERT template is thus defined as follows:
```

In [19]:
tokenizer.post_processor = processors.TemplateProcessing(
    single=f"[CLS]:0 $A:0 [SEP]:0",
    pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
    special_tokens=[("[CLS]", cls_token_id), ("[SEP]", sep_token_id)],
)

In [20]:
# Note that we need to pass along the IDs of the special tokens, so the tokenizer can properly convert them to their IDs.
# Once this is added, going back to our previous example will give:
encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)

In [21]:
# dir(encoding)

encoding.word_ids

[None, 0, 1, 2, 3, 4, 5, 5, 5, 6, None]

In [22]:
# And on a pair of sentences, we get the proper result:
encoding = tokenizer.encode("Let's test this tokenizer...", "on a pair of sentences.")

print(encoding.tokens)
print(encoding.type_ids)

In [23]:
# We’ve almost finished building this tokenizer from scratch — the last step is to include a decoder:
tokenizer.decoder = decoders.WordPiece(prefix="##")

In [24]:
print(encoding.ids)

In [25]:
# Let’s test it on our previous encoding:
tokenizer.decode(encoding.ids)

"let ' s test this tokenizer... on a pair of sentences."

### Save Tokenizer (As JSON)

In [26]:
# Great! We can save our tokenizer in a single JSON file like this:
fp: str = "./tokenizers/tokenizer.json"
tokenizer.save(fp)

### Load Tokenizer (From A File)

In [27]:
# We can then reload that file in a Tokenizer object with the from_file() method:
new_tokenizer: Tokenizer = Tokenizer.from_file(fp)

```text
- To use the tokenizer in 🤗 Transformers, it needs to be wrapped in a PreTrainedTokenizerFast. 
- This can be done by either passing the tokenizer object or the saved tokenizer file. 
- It's important to manually set the special tokens such as the mask token and the [CLS] token.
```

In [28]:
from transformers import PreTrainedTokenizerFast


wrapped_tokenizer: PreTrainedTokenizerFast = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    # tokenizer_file="tokenizer.json", # You can load from the tokenizer file, alternatively
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]",
)

In [29]:
from transformers import BertTokenizerFast


# If you are using a specific tokenizer class (like BertTokenizerFast), you will only need to specify the special tokens
# that are different from the default ones (here, none):
wrapped_tokenizer = BertTokenizerFast(tokenizer_object=tokenizer)

```text
- You can then use this tokenizer like any other 🤗 Transformers tokenizer. 
- You can save it with the save_pretrained() method, or upload it to the Hub with the push_to_hub() method.
```

<br><hr><br>

## Building A BPE Tokenizer From Scratch


In [30]:
# Build a GPT-2 tokenizer by initializing a Tokenizer with a BPE model:
tokenizer: Tokenizer = Tokenizer(models.BPE())

In [31]:
# For GPT-2, initializing the model with a vocabulary is not necessary as we will train from scratch.
# The unk_token is not required as GPT-2 uses byte-level BPE. The pre-tokenization step follows the omission of the normalizer.
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)

In [32]:
# The option we added to ByteLevel here is to not add a space at the beginning of a sentence (which is the default otherwise).
# We can have a look at the pre-tokenization of an example text like before:
tokenizer.pre_tokenizer.pre_tokenize_str("Let's test pre-tokenization!")

[('Let', (0, 3)),
 ("'s", (3, 5)),
 ('Ġtest', (5, 10)),
 ('Ġpre', (10, 14)),
 ('-', (14, 15)),
 ('tokenization', (15, 27)),
 ('!', (27, 28))]

In [33]:
# Train the model. For GPT-2, the only special token is the end-of-text token:
trainer = trainers.BpeTrainer(vocab_size=25000, special_tokens=["<|endoftext|>"])
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)






In [34]:
# Like with the WordPieceTrainer, as well as the vocab_size and special_tokens, we can specify the min_frequency if we want to,
# or if we have an end-of-word suffix (like </w>), we can set it with end_of_word_suffix.
# This tokenizer can also be trained on text files:
tokenizer.model = models.BPE()
tokenizer.train([fp], trainer=trainer)






In [35]:
# Tokenization of a sample text:
encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)

In [36]:
# We apply the byte-level post-processing for the GPT-2 tokenizer as follows:
tokenizer.post_processor = processors.ByteLevel(trim_offsets=False)

In [37]:
# By setting trim_offsets = False, the post-processor preserves the offsets of tokens starting with 'Ġ',
# ensuring that the offsets indicate the space before the word. This is demonstrated in the encoded text example.
sentence: str = "Let's test this tokenizer."
encoding = tokenizer.encode(sentence)
start, end = encoding.offsets[4]
sentence[start:end]

' '

In [38]:
# Add a byte-level decoder:
tokenizer.decoder = decoders.ByteLevel()

# Double-check it works properly:
print(tokenizer.decode(encoding.ids))

### Save BPE Tokenizer

```text
- Wrap it in a PreTrainedTokenizerFast or GPT2TokenizerFast if we want to use it in 🤗 Transformers:
```

In [39]:
from transformers import PreTrainedTokenizerFast, GPT2TokenizerFast


wrapped_tokenizer: PreTrainedTokenizerFast = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    bos_token="<|endoftext|>",
    eos_token="<|endoftext|>",
)

In [40]:
# OR
wrapped_tokenizer = GPT2TokenizerFast(tokenizer_object=tokenizer)

<br><hr>

## Building A Unigram Tokenizer From Scratch

```text
Let’s now build an XLNet tokenizer. Like for the previous tokenizers, we start by initializing a Tokenizer with a Unigram model
```

In [41]:
from tokenizers import Regex


# You can initialize the model with a vocabulary if a vocab is available.
tokenizer: Tokenizer = Tokenizer(models.Unigram())


# For the normalization, XLNet uses a few replacements (which come from SentencePiece)
tokenizer.normalizer = normalizers.Sequence(
    [
        normalizers.Replace("``", '"'),
        normalizers.Replace("''", '"'),
        normalizers.NFKD(),
        normalizers.StripAccents(),
        normalizers.Replace(Regex(" {2,}"), " "),
    ]
)

In [42]:
# The pre-tokenizer to use for any SentencePiece tokenizer is Metaspace:
tokenizer.pre_tokenizer = pre_tokenizers.Metaspace()

# Pre-tokenization of an example text:
tokenizer.pre_tokenizer.pre_tokenize_str("Let's test the pre-tokenizer!")

[("▁Let's", (0, 5)),
 ('▁test', (5, 10)),
 ('▁the', (10, 14)),
 ('▁pre-tokenizer!', (14, 29))]

In [43]:
#  Add some special tokens and train the model.
special_tokens = ["<cls>", "<sep>", "<unk>", "<pad>", "<mask>", "<s>", "</s>"]
trainer = trainers.UnigramTrainer(
    vocab_size=25000, special_tokens=special_tokens, unk_token="<unk>"
)
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)





### Train The Model [on text files]

```python
# This tokenizer can also be trained on text files
tokenizer.model = models.Unigram()
tokenizer.train(["wikitext-2.txt"], trainer=trainer)
```

In [44]:
# Tokenization of a sample text:
encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)

In [45]:
# XLNet is unique in that it places the <cls> token at the end of the sentence, with a type ID of 2
# to distinguish it from other tokens. This results in left-padding. We can use a template to handle
# all special tokens and token type IDs, but we must first obtain the IDs of the <cls> and <sep> tokens.
cls_token_id = tokenizer.token_to_id("<cls>")
sep_token_id = tokenizer.token_to_id("<sep>")
print(cls_token_id, sep_token_id)

In [46]:
# The template looks like this:
tokenizer.post_processor = processors.TemplateProcessing(
    single="$A:0 <sep>:0 <cls>:2",
    pair="$A:0 <sep>:0 $B:1 <sep>:1 <cls>:2",
    special_tokens=[("<sep>", sep_token_id), ("<cls>", cls_token_id)],
)

In [47]:
# Test the tokenizer by encoding a pair of sentences:
encoding = tokenizer.encode("Let's test this tokenizer...", "on a pair of sentences!")
print(encoding.tokens)
print(encoding.type_ids)

In [48]:
# Add a Metaspace decoder
tokenizer.decoder = decoders.Metaspace()

### Save Tokenizer

```text
- The tokenizer is now complete! We can save it and use it in 🤗 Transformers by wrapping it in PreTrainedTokenizerFast or XLNetTokenizerFast. 

- We need to tell the 🤗 Transformers library to pad on the left when using PreTrainedTokenizerFast.
```


In [49]:
from transformers import PreTrainedTokenizerFast


wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    bos_token="<s>",
    eos_token="</s>",
    unk_token="<unk>",
    pad_token="<pad>",
    cls_token="<cls>",
    sep_token="<sep>",
    mask_token="<mask>",
    padding_side="left",  # tell the 🤗 Transformers library to pad on the left
)

In [50]:
# OR
from transformers import XLNetTokenizerFast


wrapped_tokenizer = XLNetTokenizerFast(tokenizer_object=tokenizer)