# Building WordPiece Tokenizer from scratch

In this lab, we will look at building WordPiece tokenizer from scratch. 
Tokenization comprises several steps:

- Normalization (any cleanup of the text that is deemed necessary, such as removing spaces or accents, Unicode normalization, etc.)
- Pre-tokenization (splitting the input into words)
- Running the input through the model (using the pre-tokenized words to produce a sequence of tokens)
- Post-processing (adding the special tokens of the tokenizer, generating the attention mask and token type IDs)


The library is built around a central Tokenizer class with the building blocks regrouped in submodules:
- [normalizers](https://huggingface.co/docs/tokenizers/api/normalizers)
- [pre_tokenizers](https://huggingface.co/docs/tokenizers/api/pre-tokenizers)
- [models](https://huggingface.co/docs/tokenizers/api/models)
- [trainers](https://huggingface.co/docs/tokenizers/api/trainers)
- [post_processors](https://huggingface.co/docs/tokenizers/api/post-processors)
- [decoders](https://huggingface.co/docs/tokenizers/api/decoders)

To train our new tokenizer, we’ll use the [WikiText-2](https://huggingface.co/datasets/Salesforce/wikitext) dataset:

In [1]:
from datasets import load_dataset

In [2]:
dataset = load_dataset("wikitext", name="wikitext-2-raw-v1", split="train")


The function get_training_corpus() is a generator that will yield batches of 1,000 texts, which we will use to train the tokenizer.

In [3]:
def get_training_corpus():
    for i in range(0, len(dataset), 1000):
        yield dataset[i : i + 1000]["text"]

Tokenizers can also be trained on text files directly. Here’s how we can generate a text file containing all the texts/inputs from WikiText-2 that we can use locally:

In [4]:
with open("wikitext-2.txt", "w", encoding="utf-8") as f:
    for i in range(len(dataset)):
        f.write(dataset[i]["text"] + "\n")

To build a tokenizer with the HuggingFace Tokenizers library, we start by instantiating a Tokenizer object with a model, then set its normalizer, pre_tokenizer, post_processor, and decoder attributes to the values we want.

In [5]:
from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)

tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))
#We have to specify the unk_token so the model knows what to return when it encounters characters it hasn’t seen before. 

This normalization pipeline helps standardize text by reducing variations and cleaning up the input. Here we are using NFD Unicode normalizer, lowering the input, and stripping off any accents.

In [6]:
tokenizer.normalizer = normalizers.Sequence(
    [normalizers.NFD(), normalizers.Lowercase(), normalizers.StripAccents()]
)

we can use the normalize_str() method of the normalizer to check out the effects it has on a given text

In [7]:
print(tokenizer.normalizer.normalize_str("Héllò hôw are ü?"))


hello how are u?


Pre-tokenization is a crucial preprocessing step that occurs before the main tokenization process. It splits raw text into initial chunks or words, establishing boundaries that prevent tokens from crossing word boundaries.
The process converts raw text into smaller entities by splitting on elements like:
- Whitespace
- Punctuation marks

The Whitespace pre-tokenizer splits on whitespace and all characters that are not letters, digits, or the underscore character, so it technically splits on whitespace and punctuation. If you only want to split on whitespace, you should use the WhitespaceSplit pre-tokenizer instead.

In [8]:
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()


In [9]:
tokenizer.pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer.")


[('Let', (0, 3)),
 ("'", (3, 4)),
 ('s', (4, 5)),
 ('test', (6, 10)),
 ('my', (11, 13)),
 ('pre', (14, 17)),
 ('-', (17, 18)),
 ('tokenizer', (18, 27)),
 ('.', (27, 28))]

In [10]:
pre_tokenizer = pre_tokenizers.WhitespaceSplit()
pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer.")

[("Let's", (0, 5)),
 ('test', (6, 10)),
 ('my', (11, 13)),
 ('pre-tokenizer.', (14, 28))]

Like with normalizers, you can use a Sequence to compose several pre-tokenizers. 

In [11]:
pre_tokenizer = pre_tokenizers.Sequence(
    [pre_tokenizers.WhitespaceSplit(), pre_tokenizers.Punctuation()]
)
pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer.")

[('Let', (0, 3)),
 ("'", (3, 4)),
 ('s', (4, 5)),
 ('test', (6, 10)),
 ('my', (11, 13)),
 ('pre', (14, 17)),
 ('-', (17, 18)),
 ('tokenizer', (18, 27)),
 ('.', (27, 28))]

The next step in the tokenization pipeline is running the inputs through the model. We already specified our model in the initialization, but we still need to train it, which will require a WordPieceTrainer. The main thing to remember when instantiating a trainer in HuggingFace Tokenizers is that you need to pass it all the special tokens you intend to use — otherwise it won’t add them to the vocabulary, since they are not in the training corpus.

In [12]:
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordPieceTrainer(vocab_size=25000, special_tokens=special_tokens)

To train our model using the iterator we defined earlier, we just have to execute this command:

In [15]:
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)






We can also use text files to train our tokenizer, which would look like this (we reinitialize the model with an empty WordPiece beforehand).

In [16]:
tokenizer.model = models.WordPiece(unk_token="[UNK]")
tokenizer.train(["wikitext-2.txt"], trainer=trainer)






In [17]:
encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)

['let', "'", 's', 'test', 'this', 'tok', '##eni', '##zer', '.']


The last step in the tokenization pipeline is post-processing. We need to add the [CLS] token at the beginning and the [SEP] token at the end (or after each sentence, if we have a pair of sentences). We will use a TemplateProcessor for this, but first we need to know the IDs of the [CLS] and [SEP] tokens in the vocabulary.

In [None]:
cls_token_id = tokenizer.token_to_id("[CLS]")
sep_token_id = tokenizer.token_to_id("[SEP]")
print(cls_token_id, sep_token_id)

To write the template for the TemplateProcessor, we have to specify how to treat a single sentence and a pair of sentences. For both, we write the special tokens we want to use; the first (or single) sentence is represented by $A, while the second sentence (if encoding a pair) is represented by $B. For each of these (special tokens and sentences), we also specify the corresponding token type ID after a colon.
Note that we need to pass along the IDs of the special tokens, so the tokenizer can properly convert them to their IDs.

In [21]:
tokenizer.post_processor = processors.TemplateProcessing(
    single=f"[CLS]:0 $A:0 [SEP]:0",
    pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
    special_tokens=[("[CLS]", cls_token_id), ("[SEP]", sep_token_id)],
)

In [18]:
encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)

['let', "'", 's', 'test', 'this', 'tok', '##eni', '##zer', '.']


In [19]:
encoding = tokenizer.encode("Let's test this tokenizer...", "on a pair of sentences.")
print(encoding.tokens)
print(encoding.type_ids)

['let', "'", 's', 'test', 'this', 'tok', '##eni', '##zer', '...', 'on', 'a', 'pair', 'of', 'sentences', '.']
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]


We’ve almost finished building this tokenizer from scratch — the last step is to include a decoder.

In [20]:
tokenizer.decoder = decoders.WordPiece(prefix="##")

In [21]:
tokenizer.decode(encoding.ids)

"let ' s test this tokenizer... on a pair of sentences."

We can save our tokenizer in a single JSON file like below and then reload that file in a Tokenizer object with the from_file() method:

In [22]:
tokenizer.save("tokenizer.json")
new_tokenizer = Tokenizer.from_file("tokenizer.json")