<a href="https://colab.research.google.com/github/donna-noble/HPC-Workshop/blob/main/02_tokenizer_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [6]:
! pip install datasets transformers[sentencepiece]



# What are tokenizers

Before we begin, let's see a tiny little example of what are tokenizers

In [7]:
text = "This is a sample sentence we use to learn about tokenizers, tokenization and tokenizing"

First of all, we split the sentence into smaller components called *tokens*. This can be done in very different ways but for this example we are just using "words" (we are just using the space to separed said words, notice that using this strategy contracted words like *isn't* will be consider a single token).

In [8]:
tokens = text.split(" ")

tokens

['This',
 'is',
 'a',
 'sample',
 'sentence',
 'we',
 'use',
 'to',
 'learn',
 'about',
 'tokenizers,',
 'tokenization',
 'and',
 'tokenizing']

Now we create a vocabulary based on the tokens we have seen before.

In [9]:
unique_tokens = set(tokens) #use the set to eliminate repeted tokens

vocabulary = list(unique_tokens) # just to change back the data structure to a list

{vocabulary[i]:i for i in range(len(vocabulary))}

{'and': 0,
 'is': 1,
 'learn': 2,
 'we': 3,
 'sentence': 4,
 'tokenizing': 5,
 'tokenizers,': 6,
 'use': 7,
 'tokenization': 8,
 'to': 9,
 'about': 10,
 'a': 11,
 'This': 12,
 'sample': 13}

Finally, we encoded the sentence using the position of the token in the vocabulary list as it "reference" resulting in:

In [10]:
encoded = []

for token in tokens:
    encoded.append(vocabulary.index(token))

encoded


[12, 1, 11, 13, 4, 3, 7, 9, 2, 10, 6, 8, 0, 5]

You can go back to the first cell in this notebook to edit the sentence and do the whole process again to see how different text will behave with this basic strategy.

A normal word split works only when every word in your dataset is already known, but real text contains millions of rare forms, typos, inflections, compounds, and new words the model has never seen before. A tokenizer solves this by breaking words into subword pieces that can be reused across many words, which massively reduces vocabulary size and lets the model understand new words it has never encountered. For example, with a word split, “tokenization”, “tokenizer”, and “tokenizing” all become completely different words the model must learn from scratch. A tokenizer instead splits them into shared pieces like “token” + “ization”, “token” + “izer”, “token” + “izing”, so the model reuses knowledge and generalizes far better. This is why modern LLMs rely on tokenizers instead of raw word lists, they are more compact, handle any text, reduce memory, and allow the model to understand new words by combining familiar parts.

# Training your own tokenizer from scratch

In this notebook, we will see several ways to train your own tokenizer from scratch on a given corpus, so you can then use it to train a language model from scratch.

Why would you need to *train* a tokenizer? That's because Transformer models very often use subword tokenization algorithms, and they need to be trained to identify the parts of words that are often present in the corpus you are using. We recommend you take a look at the [tokenization chapter](https://huggingface.co/course/chapter2/4?fw=pt) of the Hugging Face course for a general introduction on tokenizers, and at the [tokenizers summary](https://huggingface.co/transformers/tokenizer_summary.html) for a look at the differences between the subword tokenization algorithms.

## Getting a corpus

In [2]:
!pip install huggingface_hub

from huggingface_hub import notebook_login

notebook_login()



VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [3]:
from datasets import load_dataset

# Load in streaming mode — does NOT download whole dataset
ds = load_dataset("LSX-UniWue/LLaMmlein-Dataset", streaming=True)

stream = ds["train"].shuffle(seed=42, buffer_size=100_000)

# Take first 10k from the shuffled stream
sampled = []
for i, item in enumerate(stream):
    if i >= 10_000:
        break
    sampled.append(item)

# Convert list → Dataset object (optional)
from datasets import Dataset
dataset = Dataset.from_list(sampled)

print(dataset)

Resolving data files:   0%|          | 0/1811 [00:00<?, ?it/s]

Dataset({
    features: ['id', 'text', 'source'],
    num_rows: 10000
})


We can have a look at the dataset, which as 36,718 texts:

In [4]:
dataset

Dataset({
    features: ['id', 'text', 'source'],
    num_rows: 10000
})

To access an element, we just have to provide its index:

In [5]:
dataset[1]

{'id': 'sha1:YYEBPYNDQOOGJ6NTXIO3ERIZEDFJL6M5',
 'text': '2626. September 2022 1010. Oktober 2022',
 'source': 'https://www.classic-computing.org/veranstaltungskalender/?time=day&yr=2022&month=10&dy=2'}

We can also access a slice directly, in which case we get a dictionary with the key `"text"` and a list of texts as value:

In [6]:
dataset[:2]

{'id': ['sha1:J3LZP6K4YICBH4LVXAKEVZCCOGADP73Q',
  'sha1:YYEBPYNDQOOGJ6NTXIO3ERIZEDFJL6M5'],
 'text': ['Initiative hatte für Kreuzberger Gemüseladen gekämpft – „Bizim Bakkal“ schließt aus gesundheitlichen Gründen\n„Nun ist doch Schluss mit dem Gemüseladen „Bizim Bakkal“. Ahmet Caliskan gibt das Geschäft aus gesundheitlichen Gründen auf. Seine Nachbarschaft hatte lange für den Laden gekämpft – im vergangenen Sommer kam es zu heftigen Protesten gegen die Kündigung seitens des Vermieters. Für Caliskan hatte das langfristige Folgen.\nAhmet Caliskan gibt auf. Der Inhaber des Kreuzberger Gemüseladens „Bizim Bakkal“ schließt sein Geschäft im Wrangelkiez aus gesundheitlichen Gründen, wie die Nachbarschaftsinitiative „Bizim Kiez“ am Montag bekannt gab. Im vergangenen Sommer hatte Caliskan die Kündigung seines Mietvertrages erhalten. Daraufhin formierte sich „Bizim Kiez“ („Wir sind der Kiez“) und kämpfte erfolgreich über Monate dafür, dass der kleine Laden erhalten bleibt.\nCaliskan bekam seinen

The API to train our tokenizer will require an iterator of batch of texts, for instance a list of list of texts:

In [7]:
batch_size = 1000

all_texts = []

for i in range(0, len(dataset),batch_size):
    all_texts.append(dataset[i : i + batch_size]["text"])


To avoid loading everything into memory (since the Datasets library keeps the element on disk and only load them in memory when requested), we define a Python iterator. This is particularly useful if you have a huge dataset:

In [8]:
def batch_iterator():
    for i in range(0, len(dataset), batch_size):
        yield dataset[i : i + batch_size]["text"]

### Adapting an Existing Tokenizer vs Training a Tokenizer From Scratch

- Training a tokenizer from scratch gives much better compression and fairness for under-represented languages, because the vocabulary is built around your data instead of mostly English.

-	Extending an existing tokenizer preserves compatibility with pretrained models, making adaptation far cheaper and faster without retraining embeddings from scratch.

# **Building your tokenizer from scratch**


To understand how to build your tokenizer from scratch, we have to dive a little bit more in the   Tokenizers library and the tokenization pipeline. This pipeline takes several steps:

- **Normalization**: Executes all the initial transformations over the initial input string. For example when you need to lowercase some text, maybe strip it, or even apply one of the common unicode normalization process, you will add a Normalizer.
- **Pre-tokenization**: In charge of splitting the initial input string. That's the component that decides where and how to pre-segment the origin string. The simplest example would be to simply split on spaces.
- **Model**: Handles all the sub-token discovery and generation, this is the part that is trainable and really dependent of your input data.
- **Post-Processing**: Provides advanced construction features to be compatible with some of the Transformers-based SoTA models. For instance, for BERT it would wrap the tokenized sentence around [CLS] and [SEP] tokens.

And to go in the other direction:

- **Decoding**: In charge of mapping back a tokenized input to the original string. The decoder is usually chosen according to the `PreTokenizer` we used previously.

For the training of the model, the  Tokenizers library provides a `Trainer` class that we will use.

All of these building blocks can be combined to create working tokenization pipelines. To give you some examples, we will show three full pipelines here: how to replicate GPT-2, BERT and T5 (which will give you an example of Byte Pair Encoding, WordPiece and Unigram tokenizer).

### **Subword tokenization**

In this group of cells, we will include a more detailed explanation of subword tokenization. Byte Pair Encoding (BPE), WordPiece, and Unigram are all different kinds of what is known as subword tokenization.

Subword tokenization algorithms are based on the idea of considering units smaller than words as tokens. In particular, uncommon words in the vocabulary will be split into smaller units following different approaches. For example, the word *darkest* will be divided into *dark* and *est*, adding both these tokens to the vocabulary, but not the original word. We can already see how these approaches can be advantageous.





- **BPE** – merges the most frequent character pairs first, gradually building bigger subword units based on frequency.



- **WordPiece** – selects subword units that maximize the probability of reconstructing the training text using a language-model–based criterion.



- **Unigram** – starts with a large vocabulary and removes subwords that least reduce the likelihood, keeping only the most useful ones.

# **WordPiece model like BERT**

Let's have a look at how we can create a WordPiece tokenizer like the one used for training BERT. The first step is to create a `Tokenizer` with an empty `WordPiece` model:

In [9]:
from tokenizers import decoders, models, normalizers, pre_tokenizers, processors, trainers, Tokenizer

tokenizer = Tokenizer(models.WordPiece(unl_token="[UNK]"))

This `tokenizer` is not ready for training yet. We have to add some preprocessing steps: the normalization (which is optional) and the pre-tokenizer, which will split inputs into the chunks we will call words. The tokens will then be part of those words (but can't be larger than that).

In the case of BERT, the normalization is lowercasing. Since BERT is such a popular model, it has its own normalizer:

In [10]:
tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)

If you want to customize it, you can use the existing blocks and compose them in a sequence: here for instance we lower case, apply NFD normalization (unicode normalization) and strip the accents:

In [11]:
tokenizer.normalizer = normalizers.Sequence(
    [normalizers.NFD(), normalizers.Lowercase(), normalizers.StripAccents()]
)

There is also a `BertPreTokenizer` we can use directly. It pre-tokenizes using white space and punctuation:

In [12]:
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()

Like for the normalizer, we can combine several pre-tokenizers in a `Sequence`. If we want to have a quick look at how it preprocesses the inputs, we can call the `pre_tokenize_str` method:

In [None]:
tokenizer.pre_tokenizer.pre_tokenize_str("Initiative hatte für Kreuzberger Gemüseladen gekämpft – „Bizim Bakkal“")

Note that the pre-tokenizer not only split the text into words but keeps the offsets, that is the beginning and start of each of those words inside the original text. This is what will allow the final tokenizer to be able to match each token to the part of the text that it comes from (a feature we use for question answering or token classification tasks).

We can now train our tokenizer (the pipeline is not entirely finished but we will need a trained tokenizer to build the post-processor), we use a `WordPieceTrainer` for that. The key thing to remember is to pass along the special tokens to the trainer, as they won't be seen in the corpus.

In [None]:
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordPieceTrainer(vocab_size=25000, special_tokens=special_tokens)

To actually train the tokenizer, the method looks like what we used before: we can either pass some text files, or an iterator of batches of texts:

In [None]:
tokenizer.train_from_iterator(batch_iterator(), trainer=trainer)

Now that the tokenizer is trained, we can define the post-processor: we need to add the CLS token at the beginning and the SEP token at the end (for single sentences) or several SEP tokens (for pairs of sentences). We use a [`TemplateProcessing`](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#tokenizers.processors.TemplateProcessing) to do this, which requires to know the IDs of the CLS and SEP token (which is why we waited for the training).

So let's first grab the ids of the two special tokens:

In [None]:
cls_token_id = tokenizer.token_to_id("[CLS]")
sep_token_id = tokenizer.token_to_id("[SEP]")
print(cls_token_id, sep_token_id)

And here is how we can build our post processor. We have to indicate in the template how to organize the special tokens with one sentence (`$A`) or two sentences (`$A` and `$B`). The `:` followed by a number indicates the token type ID to give to each part, we use a binary token type ID to indicate, in case that there is, what tokens belong to the first sentence and what tokens belong to the second sentence.

In [None]:
tokenizer.post_processor = processors.TemplateProcessing(
    single=f"[CLS]:0 $A:0 [SEP]:0",
    pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", cls_token_id),
        ("[SEP]", sep_token_id),
    ],
)

We can check we get the expected results by encoding a pair of sentences for instance:

In [None]:
encoding = tokenizer.encode("Für Caliskan hatte das langfristige Folgen.","\nAhmet Caliskan gibt auf.")

We can look at the tokens to check the special tokens have been inserted in the right places:

In [None]:
encoding.tokens

And we can check the token type ids are correct (the tokens that belong to the first sentence, including the special tokens [CLS] and [SEP] should get a 0 and the tokens that belong to the second sentence should get a 1):

In [None]:
encoding.type_ids

The last piece in this tokenizer is the decoder, we use a `WordPiece` decoder and indicate the special prefix `##`, used by our tokenizer to indicate tokens that were subparts of a word (for example, the word *sentence* in the last example was divided as *sent* and *##ence*).

In [None]:
tokenizer.decoder = decoders.WordPiece(prefix="##")

Now that our tokenizer is finished, we need to wrap it inside a Transformers object to be able to use it with the Transformers library. More specifically, we have to put it inside the class of tokenizer fast corresponding to the model we want to use, here a `BertTokenizerFast`:

In [None]:
from transformers import BertTokenizerFast

new_tokenizer = BertTokenizerFast(tokenizer_object=tokenizer)

And like before, we can use this tokenizer as a normal Transformers tokenizer, and use the `save_pretrained` or `push_to_hub` methods.

If the tokenizer you are building does not match any class in Transformers because it's really special, you can wrap it in `PreTrainedTokenizerFast`.

# **Byte Pair Encoding model like GPT-2**

Let's now have a look at how we can create a Byte Pair Encoding (BPE) tokenizer like the one used for training GPT-2. The first step is to create a `Tokenizer` with an empty `BPE` model:

In [None]:
tokenizer = Tokenizer(models.BPE())

Like before, we have to add the optional normalization (not used in the case of GPT-2) and we need to specify a pre-tokenizer before training. In the case of GPT-2, the pre-tokenizer used is a byte level pre-tokenizer:

In [None]:
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)

If we want to have a quick look at how it preprocesses the inputs, we can call the `pre_tokenize_str` method:

In [None]:
tokenizer.pre_tokenizer.pre_tokenize_str( "Für Caliskan hatte das langfristige Folgen.\nAhmet Caliskan gibt auf.")

We used the same default as for GPT-2 for the prefix space, so you can see that each word gets an initial `'Ġ'` added at the beginning, except the first one.

We can now train our tokenizer! This time we use a `BpeTrainer`.

In [None]:
trainer = trainers.BpeTrainer(vocab_size=25000, special_tokens=["<|endoftext|>"])
tokenizer.train_from_iterator(batch_iterator(), trainer=trainer)

To finish the whole pipeline, we have to include the post-processor and decoder:

In [None]:
tokenizer.post_processor = processors.ByteLevel(trim_offsets=False)
tokenizer.decoder = decoders.ByteLevel()

And like before, we finish by wrapping this in a Transformers tokenizer object:

In [None]:
tokenizer.encode( "'Initiative hatte für Kreuzberger Gemüseladen gekämpft – „Bizim Bakkal“ schließt aus gesundheitlichen Gründen\n„Nun ist doch Schluss mit dem Gemüseladen „Bizim Bakkal“. Ahmet Caliskan gibt das Geschäft aus gesundheitlichen Gründen auf. Seine Nachbarschaft hatte lange für den Laden gekämpft – im vergangenen Sommer kam es zu heftigen Protesten gegen die Kündigung seitens des Vermieters. Für Caliskan hatte das langfristige Folgen.\nAhmet Caliskan gibt auf. Der Inhaber des Kreuzberger Gemüseladens „Bizim Bakkal“ schließt sein Geschäft im Wrangelkiez aus gesundheitlichen Gründen, wie die Nachbarschaftsinitiative „Bizim Kiez“ am Montag bekannt gab. Im vergangenen Sommer hatte Caliskan die Kündigung seines Mietvertrages erhalten.").tokens

In [None]:
from transformers import GPT2TokenizerFast

new_tokenizer = GPT2TokenizerFast(tokenizer_object=tokenizer)

# **Unigram model like Albert,T5**

Let's now have a look at how we can create a Unigram tokenizer like the one used for training T5. The first step is to create a `Tokenizer` with an empty `Unigram` model:

In [None]:
tokenizer = Tokenizer(models.Unigram())

Like before, we have to add the optional normalization (here some replaces and lower-casing) and we need to specify a pre-tokenizer before training. The pre-tokenizer used is a `Metaspace` pre-tokenizer: it replaces all spaces by a special character (defaulting to ▁) and then splits on that character.

In [None]:
tokenizer.normalizer = normalizers.Sequence(
    [normalizers.Replace("``", '"'), normalizers.Replace("''", '"'), normalizers.Lowercase()]
)
tokenizer.pre_tokenizer = pre_tokenizers.Metaspace()

If we want to have a quick look at how it preprocesses the inputs, we can call the `pre_tokenize_str` method:

In [None]:
tokenizer.pre_tokenizer.pre_tokenize_str("'Initiative hatte für Kreuzberger Gemüseladen gekämpft – „Bizim Bakkal“ schließt aus gesundheitlichen Gründen\n„Nun ist doch Schluss mit dem Gemüseladen „Bizim Bakkal“. Ahmet Caliskan gibt das Geschäft aus gesundheitlichen Gründen auf. Seine Nachbarschaft hatte lange für den Laden gekämpft – im vergangenen Sommer kam es zu heftigen Protesten gegen die Kündigung seitens des Vermieters. Für Caliskan hatte das langfristige Folgen.\nAhmet Caliskan gibt auf. Der Inhaber des Kreuzberger Gemüseladens „Bizim Bakkal“ schließt sein Geschäft im Wrangelkiez aus gesundheitlichen Gründen, wie die Nachbarschaftsinitiative „Bizim Kiez“ am Montag bekannt gab. Im vergangenen Sommer hatte Caliskan die Kündigung seines Mietvertrages erhalten.")

You can see that each word gets an initial `▁` added at the beginning, as is usually done by sentencepiece.

We can now train our tokenizer! This time we use a `UnigramTrainer`."We have to explicitely set the unknown token in this trainer otherwise it will forget it afterward.

In [None]:
trainer = trainers.UnigramTrainer(vocab_size=25000, special_tokens=["[CLS]", "[SEP]", "<unk>", "<pad>", "[MASK]"], unk_token="<unk>")
tokenizer.train_from_iterator(batch_iterator(), trainer=trainer)

To finish the whole pipeline, we have to include the post-processor and decoder. The post-processor is very similar to what we saw with BERT, the decoder is just `Metaspace`, like for the pre-tokenizer.

In [None]:
tokenizer.encode("'Initiative hatte für Kreuzberger Gemüseladen gekämpft – „Bizim Bakkal“ schließt aus gesundheitlichen Gründen\n„Nun ist doch Schluss mit dem Gemüseladen „Bizim Bakkal“. Ahmet Caliskan gibt das Geschäft aus gesundheitlichen Gründen auf. Seine Nachbarschaft hatte lange für den Laden gekämpft – im vergangenen Sommer kam es zu heftigen Protesten gegen die Kündigung seitens des Vermieters. Für Caliskan hatte das langfristige Folgen.\nAhmet Caliskan gibt auf. Der Inhaber des Kreuzberger Gemüseladens „Bizim Bakkal“ schließt sein Geschäft im Wrangelkiez aus gesundheitlichen Gründen, wie die Nachbarschaftsinitiative „Bizim Kiez“ am Montag bekannt gab. Im vergangenen Sommer hatte Caliskan die Kündigung seines Mietvertrages erhalten.").tokens

In [None]:
cls_token_id = tokenizer.token_to_id("[CLS]")
sep_token_id = tokenizer.token_to_id("[SEP]")

In [None]:
tokenizer.post_processor = processors.TemplateProcessing(
    single="[CLS]:0 $A:0 [SEP]:0",
    pair="[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", cls_token_id),
        ("[SEP]", sep_token_id),
    ],
)
tokenizer.decoder = decoders.Metaspace()

And like before, we finish by wrapping this in a Transformers tokenizer object:

In [None]:
from transformers import AlbertTokenizerFast

new_tokenizer = AlbertTokenizerFast(tokenizer_object=tokenizer)