* [How to train your tokenizer from scratch using SentencePiece](https://colab.research.google.com/drive/1Ica34BAGK2tuIeQl01SRNTjujPq5C3d1?usp=sharing#scrollTo=x_9uWrrD1GUD)
* [HuggingFace example on how to train tokenizers](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/tokenizer_training.ipynb#scrollTo=tQ-nlKovypyv)

In this notebook, we will follow Huggingface's guidance to train a new tokenizer using an existing one. This approach allows us to leverage the predefined rules and optimizations of the original tokenizer.

Specifically, we’ll work with the Qwen tokenizer, which is built on tiktoken. Its functionality is quite similar to that of GPT-2, with only minor differences. We'll explore those differences more thoroughly another time.

> **Note:** These days, most large language models (LLMs) rely on Byte Pair Encoding (BPE) and utilize tiktoken for tokenization.

## 1 - Load and prepare data

In [7]:
from datasets import Dataset, load_dataset
from transformers import AutoTokenizer

In [2]:
# Load the dataset in streaming mode
stories_stream = load_dataset(
    "roneneldan/TinyStories", streaming=True, trust_remote_code=True
)

n_rows = 1100

# Get the first 100 rows
rows = list(stories_stream["train"].take(n_rows))

# Count the total number of characters
total_chars = sum(len(row["text"]) for row in rows)
total_chars

1014715

Make it a HuggingFace dataset, for simplicity sake

In [4]:
stories = Dataset.from_list(rows)

print(stories)

Dataset({
    features: ['text'],
    num_rows: 1100
})


In [16]:
stories[0]

{'text': 'One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.\n\nLily went to her mom and said, "Mom, I found this needle. Can you share it with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."\n\nTogether, they shared the needle and sewed the button on Lily\'s shirt. It was not difficult for them because they were sharing and helping each other. After they finished, Lily thanked her mom for sharing the needle and fixing her shirt. They both felt happy because they had shared and worked together.'}

We are also going to use a batch_iterator, as if it were a big dataset. For learning sake, we are going to use a small `batch_size`

In [22]:
batch_size = 100


def batch_iterator():
    for i in range(0, len(stories), batch_size):
        yield stories[i : i + batch_size]["text"]

In [None]:
# Iterate through the batches and print the first few items of each batch
for batch_number, batch in enumerate(batch_iterator(), start=1):
    print(f"Batch {batch_number} (Size {len(batch)}):")
    print(batch[0])  # show just the first instance of the batch
    print("-" * 40)  # Separator between batches

## 2 - Train tokenizer using GPT-2 / Qwen tokenization algorithms

If we want to train a tokenizer with the exact same algorithms and parameters as an existing one, we can just use the `train_new_from_iterator` API from 🤗 Tokenizers library.

In [36]:
gpt2_tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
qwen_tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")

Make sure that the tokenizer you picked as a *fast* version (backed by the 🤗 Tokenizers library) otherwise the rest of the notebook will not run:

In [37]:
print(gpt2_tokenizer.is_fast)
print(qwen_tokenizer.is_fast)

True
True


Then we feed the training corpus (either the list of list or the iterator we defined earlier) to the `train_new_from_iterator` method. We also have to specify the **vocabulary size** we want to use.

> Note: If the `vocab_size` we pass is lower than the base vocabulary of the tokenizer (i.e., **257** for GPT-2 and **270** for Qwen), it seems to be ignored

### 2.1 - Train GPT-2 tokenizer with 1 extra token

Let's train the GPT-2 tokenizer with vocab_size of 258. This will mean that we only add 1 new token to the vocabulary after "training".

In [61]:
new_tokenizer_gpt2 = gpt2_tokenizer.train_new_from_iterator(
    batch_iterator(), vocab_size=258
)
new_tokenizer_gpt2






GPT2TokenizerFast(name_or_path='openai-community/gpt2', vocab_size=258, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}

In [58]:
new_tokenizer_gpt2.convert_ids_to_tokens(257)  # The new token after BPE

'he'

### 2.2 - Train Qwen tokenizer with 1 extra token

Let's train the  Qwen tokenizer with vocab_size of 271. This will mean that we only add 1 new token to the vocabulary after "training".

In [62]:
new_tokenizer_qwen = qwen_tokenizer.train_new_from_iterator(
    batch_iterator(), vocab_size=271
)
new_tokenizer_qwen






Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-0.5B', vocab_size=271, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|endoftext|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=

### 2.3.- Train GPT-2 Tokenizer with 1024 tokens

In [76]:
new_tokenizer_gpt2 = gpt2_tokenizer.train_new_from_iterator(
    batch_iterator(), vocab_size=1024
)






Although we dont know the counts of each token, we can see which tokens were the "least common" by their rank in the vocabulary

In [77]:
# Get the vocabulary from the tokenizer
vocab_gpt2 = new_tokenizer_gpt2.get_vocab()

# Sort the vocabulary by token id (in ascending order)
sorted_vocab = sorted(vocab_gpt2.items(), key=lambda x: x[1])

number = 5

# Print the top token IDs
print(f"Top {number} tokens:")
for token, token_id in sorted_vocab[:number]:
    print(f"Token: {token}, ID: {token_id}")

# Print the bottom token IDs
print(f"\nBottom {number} tokens:")
for token, token_id in sorted_vocab[-number:]:
    print(f"Token: {token}, ID: {token_id}")

Top 5 tokens:
Token: <|endoftext|>, ID: 0
Token: !, ID: 1
Token: ", ID: 2
Token: #, ID: 3
Token: $, ID: 4

Bottom 5 tokens:
Token: Ġlived, ID: 1019
Token: raw, ID: 1020
Token: ash, ID: 1021
Token: ĠWhen, ID: 1022
Token: so, ID: 1023


### 2.4 - Train Qwen tokenizer with 1024 tokens

In [80]:
new_tokenizer_qwen = qwen_tokenizer.train_new_from_iterator(
    batch_iterator(), vocab_size=1024
)






In [81]:
# Get the vocabulary from the tokenizer
vocab_qwen = new_tokenizer_qwen.get_vocab()

# Sort the vocabulary by token id (or you can choose to sort by frequency if needed)
# Sorting by token ID in ascending order (if you want the first 5 ids)
sorted_vocab_qwen = sorted(vocab_qwen.items(), key=lambda x: x[1])

number = 5
# Print the top token IDs
print(f"Top {number} tokens:")
for token, token_id in sorted_vocab_qwen[:number]:
    print(f"Token: {token}, ID: {token_id}")

# Print the bottom token IDs
print(f"\nBottom {number} tokens:")
for token, token_id in sorted_vocab[-number:]:
    print(f"Token: {token}, ID: {token_id}")

Top 5 tokens:
Token: <|endoftext|>, ID: 0
Token: <|im_start|>, ID: 1
Token: <|im_end|>, ID: 2
Token: <|object_ref_start|>, ID: 3
Token: <|object_ref_end|>, ID: 4

Bottom 5 tokens:
Token: bbit, ID: 1032
Token: Ġpic, ID: 1033
Token: Ġow, ID: 1034
Token: Ġwatch, ID: 1035
Token: Ġbrave, ID: 1036


The difference in tokens is due to the varying algorithms, as I tested 1024 + 13 tokens and observed that they also yielded different tokens.

### 2.5 - Save tokenizers

In [85]:
new_tokenizer_gpt2.save_pretrained("custom-gpt2-tokenizer")

('custom-gpt2-tokenizer/tokenizer_config.json',
 'custom-gpt2-tokenizer/special_tokens_map.json',
 'custom-gpt2-tokenizer/vocab.json',
 'custom-gpt2-tokenizer/merges.txt',
 'custom-gpt2-tokenizer/added_tokens.json',
 'custom-gpt2-tokenizer/tokenizer.json')

In [86]:
new_tokenizer_qwen.save_pretrained("custom-qwen-tokenizer")

('custom-qwen-tokenizer/tokenizer_config.json',
 'custom-qwen-tokenizer/special_tokens_map.json',
 'custom-qwen-tokenizer/vocab.json',
 'custom-qwen-tokenizer/merges.txt',
 'custom-qwen-tokenizer/added_tokens.json',
 'custom-qwen-tokenizer/tokenizer.json')

## 3 - Building Our Custom Tokenizer from Scratch

If you're interested, here are several helpful resources:

* [HuggingFace Notebook Example](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/tokenizer_training.ipynb#scrollTo=9wG8PjQSj-Nc)
* [HuggingFace Forum Thread on How to Build a Custom LlamaTokenizer](https://discuss.huggingface.co/t/how-to-train-a-llamatokenizer/64835)
    * [Colab Notebook Showing How to Do It with SentencePiece](https://colab.research.google.com/drive/1Ica34BAGK2tuIeQl01SRNTjujPq5C3d1?usp=sharing), then check the thread to see [how to save it as a FastTokenizer](https://discuss.huggingface.co/t/how-to-train-a-llamatokenizer/64835/15) (not sure if this would work, to be honest)
* HuggingFace Transformers Library Implementation of Tokenizers:
    * [GPT-2](https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt2/tokenization_gpt2_fast.py)
    * [Qwen](https://github.com/huggingface/transformers/blob/v4.46.3/src/transformers/models/qwen2/tokenization_qwen2_fast.py)
* Original Repo Implementations:
    * [Llama](https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py) (as an example; other models like Qwen are available directly in HuggingFace)
