# Vocabulary Trimming of a T5 model

This notebook reduced the size of the tokenizer and of the embedding and decoding layer of a T5 model by removing the tokens that are not part of a target language. The goal is to transform a multilingual T5 model into a smaller, more efficient monolingual model while preserving the original model's accuracy for the target language.

This notebook follows [this blog](https://towardsdatascience.com/how-to-adapt-a-multilingual-t5-model-for-a-single-language-b9f94f3d9c90) that comes from the paper [Load What You Need](https://aclanthology.org/2020.sustainlp-1.16.pdf).

We run the procedure using the model mT5 and French as the target language.

A complete description of this and other methods for reducing model size is available [here](https://github.com/alpinf/smaller_language_models/blob/main/Smaller_LLMs.md).

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/alpinf/smaller_language_models/blob/main/notebooks/vocab_trim_mT5.ipynb)

## Setup

In [8]:
import sys
if 'google.colab' not in sys.modules:
    print("Warning: The setup was only tested in Google Colab")

!python -m pip install pandas==2.2.2 torch==2.3.1 transformers==4.41.2 datasets==2.20.0 sentencepiece==0.2.0 gdown tqdm --quiet --no-deps

In [9]:
import csv
import os
from collections import Counter

import gdown
import pandas as pd
from sentencepiece import sentencepiece_model_pb2 as spmp
import torch
from tqdm.auto import tqdm, trange
from transformers import T5ForConditionalGeneration, T5Tokenizer


In [10]:
out_dir = "./out"
data_dir = "./data"
os.makedirs(out_dir, exist_ok=True)
os.makedirs(data_dir, exist_ok=True)

In [11]:
def download_file_from_google_drive(file_id, destination):
    """
    Example usage:
    File link on Gdrive: https://drive.google.com/file/d/188r2cctPaqmuXnwer3KBpIHpftnE9tCb/view?usp=drive_link

    download_file_from_google_drive("188r2cctPaqmuXnwer3KBpIHpftnE9tCb", "./data/")
    """
    try:
        url = f"https://drive.google.com/uc?export=download&id={file_id}"
        gdown.download(url, output=destination, quiet=False)
        print(f"File downloaded from Google Drive to {destination}")
    except Exception as e:
        print(f"Error downloading file: {e}")
        raise e

Import the T5 model and its tokenizer

In [30]:
tokenizer = T5Tokenizer.from_pretrained("google/mt5-small")
model = T5ForConditionalGeneration.from_pretrained('google/mt5-small')

You are using a model of type mt5 to instantiate a model of type t5. This is not supported for all configurations of models and can yield errors.


We can understand the percentage of parameters in the embedding and decoding layers

In [32]:
def parameters_layer(layer): # returns the number of parameters in a layer
    return sum(p.numel() for p in layer.parameters())

num_params = parameters_layer(model)
# shared is the encoder layer, lm_head is the decoder layer
print(f"Percentage of parameters in the encoding layer: {100 * parameters_layer(model.shared) / parameters_layer(model):.1f}%")   # 42.7%
print(f"Percentage of parameters in the decoding layer: {100 * parameters_layer(model.lm_head) / parameters_layer(model):.1f}%")  # 42.7%
print(f"The length of the mT5 vocabulary is {len(list(tokenizer.get_vocab().keys()))}")

Percentage of parameters in the encoding layer: 42.7%
Percentage of parameters in the decoding layer: 42.7%
The length of the mT5 vocabulary is 250100


## Selecting the vocabulary
To select the vocabulary we use the [Leipzig corpora collection](https://wortschatz.uni-leipzig.de/en/download/French), but any corpus in the target language can be used. The data we use is part of the **French _News_** section of 2023 with 1 million sentences.

The corpus can be downloaded from the link above. We keep a backup copy on our drive for reproducibility.

First, we tokenized the entire corpus and kept track of the most frequent tokens


In [49]:
# Download the dataset manually from the link above or use a copy on our drive
data_path = f"{data_dir}/fra_news_2023_1M-sentences.txt"

if not os.path.exists(data_path):
    print("Downloading dataset...")
    download_file_from_google_drive("171RgrkXuWHfY-4BzHR8JEvYp9-D8yPqA", data_path)
    print("Dataset downloaded from drive")

df_fr = pd.read_table(data_path, header=None, quoting=csv.QUOTE_NONE,
                    dtype={0: int, 1: str})
df_fr = df_fr.rename(columns={0: "idx", 1: "text"})
cnt = Counter()

total_tokens = 0
for text in tqdm(df_fr.text):
    tokens = tokenizer.encode(text)
    cnt.update(tokens)
    total_tokens += len(tokens)

print(f"The total number of tokens seen in the dataset is {total_tokens}")
print(f"The number of unique tokens in the dataset is {len(cnt)}")
print(f"The dataset covers {100 * len(cnt) / tokenizer.vocab_size:.1f}% of the mT5 vocabulary.")

  0%|          | 0/1000000 [00:00<?, ?it/s]

The total number of tokens seen in the dataset is 37866826
The number of unique tokens in the dataset is 63347
The dataset covers 25.3% of the mT5 vocabulary.


We can decide how many tokens to keep based on what percentage of the corpus we want to cover.

In [50]:
percentage_to_keep = 0.999

cum_sum = 0
for i, (k, v) in enumerate(cnt.most_common()):
    cum_sum += v
    if cum_sum / sum(cnt.values()) > percentage_to_keep:
        break
    num_tokens = i + 1 # we save the number of tokens to keep

for top in 10_000, 20_000, num_tokens, 60_000:
    print(f"The {top} most common tokens cover {100 * sum(v for k, v in cnt.most_common(top)) / sum(cnt.values()):.2f}% of the dataset")

print(f"\nWe will keep {num_tokens} tokens to cover {percentage_to_keep * 100}% of the dataset")

The 10000 most common tokens cover 97.23% of the dataset
The 20000 most common tokens cover 99.14% of the dataset
The 42115 most common tokens cover 99.90% of the dataset
The 60000 most common tokens cover 99.99% of the dataset

We will keep 42115 tokens to cover 99.9% of the dataset


We start by keeping the first 259 tokens. In these tokens there are the special tokens (`<pad>`, `</s>`, `<unk>`), fundamental for fine-tuning, the 256 hexadecimal (`<0x00>`, `<0x01>`, ...) tokens that can be used to map every character using their ASCII code. These tokens will be kept so that even if the model encounters a character that is not in the vocabulary it can still represent it.

We also keep the last 100 tokens; these tokens are extra tokens (`▁<extra_id_0>`, `▁<extra_id_1>`, ...) that can be used for different purposes.

In [51]:
print(f"Special tokens: {tokenizer.convert_ids_to_tokens(range(3))}")
print(f"Hexadecimal tokens: {tokenizer.convert_ids_to_tokens(range(3, 256 + 3))}")
print(f"Last 100 tokens: {tokenizer.convert_ids_to_tokens(range(tokenizer.vocab_size - 100, tokenizer.vocab_size))}")

Special tokens: ['<pad>', '</s>', '<unk>']
Hexadecimal tokens: ['<0x00>', '<0x01>', '<0x02>', '<0x03>', '<0x04>', '<0x05>', '<0x06>', '<0x07>', '<0x08>', '<0x09>', '<0x0A>', '<0x0B>', '<0x0C>', '<0x0D>', '<0x0E>', '<0x0F>', '<0x10>', '<0x11>', '<0x12>', '<0x13>', '<0x14>', '<0x15>', '<0x16>', '<0x17>', '<0x18>', '<0x19>', '<0x1A>', '<0x1B>', '<0x1C>', '<0x1D>', '<0x1E>', '<0x1F>', '<0x20>', '<0x21>', '<0x22>', '<0x23>', '<0x24>', '<0x25>', '<0x26>', '<0x27>', '<0x28>', '<0x29>', '<0x2A>', '<0x2B>', '<0x2C>', '<0x2D>', '<0x2E>', '<0x2F>', '<0x30>', '<0x31>', '<0x32>', '<0x33>', '<0x34>', '<0x35>', '<0x36>', '<0x37>', '<0x38>', '<0x39>', '<0x3A>', '<0x3B>', '<0x3C>', '<0x3D>', '<0x3E>', '<0x3F>', '<0x40>', '<0x41>', '<0x42>', '<0x43>', '<0x44>', '<0x45>', '<0x46>', '<0x47>', '<0x48>', '<0x49>', '<0x4A>', '<0x4B>', '<0x4C>', '<0x4D>', '<0x4E>', '<0x4F>', '<0x50>', '<0x51>', '<0x52>', '<0x53>', '<0x54>', '<0x55>', '<0x56>', '<0x57>', '<0x58>', '<0x59>', '<0x5A>', '<0x5B>', '<0x5C>', '<0x5D

In [52]:
new_tokens = set(range(3 + 256)) # first 259 tokens
for i, (k, v) in enumerate(cnt.most_common(num_tokens)):
    if k not in new_tokens:
        new_tokens.add(k)
for t in range(tokenizer.vocab_size - 100, tokenizer.vocab_size): # last 100 tokens
    new_tokens.add(t)
kept_ids = sorted(new_tokens)

print(f"The final number of tokens, including the special tokens, is {len(kept_ids)}")
print(f"The percentage of the mT5 vocabulary we will keep is {100 * len(kept_ids) / tokenizer.vocab_size:.1f}%")

The final number of tokens, including the special tokens, is 42473
The percentage of the mT5 vocabulary we will keep is 17.0%


## Reducing the embedding layer and the decoder layer
We start by remapping the embedding and decoding layers of the T5 model to the smaller one.

To each new token of the smaller vocabulary is assigned an index, from 0 to the total number of kept tokens. (This is done using ```enumerate```). Then a new embedding matrix is created where each row corresponds to the original row of the T5 model of the relative token. The same is done for the decoder layer.

In [56]:
new_size = len(kept_ids)
new_emb = torch.nn.Embedding(new_size, model.shared.embedding_dim)
new_head = torch.nn.Linear(in_features=model.lm_head.in_features, out_features=new_size, bias=False)
for new_id, old_id in enumerate(kept_ids):
    # we map the old weights to the new weights, both for the embedding and the head
    new_emb.weight.data[new_id] = model.shared.weight.data[old_id]
    new_head.weight.data[new_id] = model.lm_head.weight.data[old_id]

# we update the model weights
model.shared.weight = new_emb.weight
model.lm_head.weight = new_head.weight

# we update the model configuration to reflect the new vocabulary size
model.config.__dict__['vocab_size'] = new_size
model.config.__dict__['_name_or_path'] = f"{out_dir}/rt5-small"

In [57]:
len(kept_ids), max(kept_ids), new_size, new_emb.weight.shape, model.shared.weight.data.shape

(42473, 250099, 42473, torch.Size([42473, 512]), torch.Size([42473, 512]))

In [58]:
print(f"The number of parameters of the original model is {num_params/1e6:.0f}M")
print(f"The number of parameters of the new model is {parameters_layer(model)/1e6:.0f}M")
print(f"The new model has {100 * parameters_layer(model) / num_params:.1f}% of the parameters of the original model.")

The number of parameters of the original model is 300M
The number of parameters of the new model is 88M
The new model has 29.2% of the parameters of the original model.


## Reducing the tokenizer
Changing the tokenizer is more complicated in this case as it is written in C.
To do this we need to download a *.proto* file from the sentencepiece repository and compile it with the protobuf compiler. Then we can use the compiled file to create a new tokenizer. This procedure requires installing the protobuf compiler and using it to compile the *.proto* file.

In [59]:
! wget https://raw.githubusercontent.com/google/sentencepiece/master/src/sentencepiece_model.proto
! protoc --python_out=./out sentencepiece_model.proto

--2024-12-22 22:54:29--  https://raw.githubusercontent.com/google/sentencepiece/master/src/sentencepiece_model.proto
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14023 (14K) [text/plain]
Saving to: ‘sentencepiece_model.proto’


2024-12-22 22:54:29 (94.4 MB/s) - ‘sentencepiece_model.proto’ saved [14023/14023]



The procedure is similar to the one used to change the embedding and decoding layers. We create a new tokenizer that maps the new tokens to the old ones. We also need to change the *vocab_size* parameter of the tokenizer to the number of kept tokens.

In [60]:
smp = tokenizer.sp_model.serialized_model_proto()
m = spmp.ModelProto()
m.ParseFromString(smp)
print("The loaded model has pieces:", len(m.pieces))
new_pieces = [m.pieces[idx] for idx in kept_ids]
print("The new pieces:", len(new_pieces))

for i, p in enumerate(new_pieces):
    m.pieces[i].piece = p.piece
    m.pieces[i].score = p.score
    m.pieces[i].type = p.type

# drop the remaining pieces
n = len(new_pieces)
for i in trange(len(m.pieces) - n):
    m.pieces.pop(len(m.pieces) - 1)

with open("new_sp.model", "wb") as f:
    f.write(m.SerializeToString())
new_tokenizer = T5Tokenizer("new_sp.model", extra_ids=0)

# delete the proto file
os.remove("sentencepiece_model.proto")

The loaded model has pieces: 250100
The new pieces: 42473


  0%|          | 0/207627 [00:00<?, ?it/s]

Finally we can save the new trimmed model.

In [61]:
new_tokenizer.save_pretrained(f"{out_dir}/frt5-small")
model.save_pretrained(f"{out_dir}/frt5-small")

## Encoding tests
To test if the new tokenizer works correctly we can encode and decode different sentences in different languages. This will also show how differently sentences are tokenized in different languages.

In [62]:
english_sentence = "This is an example"
french_sentence = "Ceci est un exemple"

# encode the sentence using the old and the new tokenizer
encoded_old = tokenizer.encode(english_sentence)
encoded_new = new_tokenizer.encode(english_sentence)

# print the encoding
print("Old Tokenizer Encoding:", encoded_old)
print("New Tokenizer Encoding:", encoded_new)
print()
# show the tokens used
tokens_old = tokenizer.convert_ids_to_tokens(encoded_old)
tokens_new = new_tokenizer.convert_ids_to_tokens(encoded_new)

print("Tokens Used (Old Tokenizer):", tokens_old)
print("Tokens Used (New Tokenizer):", tokens_new)
print()
encoded_old = tokenizer.encode(french_sentence)
encoded_new = new_tokenizer.encode(french_sentence)

# print the encoding
print("Old Tokenizer Encoding:", encoded_old)
print("New Tokenizer Encoding:", encoded_new)
print()
# show the tokens used
tokens_old = tokenizer.convert_ids_to_tokens(encoded_old)
tokens_new = new_tokenizer.convert_ids_to_tokens(encoded_new)

print("Tokens Used (Old Tokenizer):", tokens_old)
print("Tokens Used (New Tokenizer):", tokens_new)

Old Tokenizer Encoding: [1494, 339, 461, 11310, 1]
New Tokenizer Encoding: [990, 330, 427, 1274, 34898, 1]

Tokens Used (Old Tokenizer): ['▁This', '▁is', '▁an', '▁example', '</s>']
Tokens Used (New Tokenizer): ['▁This', '▁is', '▁an', '▁ex', 'ample', '</s>']

Old Tokenizer Encoding: [154750, 843, 335, 259, 17421, 1]
New Tokenizer Encoding: [35068, 692, 326, 259, 6683, 1]

Tokens Used (Old Tokenizer): ['▁Ceci', '▁est', '▁un', '▁', 'exemple', '</s>']
Tokens Used (New Tokenizer): ['▁Ceci', '▁est', '▁un', '▁', 'exemple', '</s>']


- We can see that while embedding the **English sentence**, there are some differences between the two tokenizers. This is because not all of the *English* tokens have been retained.
- On the other hand, when we encode the **French sentence**, the embeddings are the same. This is because all of the *French* tokens have been retained. The tokens have been remapped to the new vocabulary, so the IDs are different, but the embeddings are the same.

We can also notice that the decoding works correctly, both for the *English* and *French* tokens.

When we talk about *French* or *English* tokens, we refer to the tokens that are used to tokenize most of the sentences in the respective languages.

## Conclusion
We have successfully trimmed the vocabulary of a multilingual T5 model to a monolingual one. We have reduced the number of parameters by 71%. The model can now be used for a single language, in this case, French. The reduction is very significant and can be useful for applications where the model needs to be deployed on devices with limited resources.

To be able to measure the difference in performance between the original and the trimmed model, we need to fine-tune the trimmed model on a downstream task.