# Vocabulary Trimming of a GPT model

This notebook shows how to reduce the size of the vocabulary and of the embedding and decoding layers of a GPT model by removing from the vocabulary the tokens that are not part of a target language.

The goal is to transform a multilingual GPT model into a smaller, more efficient monolingual model while preserving the original model's accuracy for the target language. This notebook follows the idea from the paper [Load What You Need](https://aclanthology.org/2020.sustainlp-1.16.pdf). The paper was implemented on BERT models, we will implement it on GPT models.

A complete description of this and other methods for reducing model size is available [here](https://github.com/alpinf/smaller_language_models/blob/main/smaller_language_models.md).

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/alpinf/smaller_language_models/blob/main/notebooks/vocab_trim_mGPT.ipynb)

## Setup

This notebook can be run with a CPU

In [53]:
import sys
if 'google.colab' not in sys.modules:
    print("Warning: The setup was only tested in Google Colab")

!python -m pip install pandas==2.2.2 torch==2.3.1 transformers==4.41.2 tokenizers==0.19.1 datasets==2.20.0 gdown tqdm --no-deps --quiet

In [54]:
import csv
import json
import os
from collections import Counter

import gdown
import pandas as pd
import torch
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer, GPT2Tokenizer

Define the directories where the downloaded data and the model will be saved

In [55]:
data_dir = "./data"
out_dir = "./out"

os.makedirs(data_dir, exist_ok=True)
os.makedirs(out_dir, exist_ok=True)

In [56]:
def download_file_from_google_drive(file_id, destination):
    """
    Example usage:
    File link on Gdrive: https://drive.google.com/file/d/188r2cctPaqmuXnwer3KBpIHpftnE9tCb/view?usp=drive_link

    download_file_from_google_drive("188r2cctPaqmuXnwer3KBpIHpftnE9tCb", "./data/")
    """
    try:
        url = f"https://drive.google.com/uc?export=download&id={file_id}"
        gdown.download(url, output=destination, quiet=False)
        print(f"File downloaded from Google Drive to {destination}")
    except Exception as e:
        print(f"Error downloading file: {e}")
        raise e

def get_device():
    if torch.cuda.is_available():
        device = torch.device("cuda")
    elif torch.backends.mps.is_available():
        device = torch.device("mps")
    else:
        device = torch.device("cpu")
    print(f"Using device {device}")
    return device


We run the procedure using [mGPT by *ai-forever*](https://huggingface.co/ai-forever/mGPT) as the original model and French as the target language. The model is a multilingual GPT model that supports 50 languages with the same structure of GPT-2.

In [57]:
device = get_device()

original_tokenizer = AutoTokenizer.from_pretrained("ai-forever/mGPT")
original_model = AutoModelForCausalLM.from_pretrained("ai-forever/mGPT").to(device)

Using device cpu


We start by loading the `merges.txt` and the `vocab.json` files of the original model.

The `merges.txt` file contains a list of pairs of tokens. The tokenizer starts by splitting the input text into characters, then it reads the merge file, from top to bottom, and merges all the pairs of characters that are present in the merge file. It continues this process, merging larger and larger tokens, until no more merges are possible.

Now that the input text is tokenized, the `vocab.json` file is used to convert the tokens into integers. The file contains a dictionary with the tokens as keys and the corresponding integer as values.

In [58]:
original_vocab = original_tokenizer.get_vocab()

tokenizer_json = original_tokenizer.backend_tokenizer.to_str()
tokenizer_data = json.loads(tokenizer_json)

original_merges = tokenizer_data['model']['merges']

## Selecting the vocabulary
To select the vocabulary we use the [Leipzig corpora collection](https://wortschatz.uni-leipzig.de/en/download/French), but any corpus in the target language can be used. The data we use is part of the **French _News_** section of 2023 with 1 million sentences.

The corpus can be downloaded from the link above. We keep a backup copy on our drive for reproducibility.

First, we tokenized the entire corpus and kept track of the most frequent tokens


In [59]:
# Download the dataset manually from the link above or use a copy on our drive
data_path = os.path.join(data_dir, "fra_news_2023_1M-sentences.txt")
if not os.path.exists(data_path):
    download_file_from_google_drive("171RgrkXuWHfY-4BzHR8JEvYp9-D8yPqA", data_path)
    print("Dataset downloaded from drive")

In [60]:
df_fr = pd.read_csv(data_path, sep="\t", header=None, quoting=csv.QUOTE_NONE)
df_fr.columns = ["idx", "text"]

cnt = Counter()

total_tokens = 0
batch_size = 1000

for i in tqdm(range(0, len(df_fr), batch_size), desc="Tokenizing the corpus"):
    batch_texts = df_fr.text[i : i + batch_size].tolist()
    encoded_batch = original_tokenizer.batch_encode_plus(
        batch_texts,
        add_special_tokens=False,
        return_attention_mask=False,
    )
    for input_ids in encoded_batch["input_ids"]:
        cnt.update(input_ids)
        total_tokens += len(input_ids)

for top in 1_000, 10_000, 30_000:
    print(f"The {top} most common tokens cover {100 * sum(v for k, v in cnt.most_common(top)) / sum(cnt.values()):.0f}% of the dataset")

print(f"\nTotal number of tokens in the dataset: {total_tokens/1e6:.0f} million")
print(f"Number of unique tokens: {len(cnt)/1e3:.0f}k")
print(f"The seen tokens cover {100 * len(cnt) / original_tokenizer.vocab_size:.1f}% of the mGPT vocabulary.")

Tokenizing the corpus: 100%|██████████| 1000/1000 [00:32<00:00, 30.58it/s]

The 1000 most common tokens cover 69% of the dataset
The 10000 most common tokens cover 97% of the dataset
The 30000 most common tokens cover 100% of the dataset

Total number of tokens in the dataset: 31 million
Number of unique tokens: 36k
The seen tokens cover 36.4% of the mGPT vocabulary.





We can decide how many tokens to keep based on what percentage of the corpus we want to cover.

In [61]:
percentage_to_keep = 0.999

cum_sum = 0
for i, (k, v) in enumerate(cnt.most_common()):
    cum_sum += v
    if cum_sum / sum(cnt.values()) > percentage_to_keep:
        break
    num_tokens = i + 1 # we save the number of tokens to keep

for top in 1_000, 5_000, 10_000, 20_000, 25_000, num_tokens:
    print(f"The {top} most common tokens cover {100 * sum(v for k, v in cnt.most_common(top)) / sum(cnt.values()):.1f}% of the dataset")

print(f"\nWe will keep {num_tokens} tokens to cover {percentage_to_keep * 100}% of the dataset")

The 1000 most common tokens cover 69.4% of the dataset
The 5000 most common tokens cover 89.9% of the dataset
The 10000 most common tokens cover 96.6% of the dataset
The 20000 most common tokens cover 99.5% of the dataset
The 25000 most common tokens cover 99.8% of the dataset
The 26686 most common tokens cover 99.9% of the dataset

We will keep 26686 tokens to cover 99.9% of the dataset


In [62]:
print(f"Total number of merges: {len(original_merges)}")
print(f"Total number of tokens: {len(original_vocab)/1e3:.0f}k")
print(f"Examples of merges: {original_merges[5000:5010]}")

Total number of merges: 99737
Total number of tokens: 100k
Examples of merges: ['ĠC ent', 'Ġf a', 'ij d', 'o ung', 'os i', 'ĠJah ren', 'alif orn', 'iz ed', 'end ed', 'ep h']


If we count the number of lines in the `merge.txt` file, we notice that there are more tokens in the vocabulary than lines in the merges. If we merge all these lines and keep track of the generated tokens we notice that $256 + 7$ tokens are missing.

The first 256 tokens missing are used as hexadecimal characters, these tokens will never be the product of a merge, for this reason they do not appear in the merge file.

We also have $6$ special characters and the character `Ġ`. Since in the merge file, a merge is divided by a space (*"a b"* indicates that *b* has to be merged to *a*), the tokenizer uses the character `Ġ` to indicate a space.

We need to keep all these characters in the new vocabulary.

In [63]:
print(f"Number of tokens that are not in the merges or part of the hexadecimal tokens: {len(original_vocab) - len(original_merges) - 256}")
print(f"\nThere are {len(original_tokenizer.special_tokens_map)} special tokens:")

for k, v in original_tokenizer.special_tokens_map.items():
    print(f"{k}: {v}")

Number of tokens that are not in the merges or part of the hexadecimal tokens: 7

There are 6 special tokens:
bos_token: <s>
eos_token: <|endoftext|>
unk_token: <unk>
sep_token: </s>
pad_token: <pad>
mask_token: <mask>


In [64]:
# Select the most common tokens to keep
new_tokens = set(range(7 + 256))  # the fist 7 contain the special tokens, the next 256 are the byte-pair encodings

for i, (k, v) in enumerate(cnt.most_common(num_tokens)):
    if k not in new_tokens:
        new_tokens.add(k)
kept_ids = sorted(new_tokens)

inverted_vocab = {v: k for k, v in original_vocab.items()}
new_vocab = {inverted_vocab[i]: i for i in kept_ids}

# add the element "Ġ" to the new_vocab. This character represents the space. It is not in the vocabulary of the tokenizer but it can be used to create merges
new_vocab["Ġ"] = original_vocab["Ġ"]

## Removing Redundant Merges
Since we are working with a BPE tokenizer, we cannot simply remove the tokens we do not need from the merge file.

**Example 1: Direct removal**

Suppose we want to remove the token 'abc' from the vocabulary.
If the merge rule ('ab', 'c') => 'abc' remains, the tokenizer will still produce the token 'abc', that we want to remove.
For this reason, we need to remove the merge that produces the unwanted token.

**Example 2: Chain removal**

Furthermore, if we want to remove the 'abc' token we must also remove:
1. All merges that use 'abc' directly, for example ('abc', 'd') => 'abcd'
2. If 'abcd' isn't in our vocabulary, we must also remove merges using it, like ('abcd', 'e') => 'abcde'
3. And so on recursively until we've removed all dependent merges

**Solution**

We must therefore remove all merge rules that either:
1. Would create a token we want to remove
2. Use a token we want to remove
3. Would create new tokens (not in vocab) that contain tokens we want to remove

The implementation uses two functions:
1. `trim_merges`:
   - Takes the original merges and both vocabularies
   - Finds tokens that were removed (in original_vocab but not in new_vocab)
   - Processes tokens from shortest to longest to handle substrings properly

2. `remove_token_and_dependents`:
   - Takes a list of merges, the new vocab, and a token to remove
   - First finds all tokens that need to be removed (including dependent tokens)
   - Then removes any merges that would create or use these tokens

In [65]:
from typing import List, Set, Tuple, Dict
from tqdm import tqdm

def remove_token_and_dependents(
    merges_list: List[Tuple[str, str]],
    vocab: Set[str],
    tokens_to_remove: Set[str],
    merge_results: Dict[Tuple[str, str], str],
    token_to_merges: Dict[str, List[Tuple[str, str]]]
) -> List[Tuple[str, str]]:
    """Remove merges that would create or use given tokens and their dependents"""
    # Find all tokens to remove using BFS instead of recursion
    to_process = set([t for t in tokens_to_remove if t not in vocab])
    found_tokens = tokens_to_remove.copy()

    with tqdm(total=len(to_process), desc="Processing tokens") as pbar:
        while to_process:
            current_token = to_process.pop()
            # Find merges that involve this token
            for merge in token_to_merges.get(current_token, []):
                result = merge_results[merge]
                if result not in vocab and result not in found_tokens:
                    found_tokens.add(result)
                    to_process.add(result)
            pbar.update(1)

    # Single pass filter instead of multiple conditions
    return [merge for merge in tqdm(merges_list, desc="Filtering merges")
            if merge_results[merge] not in found_tokens
            and merge[0] not in found_tokens
            and merge[1] not in found_tokens]

def trim_merges(
    original_merges: List[str],
    original_vocab: Set[str],
    new_vocab: Set[str]
) -> List[Tuple[str, str]]:
    """Remove all merges that could create or use tokens removed from vocabulary."""
    # Pre-compute all data structures once
    merges = [tuple(merge.split()) for merge in tqdm(original_merges, desc="Parsing merges")]

    # Sort tokens to remove by length (longest first) to handle nested tokens correctly
    tokens_to_remove = sorted(original_vocab - set(new_vocab), key=len, reverse=True)

    # Build lookup tables once
    merge_results = {merge: merge[0] + merge[1] for merge in tqdm(merges, desc="Building merge results")}
    token_to_merges: Dict[str, List[Tuple[str, str]]] = {}

    # Single pass to build token_to_merges
    for merge in tqdm(merges, desc="Building token index"):
        result = merge_results[merge]
        for token in (result, merge[0], merge[1]):
            if token in token_to_merges:
                token_to_merges[token].append(merge)
            else:
                token_to_merges[token] = [merge]

    return remove_token_and_dependents(
        merges,
        new_vocab,
        set(tokens_to_remove),  # Convert sorted list back to set for efficient lookups
        merge_results,
        token_to_merges
    )

In [66]:
updated_merges = trim_merges(
    original_merges=original_merges,
    original_vocab=set(original_vocab),
    new_vocab=new_vocab
)
print(f"\nMerges reduced from {len(original_merges)} to {len(updated_merges)}")

Parsing merges: 100%|██████████| 99737/99737 [00:00<00:00, 1625670.54it/s]
Building merge results: 100%|██████████| 99737/99737 [00:00<00:00, 2160067.43it/s]
Building token index: 100%|██████████| 99737/99737 [00:00<00:00, 210483.17it/s]
Processing tokens: 100%|██████████| 73178/73178 [00:00<00:00, 293581.90it/s]
Filtering merges: 100%|██████████| 99737/99737 [00:00<00:00, 1411117.85it/s]


Merges reduced from 99737 to 25749





Finally we can save the new `vocab.json` file and the new `merges.txt` file and create the new tokenizer.

In [67]:
with open(f"{out_dir}/new_vocab.json", "w", encoding="utf-8") as f:
    json.dump(new_vocab, f, ensure_ascii=False, indent=2)

with open(f"{out_dir}/new_merges.txt", "w", encoding="utf-8") as f:
    for pair in updated_merges:
      f.write(f"{pair[0]} {pair[1]}\n")

new_tokenizer = GPT2Tokenizer(
    f"{out_dir}/new_vocab.json",
    f"{out_dir}/new_merges.txt")

## Compare original and new tokenizer

In [68]:
english_text = "This an English example"

original_tokens = original_tokenizer.tokenize(english_text)
new_tokens = new_tokenizer.tokenize(english_text)

print(f"{'Tokens using the original tokenizer:'.ljust(45)} {original_tokens}")
print(f"{'Tokens using the new tokenizer:'.ljust(45)} {new_tokens}")
print()

input_ids_original = original_tokenizer.convert_tokens_to_ids(original_tokens)
input_ids_new = new_tokenizer.convert_tokens_to_ids(new_tokens)

original_decoded_output = original_tokenizer.decode(input_ids_original)
new_decoded_output = new_tokenizer.decode(input_ids_new)
print(f"{'Decoded text using the original tokenizer:'.ljust(45)} {original_decoded_output}")
print(f"{'Decoded text using the new tokenizer:'.ljust(45)} {new_decoded_output}")
print()

french_text = "Ceci est un exemple en français"

original_tokens = original_tokenizer.tokenize(french_text)
new_tokens = new_tokenizer.tokenize(french_text)

print(f"{'Tokens using the original tokenizer:'.ljust(45)} {original_tokens}")
print(f"{'Tokens using the new tokenizer:'.ljust(45)} {new_tokens}")
print()

input_ids_original = original_tokenizer.convert_tokens_to_ids(original_tokens)
input_ids_new = new_tokenizer.convert_tokens_to_ids(new_tokens)

original_decoded_output = original_tokenizer.decode(input_ids_original)
new_decoded_output = new_tokenizer.decode(input_ids_new)
print(f"{'Decoded text using the original tokenizer:'.ljust(45)} {original_decoded_output}")
print(f"{'Decoded text using the new tokenizer:'.ljust(45)} {new_decoded_output}")


Tokens using the original tokenizer:          ['This', 'Ġan', 'ĠEnglish', 'Ġexample']
Tokens using the new tokenizer:               ['Th', 'is', 'Ġan', 'ĠEng', 'l', 'ish', 'Ġex', 'amp', 'le']

Decoded text using the original tokenizer:    This an English example
Decoded text using the new tokenizer:         This an English example

Tokens using the original tokenizer:          ['C', 'eci', 'Ġest', 'Ġun', 'Ġexemple', 'Ġen', 'ĠfranÃ§ais']
Tokens using the new tokenizer:               ['C', 'eci', 'Ġest', 'Ġun', 'Ġexemple', 'Ġen', 'ĠfranÃ§ais']

Decoded text using the original tokenizer:    Ceci est un exemple en français
Decoded text using the new tokenizer:         Ceci est un exemple en français


- We can see that while embedding the **English sentence**, there are some differences between the two tokenizers. This is because not all of the *English* tokens have been retained.
- On the other hand, when we encode the **French sentence**, the embeddings are the same. This is because all of the *French* tokens have been retained. The tokens have been remapped to the new vocabulary, so the IDs are different, but the embeddings are the same.

We can also notice that the decoding works correctly, both for the *English* and *French* tokens.

When we talk about *French* or *English* tokens, we refer to the tokens that are used to tokenize most of the sentences in the respective languages.

## Updating the embedding layer
To reduce the size of the original model we need to modify the embedding layer and the head layer of the original model. The head layer is the last layer of the model, that is used to predict the next token.

In [69]:
new_model = AutoModelForCausalLM.from_pretrained("ai-forever/mGPT").to(device)

new_size = len(kept_ids)

embedding_dim = original_model.transformer.wte.weight.data.shape[1]

new_emb = torch.nn.Embedding(new_size, embedding_dim)
new_head = torch.nn.Linear(embedding_dim, new_size)

for new_id, old_id in enumerate(kept_ids):
    # copy the original weights of the embeddings and the head
    new_emb.weight.data[new_id] = original_model.transformer.wte.weight.data[old_id]
    new_head.weight.data[new_id] = original_model.lm_head.weight.data[old_id]

# add the new embeddings and head to the model
new_model.transformer.wte = new_emb
new_model.lm_head = new_head

#  update the model configuration
new_model.config.__dict__['vocab_size'] = new_size
new_model.config.__dict__['_name_or_path'] = f"{out_dir}/frGPT"

new_vocab_final = {}
for new_token, old_token in enumerate(kept_ids):
    new_vocab_final[inverted_vocab[old_token]] = new_token

with open(f"{out_dir}/new_vocab.json", "w", encoding="utf-8") as f:
    json.dump(new_vocab_final, f, ensure_ascii=False, indent=2)

# load the new tokenizer
new_tokenizer = GPT2Tokenizer(
    f"{out_dir}/new_vocab.json",
    f"{out_dir}/new_merges.txt")

We can save the new model and use it for inference.

In [70]:
# save the new model and the tokenizer
new_model.save_pretrained(f"{out_dir}/frGPT")
new_tokenizer.save_pretrained(f"{out_dir}/frGPT")

('./out/frGPT/tokenizer_config.json',
 './out/frGPT/special_tokens_map.json',
 './out/frGPT/vocab.json',
 './out/frGPT/merges.txt',
 './out/frGPT/added_tokens.json')

Finally we can use the new model to generate text in the target language.

In [71]:
text = "Il était une fois, dans un pays lointain"

new_model.to(device)

input_ids = new_tokenizer.encode(text, return_tensors="pt").to(device)
out = new_model.generate(
        input_ids,
        min_length=100,
        max_length=100,
        eos_token_id=5,
        top_k=0,
        top_p=0.95,
        no_repeat_ngram_size=4,
        do_sample=True
)
generated_text = list(map(new_tokenizer.decode, out))[0]
print(generated_text)

Il était une fois, dans un pays lointain, un étrange pays, dans un monde éloigné, à l’autre bout de la terre, avec des rois gouvernés par des étrangers. Ils étaient tous assez grands pour nous faire la guerre de clan, nous faire massacrer, se battre contre nous. Mais ils étaient tous beaux, et ils étaient tous loin de nous. Ils ne sont pas assez grands pour être propres, ils ne sont pas intelligents, et ils sont tout sauf


When we print the structure of the model we can see that the new model has a smaller embedding layer and a smaller head layer.

In [72]:
print(new_model)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(26822, 2048)
    (wpe): Embedding(2048, 2048)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-23): 24 x GPT2Block(
        (ln_1): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=2048, out_features=26822, bias=True)
)


To understand how much we have reduced the size of the model we can compare the number of parameters of the original model with the number of parameters of the new model.

In [73]:
print(f"Number of parameters in the new model is {new_model.num_parameters()/ 1e6:.0f} million")
print(f"Number of parameters in the original model is {original_model.num_parameters()/ 1e6:.0f} million")

print(f"The percentage of parameters that are in the embedding layer is {100 *  new_model.transformer.wte.weight.data.numel() / new_model.num_parameters():.1f}%")
print(f"Percentage of parameters in the head layer: {100 * new_model.lm_head.weight.data.numel() / new_model.num_parameters():.1f}%")
print(f"Percentage of tokens kept in the new model: {100 * len(kept_ids) / original_tokenizer.vocab_size:.1f}%")
print(f"Percentage of parameters of the new model in comparison to the original model: {100 * new_model.num_parameters() / original_model.num_parameters():.1f}%")

Number of parameters in the new model is 1323 million
Number of parameters in the original model is 1418 million
The percentage of parameters that are in the embedding layer is 4.2%
Percentage of parameters in the head layer: 4.2%
Percentage of tokens kept in the new model: 26.8%
Percentage of parameters of the new model in comparison to the original model: 93.3%


## Conclusion
We have successfully reduced the size of the vocabulary of the tokenizer, keeping only 27% of the initial 100k tokens. We have also reduced the size of the embedding layer and the head layer of the model by 74% each, with an overall reduction in the number of parameters of about 7%.

The reduction in size is not very significant, so this is not necessarily the preferred approach for decoder-only models if the main objective is size reduction. It may still be a good starting point to further reduce the model's size, while using a more focused vocabulary.