# Vocabulary Trimming of a BERT model

This notebook demonstrates reducing the size of a multilingual transformer model by **removing from the vocabulary the tokens that are not part of a target language corpus** (French in this case).

The goal is to transform the multilingual model into a smaller, more efficient monolingual model while preserving the original model's accuracy on the target language. The same process can be applied to any corpus, even multilingual, e.g. for domain adaptation.

The notebook is based on the methodology described in [_Load What You Need_](https://aclanthology.org/2020.sustainlp-1.16.pdf) and its associated [GitHub repository](https://github.com/Geotrend-research/smaller-transformers), and presents a more detailed description of the process.

A complete description of this and other methods for reducing model size is available [here](https://github.com/alpinf/smaller_language_models/blob/main/Smaller_LLMs.md).

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/alpinf/smaller_language_models/blob/main/notebooks/vocab_trim_mBERT.ipynb)


## Setup


In [1]:
import sys
if 'google.colab' not in sys.modules:
    print("Warning: The setup was only tested in Google Colab")

!python -m pip install pandas==2.2.2 torch==2.3.1 transformers==4.41.2 datasets==2.20.0 gdown tqdm --quiet --no-deps

In [2]:
import csv
import json
import os
from collections import Counter

import gdown
import pandas as pd
from torch import nn
from tqdm.auto import tqdm
from transformers import BertModel, BertTokenizer

Define the directories where the downloaded data and the model will be saved

In [3]:
data_dir = "./data"
out_dir = "./out"

os.makedirs(data_dir, exist_ok=True)
os.makedirs(out_dir, exist_ok=True)

In [4]:
def download_file_from_google_drive(file_id, destination):
    """
    Example usage:
    File link on Gdrive: https://drive.google.com/file/d/188r2cctPaqmuXnwer3KBpIHpftnE9tCb/view?usp=drive_link

    download_file_from_google_drive("188r2cctPaqmuXnwer3KBpIHpftnE9tCb", "./data/")
    """
    try:
        url = f"https://drive.google.com/uc?export=download&id={file_id}"
        gdown.download(url, output=destination, quiet=False)
        print(f"File downloaded from Google Drive to {destination}")
    except Exception as e:
        print(f"Error downloading file: {e}")
        raise e

## Importing the model


In [5]:
bert_tokenizer = BertTokenizer.from_pretrained(
    "bert-base-multilingual-cased"
)  # import the original tokenizer
bert_vocab = list(bert_tokenizer.vocab.keys())

print(f"Original model's vocabulary size: {len(bert_vocab)}")  # 119_547

Original model's vocabulary size: 119547


## Select vocabulary

To select the vocabulary we use the [Leipzig corpora collection](https://wortschatz.uni-leipzig.de/en/download/French), but any corpus in the target language can be used. The data we use is part of the **French _News_** section of 2023 with 1 million sentences.

The corpus can be downloaded from the link above. We keep a backup copy on our drive for reproducibility.

First, we tokenized the entire corpus and kept track of the most frequent tokens


In [6]:
# Download the dataset manually from the link above or use a copy on our drive (134MB)
corpus_path = os.path.join(data_dir, "fra_news_2023_1M-sentences.txt")
if not os.path.exists(corpus_path):
    download_file_from_google_drive("171RgrkXuWHfY-4BzHR8JEvYp9-D8yPqA", corpus_path)
    print("Dataset downloaded from drive")

df_new = pd.read_csv(corpus_path, sep="\t", header=None, quoting=csv.QUOTE_NONE)
df_new.columns = ["idx", "text"]

cnt = Counter()

total_tokens = 0
for text in tqdm(df_new.text, desc="Counting tokens"):
    tokens = bert_tokenizer.encode(text)
    cnt.update(tokens)
    total_tokens += len(tokens)

Counting tokens:   0%|          | 0/1000000 [00:00<?, ?it/s]

In [7]:
print(f"Tokens seen in the dataset: {total_tokens / 1_000_000:.0f}M")
print(f"Unique tokens in the dataset: {len(cnt)}")
print(
    f"These tokens cover {100 * len(cnt) / bert_tokenizer.vocab_size:.1f}% of the mBERT vocabulary."
)

Tokens seen in the dataset: 33M
Unique tokens in the dataset: 33722
These tokens cover 28.2% of the mBERT vocabulary.


We can decide how many tokens to keep, based on what percentage of the corpus we want to cover

In [8]:
percentage_to_keep = 0.999

cum_sum = 0
for i, (token_id, v) in enumerate(cnt.most_common()):
    cum_sum += v
    if cum_sum / sum(cnt.values()) > percentage_to_keep:
        break
    num_tokens = i + 1  # the number of tokens to keep

for top in 1_000, 10_000, 20_000, num_tokens:
    print(
        f"The {top} most common tokens cover {100 * sum(v for k, v in cnt.most_common(top)) / sum(cnt.values()):.1f}% of the dataset"
    )

print(
    f"\nWe will keep {num_tokens} tokens to cover {percentage_to_keep * 100}% of the corpus"
)

The 1000 most common tokens cover 73.0% of the dataset
The 10000 most common tokens cover 97.8% of the dataset
The 20000 most common tokens cover 99.8% of the dataset
The 23590 most common tokens cover 99.9% of the dataset

We will keep 23590 tokens to cover 99.9% of the corpus


The first 103 tokens of the vocabulary are the 4 special tokens (`[PAD]`, `[UNK]`, `[CLS]`, `[SEP]`) and 99 _unused_ tokens that can be used for different purposes (`[unused1]`, `[unused2]`, ..., `[unused99]`), like introducing specific words to the fine-tuning procedure.

We wish to keep them in the vocabulary, so we start the vocabulary with the first 103 tokens.


In [9]:
new_token_ids = set(
    range(103)
)  # the first 4 + 99 tokens are reserved for special tokens
for _, (token_id, _) in enumerate(cnt.most_common(num_tokens)):
    if token_id >= 103:
        new_token_ids.add(token_id)

new_token_ids = sorted(list(new_token_ids))
print(f"Total number of tokens including special tokens: {len(new_token_ids)}")

Total number of tokens including special tokens: 23690


Let's look at some of the tokens of the new vocabulary


In [10]:
selected_tokens = bert_tokenizer.convert_ids_to_tokens(new_token_ids)
n = 5

print(f"First {n} tokens:")
for i in range(n):
    print(f"{i}: {selected_tokens[i]}")

print(f"\nLast {n} tokens:")
for i in range(len(selected_tokens) - n, len(selected_tokens)):
    print(f"{i}: {selected_tokens[i]}")

First 5 tokens:
0: [PAD]
1: [unused1]
2: [unused2]
3: [unused3]
4: [unused4]

Last 5 tokens:
23685: ##œ
23686: ##Š
23687: ##⁄
23688: ##€
23689: ##™


## Resize the embeddings and create a new model

We now update the embeddings matrix to match the new vocabulary size, by removing the embeddings of the tokens that are no longer part of the vocabulary.
The function `update_model` below updates the embedding matrix by keeping only the entries (rows) that appear in the new vocabulary.

This creates a smaller matrix where the number of rows is the size of the new vocabulary. The number of columns is the size of the embedding space, whose dimension remains unchanged.

In [11]:
def update_model(model, old_vocab, new_vocab, new_model_name="new_model"):

    new_model_path = os.path.join(out_dir, new_model_name)

    # Get old embeddings from model
    old_embeddings = model.get_input_embeddings()
    old_num_tokens, old_embedding_dim = old_embeddings.weight.size()

    if old_num_tokens != len(old_vocab):
        print("len(old_vocab) != len(model.old_embeddings), no modifying the model")
        return old_embeddings

    if new_vocab is None:
        print("No new vocab provided, not modifying the model")
        return old_embeddings
    new_num_tokens = len(new_vocab)

    # Build new embeddings
    new_embeddings = nn.Embedding(new_num_tokens, old_embedding_dim)
    new_embeddings.to(old_embeddings.weight.device)

    # Copy weights
    i, j = 0, 0
    vocab = []
    for token in old_vocab:
        if token in new_vocab:
            vocab.append(token)
            new_embeddings.weight.data[i, :] = old_embeddings.weight.data[j, :]
            i += 1
        j += 1

    model.set_input_embeddings(new_embeddings)

    # Update base model and current model config
    model.config.vocab_size = new_num_tokens
    model.vocab_size = new_num_tokens

    # Tie weights
    model.tie_weights()

    # Save the new model
    print("Updated model:", new_model_name)
    print("Number of parameters: ", model.num_parameters())
    print("Vocabulary size:", len(vocab))
    model.save_pretrained(new_model_path)

    # Save vocab
    with open(os.path.join(new_model_path, "vocab.txt"), "w", encoding="utf-8") as w:
        for token in vocab:
            w.write(token + "\n")

    # Save tokenizer config
    with open(os.path.join(new_model_path, "tokenizer_config.json"), "w") as w:
        json.dump({"do_lower_case": False, "model_max_length": 512}, w)

Creating the New model


In [12]:
new_model_name = "bert-base-fr-cased"
update_model(
    BertModel.from_pretrained("bert-base-multilingual-cased"),
    bert_vocab,
    selected_tokens,
    new_model_name=new_model_name,
)

model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

Updated model: bert-base-fr-cased
Number of parameters:  104235264
Vocabulary size: 23690


## Compare the original & trimmed models


In [13]:
# original model
bert_model = BertModel.from_pretrained("bert-base-multilingual-cased")
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-cased")
print(
    f"Original model has\t{bert_model.num_parameters() / 1e6:.0f}M parameters"
)  # 177_854_978
print(
    f"Original model has\t{bert_model.config.vocab_size / 1e3:.0f}k tokens in the vocabulary"
)  # 119_547

# new model
new_tokenizer = BertTokenizer.from_pretrained(os.path.join(out_dir, new_model_name))
new_model = BertModel.from_pretrained(os.path.join(out_dir, new_model_name))
print(
    f"New model has\t\t{new_model.num_parameters() / 1e6:.0f}M parameters"
)  # 96_454_656
print(
    f"New model has\t\t{new_model.config.vocab_size / 1e3:.0f}k tokens in the vocabulary"
)
print(
    f"\nNew model's number of parameters is {new_model.num_parameters() / bert_model.num_parameters() * 100:.0f}% of the original model's"
)

Original model has	178M parameters
Original model has	120k tokens in the vocabulary
New model has		104M parameters
New model has		24k tokens in the vocabulary

New model's number of parameters is 59% of the original model


Let's tokenize the same sentence with the original and the new model


In [14]:
text = "This is an example of a English sentence."
print(bert_tokenizer.tokenize(text))
print(new_tokenizer.tokenize(text))
print()

text = "Ceci est un exemple de phrase en français."
print(bert_tokenizer.tokenize(text))
print(new_tokenizer.tokenize(text))

['This', 'is', 'an', 'example', 'of', 'a', 'Engl', '##ish', 'sentence', '.']
['This', 'is', 'an', 'ex', '##amp', '##le', 'of', 'a', 'Engl', '##ish', 'sentence', '.']

['Ceci', 'est', 'un', 'exemple', 'de', 'phrase', 'en', 'français', '.']
['Ceci', 'est', 'un', 'exemple', 'de', 'phrase', 'en', 'français', '.']


We can see that while embedding the **English sentence**, there are some differences between the two tokenizers, since not all of the _English_ tokens have been retained.

On the other hand, when we encode the **French sentence**, the embeddings are the same, as all of the _French_ tokens have been retained.

## Fine-tuning on sentiment analysis & results

We fine-tuned both models on the French path of [tyqiangz/multilingual-sentiments](https://huggingface.co/datasets/tyqiangz/multilingual-sentiments), a multilingual  sentiment analysis dataset.

Hyper-parameters:

- `learning_rate=2e-5`
- `per_device_train_batch_size=64`
- `num_train_epochs=10`
- `weight_decay=0.01`
- `eval_strategy="steps"`

We applied early stopping on the validation accuracy, with a patience of 3.

The fine-tuning procedure is shown in the following notebook: [mBERT classification fine-tuning](https://colab.research.google.com/github/alpinf/smaller_llms/blob/main/notebooks/classification_fine_tuning_mBERT.ipynb).

The accuracy scores we got were:

- **mBERT: 69.4%**
- **frBERT: 69.9%**

We see no drop in accuracy when using the vocabulary-trimmed model, with a model that's **41%** smaller in terms of the number of parameters compared to the original mBERT model.