# Lab 3: Training Tokenizers

![](../figs/deep_nlp/lab/tokenize.png)

## Prepare the environment

In [1]:
# %pip install --pre ekorpkit[tokenize]

zsh:1: no matches found: ekorpkit[wiki,fetch]
Note: you may need to restart the kernel to use updated packages.


In [1]:
%config InlineBackend.figure_format='retina'
%load_ext autotime

from ekorpkit import eKonf

eKonf.setLogger("INFO")
print("version:", eKonf.__version__)

is_colab = eKonf.is_colab()
print("is colab?", is_colab)
if is_colab:
    eKonf.mount_google_drive()
project_dir = eKonf.set_workspace(workspace="/content/drive/MyDrive/workspace/", project="ekorpkit-book")
print("project_dir:", project_dir)

INFO:ekorpkit.utils.notebook:Google Colab not detected.
INFO:ekorpkit.base:Setting EKORPKIT_WORKSPACE_ROOT to /content/drive/MyDrive/workspace/
INFO:ekorpkit.base:Setting EKORPKIT_PROJECT to ekorpkit-book
INFO:ekorpkit.base:Loaded .env from /workspace/projects/ekorpkit-book/config/.env


version: 0.1.40.post0.dev16
is colab? False
project_dir: /content/drive/MyDrive/workspace/projects/ekorpkit-book
time: 1.56 s (started: 2022-11-12 06:30:45 +00:00)


In [18]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
time: 20.8 ms (started: 2022-11-12 07:16:33 +00:00)


### Load the saved corpora

In [2]:
data = eKonf.load_data("enko_filtered.parquet", project_dir + "/data")

INFO:ekorpkit.io.file:Processing [1] files from ['enko_filtered.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['/content/drive/MyDrive/workspace/projects/ekorpkit-book/data/enko_filtered.parquet']
INFO:ekorpkit.io.file:Loading data from /content/drive/MyDrive/workspace/projects/ekorpkit-book/data/enko_filtered.parquet


time: 3.54 s (started: 2022-11-12 06:30:47 +00:00)


Covert pandas datafame to huggingface dataset

In [3]:
from datasets import Dataset

text_column = "text"
raw_dataset = Dataset.from_pandas(data[[text_column]])
raw_dataset

Dataset({
    features: ['text', '__index_level_0__'],
    num_rows: 603719
})

time: 1.3 s (started: 2022-11-12 06:30:51 +00:00)


In [4]:
text_en = raw_dataset[2][text_column].split("\n")
text_ko = raw_dataset[5555][text_column].split("\n")

print(text_en)
print(text_ko)

'Eylex Films Pvt is a chain of multiplex and single screen theatres. Eylex films pioneered the multiplex model in Jharkhand and Bihar. Its first multiplex was established in Ranchi in 2007 – the first Multiplex in the city of Ranchi and States like Jharkhand, Bihar, West Bengal, Odisha and Assam.\nAt present the chain operates 24 screens in cities like Asansol, Deoghar, Ranchi, Jamshedpur, Muzaffarpur, Sambalpur, Jharsuguda and Silchar. The chain is aggressively looking forward to expand its footprint across India.\nChain of theatres.\nThough the initial plan was to build one multiplex Eylex, the positive response from the cine viewers led to the launch of Eylex in Ranchi, Jamshedpur, Sambalpur, Deoghar, Asansol and Silchar, and newly opened DRB Palace Motijheel, Muzaffarpur.\nFilm Production.\nEylex Films ventured into film production in 2016. The first film produced by them is Mandobasar Galpo, a Bengali film, scheduled to release on 24 March 2017. The film has been directed by Tatha

time: 2.17 ms (started: 2022-11-12 06:30:52 +00:00)


### Shuffle the dataset

In [7]:
# shuffle the dataset

raw_dataset = raw_dataset.shuffle(seed=42)

time: 146 ms (started: 2022-11-12 06:40:29 +00:00)


## Train tokenizers with Hugging Face Tokenizers

[Hugging Face's Tokenizers](https://huggingface.co/docs/tokenizers/quicktour) provides a wide range of tokenizers, including BPE, WordPiece, Unigram, SentencePiece, and ByteLevel. We will use the BPE and Unigram tokenizers in this lab.

### Import the libraries and prepare functions

In [10]:
from tokenizers import Tokenizer
from tokenizers.models import BPE, Unigram, WordLevel
from tokenizers.trainers import BpeTrainer, UnigramTrainer, WordLevelTrainer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers import normalizers
from tokenizers.normalizers import NFD, StripAccents
from ekorpkit.tokenizers.spm import batch_chunks


unk_token = "<UNK>"  # token for unknown words
spl_tokens = ["<UNK>", "<SEP>", "<MASK>", "<CLS>", "[MASK]"]  # special tokens

def prepare_tokenizer_trainer(algo):
    """
    Prepares the tokenizer and trainer with unknown & special tokens.
    """
    if algo == 'BPE':
        tokenizer = Tokenizer(BPE(unk_token = unk_token))
        trainer = BpeTrainer(special_tokens = spl_tokens)
    elif algo == 'UNI':
        tokenizer = Tokenizer(Unigram())
        trainer = UnigramTrainer(unk_token= unk_token, special_tokens = spl_tokens)
    else:
        tokenizer = Tokenizer(WordLevel(unk_token = unk_token))
        trainer = WordLevelTrainer(special_tokens = spl_tokens)

         
    normalizer = normalizers.Sequence([NFD(), StripAccents()])
    tokenizer.normalizer = normalizer
    tokenizer.pre_tokenizer = Whitespace()

    return tokenizer, trainer

time: 2.35 ms (started: 2022-11-12 06:51:08 +00:00)


In [12]:
def train_tokenizer(algo="WLV"):
    """
    Takes the files and trains the tokenizer.
    """
    save_path = f"{project_dir}/tokenizers/{algo}_tokenizer.json"
    tokenizer, trainer = prepare_tokenizer_trainer(algo)
    tokenizer.train_from_iterator(
        batch_chunks(raw_dataset, batch_size=1000, text_column=text_column),
        trainer=trainer,
    )
    tokenizer.save(save_path)
    tokenizer = Tokenizer.from_file(save_path)
    return tokenizer

time: 577 µs (started: 2022-11-12 06:53:10 +00:00)


### Train BPE tokenizer

In [None]:
bpe_tokenizer = train_tokenizer("BPE")

> time: 15h 30min 58s (started: 2022-11-11 06:36:01 +00:00)

To train a BPE tokenizer, it took 15 hours and 30 minutes for 603,719 wiki articles. The tokenizer was saved in the `{project_dir}/tokenizers` directory.

In [16]:
tokenizer_path = f"{project_dir}/tokenizers/BPE_tokenizer.json"

bpe_tokenizer = Tokenizer.from_file(tokenizer_path)
print(f"Vocab size: {bpe_tokenizer.get_vocab_size()}")
print(bpe_tokenizer.encode(raw_dataset[5555]["text"]).tokens)

Vocab size: 30000
['임', '서', '준', '(', '林', '序', '准', ',', '1993년', '4월', '19일', '~', ')', '은', '전', 'KBO', '리그', 'NC', '다이', '노', '스의', '투수이다', '.', '개명', '전', '이름은', "'", '임', '인', '혁', "'", '이다', '.', 'NC', '다이', '노스', '시절', '.', '2016년', '6월', '30일', '두산', '베어스', '와의', '경기에서', '선발', '등판', '하며', '데뷔', '첫', '경기를', '치렀고', ',', '경기에서', '2', '.', '2', '이닝', '2', '실', '점을', '기록하며', '패', '전', '투', '수가', '됐다', '.', '경찰', '야구', '단', '시절', '.', '2017년에', '입단하였다', '.', 'NC', '다이', '노스', '복귀', '.', '2018년에', '복귀하였다', '.']
time: 43.4 ms (started: 2022-11-12 07:10:01 +00:00)


### Train Unigram tokenizer

In [None]:
uni_tokenizer = train_tokenizer("UNI")

It took more time to train a Unigram tokenizer than a BPE tokenizer.

## Train tokenizers with Google SentencePiece (SPM)

### Install SentencePiece

```bash
pip install sentencepiece
```

### Split the dataset into sentences for training

The sentencepiece module comes with a python training API, which uses sentences in a file, one sentence per line. We will use the `sent_tokenize` function from the `nltk` package to split the text into sentences. The `sent_tokenize` function is a wrapper around the `punkt` tokenizer, which is a pre-trained sentence tokenizer. The `punkt` tokenizer is trained on the Penn Treebank corpus, which is a collection of Wall Street Journal articles. The `punkt` tokenizer is a good choice for plain English text, but it may not be the best choice for other languages.

In [8]:
import nltk
from nltk.tokenize import sent_tokenize
from ekorpkit.tokenizers.spm import export_sentence_chunk_files

nltk.download("punkt")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

time: 3.82 ms (started: 2022-11-12 06:40:36 +00:00)


In [9]:
output_dir = project_dir + "/data/tokenizers/enko_filtered_chunk"

export_sentence_chunk_files(
    raw_dataset,
    output_dir=output_dir,
    chunk_size=10000,
    text_column=text_column,
    sent_tokenize=sent_tokenize,
)

INFO:ekorpkit.tokenizers.spm:Writing sentence chunks to /content/drive/MyDrive/workspace/projects/ekorpkit-book/data/tokenizers/enko_filtered_chunk


  0%|          | 0/61 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0000.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0001.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0002.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0003.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0004.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0005.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0006.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0007.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0008.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0009.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0010.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0011.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0012.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0013.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0014.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0015.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0016.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0017.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0018.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0019.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0020.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0021.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0022.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0023.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0024.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0025.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0026.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0027.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0028.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0029.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0030.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0031.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0032.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0033.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0034.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0035.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0036.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0037.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0038.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0039.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0040.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0041.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0042.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0043.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0044.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0045.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0046.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0047.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0048.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0049.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0050.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0051.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0052.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0053.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0054.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0055.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0056.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0057.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0058.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0059.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0060.txt:   0%|          | 0/3719 [00:00<?, ?it/s]

time: 1min 29s (started: 2022-11-12 06:40:38 +00:00)


### Sample sentences and combine them into a single file

If your dataset is too large, you can sample a subset of the sentence files for training. The `sample` function from the `random` module can be used to sample a subset of the files.

You can use `sample_and_combine` function to sample a subset of sentence files and combine them into a single file.

In [24]:
from ekorpkit.tokenizers.spm import sample_and_combine

input_dir = project_dir + "/data/tokenizers/enko_filtered_chunk"
output_dir = project_dir + "/data/tokenizers/enko_filtered_samples"

sampled_file = sample_and_combine(
    input_dir=input_dir,
    output_dir=output_dir,
    sample_size=50
)


INFO:ekorpkit.tokenizers.spm:sampled files: ['sent_chunk_0008.txt', 'sent_chunk_0055.txt', 'sent_chunk_0002.txt', 'sent_chunk_0059.txt', 'sent_chunk_0026.txt', 'sent_chunk_0047.txt', 'sent_chunk_0022.txt', 'sent_chunk_0004.txt', 'sent_chunk_0050.txt', 'sent_chunk_0045.txt', 'sent_chunk_0060.txt', 'sent_chunk_0042.txt', 'sent_chunk_0051.txt', 'sent_chunk_0014.txt', 'sent_chunk_0007.txt', 'sent_chunk_0056.txt', 'sent_chunk_0005.txt', 'sent_chunk_0041.txt', 'sent_chunk_0039.txt', 'sent_chunk_0038.txt', 'sent_chunk_0035.txt', 'sent_chunk_0000.txt', 'sent_chunk_0020.txt', 'sent_chunk_0029.txt', 'sent_chunk_0053.txt', 'sent_chunk_0046.txt', 'sent_chunk_0033.txt', 'sent_chunk_0006.txt', 'sent_chunk_0048.txt', 'sent_chunk_0001.txt', 'sent_chunk_0025.txt', 'sent_chunk_0027.txt', 'sent_chunk_0036.txt', 'sent_chunk_0016.txt', 'sent_chunk_0052.txt', 'sent_chunk_0037.txt', 'sent_chunk_0015.txt', 'sent_chunk_0049.txt', 'sent_chunk_0010.txt', 'sent_chunk_0021.txt', 'sent_chunk_0024.txt', 'sent_chunk_

  0%|          | 0/24425187 [00:00<?, ?it/s]

INFO:ekorpkit.tokenizers.spm:saved sampled sentences to /content/drive/MyDrive/workspace/projects/ekorpkit-book/data/tokenizers/enko_filtered_samples/sampled_sentences.txt


time: 2min 47s (started: 2022-11-12 07:33:26 +00:00)


### Train SentencePiece models

You can use `train_spm` function to train a SentencePiece model. The `train_spm` function takes the following arguments:

- `model_prefix`: The prefix of the model file. The model file will be saved as `{model_prefix}_{model_type}_vocab_{vocab_size}.model`.
- `input`: The input file for training.
- `output_dir`: The directory to save the model file.
- `vocab_size`: The vocabulary size.
- `model_type`: The model type. It can be `unigram` (default), `bpe`, `char`, or `word`.
- `character_coverage`: The character coverage. It is only used for `unigram` and `bpe` model types. The default value is `1.0`.
- `num_threads`: The number of threads to use for training. The default value is `1`. The max value is `128`.
- `train_extremely_large_corpus`: Whether to train an extremely large corpus. The default value is `False`.

#### Train Unigram model

In [26]:
from ekorpkit.tokenizers.spm import train_spm

uni_model_path = train_spm(
    model_prefix="enko_wiki",
    input=sampled_file,
    output_dir=project_dir + "/tokenizers/spm",
    model_type="unigram",
    vocab_size=30000,
    character_coverage=0.9995,
    num_threads=128,
)

INFO:ekorpkit.tokenizers.spm:Training SentencePiece model enko_wiki_unigram_vocab_30000.model
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: /content/drive/MyDrive/workspace/projects/ekorpkit-book/data/tokenizers/enko_filtered_samples/sampled_sentences.txt
  input_format: 
  model_prefix: 
  model_type: UNIGRAM
  vocab_size: 30000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 128
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1


time: 23min 6s (started: 2022-11-12 08:03:08 +00:00)


ting frequent sub strings...
unigram_model_trainer.cc(201) LOG(INFO) Initialized 1000000 seed sentencepieces
trainer_interface.cc(596) LOG(INFO) Tokenizing input sentences with whitespace: 24419902
trainer_interface.cc(607) LOG(INFO) Done! 4696044
unigram_model_trainer.cc(491) LOG(INFO) Using 4696044 sentences for EM training
unigram_model_trainer.cc(507) LOG(INFO) EM sub_iter=0 size=838106 obj=16.3019 num_tokens=14291441 num_tokens/piece=17.0521
unigram_model_trainer.cc(507) LOG(INFO) EM sub_iter=1 size=688067 obj=14.038 num_tokens=14365800 num_tokens/piece=20.8785
unigram_model_trainer.cc(507) LOG(INFO) EM sub_iter=0 size=516036 obj=13.9962 num_tokens=14484892 num_tokens/piece=28.0695
unigram_model_trainer.cc(507) LOG(INFO) EM sub_iter=1 size=515998 obj=13.9855 num_tokens=14495195 num_tokens/piece=28.0916
unigram_model_trainer.cc(507) LOG(INFO) EM sub_iter=0 size=386997 obj=14.0208 num_tokens=14763508 num_tokens/piece=38.1489
unigram_model_trainer.cc(507) LOG(INFO) EM sub_iter=1 size

> time: 23min 6s (started: 2022-11-12 08:03:08 +00:00)

It took 23 minutes to train a unigram model with a vocabulary size of 30,000. The model file was saved in the `{project_dir}/tokenizers` directory.

#### Load the trained model

In [None]:
import sentencepiece as spm

uni_model_path = model_path
uni_spm = spm.SentencePieceProcessor(model_file=uni_model_path)
print(f"Vocab size: {uni_spm.vocab_size()}")
print(uni_spm.encode_as_pieces(raw_dataset[5555]["text"]))


#### Train BPE model

In [27]:
bpe_model_path = train_spm(
    model_prefix="enko_wiki",
    input=sampled_file,
    output_dir=project_dir + "/tokenizers/spm",
    model_type="bpe",
    vocab_size=30000,
    character_coverage=0.9995,
    num_threads=128,
)

#### Load the trained model

In [None]:
import sentencepiece as spm

bpe_spm = spm.SentencePieceProcessor(model_file=bpe_model_path)
print(f"Vocab size: {bpe_spm.get_piece_size()}")
print(bpe_spm.encode_as_pieces(raw_dataset[5555]["text"]))


## Compare the Tokenizers

### Load the tokenizers

In [None]:
tokenizers = {
    "BPE": bpe_tokenizer,
    "UNI": uni_tokenizer,
    "SPM": spm_tokenizer,
}

### Analyze the output of the tokenizers

In [None]:
texts = []
num_samples = 10
tokens = {name: [] for name in tokenizers.keys()}

# sample 10 texts from the dataset randomly
for i in range(num_samples):
    texts.append(raw_dataset.shuffle(seed=i)[i][text_column])

# tokenize the texts with the tokenizers
for text in texts:
    for name, tokenizer in tokenizers.items():
        print(f"Tokenizer: {name}")
        tokens[name].append(tokenizer.encode(text).tokens)
        print(tokens[name][-1])
        print("-" * 50)

### Compare the Tokens

In [None]:
import pandas as pd

sample_num = 5

max_len = max(len(tokens[name][sample_num]) for name in tokenizers.keys())
diffs = {name: max_len - len(tokens[name][sample_num]) for name in tokenizers.keys()}

padded_tokens = {
    name: tokens[name][sample_num] + [""] * diffs[name] for name in tokenizers.keys()
}

df = pd.DataFrame(padded_tokens)
df