# Lab 3: Training Tokenizers

![](../figs/deep_nlp/lab/tokenize.png)


## Prepare the environment


In [1]:
# %pip install --pre ekorpkit[tokenize]


zsh:1: no matches found: ekorpkit[wiki,fetch]
Note: you may need to restart the kernel to use updated packages.


In [5]:
%config InlineBackend.figure_format='retina'
%load_ext autotime
%load_ext autoreload
%autoreload 2

from ekorpkit import eKonf

eKonf.setLogger("INFO")
print("version:", eKonf.__version__)

is_colab = eKonf.is_colab()
print("is colab?", is_colab)
if is_colab:
    eKonf.mount_google_drive()
workspace_dir = "/content/drive/MyDrive/workspace"
project_name = "ekorpkit-book"
project_dir = eKonf.set_workspace(workspace=workspace_dir, project=project_name)
print("project_dir:", project_dir)

INFO:ekorpkit.utils.notebook:Google Colab not detected.
INFO:ekorpkit.base:Setting EKORPKIT_WORKSPACE_ROOT to /content/drive/MyDrive/workspace
INFO:ekorpkit.base:Setting EKORPKIT_PROJECT to ekorpkit-book
INFO:ekorpkit.base:Loaded .env from /workspace/projects/ekorpkit-book/config/.env


The autotime extension is already loaded. To reload it, use:
  %reload_ext autotime
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
version: 0.1.40.post0.dev17
is colab? False
project_dir: /content/drive/MyDrive/workspace/projects/ekorpkit-book
time: 37.3 ms (started: 2022-11-13 04:47:09 +00:00)


### Load the saved corpora


In [6]:
data = eKonf.load_data("enko_filtered.parquet", project_dir + "/data")
data.head()


INFO:ekorpkit.io.file:Processing [1] files from ['enko_filtered.parquet']
INFO:ekorpkit.io.file:Loading 1 dataframes from ['/content/drive/MyDrive/workspace/projects/ekorpkit-book/data/enko_filtered.parquet']
INFO:ekorpkit.io.file:Loading data from /content/drive/MyDrive/workspace/projects/ekorpkit-book/data/enko_filtered.parquet


Unnamed: 0,id,text,split,filename,corpus,num_chars,num_words,num_sents,avg_num_chars,avg_num_words
1,7644961,Anaissini is a tribe of click beetles in the f...,train,wiki_49,enwiki_sampled,63,11,1,5.727273,11.0
2,6658552,The Vicky Metcalf Award for Literature for You...,train,wiki_24,enwiki_sampled,479,82,5,5.841463,16.4
4,11081255,Eylex Films Pvt is a chain of multiplex and si...,train,wiki_94,enwiki_sampled,1161,181,12,6.414365,15.083333
8,4706486,Željko Zečević (; born 21 October 1963) is a S...,train,wiki_02,enwiki_sampled,1151,201,15,5.726368,13.4
12,2170359,Gilberto Nascimento Silva (born 9 June 1956) i...,train,wiki_57,enwiki_sampled,685,105,9,6.52381,11.666667


time: 3.34 s (started: 2022-11-13 04:47:10 +00:00)


In [7]:
text_column = "text"

text_en = (
    data[data.corpus == "enwiki_sampled"][text_column].sample(1).values[0].split("\n")
)
text_ko = data[data.corpus == "kowiki"][text_column].sample(1).values[0].split("\n")

print(text_en)
print(text_ko)


['The London and South Western Railway T1 class was a class of fifty 0-4-4T steam tank locomotives designed for suburban passenger work by William Adams and built between 1888 and 1896.', 'History.', 'The class were numbered 1–20, 61–80 and 358–367. In typical London and South Western Railway fashion, they reused the numbers of retired or duplicated engines. The class remained largely intact until the 1930s, being scheduled to be withdrawn by 1940, however due to the Second World War a few remained in traffic (around eight examples) until the early British Railways years, the final one (30007) being withdrawn in May 1951.', 'Possible Revial.', "No complete T1 locomotives were saved for preservation, however, a boiler and smokebox from a withdrawn locomotive was found in a factory in Essex back in the 1980s and was subsequently purchased for use on a 'new' T1 locomotive. Since September 2004, this boiler has been stored on the Avon Valley Railway."]
['보이보디나 자치주()는 유고슬라비아 연방 인민 공화국의 세르비아

### Covert pandas datafame to huggingface dataset


In [9]:
from datasets import Dataset

raw_dataset = Dataset.from_pandas(data[[text_column]])
raw_dataset


Dataset({
    features: ['text', '__index_level_0__'],
    num_rows: 603719
})

time: 940 ms (started: 2022-11-13 03:06:51 +00:00)


### Shuffle the dataset


In [10]:
# shuffle the dataset

raw_dataset = raw_dataset.shuffle(seed=42)


time: 112 ms (started: 2022-11-13 03:07:01 +00:00)


### Split the dataset into sentences for training

The sentencepiece module comes with a python training API, which uses sentences in a file, one sentence per line. We will use the `sent_tokenize` function from the `nltk` package to split the text into sentences. The `sent_tokenize` function is a wrapper around the `punkt` tokenizer, which is a pre-trained sentence tokenizer. The `punkt` tokenizer is trained on the Penn Treebank corpus, which is a collection of Wall Street Journal articles. The `punkt` tokenizer is a good choice for plain English text, but it may not be the best choice for other languages.


In [None]:
import nltk
from nltk.tokenize import sent_tokenize
from ekorpkit.tokenizers.trainers.spm import export_sentence_chunk_files

nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

time: 3.82 ms (started: 2022-11-12 06:40:36 +00:00)


In [None]:
output_dir = project_dir + "/data/tokenizers/enko_filtered_chunk"

export_sentence_chunk_files(
    raw_dataset,
    output_dir=output_dir,
    chunk_size=10000,
    text_column=text_column,
    sent_tokenize=sent_tokenize,
)


INFO:ekorpkit.tokenizers.spm:Writing sentence chunks to /content/drive/MyDrive/workspace/projects/ekorpkit-book/data/tokenizers/enko_filtered_chunk


  0%|          | 0/61 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0000.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0001.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0002.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0003.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0004.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0005.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0006.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0007.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0008.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0009.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0010.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0011.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0012.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0013.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0014.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0015.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0016.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0017.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0018.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0019.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0020.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0021.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0022.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0023.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0024.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0025.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0026.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0027.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0028.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0029.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0030.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0031.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0032.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0033.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0034.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0035.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0036.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0037.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0038.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0039.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0040.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0041.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0042.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0043.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0044.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0045.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0046.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0047.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0048.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0049.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0050.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0051.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0052.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0053.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0054.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0055.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0056.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0057.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0058.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0059.txt:   0%|          | 0/10000 [00:00<?, ?it/s]

Writing sentences to sent_chunk_0060.txt:   0%|          | 0/3719 [00:00<?, ?it/s]

time: 1min 29s (started: 2022-11-12 06:40:38 +00:00)


### Sample sentences and combine them into a single file

If your dataset is too large, you can sample a subset of the sentence files for training. The `sample` function from the `random` module can be used to sample a subset of the files.

You can use `sample_and_combine` function to sample a subset of sentence files and combine them into a single file.


In [11]:
from ekorpkit.tokenizers.trainers.spm import sample_and_combine

input_dir = project_dir + "/data/tokenizers/enko_filtered_chunk"
output_dir = project_dir + "/data/tokenizers/enko_filtered_samples"

sampled_file = sample_and_combine(
    input_dir=input_dir, output_dir=output_dir, sample_size=1
)

INFO:ekorpkit.tokenizers.trainers.spm:sampled files: ['sent_chunk_0002.txt']
INFO:ekorpkit.tokenizers.trainers.spm:number of lines sampled: 61,693


  0%|          | 0/61693 [00:00<?, ?it/s]

INFO:ekorpkit.tokenizers.trainers.spm:saved sampled sentences to /content/drive/MyDrive/workspace/projects/ekorpkit-book/data/tokenizers/enko_filtered_samples/sampled_sentences.txt


time: 747 ms (started: 2022-11-13 03:08:24 +00:00)


## Train tokenizers with Hugging Face Tokenizers

[Hugging Face's Tokenizers](https://huggingface.co/docs/tokenizers/quicktour) provides a wide range of tokenizers, including BPE, WordPiece, Unigram, SentencePiece, and ByteLevel. We will use the BPE and Unigram tokenizers in this lab.


### Import the libraries and prepare functions


In [10]:
from tokenizers import Tokenizer
from tokenizers.models import BPE, Unigram, WordLevel
from tokenizers.trainers import BpeTrainer, UnigramTrainer, WordLevelTrainer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers import normalizers
from tokenizers.normalizers import NFD, StripAccents
from ekorpkit.tokenizers.trainers.spm import batch_chunks


unk_token = "<UNK>"  # token for unknown words
spl_tokens = ["<UNK>", "<SEP>", "<MASK>", "<CLS>", "[MASK]"]  # special tokens


def prepare_tokenizer_trainer(algo):
    """
    Prepares the tokenizer and trainer with unknown & special tokens.
    """
    if algo == "BPE":
        tokenizer = Tokenizer(BPE(unk_token=unk_token))
        trainer = BpeTrainer(special_tokens=spl_tokens)
    elif algo == "UNI":
        tokenizer = Tokenizer(Unigram())
        trainer = UnigramTrainer(unk_token=unk_token, special_tokens=spl_tokens)
    else:
        tokenizer = Tokenizer(WordLevel(unk_token=unk_token))
        trainer = WordLevelTrainer(special_tokens=spl_tokens)

    normalizer = normalizers.Sequence([NFD(), StripAccents()])
    tokenizer.normalizer = normalizer
    tokenizer.pre_tokenizer = Whitespace()

    return tokenizer, trainer


time: 2.35 ms (started: 2022-11-12 06:51:08 +00:00)


In [12]:
def train_tokenizer(algo="BPE"):
    """
    Takes the files and trains the tokenizer.
    """
    save_path = f"{project_dir}/tokenizers/{algo}_tokenizer.json"
    tokenizer, trainer = prepare_tokenizer_trainer(algo)
    tokenizer.train_from_iterator(
        batch_chunks(raw_dataset, batch_size=1000, text_column=text_column),
        trainer=trainer,
    )
    tokenizer.save(save_path)
    tokenizer = Tokenizer.from_file(save_path)
    return tokenizer


time: 20.2 ms (started: 2022-11-13 03:09:51 +00:00)


### Train BPE tokenizer


In [None]:
model_path = train_tokenizer("BPE")


> time: 15h 30min 58s (started: 2022-11-11 06:36:01 +00:00)

To train a BPE tokenizer, it took 15 hours and 30 minutes for 603,719 wiki articles. The tokenizer was saved in the `{project_dir}/tokenizers` directory.

> took 2m 15.5s

With 256 processors, it took 2 minutes and 15 seconds to tokenize the 603,719 wiki articles.


To train more efficiently with multiple processors, it is preferable to use CLI (command line interface) tools.


```bash
ekorpkit \
    project.name=ekorpkit-book \
    dir.workspace=/content/drive/MyDrive/workspace \
    verbose=false \
    cmd=train_tokenizer \
    +tokenizer=train_hf \
    tokenizer.model_prefix=enko_wiki \
    tokenizer.model_type=bpe \
    tokenizer.vocab_size=30000 \
    tokenizer.input_files=sampled_sentences.txt \
    tokenizer.input_dir=data/tokenizers/enko_filtered_samples \
    tokenizer.output_dir=tokenizers/hf/enko_wiki
```


In [9]:
from tokenizers import Tokenizer

tokenizer_path = (
    f"{project_dir}/tokenizers/hf/enko_wiki/enko_wiki_bpe_huggingface_vocab_30000.json"
)

bpe_tokenizer = Tokenizer.from_file(tokenizer_path)
print(f"Vocab size: {bpe_tokenizer.get_vocab_size()}")
print(bpe_tokenizer.encode(text_en[0]).tokens)
print(bpe_tokenizer.encode(text_ko[0]).tokens)

Vocab size: 30000
['The', 'London', 'and', 'South', 'Western', 'Railway', 'T', '1', 'class', 'was', 'a', 'class', 'of', 'fif', 'ty', '0', '-', '4', '-', '4', 'T', 'ste', 'am', 'tank', 'loc', 'om', 'ot', 'ives', 'designed', 'for', 'sub', 'urban', 'passenger', 'work', 'by', 'William', 'Adams', 'and', 'built', 'between', '1888', 'and', '1896', '.']
['보이', '보', '디나', '자치', '주', '()', '는', '유고슬라비아', '연방', '인민', '공화국의', '세르비아', '인민', '공화국', '에', '있던', '두', '개의', '자치', '주', '가운데', '하나로', ',', '수도는', '노비', '사', '드', '였다', '.']
time: 53.3 ms (started: 2022-11-13 04:48:08 +00:00)


### Train Unigram tokenizer


```python
model_path = train_tokenizer("UNI")
```

For a very large corpus, it may take a long time to train a Unigram tokenizer. It is recommended to use the following CLI command to train a Unigram tokenizer.

```bash
ekorpkit \
    project.name=ekorpkit-book \
    dir.workspace=/content/drive/MyDrive/workspace \
    verbose=false \
    cmd=train_tokenizer \
    +tokenizer=train_hf \
    tokenizer.model_prefix=enko_wiki \
    tokenizer.model_type=unigram \
    tokenizer.vocab_size=30000 \
    tokenizer.input_files=sampled_sentences.txt \
    tokenizer.input_dir=data/tokenizers/enko_filtered_samples \
    tokenizer.output_dir=tokenizers/hf/enko_wiki
```


In [10]:
from tokenizers import Tokenizer

tokenizer_path = (
    f"{project_dir}/tokenizers/hf/enko_wiki/enko_wiki_unigram_huggingface_vocab_30000.json"
)

unigram_tokenizer = Tokenizer.from_file(tokenizer_path)
print(f"Vocab size: {bpe_tokenizer.get_vocab_size()}")
print(unigram_tokenizer.encode(text_en[0]).tokens)
print(unigram_tokenizer.encode(text_ko[0]).tokens)

Vocab size: 30000
['The', 'Lond', 'on', 'and', 'South', 'Wester', 'n', 'Railway', 'T', '1', 'class', 'was', 'a', 'class', 'of', 'fift', 'y', '0', '-', '4', '-', '4', 'T', 's', 'team', 'tank', 'lo', 'c', 'omotiv', 'es', 'design', 'ed', 'for', 'suburb', 'an', 'passenger', 'work', 'by', 'William', 'Adams', 'and', 'buil', 't', 'be', 'twe', 'en', '1888', 'and', '1896', '.']
['보이', '보', '디', '나', '자치주', '()', '는', '유고슬라비아', '연방', '인민', '공화국', '의', '세르비아', '인민', '공화국', '에', '있던', '두', '개', '의', '자치주', '가운데', '하나로', ',', '수도', '는', '노비', '사', '드', '였다', '.']
time: 56.3 ms (started: 2022-11-13 04:49:18 +00:00)


## Train tokenizers with Google SentencePiece (SPM)

### Install SentencePiece

```bash
pip install sentencepiece
```


### Train SentencePiece models

You can use `train_spm` function to train a SentencePiece model. The `train_spm` function takes the following arguments:

- `model_prefix`: The prefix of the model file. The model file will be saved as `{model_prefix}_{model_type}_vocab_{vocab_size}.model`.
- `input`: The input file for training.
- `output_dir`: The directory to save the model file.
- `vocab_size`: The vocabulary size.
- `model_type`: The model type. It can be `unigram` (default), `bpe`, `char`, or `word`.
- `character_coverage`: The character coverage. It is only used for `unigram` and `bpe` model types. The default value is `1.0`.
- `num_threads`: The number of threads to use for training. The default value is `1`. The max value is `128`.
- `train_extremely_large_corpus`: Whether to train an extremely large corpus. The default value is `False`.


#### Train Unigram model


In [26]:
from ekorpkit.tokenizers.trainers.spm import train_spm

uni_model_path = train_spm(
    model_prefix="enko_wiki",
    input=sampled_file,
    output_dir=project_dir + "/tokenizers/spm",
    model_type="unigram",
    vocab_size=30000,
    character_coverage=0.9995,
    num_threads=128,
)


INFO:ekorpkit.tokenizers.spm:Training SentencePiece model enko_wiki_unigram_vocab_30000.model
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: /content/drive/MyDrive/workspace/projects/ekorpkit-book/data/tokenizers/enko_filtered_samples/sampled_sentences.txt
  input_format: 
  model_prefix: 
  model_type: UNIGRAM
  vocab_size: 30000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 128
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1


time: 23min 6s (started: 2022-11-12 08:03:08 +00:00)


ting frequent sub strings...
unigram_model_trainer.cc(201) LOG(INFO) Initialized 1000000 seed sentencepieces
trainer_interface.cc(596) LOG(INFO) Tokenizing input sentences with whitespace: 24419902
trainer_interface.cc(607) LOG(INFO) Done! 4696044
unigram_model_trainer.cc(491) LOG(INFO) Using 4696044 sentences for EM training
unigram_model_trainer.cc(507) LOG(INFO) EM sub_iter=0 size=838106 obj=16.3019 num_tokens=14291441 num_tokens/piece=17.0521
unigram_model_trainer.cc(507) LOG(INFO) EM sub_iter=1 size=688067 obj=14.038 num_tokens=14365800 num_tokens/piece=20.8785
unigram_model_trainer.cc(507) LOG(INFO) EM sub_iter=0 size=516036 obj=13.9962 num_tokens=14484892 num_tokens/piece=28.0695
unigram_model_trainer.cc(507) LOG(INFO) EM sub_iter=1 size=515998 obj=13.9855 num_tokens=14495195 num_tokens/piece=28.0916
unigram_model_trainer.cc(507) LOG(INFO) EM sub_iter=0 size=386997 obj=14.0208 num_tokens=14763508 num_tokens/piece=38.1489
unigram_model_trainer.cc(507) LOG(INFO) EM sub_iter=1 size

> time: 23min 6s (started: 2022-11-12 08:03:08 +00:00)

It took 23 minutes to train a unigram model with a vocabulary size of 30,000. The model file was saved in the `{project_dir}/tokenizers` directory.


```bash
ekorpkit \
    project.name=ekorpkit-book \
    dir.workspace=/content/drive/MyDrive/workspace \
    verbose=false \
    cmd=train_tokenizer \
    +tokenizer=train_spm \
    tokenizer.model_prefix=enko_wiki \
    tokenizer.model_type=unigram \
    tokenizer.vocab_size=30000 \
    tokenizer.input_files=sampled_sentences.txt \
    tokenizer.input_dir=data/tokenizers/enko_filtered_samples \
    tokenizer.output_dir=tokenizers/spm/enko_wiki \
    tokenizer.character_coverage=0.9995 \
    tokenizer.num_workers=128
```


#### Load the trained model


In [13]:
import sentencepiece as spm

model_file = "tokenizers/spm/enko_wiki_unigram_vocab_30000.model"
model_file = project_dir + "/" + model_file
uni_spm = spm.SentencePieceProcessor(model_file=model_file)
print(f"Vocab size: {uni_spm.vocab_size()}")
print(uni_spm.encode(text_en[0], out_type=str))
print(uni_spm.encode(text_ko[0], out_type=str))

Vocab size: 30000
['▁The', '▁London', '▁and', '▁South', '▁Western', '▁Railway', '▁T', '1', '▁class', '▁was', '▁a', '▁class', '▁of', '▁fifty', '▁0', '-4', '-4', 'T', '▁steam', '▁tank', '▁locomotive', 's', '▁designed', '▁for', '▁suburb', 'an', '▁passenger', '▁work', '▁by', '▁William', '▁Adams', '▁and', '▁built', '▁between', '▁1888', '▁and', '▁1896', '.']
['▁보이', '보', '디', '나', '▁자치주', '()', '는', '▁유고슬라비아', '▁연방', '▁인민', '▁공화국의', '▁세르비아', '▁인민', '▁공화국', '에', '▁있던', '▁두', '▁개의', '▁자치주', '▁가운데', '▁하나로', ',', '▁수도', '는', '▁노비', '사', '드', '였다', '.']
time: 58.7 ms (started: 2022-11-13 04:52:09 +00:00)


#### Train BPE model


```bash
ekorpkit \
    project.name=ekorpkit-book \
    dir.workspace=/content/drive/MyDrive/workspace \
    verbose=false \
    cmd=train_tokenizer \
    +tokenizer=train_spm \
    tokenizer.model_prefix=enko_wiki \
    tokenizer.model_type=bpe \
    tokenizer.vocab_size=30000 \
    tokenizer.input_files=sampled_sentences.txt \
    tokenizer.input_dir=data/tokenizers/enko_filtered_samples \
    tokenizer.output_dir=tokenizers/spm/enko_wiki \
    tokenizer.character_coverage=0.9995 \
    tokenizer.num_workers=128
```

> took 15m 24.4s

It took 15 minutes to train a BPE model with a vocabulary size of 30,000. The model file was saved in the `{project_dir}/tokenizers` directory.


#### Load the trained model


In [14]:
import sentencepiece as spm

model_file = "tokenizers/spm/enko_wiki_bpe_vocab_30000.model"
model_file = project_dir + "/" + model_file
bpe_spm = spm.SentencePieceProcessor(model_file=model_file)
print(f"Vocab size: {uni_spm.vocab_size()}")
print(bpe_spm.encode(text_en[0], out_type=str))
print(bpe_spm.encode(text_ko[0], out_type=str))

Vocab size: 30000
['▁The', '▁London', '▁and', '▁South', '▁Western', '▁Railway', '▁T', '1', '▁class', '▁was', '▁a', '▁class', '▁of', '▁fif', 'ty', '▁0', '-4', '-4', 'T', '▁steam', '▁t', 'ank', '▁locomot', 'ives', '▁designed', '▁for', '▁sub', 'urban', '▁passenger', '▁work', '▁by', '▁William', '▁Adams', '▁and', '▁built', '▁between', '▁1888', '▁and', '▁189', '6.']
['▁보이', '보', '디나', '▁자치', '주', '()', '는', '▁유고슬라비아', '▁연방', '▁인민', '▁공화국의', '▁세르비아', '▁인민', '▁공화국', '에', '▁있던', '▁두', '▁개의', '▁자치', '주', '▁가운데', '▁하나로', ',', '▁수', '도는', '▁노비', '사', '드', '였다', '.']
time: 29.5 ms (started: 2022-11-13 04:53:59 +00:00)


## Compare the Tokenizers

### Load the tokenizers


In [16]:
tokenizers = {
    "BPE": bpe_tokenizer,
    "UNI": unigram_tokenizer,
    "UNI_SPM": uni_spm,
    "BPE_SPM": bpe_spm,
}


def tokenize(tokenizer, text):
    """
    Tokenizes the text using the tokenizer.
    """
    if isinstance(tokenizer, spm.SentencePieceProcessor):
        return tokenizer.encode(text, out_type=str)
    return tokenizer.encode(text).tokens

time: 19.1 ms (started: 2022-11-13 04:56:11 +00:00)


### Analyze the output of the tokenizers


In [19]:
texts = [text_en[0] , text_ko[0]]
tokens = {name: [] for name in tokenizers.keys()}


# tokenize the texts with the tokenizers
for text in texts:
    for name, tokenizer in tokenizers.items():
        print(f"Tokenizer: {name}")
        tokens[name].append(tokenize(tokenizer, text))
        print(tokens[name][-1])
        print("-" * 50)

Tokenizer: BPE
['The', 'London', 'and', 'South', 'Western', 'Railway', 'T', '1', 'class', 'was', 'a', 'class', 'of', 'fif', 'ty', '0', '-', '4', '-', '4', 'T', 'ste', 'am', 'tank', 'loc', 'om', 'ot', 'ives', 'designed', 'for', 'sub', 'urban', 'passenger', 'work', 'by', 'William', 'Adams', 'and', 'built', 'between', '1888', 'and', '1896', '.']
--------------------------------------------------
Tokenizer: UNI
['The', 'Lond', 'on', 'and', 'South', 'Wester', 'n', 'Railway', 'T', '1', 'class', 'was', 'a', 'class', 'of', 'fift', 'y', '0', '-', '4', '-', '4', 'T', 's', 'team', 'tank', 'lo', 'c', 'omotiv', 'es', 'design', 'ed', 'for', 'suburb', 'an', 'passenger', 'work', 'by', 'William', 'Adams', 'and', 'buil', 't', 'be', 'twe', 'en', '1888', 'and', '1896', '.']
--------------------------------------------------
Tokenizer: UNI_SPM
['▁The', '▁London', '▁and', '▁South', '▁Western', '▁Railway', '▁T', '1', '▁class', '▁was', '▁a', '▁class', '▁of', '▁fifty', '▁0', '-4', '-4', 'T', '▁steam', '▁tank',

### Compare the Tokens


In [21]:
import pandas as pd


def compare_tokens(tokenizers, tokens, sample_num=0):

    max_len = max(len(tokens[name][sample_num]) for name in tokenizers.keys())
    diffs = {
        name: max_len - len(tokens[name][sample_num]) for name in tokenizers.keys()
    }

    padded_tokens = {
        name: tokens[name][sample_num] + [""] * diffs[name]
        for name in tokenizers.keys()
    }

    df = pd.DataFrame(padded_tokens)
    return df

time: 18.9 ms (started: 2022-11-13 05:00:58 +00:00)


In [22]:
compare_tokens(tokenizers, tokens, sample_num=0)

Unnamed: 0,BPE,UNI,UNI_SPM,BPE_SPM
0,The,The,▁The,▁The
1,London,Lond,▁London,▁London
2,and,on,▁and,▁and
3,South,and,▁South,▁South
4,Western,South,▁Western,▁Western
5,Railway,Wester,▁Railway,▁Railway
6,T,n,▁T,▁T
7,1,Railway,1,1
8,class,T,▁class,▁class
9,was,1,▁was,▁was


time: 27.5 ms (started: 2022-11-13 05:01:05 +00:00)


In [23]:
compare_tokens(tokenizers, tokens, sample_num=1)

Unnamed: 0,BPE,UNI,UNI_SPM,BPE_SPM
0,보이,보이,▁보이,▁보이
1,보,보,보,보
2,디나,디,디,디나
3,자치,나,나,▁자치
4,주,자치주,▁자치주,주
5,(),(),(),()
6,는,는,는,는
7,유고슬라비아,유고슬라비아,▁유고슬라비아,▁유고슬라비아
8,연방,연방,▁연방,▁연방
9,인민,인민,▁인민,▁인민


time: 24.7 ms (started: 2022-11-13 05:01:13 +00:00)
