# ActivateBaby - buliding tokenizer
based on [How to train a new language model from scratch using Transformers and Tokenizers](https://colab.research.google.com/github/huggingface/blog/blob/main/notebooks/01_how_to_train.ipynb#scrollTo=M1oqh0F6W3ad)


# 0. init

In [1]:
conda env list

# conda environments:
#
base                     /root/miniconda3
CreativeSumm             /root/miniconda3/envs/CreativeSumm
allennlp_env             /root/miniconda3/envs/allennlp_env
codenames                /root/miniconda3/envs/codenames
rsa_tapm                 /root/miniconda3/envs/rsa_tapm
sum                      /root/miniconda3/envs/sum
tapm                  *  /root/miniconda3/envs/tapm
transformers             /root/miniconda3/envs/transformers


Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
from os.path import join as osj
from pathlib import Path

import nltk
import pandas as pd
from tqdm import tqdm

In [3]:
from spacy.lang.en import English
import spacy
nlp = spacy.load("en_core_web_sm")

2023-05-25 16:43:56.221129: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-05-25 16:43:56.221160: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


## 1. Dataset

In [34]:
DATA_PATH = '/root/xhong/babylm/babylm_data'

In [35]:
data_version = 'babylm_10M'
data_path = osj(DATA_PATH, data_version)
data_dict = {}

for f in os.listdir(data_path):
    if f[0] == '.':
        continue
    corpus_name = f.split('.')[0]
    print(f)

open_subtitles.train
simple_wikipedia.train
wikipedia.train
switchboard.train
cbt.train
aochildes.train
bnc_spoken.train
qed.train
gutenberg.train
children_stories.train


### get corpus by sents

In [37]:
paths = [str(x) for x in Path(data_path).glob("**/*.train")]
output_file = 'babylm_10M_sents.txt'
with open(output_file, 'w') as outfile:
    for fname in paths:
        print(fname)
        with open(fname) as infile:
            for line in tqdm(infile):
                line = line.strip()
                if line:
                    doc = nlp(line)
                    new_line = '\n'.join([d.text for d in doc.sents])
                    outfile.write(new_line)

11it [00:00, 106.20it/s]

/root/xhong/babylm/babylm_data/babylm_10M/open_subtitles.train


519988it [1:00:38, 142.90it/s]


KeyboardInterrupt: 

**example from "children_stories.train"**

In [None]:
input_str = "and got the leg of a chair and struck out into the midst of them with it. But nine devils against one soldier were still too many, and when he struck those in front of him, the others seized him behind by the hair, and tore it unmercifully. “Devils’ crew,” cried he, “it is getting too bad, but wait. Into my knapsack, all nine of you!” In an instant they were in it, and then he buckled it up and threw it into a corner. After this all was suddenly quiet, and Brother Lustig lay down again, and slept till it was bright day. Then came the inn-keeper, and the nobleman to whom the castle belonged, to see how he had fared; but when they perceived that he was merry and well they were astonished, and asked, “Have the spirits done you no harm, then?” “The reason why they have not,” answered Brother Lustig, “is because I have got the whole nine of them in my knapsack! You may once more inhabit your castle quite tranquilly, none of them will ever haunt it again.” The nobleman thanked him, made him rich presents, and begged him to remain in his service, and he would provide for him as long as he lived. “No,”"
list(nlp(input_str).sents)

## 2. Train a ByteLevelBPETokenizer tokenizer

We choose to train a byte-level Byte-pair encoding tokenizer (the same as GPT-2), with the same special tokens as RoBERTa. Let’s arbitrarily pick its size to be 52,000.

We recommend training a byte-level BPE (rather than let’s say, a WordPiece tokenizer like BERT) because it will start building its vocabulary from an alphabet of single bytes, so all words will be decomposable into tokens (no more `<unk>` tokens!).


In [None]:
%%time 

from tokenizers import ByteLevelBPETokenizer

paths = [str(x) for x in Path(data_path).glob("**/*.train")]

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

Now let's save files to disk

In [None]:
!mkdir ABByteLevelBPE
tokenizer.save_model("ABByteLevelBPE")

🔥🔥 Wow, that was fast! ⚡️🔥

We now have both a `vocab.json`, which is a list of the most frequent tokens ranked by frequency, and a `merges.txt` list of merges.

```json
{
	"<s>": 0,
	"<pad>": 1,
	"</s>": 2,
	"<unk>": 3,
	"<mask>": 4,
	"!": 5,
	"\"": 6,
	"#": 7,
	"$": 8,
	"%": 9,
	"&": 10,
	"'": 11,
	"(": 12,
	")": 13,
	# ...
}

# merges.txt
l a
Ġ k
o n
Ġ la
t a
Ġ e
Ġ d
Ġ p
# ...
```

What is great is that our tokenizer is optimized for Esperanto. Compared to a generic tokenizer trained for English, more native words are represented by a single, unsplit token. Diacritics, i.e. accented characters used in Esperanto – `ĉ`, `ĝ`, `ĥ`, `ĵ`, `ŝ`, and `ŭ` – are encoded natively. We also represent sequences in a more efficient manner. Here on this corpus, the average length of encoded sequences is ~30% smaller as when using the pretrained GPT-2 tokenizer.

Here’s  how you can use it in `tokenizers`, including handling the RoBERTa special tokens – of course, you’ll also be able to use it directly from `transformers`.


In [25]:
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing


tokenizer = ByteLevelBPETokenizer(
    "./ABByteLevelBPE/vocab.json",
    "./ABByteLevelBPE/merges.txt",
)

In [26]:
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)

In [27]:
tokenizer.encode("Hello world").ids

[0, 2230, 1019, 2]

In [28]:
tokenizer.encode(" Hello world").ids

[0, 6168, 1019, 2]

In [29]:
tokenizer.encode("Hello world").tokens

['<s>', 'Hello', 'Ġworld', '</s>']

In [30]:
tokenizer.encode(" Hello world").tokens

['<s>', 'ĠHello', 'Ġworld', '</s>']

## 3. init LongformerTokenizer

In [None]:
from transformers import LongformerTokenizer
from tokenizers.processors import BertProcessing


tokenizer = LongformerTokenizer(
    "./ABByteLevelBPE/vocab.json",
    "./ABByteLevelBPE/merges.txt",
)

In [None]:
# tokenizer._tokenizer.post_processor = BertProcessing(
#     ("</s>", tokenizer.token_to_id("</s>")),
#     ("<s>", tokenizer.token_to_id("<s>")),
# )

In [None]:
tokenizer.encode("Hello world")

In [None]:
tokenizer.encode(" Hello world")

## 4. tokenize the corpora with LongformerTokenizer
input: 164927

output: 164798

In [46]:
input_file = 'babylm_10M_sents.txt'
output_file = 'babylm_10M_sent_tokens.txt'
with open(output_file, 'w') as outfile:
    with open(input_file, 'r') as infile:
        for line in tqdm(infile):
            line = line.strip()
            if line:
#                 doc = nlp(line)
#                 new_line = '\n'.join([d.text for d in doc.sents])
                tokens = tokenizer.encode(line)
                token_line = ','.join([str(i) for i in tokens.ids]) + '\n'
                outfile.write(token_line)

164928it [00:55, 2977.40it/s]
