# Lab 4: Pretraining Language Models

![](../figs/deep_nlp/lab/train.png)

Now that we have our dataset and tokenizer, in this lab, we will train a language model on a large corpus of text from scratch.

```python
%pip install --pre ekorpkit[model]
```

In [3]:
%config InlineBackend.figure_format='retina'
%load_ext autotime
%load_ext autoreload
%autoreload 2

from ekorpkit import eKonf

eKonf.setLogger("INFO")
print("version:", eKonf.__version__)

is_colab = eKonf.is_colab()
print("is colab?", is_colab)
if is_colab:
    eKonf.mount_google_drive()
workspace_dir = "/content/drive/MyDrive/workspace"
project_name = "ekorpkit-book"
project_dir = eKonf.set_workspace(workspace=workspace_dir, project=project_name)
print("project_dir:", project_dir)

INFO:ekorpkit.utils.notebook:Google Colab not detected.
INFO:ekorpkit.base:Setting EKORPKIT_WORKSPACE_ROOT to /content/drive/MyDrive/workspace
INFO:ekorpkit.base:Setting EKORPKIT_PROJECT to ekorpkit-book
INFO:ekorpkit.base:Loaded .env from /workspace/projects/ekorpkit-book/config/.env


version: 0.1.40.post0.dev22
is colab? False
project_dir: /content/drive/MyDrive/workspace/projects/ekorpkit-book
time: 1.2 s (started: 2022-11-19 10:16:54 +00:00)


In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [32]:
import os
from huggingface_hub import HfApi
from huggingface_hub import HfFolder

token = HfFolder.get_token()
if token is None:
    token = os.environ["HF_USER_ACCESS_TOKEN"]

if token is None:
    raise ValueError("Please login to huggingface_hub")

user_id = HfApi().whoami(token)["name"]

print(f"user id '{user_id}' will be used during this lab")

user id 'entelecheia' will be used during this lab
time: 866 ms (started: 2022-11-19 11:07:25 +00:00)


## Unicode Normalization

One little thing to note is that we will need to normalize our text before training our language model. This is because the same character can be represented in different ways. For example, the character "é" can be represented as "e" followed by a combining accent character, or as a single character.

### TL;DR

Use `NFKC` normalization to normalize your text before training your language model.


### Unicode Normalization Forms

There are four normalization forms:

- **NFC**: Normalization Form Canonical Composition
- **NFD**: Normalization Form Canonical Decomposition
- **NFKC**: Normalization Form Compatibility Composition
- **NFKD**: Normalization Form Compatibility Decomposition

In the above forms, "C" stands for "Canonical" and "K" stands for "Compatibility". The "C" forms are the most commonly used. The "K" forms are used when you need to convert characters to their compatibility representation. For example, the "K" forms will convert "ﬁ" to "fi".

There two main differences between the two sets of forms:

- The length of the string is changed or not: NFC and NFKC always produce a string of the same length or shorter, while NFD and NFKD may produce a string that is longer.
- The original string is changed or not: NFC and NFD always produce a string that is identical to the original string, while NFKC and NFKD may produce a string that is different from the original string.

### Unicode Normalization in Python

In Python, you can use the `unicodedata` module to normalize your text. The `unicodedata.normalize` function takes two arguments:

- `form`: The normalization form to use. This can be one of the following: `NFC`, `NFD`, `NFKC`, `NFKD`.
- `unistr`: The string to normalize.

In [4]:
import unicodedata

text = "ａｂｃＡＢＣ１２３가나다…"
print(f"Original: {text}, {len(text)}")
for form in ["NFC", "NFD", "NFKC", "NFKD"]:
    ntext = unicodedata.normalize(form, text)
    print(f"{form}: {ntext}, {len(ntext)}")

Original: ａｂｃＡＢＣ１２３가나다…, 13
NFC: ａｂｃＡＢＣ１２３가나다…, 13
NFD: ａｂｃＡＢＣ１２３가나다…, 16
NFKC: abcABC123가나다..., 15
NFKD: abcABC123가나다..., 18
time: 23.8 ms (started: 2022-11-19 10:16:57 +00:00)


## BERT Pretraining

In this lab, we will train a BERT-like model using masked-language modeling, one of the two pretraining tasks used in the original BERT paper.

### What is BERT?

BERT is a large-scale language model that was trained on the English Wikipedia using a masked-language modeling objective. The model was then fine-tuned on a variety of downstream tasks, including question answering, natural language inference, and sentiment analysis. BERT was the first large-scale language model to be pre-trained using a deep bidirectional architecture and outperformed previous language models on a variety of tasks.

BERT was originally pre-trained on 1 Million Steps with a global batch size of 256.

> "We train with batch size of 256 sequences (256 sequences * 512 tokens = 128,000 tokens/batch) for 1,000,000 steps, which is approximately 40 epochs over the 3.3 billion word corpus."

For more information, see the lecture notes on BERT.

### Masked-Language Modeling (MLM)

Masked-language modeling is a pretraining task where we mask some of the input tokens and train the model to predict the original value of the masked tokens. For example, if we have the sentence "The dog ate the apple", we can mask the word "ate" and train the model to predict the original value of the masked token. The model will then learn to predict the original value of the masked tokens based on the context of the sentence.

Example:

> Input: "The dog [MASK] the apple"



## Preprocessing the Dataset

Before training our language model, we need to preprocess our dataset. We will use our tokenizer to tokenize our dataset and then convert the tokens to their IDs. If we have a sentence that is longer than the maximum sequence length, we will truncate the sentence. If the sentence is shorter than the maximum sequence length, we will pad the sentence with the padding token.

Unlike the original BERT paper, we will not use the WordPiece tokenization algorithm. Instead, we will use the `unigram` tokenization algorithm.

In [18]:
from tokenizers import Tokenizer
from transformers import PreTrainedTokenizerFast
from tokenizers.processors import BertProcessing

tokenizer_path = "tokenizers/enko_wiki/enko_wiki_unigram_huggingface_vocab_30000.json"
tokenizer_path = project_dir + "/" + tokenizer_path
context_length = 512

unigram_tokenizer = Tokenizer.from_file(tokenizer_path)
print(f"Vocab size: {unigram_tokenizer.get_vocab_size()}")
unigram_tokenizer.post_processor = BertProcessing(
    ("</s>", unigram_tokenizer.token_to_id("</s>")),
    ("<s>", unigram_tokenizer.token_to_id("<s>")),
)

tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=unigram_tokenizer,
    truncation=True,
    max_length=context_length,
    return_length=True,
    bos_token="<s>",
    eos_token="</s>",
    unk_token="<unk>",
    pad_token="<pad>",
    cls_token="<cls>",
    sep_token="<sep>",
    mask_token="<mask>",
    padding_side="right",
)

print(f"is_fast: {tokenizer.is_fast}")
print(f"Vocab size: {tokenizer.vocab_size}")
print(tokenizer("Hello, my dog is cute"))
tokenizer.save_pretrained(project_dir + "/tokenizers/enko_wiki")

Vocab size: 30000
is_fast: True
Vocab size: 30000
{'input_ids': [1, 8, 14690, 10, 8, 968, 8, 6871, 8, 42, 8, 2777, 72, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


('/content/drive/MyDrive/workspace/projects/ekorpkit-book/tokenizers/enko_wiki/tokenizer_config.json',
 '/content/drive/MyDrive/workspace/projects/ekorpkit-book/tokenizers/enko_wiki/special_tokens_map.json',
 '/content/drive/MyDrive/workspace/projects/ekorpkit-book/tokenizers/enko_wiki/tokenizer.json')

time: 67 ms (started: 2022-11-19 10:55:20 +00:00)


In [20]:
tokenizer("Hello, this one sentence!", "And this sentence goes with it.")

{'input_ids': [1, 8, 14690, 10, 8, 235, 8, 202, 8, 15219, 489, 2, 8, 37, 8, 235, 8, 15219, 8, 11241, 8, 80, 8, 65, 9, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

time: 23.9 ms (started: 2022-11-19 10:55:39 +00:00)


In [5]:
from datasets import load_dataset

data_dir = project_dir + "/data/tokenizers/enko_filtered_chunk"

dataset = load_dataset("text", data_dir=data_dir, split="train")
dataset

Resolving data files:   0%|          | 0/61 [00:00<?, ?it/s]



Dataset({
    features: ['text'],
    num_rows: 3618972
})

time: 1.02 s (started: 2022-11-19 09:39:39 +00:00)


In [12]:
text_column = "text"


def tokenize(element):
    outputs = tokenizer(
        element[text_column],
        truncation=True,
        max_length=context_length,
        return_special_tokens_mask=True,
    )
    return outputs

time: 15.3 ms (started: 2022-11-19 09:55:12 +00:00)


In [None]:
num_proc = 20

# preprocess dataset
tokenized_dataset = dataset.map(
    tokenize, batched=True, remove_columns=[text_column], num_proc=num_proc
)
tokenized_dataset.features

In [None]:
from itertools import chain

# Main data processing function that will concatenate all texts from our dataset and generate chunks of
# max_seq_length.
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= context_length:
        total_length = (total_length // context_length) * context_length
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + context_length] for i in range(0, total_length, context_length)]
        for k, t in concatenated_examples.items()
    }
    return result


tokenized_dataset = tokenized_dataset.map(group_texts, batched=True, num_proc=num_proc)

# shuffle dataset
tokenized_dataset = tokenized_dataset.shuffle(seed=1234)

print(f"the dataset contains in total {len(tokenized_dataset)*context_length} tokens")
# the dataset contains in total 137,816,832 tokens

In [49]:
from transformers import AutoTokenizer, AutoConfig, AutoModelForMaskedLM

tk_path = project_dir + "/tokenizers/enko_wiki"

# Load codeparrot tokenizer trained for Python code tokenization
tokenizer = AutoTokenizer.from_pretrained(tk_path)

# Configuration
config_kwargs = {
    "vocab_size": len(tokenizer),
    "pad_token_id": tokenizer.pad_token_id,
    # "torch_dtype": "float16",
}

# # Load model with config and push to hub
config = AutoConfig.from_pretrained("bert-base-uncased", **config_kwargs)
model = AutoModelForMaskedLM.from_config(config)

model_path = project_dir + "/models/enko_wiki_bert_base_uncased"
model.save_pretrained(model_path)

time: 3.4 s (started: 2022-11-19 11:29:40 +00:00)


In [45]:
from transformers import BertForMaskedLM

model = BertForMaskedLM(config)
model_size = sum(t.numel() for t in model.parameters())
print(f"BERT size: {model_size/1000**2:.1f}M parameters")

BERT size: 109.1M parameters
time: 2.07 s (started: 2022-11-19 11:23:15 +00:00)


In [46]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(
    tokenizer, mlm=True, mlm_probability=0.15
)

2022-11-19 11:24:52.315972: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


time: 1.71 s (started: 2022-11-19 11:24:52 +00:00)


In [47]:
from datasets import Dataset

dataset_dir = project_dir + "/data/tokenized_datasets/enko_filtered"

tokenized_dataset = Dataset.load_from_disk(dataset_dir)
tokenized_dataset

Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'special_tokens_mask'],
    num_rows: 268366
})

time: 55.2 ms (started: 2022-11-19 11:27:08 +00:00)


In [48]:
out = data_collator([tokenized_dataset[i] for i in range(5)])
for key in out:
    print(f"{key} shape: {out[key].shape}")

input_ids shape: torch.Size([5, 512])
token_type_ids shape: torch.Size([5, 512])
attention_mask shape: torch.Size([5, 512])
labels shape: torch.Size([5, 512])
time: 49.9 ms (started: 2022-11-19 11:27:34 +00:00)


In [50]:
from transformers import Trainer, TrainingArguments


args = TrainingArguments(
    output_dir=model_path,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    evaluation_strategy="steps",
    eval_steps=5_000,
    logging_steps=5_000,
    gradient_accumulation_steps=8,
    num_train_epochs=1,
    weight_decay=0.1,
    warmup_steps=1_000,
    lr_scheduler_type="cosine",
    learning_rate=5e-4,
    save_steps=5_000,
    fp16=True,
    push_to_hub=False,
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset,
)

Using cuda_amp half precision backend


time: 2.43 s (started: 2022-11-19 11:34:48 +00:00)


In [51]:
from accelerate import Accelerator

accelerator = Accelerator()
acc_state = {str(k): str(v) for k, v in accelerator.state.__dict__.items()}
device = accelerator.device

print(f"device: {device}")

trainer = accelerator.prepare(trainer)

trainer.train()

trainer.save_model(model_path)

device: cuda
time: 47.9 ms (started: 2022-11-19 11:34:51 +00:00)


took 6h 33m 0.0s 

In [None]:
from transformers import pipeline, AutoTokenizerFast, AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained(
    os.path.join(model_path, "checkpoint-10000")
)
tokenizer = AutoTokenizerFast.from_pretrained(model_path)

fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)

In [None]:
# perform predictions
example = "It is known that [MASK] is the capital of Germany"
for prediction in fill_mask(example):
    print(prediction)

## References

- [Unicode equivalence](https://en.wikipedia.org/wiki/Unicode_equivalence)