# Lab 4: Pretraining a Language Model

![](../figs/deep_nlp/lab/train.png)

Now that we have our dataset and tokenizer, in this lab, we will train a language model on a large corpus of text from scratch.

## Unicode Normalization

One little thing to note is that we will need to normalize our text before training our language model. This is because the same character can be represented in different ways. For example, the character "é" can be represented as "e" followed by a combining accent character, or as a single character. We will use the [Unicode Normalization Form C](https://unicode.org/reports/tr15/#Norm_Forms) to normalize our text. This will convert all characters to their canonical representation.

### TL;DR

Use `NFKC` normalization to normalize your text before training your language model.


### Unicode Normalization Forms

There are four normalization forms:

- **NFC**: Normalization Form Canonical Composition
- **NFD**: Normalization Form Canonical Decomposition
- **NFKC**: Normalization Form Compatibility Composition
- **NFKD**: Normalization Form Compatibility Decomposition

In the above forms, "C" stands for "Canonical" and "K" stands for "Compatibility". The "C" forms are the most commonly used. The "K" forms are used when you need to convert characters to their compatibility representation. For example, the "K" forms will convert "ﬁ" to "fi".

There two main differences between the two sets of forms:

- The length of the string is changed or not: NFC and NFKC always produce a string of the same length or shorter, while NFD and NFKD may produce a string that is longer.
- The original string is changed or not: NFC and NFD always produce a string that is identical to the original string, while NFKC and NFKD may produce a string that is different from the original string.

### Unicode Normalization in Python

In Python, you can use the `unicodedata` module to normalize your text. The `unicodedata.normalize` function takes two arguments:

- `form`: The normalization form to use. This can be one of the following: `NFC`, `NFD`, `NFKC`, `NFKD`.
- `unistr`: The string to normalize.

In [6]:
import unicodedata

text = "ａｂｃＡＢＣ１２３가나다…"
print(f"Original: {text}, {len(text)}")
for form in ["NFC", "NFD", "NFKC", "NFKD"]:
    ntext = unicodedata.normalize(form, text)
    print(f"{form}: {ntext}, {len(ntext)}")

Original: ａｂｃＡＢＣ１２３가나다…, 13
NFC: ａｂｃＡＢＣ１２３가나다…, 13
NFD: ａｂｃＡＢＣ１２３가나다…, 16
NFKC: abcABC123가나다..., 15
NFKD: abcABC123가나다..., 18


## BERT Pretraining

In this lab, we will train a BERT-like model using masked-language modeling, one of the two pretraining tasks used in the original BERT paper.

### What is BERT?

BERT is a large-scale language model that was trained on the English Wikipedia using a masked-language modeling objective. The model was then fine-tuned on a variety of downstream tasks, including question answering, natural language inference, and sentiment analysis. BERT was the first large-scale language model to be pre-trained using a deep bidirectional architecture and outperformed previous language models on a variety of tasks.

For more information, see the lecture notes on BERT.

### Masked-Language Modeling (MLM)

Masked-language modeling is a pretraining task where we mask some of the input tokens and train the model to predict the original value of the masked tokens. For example, if we have the sentence "The dog ate the apple", we can mask the word "ate" and train the model to predict the original value of the masked token. The model will then learn to predict the original value of the masked tokens based on the context of the sentence.

Example:

> Input: "The dog [MASK] the apple"



## Preprocessing the Dataset

Before training our language model, we need to preprocess our dataset. We will use our tokenizer to tokenize our dataset and then convert the tokens to their IDs. If we have a sentence that is longer than the maximum sequence length, we will truncate the sentence. If the sentence is shorter than the maximum sequence length, we will pad the sentence with the padding token.

From four tokenizers we have trained in the previous lab, we will use the `unigram` model trained using `SentencePiece` to tokenize our dataset.

## References

- [Unicode equivalence](https://en.wikipedia.org/wiki/Unicode_equivalence)