# TRAINING TRANSFORMERS FROM SCRATCH

So far we've mostly worked on data-constrained applications where the amount of labeled training data is limited. In these cases, transfer learning helped us build performant
models.

In this chapter we’ll move to the other extreme and look at what we can do when we are drowning in all the data we could possibly want. We’ll explore the pretraining step
itself and learn how to train a transformer from scratch. In working through this problem, we’ll look at some aspects of training that we have not considered yet, such
as the following:

* Gathering and processing a very large dataset
* Creating a custom tokenizer for our dataset
* Training a model on multiple GPUs at scale

----

<span style="color:yellow"><b>WARNING:</b></span> Unlike the code in the others in this book (which can be run with a Jupyter notebook on a single GPU), the training code in this chapter is designed to be run as a script with multiple GPUs. If you want to train your own version of CodeParrot, [we recommend running the script provided in the Transformers repository](https://github.com/huggingface/transformers/tree/main/examples/research_projects/codeparrot).

----

In [1]:
import keyword
import torch
import datasets
import multiprocessing
import pytorch_lightning as pl
import pandas as pd

from torch.utils.data import IterableDataset, DataLoader, Dataset

from transformers import pipeline, set_seed
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, GPT2LMHeadModel
from datasets import load_dataset
from datasets import Dataset as HF_Dataset
from transformers.models.gpt2.tokenization_gpt2 import bytes_to_unicode
from tqdm.auto import tqdm

In [2]:
valid_data = load_dataset('transformersbook/codeparrot-valid', split="validation", streaming=True)

Repo card metadata block was not found. Setting CardData to empty.


In [3]:
next(iter(valid_data))

{'repo_name': '132nd-etcher/EMFT',
 'path': 'emft/gui/tab_about.py',
 'copies': '1',
 'size': '1269',
 'content': "# coding=utf-8\n\nfrom emft.core import constant\nfrom emft.core.logging import make_logger\nfrom emft.gui.base import GridLayout, HSpacer, Label, VLayout, VSpacer\nfrom emft.gui.main_ui_tab_widget import MainUiTabChild\n\nLOGGER = make_logger(__name__)\n\n\nclass TabChildAbout(MainUiTabChild):\n    def tab_clicked(self):\n        pass\n\n    @property\n    def tab_title(self) -> str:\n        return 'About'\n\n    def __init__(self, parent=None):\n        super(TabChildAbout, self).__init__(parent)\n\n        repo_label = Label(\n            '''<a href='{link}'>{link}</a>'''.format(link=constant.LINK_REPO)\n        )\n        repo_label.setOpenExternalLinks(True)\n\n        changelog_label = Label(\n            '''<a href='{link}'>{link}</a>'''.format(link=constant.LINK_CHANGELOG)\n        )\n        changelog_label.setOpenExternalLinks(True)\n\n        self.setLayout(\n 

## 1 - Large datasets and where to find them

There are many domains where you may actually have a large amount of data at hand, ranging from legal documents to biomedical datasets to programming codebases.
In most cases, these datasets are unlabeled, and their large size means that they can usually only be labeled through the use of heuristics, or by using accompanying
metadata that is stored during the gathering process.

**Using a pretrained model forces you to use the model’s corresponding tokenizer, but using a tokenizer that is trained on a corpus from another domain is typically suboptimal.** For example, using GPT's pretrained tokenizer on legal documents, other languages, or even completely different sequences such as musical notes or DNA sequences will result in poor tokenization (as we will see shortly).

As the amount of training data you have access to gets closer to the amount of dataused for pretraining, it thus becomes interesting to consider training the model and
the tokenizer from scratch, provided the necessary computational resources are available.

### 1.1 - Challenges of building large-scale corpus

**The quality of a model after pretraining largely reflects the quality of the pretraining corpus**. In particular, the model will inherit any defects in the pretraining corpus.
Thus, before we attempt to create one of our own it’s good to be aware of some of the common issues and challenges that are associated with building large corpora for
pretraining.

As the dataset gets larger and larger, the chances that you can fully control—or at least have a precise idea of—what is inside it diminish. A very large dataset is much more likely thathas been created in an automatic or semiautomatic way by collecting data that is generated as a side effect of other activities. For instance, it may consist
of:

* all the documents (e.g., contracts, purchase orders, etc.) that a company stores,
* logs from user activities, 
* data gathered from the internet
* etc.

There are several important consequences that follow from the fact that large-scale datasets are mostly created with a high degree of automation. One of them is the risk of training a model on biased and lower-quality data.

[Recent investigations of famous large-scale datasets like BookCorpus and C4, which were used to train BERT and T5, respectively, have uncovered (among other things) that](https://arxiv.org/abs/2104.08758):
* A significant proportion of the C4 corpus is machine-translated rather than translated by humans.
* Disparate erasure of African-American English as a result of stopword filtering in C4 has resulted in an underrepresentation of such content.
* It is typically difficult in a large text corpus to find a middle ground between including (often too much) sexually or other explicit content and totally erasing
all mention of sexuality or gender. As a surprising consequence of this, a rather common word like **“sex”** (which can have both neutral and explicit meanings) is
completely unknown to a tokenizer that is trained on C4, since this word is fully absent from the corpus.
* [There are many occurrences of copyright violation in BookCorpus, and probably in other large-scale datasets as well](https://arxiv.org/abs/2105.05241).
* There is genre skew toward "romance" novels in BookCorpus.

Let's illustrate the notion of a model being skewed by the data by comparing text generations from GPT and GPT-2.
* GPT was mostly trained on BookCorpus
* GPT-2 was trained on web pages, blogs, and news articles liked from Reddit

We'll compare similar-sized versions of both models on the same prompt, so that the main difference is the pretraining dataset, and we'll use the `text-generation` pipeline to investigate the model outputs:

In [4]:
generation_gpt = pipeline("text-generation", model="openai-gpt")
generation_gpt2 = pipeline("text-generation", model="gpt2")

Downloading (…)lve/main/config.json:   0%|          | 0.00/656 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/479M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/74.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/816k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/458k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.27M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Next, let’s create a simple function to count the number of parameters in each model:

In [5]:
def model_size(model):
    return sum(t.numel() for t in model.parameters())
    
print(f"GPT size: {model_size(generation_gpt.model)/1000**2:.1f}M parameters")
print(f"GPT2 size: {model_size(generation_gpt2.model)/1000**2:.1f}M parameters")

GPT size: 116.5M parameters
GPT2 size: 124.4M parameters


The original GPT model is about the same size as the smallest GPT-2 model. Now we can generate three different completions from each model, each with the same input prompt:

In [6]:
def enum_pipeline_ouputs(pipe, prompt, num_return_sequences):
    out = pipe(prompt, num_return_sequences=num_return_sequences, clean_up_tokenization_spaces=True)
    return "\n".join(f"{i+1}." + s["generated_text"] for i, s in enumerate(out))

prompt = "\nWhen they came back"
print("GPT completions:\n" + enum_pipeline_ouputs(generation_gpt, prompt, 3))
print("")
print("GPT-2 completions:\n" + enum_pipeline_ouputs(generation_gpt2, prompt, 3))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


GPT completions:
1.
When they came back with some money. they hadn't even bothered with a check to cover the gas and the gas tank. " 
 she glanced up, her eyes sad from pain. " i'm clendon, by the way, by the way.
2.
When they came back on the porch. 
 " what are you looking at? " he pulled his sweater back down, and then pulled off his jacket. " what's this guy doing here? " he said, " i 'll bet he's a
3.

GPT-2 completions:
1.
When they came back from the airport, their car keys were locked. The car they saw was brand new and in the trunk. They turned away and told detectives he had stolen five cars from the area.

The officers pulled him over for
2.
When they came back for business, I'd asked what was the most important thing they had in mind when they learned they could be arrested for taking the first shot.

"Do you think the person who shot you, or someone with your
3.
When they came back, they found a box and a pair of shoes. The one with the socks, she found a dresser, and he w

By just sampling a handful of outputs from both models we can already see the distinctive “romance” skew in GPT generation, which will typically imagine a dialogue with a romantic interaction between a woman and a man. On the other hand, GPT-2 was trained on webtext linked to and from Reddit articles and mostly adopts a neutral “they” in its generations, which contain “blog-like” or adventure-related elements.

In general, any model trained on a dataset will reflect the language bias and over- or underrepresentation of populations and events in its training data. These biases in the
behavior of the model are important to take into consideration with regard to the target audience interacting with the model; [for some useful guidelines, we refer you to a
paper by Google that provides a framework for dataset development](https://arxiv.org/abs/2010.13561).

### 1.2 - To filter the noise or not?

There are some conscious choices to be made regarding how we want the system to perform in a real-world setting. Having some noise in the training dataset will make our system more robust to noisy inputs at inference time, but will also make its predictions more random. Depending on the intended use and whole system integration, you may choose more or less noisy data and add pre- and postfiltering operations.

## 2 - Building a Tokenizer

In the previous chapters we've used tokenizers that accompanied the models we used. This made sense since these models were pretrained using data passed through a specific preprocessing pipeline defined in the tokenizer. When using a pretrained model, it's important to stick with the same preprocessing design choices selected for pretraining. Otherwise the model may be fed out-of-distribution patterns or unknown tokens.

However, when we train a language model from scratch, using a tokenizer prepare for another dataset can be suboptimal. Here are a few examples of the kinds of problems we might run into when using an existing tokenizer:

* The T5 tokenizer was trained on the [C4 corpus](https://huggingface.co/datasets/c4) that we mentioned earlier, but an extensive step of stopword filtering was used to create it. As a result, the T5 tokenizer has never seen common English words such as "sex".
* The CamemBERT tokenizer was also trained on a very large corpus of text, but only comprising French text (the French subset of the [OSCAR corpus](https://huggingface.co/datasets/oscar)). As such, it is unaware of common English words such as "being".

We can easily observe this in practice:

In [7]:
def tok_list(tokenizer, string):
    input_ids = tokenizer(string, add_special_tokens=False)["input_ids"]
    return [tokenizer.decode(tok) for tok in input_ids]
    
tokenizer_T5 = AutoTokenizer.from_pretrained("t5-base")
tokenizer_camembert = AutoTokenizer.from_pretrained("camembert-base")
print(f'T5 tokens for "sex": {tok_list(tokenizer_T5,"sex")}')
print(f'CamemBERT tokens for "being": {tok_list(tokenizer_camembert,"being")}')

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Downloading (…)lve/main/config.json:   0%|          | 0.00/508 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/811k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.40M [00:00<?, ?B/s]

T5 tokens for "sex": ['', 's', 'ex']
CamemBERT tokens for "being": ['be', 'ing']


**In many cases, splitting such short and common words into subparts will be inefficient, since this will increase the input sequence length of the model** (which has limited context). Therefore, it is important to be aware of the domain and preprocessing of the dataset that was used to train the tokenizer. The tokenizer and model can encode bias from the dataset that has an impact on the downstream behaviour of the model. To create an optimal tokenizaer for our dataset, we thus need to train one ourselves. 

-----

**Training a model** involves starting from a given set of weights and using backprpagation from an error signal on a designed objective to minimize the loss of the model and find an optimal set of weights for the model to perform the task defined by the training objective.

**Training a tokenizer**, on the other hand, does not involve backpropagation or weights. It is a way to create an optimal mapping from a string of text to a list of integers that can be ingested by the model. In today's tokenizers, the optimal string-to-integer conversion involves a vocabulary consisting of a list of atomic strings and an associated method to convert, normalize, cut, or map a text string into a list of indices with this vocabulary. This list of indices is then the input for our neural network.

-----

### 2.1 - The Tokenizer Model

As we saw in Chapter 4, the tokenizer is a processing pipeline consisting of four steps:
* Normalization
* Pretokenization
* Tokenizer model
* Postprocessing

The part of the tokenizer pipeline that can be trained on data is the tokenizer model. As we discussed in Chapter 2, there are several subword tokenization algorithms that can be used, such as BPE, WordPiece, and Unigram:

* **BPE** starts from a list of basic units (single characters) and creates a vocabulary by a process of progressively creating new tokens formed by merging the most frequently co-occurring basic unit and adding them to the vocabulary. This process is reiterated until a predefined vocabulary size is reached.
* **Unigram** starts from the other end, by initializing its base vocabulary with all the words in the corpus and potential subwords. Then, it progressibely removes or splits the less useful tokens to obtain a smaller and smaller vocabulary, until the target vocabulary is reached.
* **WordPiece** is a predecessor of Unigram, and its official implementation was never open-sourced by Google.

The impact of these various algorithms on downstream performance varies depending on the task, and overall it’s quite difficult to identify if one algorithm is clearly
superior to the others. Both BPE and Unigram have reasonable performance in most cases, but let’s have a look at some aspects to consider when evaluating.

### 2.2 - Measuing Tokenizer performance

The optmimality and performance of a tokenizer are challenging to measure in practice. Some possible metrics include:
* **Subword feritlity**, which calculates the average number of subwords produced per tokenized word.
* **Proportion of continued words**, which refers to the proportion of tokenized words in a corpus that are split into at least two subtokens.
* **Coverage metrics** like the proportion of unknown words or rarely used tokens in a tokenized corpus.

In addition, robustness to misspelling or noise is often estimated, as well as model performance on such out-of-domain examples, as this strongly depends on the tokenization process.

These measures give a set of different views on the tokenizer's performance, but they tend to ignore the interaction of the tokenizer with the model. For example, subword fertility can be minimized by including all the possible words in the vocabulary, but this will produce a very large vocabulary for the model.

In the end, the performance of the various tokenization approaches is thus generally best estimated by using the downstream performance of the model as the ultimate metric. For instance, the good performance of early BPE approaches was demonstrated by showing improved performance on machine translation tasks by models trained on these tokenizers and vocabularies instead of character- or word-based tokenization.

### 2.3 - Building a custom tokenizer for Python

We need a custom tokenizer for our use case: **tokenizing Python code**. The question of pretokenization merits some discussion for programming languages. If we split on whitespaces and remove them, we will lose all the indentation information, which in Python is important for the semantics of the program (just think about `while` loops, or `if-then-else` statements). On the other hand, line breaks are not meaningful and can be added or removed without impac on the semantics. Similarly, splitting on punctuation, like an underscore, which is used to compose a single variable name from several subparts, might not make as much sense as it would in natural language.

**Using a natural language pretokenizer for tokenizing code thus seems potentially suboptimal**. Let's see it in action:

In [8]:
python_code = r"""
def say_hello():
    print("Hello, World!")
    # Print it
    say_hello()
"""
tokenizer = AutoTokenizer.from_pretrained("gpt2")
print(tokenizer(python_code).tokens())

['Ċ', 'def', 'Ġsay', '_', 'hello', '():', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġprint', '("', 'Hello', ',', 'ĠWorld', '!"', ')', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġ#', 'ĠPrint', 'Ġit', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġsay', '_', 'hello', '()', 'Ċ']


This is quite strange output, so let's try to understand what is happening here by running the various submodules of the tokenizer's pipeline. First let's see what normalization is applied in this tokenizer:

In [9]:
print(tokenizer.backend_tokenizer.normalizer)

None


As we can see, the GPT-2 tokenizer uses no normalization. It works directly on the raw Unicode inputs without any normalization steps. Let’s now take a look at the pretokenization:

In [10]:
print(tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(python_code))

[('Ċ', (0, 1)), ('def', (1, 4)), ('Ġsay', (4, 8)), ('_', (8, 9)), ('hello', (9, 14)), ('():', (14, 17)), ('ĊĠĠĠ', (17, 21)), ('Ġprint', (21, 27)), ('("', (27, 29)), ('Hello', (29, 34)), (',', (34, 35)), ('ĠWorld', (35, 41)), ('!")', (41, 44)), ('ĊĠĠĠ', (44, 48)), ('Ġ#', (48, 50)), ('ĠPrint', (50, 56)), ('Ġit', (56, 59)), ('ĊĠĠĠ', (59, 63)), ('Ġsay', (63, 67)), ('_', (67, 68)), ('hello', (68, 73)), ('()', (73, 75)), ('Ċ', (75, 76))]


What are all these `Ġ` symbols, and what are the numbers accompanying the tokens? Let's explain both and see if we can understand better how this tokenizer works.

Let's start with the numbers. 🤗 Tokenizers has a very useful feature for switching between strings and tokens, called *offset tracking*. All the operations on the input string are tracked so that it's possible to know exactly what part of the input string a token after tokenization corresponds to. These numbers simply indicate where in the original string each token comes from; for instance, the word 'hello' in the first line corresponds to the characters 8 to 13 in the original string. If some characters are removed in a normalization step, we are thus still able to associate each token with the respective part in the original string.

The other curious feature of the tokenized text is the odd-looking characters, such as `Ċ` and `Ġ`. Byte-level means that this tokenizer woks on bytes instead of Unicode characters. Each Unicode character is composed of between 1 and 4 bytes, depending on the character. The nice thing about bytes is that while there are 143,859 Unicode characters in the Unicode alphabet, there are only 256 elements in the byte alphabet, and you can express each Unicode character as a sequence of these bytes. If we work on bytes we can thus express all the strings composed from the UTF-8 world as longer strings in this alphabet of 256 values. That is, we can have a model using an alphabet of only 256 words and be able to process any Unicode string.

Let's have a look at what the byte representations of some characters look like:

In [11]:
a, e = u"a", u"€"
byte = ord(a.encode("utf-8"))
print(f'`{a}` is encoded as `{a.encode("utf-8")}` with a single byte: {byte}')
byte = [ord(chr(i)) for i in e.encode("utf-8")]
print(f'`{e}` is encoded as `{e.encode("utf-8")}` with three bytes: {byte}')

`a` is encoded as `b'a'` with a single byte: 97
`€` is encoded as `b'\xe2\x82\xac'` with three bytes: [226, 130, 172]


At this point you might wonder: **why work on a byte level?**

We could decide to build our vocabulary from the 143,859 Unicode characters, but we would also like to include words (i.e., combinations of Unicode characters—in our vocabulary), so this (already very large) size is only a lower bound for the total size of the vocabulary. **This will make our model’s embedding layer very large because it comprises one vector for each vocabulary token**.

On the other extreme, if we only use the 256 byte values  as our vocabulary, the input sequences will be segmented in many small pieces (i.e., similar to character level tokenization), and as such **our model will have to work on long inputs and spend significant compute power on reconstructing Unicode characters from their separate bytes, and the words from these characters**. [This approach was studied in the ByT5 paper](https://arxiv.org/abs/2105.13626).

A middle-ground solution is to construct medium-sized vocabulary by extending the 256-word vocabulary with the most common combinations of bytes. This is the approach taken by the BPE algorithm. The idea is to progressively construct a vocabulary of a predefined size by creating new vocabulary tokens through iteratively merging the most frequently co-occurring pair of tokens in the vocabulary. For instance, if `t` and `h` occur very frequently together, like in English, we'll add a token `th` to the vocabulary to model this pair of tokens instead of keeping them separated. The `t` and `h` tokens are also kept in the vocabulary to tokenize instances where they do not occur together.

**There is just one issue when using a typical BPE algorithm in NLP. These algorithms are designed to work with clean Unicode strings as inputs, not bytes**, and expect regular ASCII characters in the inputs, without spaces or control characters. But in the Unicode characters corresponding to the 256 first bytes, **there are many control characters** (newline, tab, escape, line feed, and other nonprintable characters). To overcome this problem, the GPT-2 tokenizer first maps all the 256 input bytes to Unicode strings that can easily be digested by the standard BPE algorithms. That that is, we will map our 256 elementary values to Unicode strings that all correspond to standard printable Unicode characters.

It’s not very important that these Unicode characters are each encoded with 1 "byte" (i.e., Unicode string after the mapping) or more; what is important is that we have 256 single values at the end, forming our base vocabulary, and that these 256 values are correctly handled by our BPE algorithm.

In [12]:
byte_to_unicode_map = bytes_to_unicode()
unicode_to_byte_map = dict((v, k) for k, v in byte_to_unicode_map.items())
base_vocab = list(unicode_to_byte_map.keys())
print(f'Size of our base vocabulary: {len(base_vocab)}')
print(f'First element: `{base_vocab[0]}`, last element: `{base_vocab[-1]}`')

Size of our base vocabulary: 256
First element: `!`, last element: `Ń`


We can see an example of a Unicode character that is represented by more than one "byte" in the following tokenization example:

In [13]:
print(tokenizer("Ġ def").tokens())

['Ä', 'ł', 'Ġdef']


The `Ġ` symbol is a special token in BPE that represents a space. Therefore, when the text contains the Ġ Unicode string, it is represent by 2 "bytes" that have been mapped to other "special" Unicode strings. In this case, `Ä`, `ł`. The `Ġ` symbol has a control character in its composition (i.e., a nonbreakable space), represented by the `ł` string.

**Examples of character mappings in BPE:**

| Description                                    | Character | Bytes     | Mapped bytes |
|-----------------------------------------------|-----------|-----------|--------------|
| Regular characters `a` and `?`                | `a`       | 97        | `a`          |
|                                                | `?`       | 63        | `?`          |
| A nonprintable control character (carriage return) | `U+000D`  | 13        | `č`          |
| A space                                        | ` `       | 32        | `Ġ`          |
| A nonbreakable space                          | `\xa0`    | 160       | `ł`          |
| A newline character                            | `\n`      | 10        | `Ċ`          |

We could have used a more explicit conversion, like mapping newlines to a `NEWLINE` string, but **BPE algorithms are typically designed to work on characters**. For this reason, keeping one Unicode character for each byte character is easier to handle with an out-of-the-box BPE algorithm. Now that we have been introduced to the dark magic of Unicode encodings, we can understand our tokenization conversion a bit better:

In [14]:
print(tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(python_code))

[('Ċ', (0, 1)), ('def', (1, 4)), ('Ġsay', (4, 8)), ('_', (8, 9)), ('hello', (9, 14)), ('():', (14, 17)), ('ĊĠĠĠ', (17, 21)), ('Ġprint', (21, 27)), ('("', (27, 29)), ('Hello', (29, 34)), (',', (34, 35)), ('ĠWorld', (35, 41)), ('!")', (41, 44)), ('ĊĠĠĠ', (44, 48)), ('Ġ#', (48, 50)), ('ĠPrint', (50, 56)), ('Ġit', (56, 59)), ('ĊĠĠĠ', (59, 63)), ('Ġsay', (63, 67)), ('_', (67, 68)), ('hello', (68, 73)), ('()', (73, 75)), ('Ċ', (75, 76))]


We can recognize the newlines, which as we now know are mapped to `Ċ`, and the spaces, mapped to `Ġ`. We also see that:

* Spaces, and in particular consecutive spaces, are conserved (for instance, the three spaces in `ĊĠĠĠ`).
* Consecutive spaces are considered as a single word.
* Each space preceding a word is attached to and considered a part of the subsequent word (e.g., in `Ġsay`).

Let's now experiment with the BPE model. As we've mentioned, it's in charge of splitting the words into subunits until all subunits belong to the predefined vocabulary.

**The vocabulary of our GPT-2 tokenizer comprises 50,257 words**:

* The base vocabulary with the 256 bytes (which allows us to build any Unicode of the 143,859 Unicode characters in the Unicode alphabet)
* 50,000 additional tokens created by repeatedly merging the most commonly occurring tokens
* A special character added to the vocabulary to represent document boundaries

We can easily check that by looking at the length attribute of the tokenizer:

In [15]:
print(f"Size of the vocabulary: {len(tokenizer)}")

Size of the vocabulary: 50257


Running the full pipeline on our input code gives us the following output:

In [16]:
print(tokenizer(python_code).tokens())

['Ċ', 'def', 'Ġsay', '_', 'hello', '():', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġprint', '("', 'Hello', ',', 'ĠWorld', '!"', ')', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġ#', 'ĠPrint', 'Ġit', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġsay', '_', 'hello', '()', 'Ċ']


As we can see, the BPE tokenizer keeps most of the words but will split the multiple spaces of our indentation into several consecutive spaces. **This happens because this tokenizer is not specifically trained on code**, but mostly on texts where consecutive spaces are rare. The BPE model thus doesn’t include a specific token in the vocabulary for indentation. This is a case where the tokenizer model is poorly suited for the dataset’s domain. As we discussed earlier, **the solution is to retrain the tokenizer on the target corpus**. So let’s get to it!

### 2.4 - Training a Tokenizer

Let's retrain our byte-level BPE tokenizer on a slice of our corpus to get a vocabulary better adapted to Python code. Retraining a tokenizer provided by 🤗 Transformers is simple. We just need to:

* Specify our target vocabulary size.
* Prepare an iterator to supply lists of input strings to process to train the tokenizer's model.
* Call the `train_new_from_iterator()` method.

Unlie deep learning models, which are often expected to memorize a lot of specific details from the training corpus, <span style="color:blue"><b>tokenizers are really just trained to extract the main statistics.</b> In a nutshell, the tokenizer is just trained to know which letter combinations are the most frequent in our corpus.</span> 

Therefore, you don't necessarily need to train your tokenizer on a very large corpus; the corpus just needs to be representative of your domain and big enough for the tokenizer to extract statistically significant measure. But depending on the vocabulary size and the exact texts in the corpus, the tokenizer can end up storing unexpected words. We can see this, for instance, when looking at the longest words in the vocabulary of the GPT-2 tokenizer:

In [17]:
tokens = sorted(tokenizer.vocab.items(), key=lambda x: len(x[0]), reverse=True)
# This returns the token itself and the index in the vocabulary, the higher, the lesser frequent
for t in tokens[:4]: 
    print(t)

('ÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤÃĥÃĤ', 35496)
('Ġ----------------------------------------------------------------', 16529)
('................................................................', 23193)


These tokens look like separator lines that are likely to be used in forums. This makes sense since GPT-2 was trained on a corpus centered around Reddit. Now let's have a look at the last words that were added to the vocabulary, and thus the least frequent ones:

In [18]:
tokens = sorted(tokenizer.vocab.items(), key=lambda x: x[1], reverse=True)
for t in tokens[:8]: 
    print(t)

('<|endoftext|>', 50256)
('Ġgazed', 50255)
('Ġinformants', 50254)
('ĠCollider', 50253)
('Ġregress', 50252)
('ominated', 50251)
('Ġamplification', 50250)
('Compar', 50249)


The first token, `<|endoftext|>`, is the special token used to specify the end of a text sequence and was added after the BPE vocabulary was built. 

For each of these tokens our model will have to learn an associated word embedding, and we probably don't want the embeding matrix to contain too many noisy words. Also note how some very time- and space-specific knowledge of the world (e.g., proper nouns like `informants` and `amplification`) is embedded at a very low level in our modeling approach by these words being granted separate tokens with associated vectors in the vocabulary. The creation of such specific tokens by a BPE tokenizer can be an indication that the target vocabulary size is too large or that the corpus contains idiosyncratic tokens.

Let's train a fresh tokenizer on our corpus and examine its learned vocabulary. Since we just need a corpus reasonably representative of our dataset statistics, let's select a subset of the data (e.g., 1-2GB of data, or about 100,000 documents from our corpus):

----

**Note:** For test purposes, we are going to select an even smaller dataset, with around 1000 documents

----

In [19]:
def batch_iterator(length, batch_size=10):
    for _ in tqdm(range(0, length, batch_size)):
        yield [next(iter_dataset)['content'] for _ in range(batch_size)]

dataset_name = 'transformersbook/codeparrot-train'
dataset = load_dataset(dataset_name, split="train", streaming=True)
iter_dataset = iter(dataset)

length = 1000
vocab_size = 2500
new_tokenizer = tokenizer.train_new_from_iterator(batch_iterator(length), vocab_size=vocab_size, initial_alphabet=base_vocab)

Downloading readme:   0%|          | 0.00/583 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Resolving data files:   0%|          | 0/183 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]






Let's investigate the first and last words created by our BPE algorithm to see how relevant our vocabulary is. We skip the 256 byte tokens and look at the first tokens added thereafter:

In [20]:
tokens = sorted(new_tokenizer.vocab.items(), key=lambda x: x[1], reverse=False)
for t in tokens[257:280]: 
    print(t)

('ĠĠ', 257)
('ĠĠĠĠ', 258)
('ĠĠĠ', 259)
('ĊĠĠĠĠ', 260)
('se', 261)
('re', 262)
('in', 263)
('on', 264)
('te', 265)
('ĊĠĠĠĠĠĠĠ', 266)
('ĠĠĠĠĠĠĠĠ', 267)
('ĊĠĠĠ', 268)
('st', 269)
('or', 270)
('de', 271)
('le', 272)
('th', 273)
('Ġ=', 274)
('lf', 275)
('al', 276)
('self', 277)
('me', 278)
('ti', 279)


Here we can see various standard levels of indentation and whitespace tokens, as well as short common Python keywords like `self`, `or` and `in`. This is a good sign that our BPE algorithm is working as intended. Now let's check out the last words:

In [21]:
for t in tokens[-12:]: 
    print(t)

('ĠWITHOUT', 2488)
('39', 2489)
('69', 2490)
('Option', 2491)
('handler', 2492)
('patch', 2493)
('ĊĠĠĠĠĊĠĠĠ', 2494)
('serialize', 2495)
('Ġwrite', 2496)
('utf', 2497)
('Ġlook', 2498)
('system', 2499)


Here there are still some relatively common words like `serialize`, `system` and `write`, but there are also random numbers like `39` and `69`.

We can now tokenize our simple example of Python code to see how our tokenizer is behaving on a simple example:

In [22]:
print(python_code)
print(new_tokenizer(python_code).tokens())
print(tokenizer(python_code).tokens())


def say_hello():
    print("Hello, World!")
    # Print it
    say_hello()

['Ċ', 'def', 'Ġs', 'ay', '_', 'h', 'el', 'lo', '():', 'ĊĠĠĠ', 'Ġprint', '("', 'H', 'el', 'lo', ',', 'ĠW', 'or', 'ld', '!', '")', 'ĊĠĠĠ', 'Ġ#', 'ĠP', 'rint', 'Ġit', 'ĊĠĠĠ', 'Ġs', 'ay', '_', 'h', 'el', 'lo', '()', 'Ċ']
['Ċ', 'def', 'Ġsay', '_', 'hello', '():', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġprint', '("', 'Hello', ',', 'ĠWorld', '!"', ')', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġ#', 'ĠPrint', 'Ġit', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġsay', '_', 'hello', '()', 'Ċ']


Even though they are not code keywords, it’s a little annoying to see common English words like World or say being split by our tokenizer, since we’d expect them to occur
rather frequently in the corpus. **On the positive side, compared to GPT2, we can see that our tokenizer conviently kept the indents in the vocabulary.**

----

**Note:** This is especially apparent due to our even smaller sample of data that was used to train the tokenizer

----

Let’s check if all the Python reserved keywords are in the vocabulary:

In [23]:
print(f'There are in total {len(keyword.kwlist)} Python keywords.')
for keyw in keyword.kwlist:
    if keyw not in new_tokenizer.vocab:
        print(f'No, keyword `{keyw}` is not in the vocabulary')

There are in total 36 Python keywords.
No, keyword `__peg_parser__` is not in the vocabulary
No, keyword `async` is not in the vocabulary
No, keyword `await` is not in the vocabulary
No, keyword `break` is not in the vocabulary
No, keyword `continue` is not in the vocabulary
No, keyword `elif` is not in the vocabulary
No, keyword `except` is not in the vocabulary
No, keyword `finally` is not in the vocabulary
No, keyword `global` is not in the vocabulary
No, keyword `nonlocal` is not in the vocabulary
No, keyword `raise` is not in the vocabulary
No, keyword `while` is not in the vocabulary
No, keyword `yield` is not in the vocabulary


It appears that several quite frequent keywords, like `elif`, `while`, `yield`, etc. are not in the vocabulary either. Let's try building a larger vocabulary using a larger sample of our dataset. For instance, we can build a vocabulary of 5096 words (multiples of 8 are better for some efficient GPU/TPU computations) and train the tokenizer on a twice as large slice of our corpus:

In [24]:
length = 10000
vocab_size = 5096
new_tokenizer_larger = tokenizer.train_new_from_iterator(batch_iterator(length), vocab_size=vocab_size, initial_alphabet=base_vocab)

  0%|          | 0/1000 [00:00<?, ?it/s]






We don’t expect the most frequent tokens to change much when adding more documents, but let’s look at the last tokens:

In [25]:
tokens = sorted(new_tokenizer_larger.vocab.items(), key=lambda x: x[1], reverse=False)
for t in tokens[-12:]: 
    print(t)

('parameter', 5084)
('PARAM', 5085)
('morphic', 5086)
('break', 5087)
('ence', 5088)
('ĠIO', 5089)
('Ġpickle', 5090)
('DATE', 5091)
('New', 5092)
('vas', 5093)
('Plugin', 5094)
('builtin', 5095)


A brief inspection doesn’t show any regular programming keywords here, which is promising. Let’s try tokenizing our sample code example with the new larger tokenizer:

In [26]:
print(python_code)
print(new_tokenizer_larger(python_code).tokens())
print(new_tokenizer(python_code).tokens())
print(tokenizer(python_code).tokens())


def say_hello():
    print("Hello, World!")
    # Print it
    say_hello()

['Ċ', 'def', 'Ġs', 'ay', '_', 'h', 'ello', '():', 'ĊĠĠĠ', 'Ġprint', '("', 'H', 'ello', ',', 'ĠW', 'or', 'ld', '!', '")', 'ĊĠĠĠ', 'Ġ#', 'ĠP', 'rint', 'Ġit', 'ĊĠĠĠ', 'Ġs', 'ay', '_', 'h', 'ello', '()', 'Ċ']
['Ċ', 'def', 'Ġs', 'ay', '_', 'h', 'el', 'lo', '():', 'ĊĠĠĠ', 'Ġprint', '("', 'H', 'el', 'lo', ',', 'ĠW', 'or', 'ld', '!', '")', 'ĊĠĠĠ', 'Ġ#', 'ĠP', 'rint', 'Ġit', 'ĊĠĠĠ', 'Ġs', 'ay', '_', 'h', 'el', 'lo', '()', 'Ċ']
['Ċ', 'def', 'Ġsay', '_', 'hello', '():', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġprint', '("', 'Hello', ',', 'ĠWorld', '!"', ')', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġ#', 'ĠPrint', 'Ġit', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġsay', '_', 'hello', '()', 'Ċ']


Here also the indents are conveniently kept in the vocabulary, and we see that common english words like `Hello` are getting merged, although not to the degree we would like (`Hello` should probably be stored as a single token). This occurss because we have a very small sample of the data.

Let’s investigate the common Python keywords, as we did before:

In [27]:
for keyw in keyword.kwlist:
    if keyw not in new_tokenizer_larger.vocab:
        print(f'No, keyword `{keyw}` is not in the vocabulary')

No, keyword `__peg_parser__` is not in the vocabulary
No, keyword `async` is not in the vocabulary
No, keyword `await` is not in the vocabulary
No, keyword `continue` is not in the vocabulary
No, keyword `finally` is not in the vocabulary
No, keyword `nonlocal` is not in the vocabulary
No, keyword `while` is not in the vocabulary
No, keyword `yield` is not in the vocabulary


We are still missing some common keywords like `yield` and `while`, but others like `elif` have been finally included. After this manual inspection, our larger tokenizer seems better adapted for our task, but as we mentioned earlier, objectively evaluating the performance of a tokenizer is a challenging task without measuing the model's performance. 

We will proceed with this one and train a model to see how well it works in practice.

#### Why not use Python's built-in `tokenize` module?

Python has a built-in `tokenize` module that splits Python code strings into meaningful units (code operation, comments, indent and dedent, etc.). 

One issue with using this approach is that **this pretokenizer is Python-based and as such is typically rather slow and limited by the Python global interpreter lock (GIL)**. 

On the other hand, most of the tokenizers in the 🤗 Transformers library are provided by the 🤗 Tokenizers library and are coded in Rust. **The Rust tokenizers are many orders of magnitude faster to train and to use**, and we will thus likely want to use them given the size of our corpus.

## 3 - Training a model from scratch

In this section we'll decide which architecture works best for the task, initialize a fresh model without pretrained weights, set up a custom data loading class, and create a scalable training loop.

### 3.1 - A tale of pretraining objectives

Now that we have access to a "large-scale" pretraining corpus and an efficient tokenizer, we can start thinking about how to pretrain a transformer model.

With such a large codebase consisting of code snippets like the one shown in the following Figure, we can tackle several tasks.

<table>
    <tr>
        <td><img title="" src="images_ch10/data_example_pretraining.PNG" alt="" height="100" data-align="center"></td>
    </tr>
</table>

Which one we choose will influence our choice of pretraining objectives. Let’s have a look at three common tasks:

* Causal language modeling
* Masked language modeling
* Sequence-to-sequence training

#### 3.1.1 - Causal language modeling

A natural task with textual data is to provide a model with the beginning of a code sample and ask to generate possible completions. This is a self-supervised training objective in which we can use the dataset without annotations.

**In causal language modeling, the future tokens are masked and the model has to predict them.**

A decoder-only architecture such as the GPT family of models is usually best suited for this task.

<table>
    <tr>
        <td><img title="" src="images_ch10/causal_language_modeling_example.PNG" alt="" height="100" data-align="center"></td>
    </tr>
</table>

#### 3.1.2 - Masked language modeling

A related but slightly different task is to provide a model with a noisy code sample, for instance with a code instruction replaced by a random or masked word, and ask it to reconstruct the original clean sample.

**In masked language modeling some of the input tokens are either masked or replaced, and the model’s task is to predict the original tokens.**

A encoder-only architecture such as the BERT family of models is usually trained this task (also the encoder part of an encoder-decoder model).

<table>
    <tr>
        <td><img title="" src="images_ch10/masked_language_modeling_example.PNG" alt="" height="100" data-align="center"></td>
    </tr>
</table>

This is also a self-supervised training objective and is commonly called *masked language modeling* or the *denoising objective*. It's harder to think about a downstream task directly related to denoising, but denoising is generally a good pretraining task to learn general representations for later downstream tasks.

Many of the decoder-only models (like BERT, and XLM-RoBERTa) are pretrained that way. Training a masked language model on a large corpus can thus be combined with fine-tuning the model on a downstream task with a limited number of labeled examples.

#### 3.1.3 - Sequence-to-sequence training

An alternative task is to use a heuristic like regular expressions to separate comments or docstrings from code and build a large-scale dataset of `(code, comments)` pairs that can be used as an annotated dataset. The training task is then a supervised training objective in which one category (e.g., `comment`) is used as input for the model and the other category (e.g., `code`) is used as labels. This is a case of *supervised learning* with (input, labels) pairs, as highlighted in the following Figure:

<table>
    <tr>
        <td><img title="" src="images_ch10/sequence_to_sequence_example.PNG" alt="" height="200" data-align="center"></td>
    </tr>
</table>

With a large, clean, and diverse dataset as well as a model with sufficient capacity, we can try to train a model that learns to translate comments into code or vice versa. A downstream task directly related to this supervised training task is then documentation generation from code or code generation from documentation, depending on how we set our input/outputs. 

In this setting a sequence is translated into another sequence, which is where **encoder-decoder architectures such as T5, BART, and PEGASUS shine**.

### 3.2 - Initializing the model

<span style="color:blue">Since we want to build a code autocompletion model, <b>we’ll select the first objective and choose a GPT architecture for the task</b>. So let’s initialize a fresh GPT-2 model!</span>

Since are training a model from scratch, we won't use the `from_pretrained()` method to load a model but instead we will initalize a new model. We will, however load the configuration of `gpt2` so that we use the same hyperparameters and only adapt the vocabulary size for the new tokenizer. We then initialize a new model with this configuration with the `from_config()` method:

In [28]:
# We load a fresh instance of the GPT2 model and adjust its vocabulary to that of our previously learned tokenizer
model_ckpt = "gpt2"

# tokenizer = AutoTokenizer.from_pretrained(model_ckpt) # Base tokenizer
tokenizer = new_tokenizer_larger # Previously learned tokenizer (specific for Python code)

config = AutoConfig.from_pretrained(model_ckpt, vocab_size=len(tokenizer))
model = AutoModelForCausalLM.from_config(config)

print(f'GPT-2 size: {model_size(model)/1000**2:.1f}M parameters')

GPT-2 size: 89.8M parameters


We can check how changing the tokenizer we use has consequences in the number of model parameters (due to the vocabulary size differences). For instance, since the vocabulary of our previously learned tokenizer is much shorter, the embedding layer has fewer parameters and thus the model as a whole is smaller in size.

### 3.3 - Implementing the `InfiniteConstantLenghtDataset` & `ConstantLengthDataset`

To be able to train with maximal efficiency, we will want to supply our model with sequences filling its context. For example, if the context length of our model is 1024 tokens, we always want to provide 1024-token sequences during training. But some of our code examples might be shorter or longer than 1024 tokens.

To feed batches with full sequences of `sequence_length` to our model (i.e., length of 1024 in case of GPT-2), we should either drop the last incomplete sequence or pad it. So, for example, if each example corresponds to the code of a Python file, we would pass it to the tokenizer, tokenize it and either drop the last part if the number of generated tokens is larger than the `sequence_length`, or pad it until `sequence_length` if it is shorter. However, this has two main inconvenients:
* It makes our training less computationally efficient (i.e., we need to use an attention mask to ignore padding tokens)
* It introduces bias towards shorter files because in longer ones we are always dropping the "last parts".

For these reasons, we can use a little trick to avoid dropping the last parts of files and also to avoid needing to pad short files. The idea would be to create a "buffer", which we would fill with samples (i.e., Python files) until a specified number of characters is reached. After this, we would tokenize the whole buffer and concatenate the individual sequences with the `EOS` token. Thus generating a very long sequence. Finally, we split this long sequence into equally sized chunks (of `sequence_lenght`) as shown in the Figure below. With this approach, we lose at most a small fraction of the characters at the end of the last Python file being concatenated.

<table>
    <tr>
        <td><img title="" src="images_ch10/data_preparation.PNG" alt="" height="300" data-align="center"></td>
    </tr>
</table>

Now, the only thing left would be to decide the size of this buffer. The longer this buffer is, the fewer information we would obviously lose. However, it would also mean, more information in memory (remember that the dataset we are working with is really big and it is loaded in `streaming=True` mode).

For convenience, we can estimate an approximate buffer size from the number of sequences we want to store in memory and the `sequence_length` of the model we want to train:

```python

buffer_characters = number_of_sequences * sequence_length * characters_per_token

```

where:

* `buffer_characters` is the number of characters in our buffer.
* `number_of_sequences` is the number of (truncated) sequences we would like from our tokenizer.
* `sequence_length` is the number of tokens per sequence returned by the tokenizer (e.g., 1024 for GPT2).
* `characters_per_token` is the average number of characters per token (needs to be estimated before hard)

Now, let's say we have estimated the average number of characters per token to be `3.5`. If we want to generate 100 sequences of 1024 tokens, we would need a buffer o `10` * `1024` * `3.1` = `31744`.

----

**Note:** There two special cases where this approach could be "problematic".
 
1. **If documents are really long**, so much that each of them are longer than the buffer_characters size, we would be losing information every time, unless we would increase the buffer size accordingly. 
2. Maybe not all documents are really long, but if we are unlucky and the last document is appended to the buffer but only a small set of characters are left in it. In that case we would lose most of the information in that document. It was unlucky to be selected.

If these issues are very common in our data, we may need to do some extra adjustments to our strategy.

----

To apply this approach for our case, let's first estimate the average character length per token in our dataset:

In [29]:
def calculate_characters_per_token(dataset_name, tokenizer, examples=500):
    """
    Calculate the average number of characters per token in the specified dataset.

    Args:
        dataset_name (str): The name of the dataset.
        examples (int): The number of examples to use for calculation.

    Returns:
        float: The average number of characters per token.
    """
    print(f"Calculating the average number of characters per token in the '{dataset_name}' dataset")
    dataset = load_dataset(dataset_name, split="train", streaming=True)
    total_characters, total_tokens = 0, 0

    for _, example in tqdm(zip(range(examples), iter(dataset)), total=examples):
        total_characters += len(example['content'])
        total_tokens += len(tokenizer(example['content']).input_ids)

    characters_per_token = total_characters / total_tokens
    return characters_per_token

dataset_name = 'transformersbook/codeparrot-train'
characters_per_token = calculate_characters_per_token(dataset_name, new_tokenizer_larger, 500)
print(characters_per_token)

Calculating the average number of characters per token in the 'transformersbook/codeparrot-train' dataset


Repo card metadata block was not found. Setting CardData to empty.


Resolving data files:   0%|          | 0/183 [00:00<?, ?it/s]

  0%|          | 0/500 [00:00<?, ?it/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (3044 > 1024). Running this sequence through the model will result in indexing errors


3.11763181026712


#### 3.3.1 - `InfiniteConstantLenghtDataset`

**This is an infinite dataset**. When learning with it, we should not use epochs, but steps instead...

With that we have all that's needed to create our own `IterableDataset` (which is a helper class provided by PyTorch) for preparing constant-length inputs for the model. We just need to inherit from `IterableDataset` and set up the `__iter__()` function that yields the next element with the logic we just walked through:

In [30]:
class InfiniteConstantLengthDataset(IterableDataset):
    def __init__(
        self,
        tokenizer,
        dataset,
        text_attribute_name,
        seq_length=1024,
        num_of_sequences=1024,
        chars_per_token=3.6,
        debug=False,
    ):
        """
        Create a custom dataset for constant-length sequences for language modeling.

        Args:
            tokenizer (transformers.PreTrainedTokenizer): A tokenizer object for text preprocessing.
            dataset (torch.utils.data.Dataset): An existing dataset containing text examples.
            seq_length (int): The desired sequence length for language modeling.
            num_of_sequences (int): The number of sequences to concatenate.
            chars_per_token (float): Average number of characters per token in the text data.
        """
        self.tokenizer = tokenizer
        self.concat_token_id = tokenizer.eos_token_id  # End-of-sequence token ID
        self.dataset = dataset
        self.text_attribute_name = text_attribute_name  # name of the "column" that contains the text in the HF dataset (e.g., "content", "text", etc)
        self.seq_length = seq_length
        self.input_characters = seq_length * chars_per_token * num_of_sequences# buffer size
        self.debug = debug

    def __iter__(self):
        """
        Iterator function for the dataset.
        """
        iterator = iter(self.dataset)
        more_examples = True
        while more_examples:
            buffer, buffer_len = [], 0
            while True:
                # Check if the buffer has reached the desired character limit
                if buffer_len >= self.input_characters:
                    if self.debug:
                        print(f"Buffer full: {buffer_len}>={self.input_characters:.0f}")

                    break
                try:
                    if self.debug:
                        print(f"Fill buffer: {buffer_len}<{self.input_characters:.0f}")

                    # Append the content of the next example to the buffer
                    buffer.append(next(iterator)[self.text_attribute_name])
                    buffer_len += len(buffer[-1])
                except StopIteration:
                    # If there are no more examples, reset the iterator to reuse the dataset
                    if buffer_len > 0:
                        iterator = iter(self.dataset)
                    # Unless the buffer has size 0, when we stop to avoid infinite loops
                    else:
                        more_examples = False
                        break

            if more_examples:
                # Once the buffer is filled with characters, we tokenize them and prepare the sequences
                # with the specified length
                all_token_ids = []
                tokenized_buffer = self.tokenizer(buffer, truncation=False)
                for tokenized_input in tokenized_buffer["input_ids"]:
                    # Extend the list of token IDs with end-of-sequence token IDs
                    all_token_ids.extend(tokenized_input + [self.concat_token_id])

                for i in range(0, len(all_token_ids), self.seq_length):
                    input_ids = all_token_ids[i : i + self.seq_length]
                    if len(input_ids) == self.seq_length:
                        # Yield a tensor representing a sequence of the specified length
                        yield torch.tensor(input_ids)

The `__iter__()` function builds up a buffer of strings until it contains enough characters. All the elements in the buffer are tokenized and concatenated with the `EOS` token, then the long sequence in `all_token_ids` is chunked in `seq_length`-size slices. **Normally, we would need attention masks to stack padded sequences** of varying length and make sure the padding is ignored during training. **We have taken care of this by only providing sequences of the same (maximal) length**, so we don't need the masks here and only return the `input_ids` (i.e., no need for padding). Let's test our iterable dataset:

In [31]:
shuffled_dataset = dataset.shuffle(buffer_size=100)
# As an example, we create a dataset that is compoased of 10 sequences
constant_length_dataset = InfiniteConstantLengthDataset(tokenizer, shuffled_dataset, "content", num_of_sequences=10)

dataset_iterator = iter(constant_length_dataset)

for i in range(0, 5):
    b = next(dataset_iterator)
    print(f"{i}: {len(b)}")
    print(b)

0: 1024
tensor([ 638, 3806,  199,  ...,  365,  482,   70])
1: 1024
tensor([1054,  426,  478,  ..., 2729,  274, 1040])
2: 1024
tensor([  63, 2729,  267,  ...,  282,  571,  400])
3: 1024
tensor([4480,  433,  313,  ...,   63,  693, 3057])
4: 1024
tensor([   8, 4512, 1046,  ...,  199,    3,  258])


Nice, this works as intended and we get constant-length inputs for the model. Now that we have a reliable data source for the model, it's time to build the actual training loop.

-----

Notice that we shuffled the raw dataset before creating a ConstantLengthDataset. Since this is an iterable dataset, we can’t just shuffle the whole dataset at the beginning. Instead, we set up a buffer with size `buffer_size` and shuffle the elements in this buffer before we get elements from the dataset

----

#### 3.3.2 - `ConstantLenghtDataset`

This is a finite version of the ConstantLenghtDataset. In the previous case, by making the dataset infinite we don't need to preoccupy ourselves with certain issues such as the number of buffers we can generate from a dataset because

In [32]:
class FixedConstantLengthDataset(Dataset):
    def __init__(
        self, 
        tokenizer, 
        dataset,
        text_attribute_name,
        seq_length=1024,
        num_of_sequences=1024, 
        chars_per_token=3.6,
        debug=False
    ):
        """
        Create a custom dataset for constant-length sequences for language modeling.
        
        Args:
            tokenizer (transformers.PreTrainedTokenizer): A tokenizer object for text preprocessing.
            dataset (torch.utils.data.Dataset): An existing dataset containing text examples.
            seq_length (int): The desired sequence length for language modeling.
            num_of_sequences (int): The number of sequences to concatenate.
            chars_per_token (float): Average number of characters per token in the text data.
        """
        self.tokenizer = tokenizer
        self.concat_token_id = tokenizer.eos_token_id  # End-of-sequence token ID
        self.dataset = dataset
        self.text_attribute_name = text_attribute_name # name of the "column" that contains the text in the HF dataset (e.g., "content", "text", etc)
        self.seq_length = seq_length
        self.input_characters = seq_length * chars_per_token * num_of_sequences # buffer size
        self.debug=debug

    def generate_all_buffers(self):
        """
        Generate all buffers from the entire dataset and return them as a list.
        """
        buffers = []
        iterator = iter(self.dataset)
        
        while True:
            buffer, buffer_len = [], 0
            while True:
                # Check if the buffer has reached the desired character limit
                if buffer_len >= self.input_characters:
                    break
                try:
                    # Append the content of the next example to the buffer
                    buffer.append(next(iterator)[self.text_attribute_name])
                    buffer_len += len(buffer[-1])
                except StopIteration:
                    # If there are no more examples, break out of the loop
                    break
            
            if not buffer:
                # If the buffer is empty, there are no more examples to process
                break
            
            # Once the buffer is filled with characters, tokenize them and add to the list
            all_token_ids = []
            tokenized_buffer = self.tokenizer(buffer, truncation=False)
            for tokenized_input in tokenized_buffer['input_ids']:
                # Extend the list of token IDs with end-of-sequence token IDs
                all_token_ids.extend(tokenized_input + [self.concat_token_id])
            
            buffers.append(torch.tensor(all_token_ids))
        
        return buffers

In [33]:
def generate_fixed_dataset(streaming_dataset, num_examples):

    # Initialize an empty list to store the examples
    examples = []

    # Loop through the streaming dataset and collect examples
    for example in streaming_dataset:
        examples.append(example)
        
        # Break the loop once you have collected the desired number of examples
        if len(examples) == num_examples:
            break

    # Create a new Hugging Face dataset from the collected examples (easist way to do is with Pandas)
    return datasets.Dataset.from_pandas(pd.DataFrame(examples))

In [34]:
            # # Once the buffer is filled with characters, we tokenize them and prepare the sequences
            # # with the specified length
            # all_token_ids = []
            # tokenized_buffer = self.tokenizer(buffer, truncation=False)
            # for tokenized_input in tokenized_buffer['input_ids']:
            #     # Extend the list of token IDs with end-of-sequence token IDs
            #     all_token_ids.extend(tokenized_input + [self.concat_token_id])
            
            # for i in range(0, len(all_token_ids), self.seq_length):
            #     input_ids = all_token_ids[i : i + self.seq_length]
            #     if len(input_ids) == self.seq_length:
            #         # Yield a tensor representing a sequence of the specified length
            #         yield torch.tensor(input_ids)

In [35]:
dataset_name = 'transformersbook/codeparrot-train'
streaming_dataset = load_dataset(dataset_name, split="train", streaming=True)
fixed_dataset = generate_fixed_dataset(streaming_dataset, 100)

Repo card metadata block was not found. Setting CardData to empty.


Resolving data files:   0%|          | 0/183 [00:00<?, ?it/s]

### 3.4 - Defining the training loop

We now have all the elements to write our training loop. One obvious limitation of training our own language model is the memory limits on the GPUs we will use. Even
on a modern graphics card you can’t train a model at GPT-2 scale in reasonable time.

In this tutorial we will implement a training loop with PyTorch lightning, which simplifies the process and makes it more scalable.

#### 3.4.1 - Create Lightning Data Module

In [47]:
class CodeParrotDataModule(pl.LightningDataModule):

    def __init__(
        self, 
        tokenizer, 
        seq_length, 
        num_of_sequences, 
        chars_per_token,
        train_batch_size,
        val_batch_size,
        shuffle_buffer,
        seed,
        num_workers, 
        debug
    ):
        super().__init__()
        self.tokenizer = tokenizer
        self.seq_length = seq_length
        self.num_of_sequences = num_of_sequences
        self.chars_per_token = chars_per_token
        self.train_batch_size = train_batch_size
        self.val_batch_size = val_batch_size
        self.shuffle_buffer = shuffle_buffer
        self.seed = seed
        self.num_workers = num_workers
        self.debug = debug

    # Multiple GPU
    def setup(self, stage):
        # Load data
        train_data = load_dataset('transformersbook/codeparrot-train', split="train", streaming=True)
        # This shuffles the Python files that will then be concantenated to form the "buffer" in the ConstantLengthDataset
        train_data = train_data.shuffle(buffer_size=self.shuffle_buffer, seed=self.seed)
        valid_data = load_dataset('transformersbook/codeparrot-valid', split="validation", streaming=True)
        
        # Create ConstantLengthDataset, which have an internal buffer
        # to avoid the necessity of padding and the bias of dropping the last part of Python files
        self.train_ds = InfiniteConstantLengthDataset(tokenizer, train_data, seq_length=self.seq_length, text_attribute_name="content", debug=self.debug)
        self.val_ds = InfiniteConstantLengthDataset(tokenizer, valid_data, seq_length=self.seq_length, text_attribute_name="content", debug=self.debug)
        
    def train_dataloader(self):
        return DataLoader(
            dataset=self.train_ds,
            batch_size=self.train_batch_size,
            num_workers=self.num_workers
        )

    def val_dataloader(self):
        return DataLoader(
            dataset=self.val_ds,
            batch_size=self.val_batch_size,
            num_workers=self.num_workers
        )

#### 3.4.2 - Create Lightning Model Module

In [44]:
class GPT2LightningModule(pl.LightningModule):
    def __init__(self, tokenizer, learning_rate, model_checkpoint="gpt2"):
        super().__init__()
        self.tokenizer = tokenizer
        self.learning_rate = learning_rate

        config = AutoConfig.from_pretrained(model_checkpoint, vocab_size=len(tokenizer))
        self.model = AutoModelForCausalLM.from_config(config)

    def forward(self, input_ids, attention_mask=None, labels=None):
        return self.model(input_ids, attention_mask=attention_mask, labels=labels)

    def _common_step(self, batch):
        # We automatically pass a Tensor, it is not a dictionary since there are no labels or attention_mask
        input_ids = batch
        labels = batch

        outputs = self(input_ids, labels=labels)
        loss = outputs.loss

        return loss

    def training_step(self, batch, batch_idx):
        loss = self._common_step(batch)

        self.log_dict(
            {
                "train_loss": loss,
            },
            on_step=False,
            on_epoch=True,
            prog_bar=True,
            logger=True,
        )
        return loss

    def validation_step(self, batch, batch_idx):
        loss = self._common_step(batch)

        self.log_dict(
            {
                "val_loss": loss,
            },
            on_step=False,
            on_epoch=True,
            prog_bar=True,
            logger=True,
        )
        self.validation_step_outputs = [{"val_loss": loss}]

    def configure_optimizers(self):
        optimizer = torch.optim.AdamW(self.parameters(), lr=self.learning_rate)
        return optimizer

    def on_validation_epoch_end(self):
        val_loss = torch.stack(
            [x["val_loss"] for x in self.validation_step_outputs]
        ).mean()
        perplexity = torch.exp(val_loss)
        self.log("val_perplexity", perplexity, prog_bar=True)

    def generate_text(self, input_text, max_length=50, temperature=1.0):
        """
        Generate text given an initial input string using the trained GPT-2 model.

        Args:
            input_text (str): The initial input text.
            max_length (int): Maximum length of the generated text.
            temperature (float): Sampling temperature (higher values increase randomness).

        Returns:
            str: The generated text.
        """
        # Tokenize the input text
        input_ids = self.tokenizer.encode(input_text, return_tensors="pt")

        with torch.no_grad():
            output = self.model.generate(
                input_ids,
                max_length=max_length,
                temperature=temperature,
                pad_token_id=self.tokenizer.pad_token_id,
            )
        generated_text = self.tokenizer.decode(output[0], skip_special_tokens=True)
        return generated_text

The `validation_step()` method does not return anything by default. This is because PyTorch Lightning will automatically aggregate the losses and other metrics that you log in the validation_step() method at the end of each epoch. This allows you to focus on writing code to compute and log your metrics, and let PyTorch Lightning take care of the rest.

However, there are some cases where you may want to return something from the `validation_step()` method. For example, you may want to return a list of predictions or other outputs that you can use in the `on_validation_epoch_end()` method.

If you do want to return something from the validation_step() method, you can simply add a return statement at the end of your method. For example:

```python
def validation_step(self, batch, batch_idx):
    # rest of the code
    ...

    # Return the predictions
    return self.model.generate(batch["input_ids"], max_length=50)
```

In this example, the `validation_step()` method returns a list of predictions for each batch. These predictions can then be used in the `on_validation_epoch_end()` method to compute additional metrics, such as the accuracy on a held-out validation set.

#### 3.4.3 - Custom evaluation callback

Since we are working with infinite training and validation datasets, the idea is to run Y steps of evaluation after X steps of training. While PyTorch Lightning has a `val_check_interval` that allows us to indicate X, the issue is that it does not provide a way to also establish Y, so we need to do this manually. And for that, we are going to define a custom evaluation callback

In [40]:
class CustomEvaluationCallback(pl.Callback):
    def __init__(self, evaluate_every_n_batches, evaluation_steps):
        super().__init__()
        self.evaluate_every_n_batches = evaluate_every_n_batches
        self.evaluation_steps = evaluation_steps
        self.batch_counter = 0
    
    def on_train_batch_end(self, trainer, pl_module, outputs, batch, batch_idx):
        self.batch_counter += 1

        if self.batch_counter % self.evaluate_every_n_batches == 0:
            print(self.batch_counter)

#### 3.4.3 - Training

##### Configuration

In [39]:
dataset_name = 'transformersbook/codeparrot-train'
tokenizer = new_tokenizer_larger
chars_per_token = calculate_characters_per_token(dataset_name, tokenizer, 500)

Calculating the average number of characters per token in the 'transformersbook/codeparrot-train' dataset


Repo card metadata block was not found. Setting CardData to empty.


Resolving data files:   0%|          | 0/183 [00:00<?, ?it/s]

  0%|          | 0/500 [00:00<?, ?it/s]

In [46]:
seed = 1337
debug = False

# Data
seq_length = 1024 # GPT2 specific
num_of_sequences = 200 # The higher the better data quality
shuffle_buffer = 100
train_batch_size = 2
val_batch_size = 2
text_attribute_name = "content"

# Training
learning_rate = 5e-4
max_train_steps = 50000
eval_interval = 100 # every X training steps, we run evaluation
max_eval_steps = 10 # We run Y evaluation steps
precision = "16-mixed"
num_epochs = 3
num_workers  = multiprocessing.cpu_count() - 1

# Compute
compute_accelerator = "gpu"
compute_devices = 1

In [53]:
enumerator = enumerate(data_module.val_dataloader())

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Too many dataloader workers: 3 (max is dataset.n_shards=1). Stopping 2 dataloader workers.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [59]:
next(enumerator)

(5,
 tensor([[1310,  313, 1628,  ...,  382, 3434,  274],
         [1249,    8, 2393,  ...,  610,   59,   76]]))

##### Loop

There seems to be an issue between FastTokenizers and PyTorchLightning. Would need to take a deeper look.

In [48]:
data_module = CodeParrotDataModule(
    tokenizer, seq_length, num_of_sequences, chars_per_token, train_batch_size, val_batch_size, shuffle_buffer, seed, num_workers, debug
)

model = GPT2LightningModule(tokenizer, learning_rate, "gpt2")

trainer = pl.Trainer(
    devices=compute_devices,
    accelerator=compute_accelerator,
    min_epochs=1,
    max_epochs=num_epochs,
    precision=precision,
    callbacks=[CustomEvaluationCallback(evaluate_every_n_batches=10, evaluation_steps=50)],
    # logger=logger,
    # profiler=profiler,
)

Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


In [49]:
trainer.fit(model, data_module)
# trainer.validate(model, data_module)
# trainer.test(model, data_module)

Repo card metadata block was not found. Setting CardData to empty.


Resolving data files:   0%|          | 0/183 [00:00<?, ?it/s]

Repo card metadata block was not found. Setting CardData to empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type            | Params
------------------------------------------
0 | model | GPT2LMHeadModel | 89.8 M
------------------------------------------
89.8 M    Trainable params
0         Non-trainable params
89.8 M    Total params
359.025   Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Too many dataloader workers: 3 (max is dataset.n_shards=1). Stopping 2 dataloader workers.


	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Training: 0it [00:00, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
10
20
30
40
50
60
70


  rank_zero_warn("Detected KeyboardInterrupt, attempting graceful shutdown...")
