# Pre-training GPT-2 on Custom Data

Causal language models are frequently used for text generation. You can use these models for creative applications like
choosing your own text adventure or an intelligent coding assistant like Copilot or CodeParrot. The typical task here is to focus on making the LLM or SLM learn to be a good text autocompleter by going through pages and pages of text and solve the classic task.

Causal language modeling predicts the next token in a sequence of tokens, and the model can only attend to tokens on
the left. This means the model cannot see future tokens. GPT-2 is an example of a causal language model.

This guide will show you how to:

1. Finetune [DistilGPT2](https://huggingface.co/distilgpt2) on the [r/askscience](https://www.reddit.com/r/askscience/) subset of the [ELI5](https://huggingface.co/datasets/eli5) dataset.
2. Use your finetuned model for inference.

<Tip>
You can finetune other architectures for causal language modeling following the same steps in this guide.
Choose one of the following architectures:

<!--This tip is automatically generated by `make fix-copies`, do not fill manually!-->
[BART](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/bart), [BERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/bert), [Bert Generation](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/bert-generation), [BigBird](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/big_bird), [BigBird-Pegasus](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/bigbird_pegasus), [BioGpt](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/biogpt), [Blenderbot](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/blenderbot), [BlenderbotSmall](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/blenderbot-small), [BLOOM](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/bloom), [CamemBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/camembert), [CodeGen](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/codegen), [CPM-Ant](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/cpmant), [CTRL](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/ctrl), [Data2VecText](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/data2vec-text), [ELECTRA](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/electra), [ERNIE](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/ernie), [GIT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/git), [GPT-Sw3](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/gpt-sw3), [OpenAI GPT-2](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/gpt2), [GPTBigCode](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/gpt_bigcode), [GPT Neo](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/gpt_neo), [GPT NeoX](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/gpt_neox), [GPT NeoX Japanese](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/gpt_neox_japanese), [GPT-J](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/gptj), [LLaMA](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/llama), [Marian](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/marian), [mBART](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mbart), [MEGA](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mega), [Megatron-BERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/megatron-bert), [MVP](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mvp), [OpenLlama](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/open-llama), [OpenAI GPT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/openai-gpt), [OPT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/opt), [Pegasus](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/pegasus), [PLBart](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/plbart), [ProphetNet](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/prophetnet), [QDQBert](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/qdqbert), [Reformer](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/reformer), [RemBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/rembert), [RoBERTa](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/roberta), [RoBERTa-PreLayerNorm](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/roberta-prelayernorm), [RoCBert](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/roc_bert), [RoFormer](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/roformer), [RWKV](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/rwkv), [Speech2Text2](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/speech_to_text_2), [Transformer-XL](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/transfo-xl), [TrOCR](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/trocr), [XGLM](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/xglm), [XLM](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/xlm), [XLM-ProphetNet](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/xlm-prophetnet), [XLM-RoBERTa](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/xlm-roberta), [XLM-RoBERTa-XL](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/xlm-roberta-xl), [XLNet](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/xlnet), [X-MOD](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/xmod)

In [1]:
import torch
torch.cuda.empty_cache()

## Load ELI5 dataset

We will use an open dataset here which focuses on questions and answers asked by people on a public forum.

While our focus will be more on building a model just generating text (not custom fine-tuning), we will show you how the model can actually learn the pattern of learning to generate answers, given the question based on our input data format

In [2]:
from datasets import load_dataset

eli5 = load_dataset("dipanjanS/eli5_tech_data", split="train", 
                    trust_remote_code=True)

In [3]:
eli5

Dataset({
    features: ['q_id', 'title', 'selftext', 'category', 'subreddit', 'answers', 'title_urls', 'selftext_urls'],
    num_rows: 14034
})

Split the dataset's `train_asks` split into a train and test set with the [train_test_split](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.train_test_split) method:

In [4]:
eli5 = eli5.train_test_split(test_size=0.2)

In [5]:
eli5

DatasetDict({
    train: Dataset({
        features: ['q_id', 'title', 'selftext', 'category', 'subreddit', 'answers', 'title_urls', 'selftext_urls'],
        num_rows: 11227
    })
    test: Dataset({
        features: ['q_id', 'title', 'selftext', 'category', 'subreddit', 'answers', 'title_urls', 'selftext_urls'],
        num_rows: 2807
    })
})

Then take a look at an example:

In [6]:
eli5["train"][0]

{'q_id': '6caba6',
 'title': "why aren't oil dipsticks white",
 'selftext': "All oil dipsticks I've ever seen are black plastic or dark metal. Why not make an white to contrast against the black oil?",
 'category': 'Technology',
 'subreddit': 'explainlikeimfive',
 'answers': {'a_id': ['dht3yuh'],
  'text': ["Because it's in the engine compartment, it's going to get covered in oil and grime. So they are either black or orange or yellow. Also you read them on the stick, not the cap, and the stick is metal because it goes into the engine and gets very hot. A plastic colored stick wouldn't stand up to the conditions."],
  'score': [7],
  'text_urls': [[]]},
 'title_urls': ['url'],
 'selftext_urls': ['url']}

We will focus on the `title` and `text` fields in this tutorial

## Data Preparation for Pre-training

The next step is to load the GPT-2 tokenizer to process the data fields we mentioned above

In [7]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

You'll notice from the example above, the `text` field is actually nested inside `answers`. This means you'll need to
extract the `text` subfield from its nested structure with the [`flatten`](https://huggingface.co/docs/datasets/process.html#flatten) method:

In [8]:
eli5 = eli5.flatten()
eli5["train"][0]

{'q_id': '6caba6',
 'title': "why aren't oil dipsticks white",
 'selftext': "All oil dipsticks I've ever seen are black plastic or dark metal. Why not make an white to contrast against the black oil?",
 'category': 'Technology',
 'subreddit': 'explainlikeimfive',
 'answers.a_id': ['dht3yuh'],
 'answers.text': ["Because it's in the engine compartment, it's going to get covered in oil and grime. So they are either black or orange or yellow. Also you read them on the stick, not the cap, and the stick is metal because it goes into the engine and gets very hot. A plastic colored stick wouldn't stand up to the conditions."],
 'answers.score': [7],
 'answers.text_urls': [[]],
 'title_urls': ['url'],
 'selftext_urls': ['url']}

In [9]:
eli5_train = eli5["train"].to_pandas()
eli5_val = eli5["test"].to_pandas()

We flatten the dataset to remove the nesting and can see the relevant fields below for each user post

In [10]:
eli5_train.head()

Unnamed: 0,q_id,title,selftext,category,subreddit,answers.a_id,answers.text,answers.score,answers.text_urls,title_urls,selftext_urls
0,6caba6,why aren't oil dipsticks white,All oil dipsticks I've ever seen are black pla...,Technology,explainlikeimfive,[dht3yuh],"[Because it's in the engine compartment, it's ...",[7],[[]],[url],[url]
1,e80m4d,Why would a video of a still image take up les...,I figured that the lines still have to be draw...,Technology,explainlikeimfive,[fa8d4zq],[Videos use something called [inter-frame comp...,[10],[[https://en.wikipedia.org/wiki/Inter_frame]],[url],[url]
2,7behyz,how do “smart” products save on electricity bi...,As above. Smart lighting is supposed to save o...,Technology,explainlikeimfive,"[dphblli, dphbdc3]",[The amount that these devices consume for the...,"[9, 3]","[[], []]",[url],[url]
3,aoclg9,Why can my phone last so long on 1%?,,Technology,explainlikeimfive,[efzww17],[The phone doesn’t actually know the exact amo...,[6],[[]],[url],[url]
4,5m1mdt,"Why is Facebook advertising it's ""Facebook Liv...",It seems like it's on TV a bunch and I get not...,Technology,explainlikeimfive,"[dc044yt, dc031wv]",[If the social media market starts to gravitat...,"[7, 5]","[[], []]",[url],[url]


We will now prepare our dataset in a specific format to make the model learn to do language modeling.

Language modeling is where the model learns to predict the next word given a previous sequence of words and it goes through the whole text, doing this one step at a time.

Here we will format our data in a special way to kind of try to make the model learn to generate the answer given the user question by preparing our training data in this format

```    
<|query|> User question text goes here <|answer|> Answer of the question goes here <|end|>
```

Thus the model learns to go through this text word by word trying to predict the next word.

The idea here is not to build a QA model but just to show that even with small amounts of data, LLMs can learn patterns and learn to understand that:

- The question is between `<|query|>` and `<|answer|>` symbols (which are themselves tokens)
- The answer starts right after the `<|answer|>` symbol and ends after the `<|end|>` symbol
- The model learns to recognize this pattern so in the future if you put only a question between the `<|query|>` and `<|answer|>` it will try to then generate an answer

Here we pretrain with very less data, ideally the focus of pre-training is not to really solve a specific problem like QA but make the model learn to understand words, context, meaning etc (by updating embeddings)

In [11]:
docs = []
for idx, row in eli5_train.iterrows():
    doc = '<|query|>'+row['title']+'<|answer|>'+row['answers.text'][0]+'<|end|>'
    docs.append(doc)

docs[:3]

["<|query|>why aren't oil dipsticks white<|answer|>Because it's in the engine compartment, it's going to get covered in oil and grime. So they are either black or orange or yellow. Also you read them on the stick, not the cap, and the stick is metal because it goes into the engine and gets very hot. A plastic colored stick wouldn't stand up to the conditions.<|end|>",
 '<|query|>Why would a video of a still image take up less space than a video with moving frames?<|answer|>Videos use something called [inter-frame compression]( URL_0 ) to shrink the videos. Instead of storing each frame in full, they store only some frames in full, and between them they store inter frames which only contain the difference from the last frame. Since a lot of videos have shots where the next frame is very similar to the previous one, this allows many inter frames which just have information about how to move and re-color a few chunks of pixels. For a video that\'s a still image, you could have a single fr

In [12]:
eli5_train['data'] = docs

In [13]:
docs = []
for idx, row in eli5_val.iterrows():
    doc = '<|query|>'+row['title']+'<|answer|>'+row['answers.text'][0]+'<|end|>'
    docs.append(doc)
eli5_val['data'] = docs

In [14]:
from datasets import Dataset
eli5['train'] = Dataset.from_pandas(eli5_train)
eli5['test'] = Dataset.from_pandas(eli5_val.sample(64, random_state=42))

In [15]:
eli5

DatasetDict({
    train: Dataset({
        features: ['q_id', 'title', 'selftext', 'category', 'subreddit', 'answers.a_id', 'answers.text', 'answers.score', 'answers.text_urls', 'title_urls', 'selftext_urls', 'data'],
        num_rows: 11227
    })
    test: Dataset({
        features: ['q_id', 'title', 'selftext', 'category', 'subreddit', 'answers.a_id', 'answers.text', 'answers.score', 'answers.text_urls', 'title_urls', 'selftext_urls', 'data', '__index_level_0__'],
        num_rows: 64
    })
})

We will now be tokenizing each text document by passing it to the GPT-2 tokenizer, remember the tokenizer helps in assigning IDs to each word and that maps to a specific embedding vector in its embedding layer, which is nothing more than a matrix of embeddings, where each row will have the same ID as a particular word `input_id`

In [16]:
def preprocess_function(examples):
    return tokenizer(examples['data'], max_length=256, truncation=True)

In [17]:
preprocess_function(eli5['train'][:2])

{'input_ids': [[27, 91, 22766, 91, 29, 22850, 3588, 470, 3056, 19550, 34810, 2330, 27, 91, 41484, 91, 29, 8128, 340, 338, 287, 262, 3113, 26247, 11, 340, 338, 1016, 284, 651, 5017, 287, 3056, 290, 1036, 524, 13, 1406, 484, 389, 2035, 2042, 393, 10912, 393, 7872, 13, 4418, 345, 1100, 606, 319, 262, 4859, 11, 407, 262, 1451, 11, 290, 262, 4859, 318, 6147, 780, 340, 2925, 656, 262, 3113, 290, 3011, 845, 3024, 13, 317, 7309, 16396, 4859, 3636, 470, 1302, 510, 284, 262, 3403, 29847, 91, 437, 91, 29], [27, 91, 22766, 91, 29, 5195, 561, 257, 2008, 286, 257, 991, 2939, 1011, 510, 1342, 2272, 621, 257, 2008, 351, 3867, 13431, 30, 27, 91, 41484, 91, 29, 53, 4921, 779, 1223, 1444, 685, 3849, 12, 14535, 19794, 16151, 10289, 62, 15, 1267, 284, 22085, 262, 5861, 13, 5455, 286, 23069, 1123, 5739, 287, 1336, 11, 484, 3650, 691, 617, 13431, 287, 1336, 11, 290, 1022, 606, 484, 3650, 987, 13431, 543, 691, 3994, 262, 3580, 422, 262, 938, 5739, 13, 4619, 257, 1256, 286, 5861, 423, 6934, 810, 262, 1306, 573

We don't care about the `attention_mask` as we are not really doing masked language modeling which is like fill in the blanks

Just to show how you can get back the original text by using the `decode` function of the tokenizer

In [18]:
[tokenizer.decode(tokens) for tokens in preprocess_function(eli5['train'][:2])['input_ids']]

["<|query|>why aren't oil dipsticks white<|answer|>Because it's in the engine compartment, it's going to get covered in oil and grime. So they are either black or orange or yellow. Also you read them on the stick, not the cap, and the stick is metal because it goes into the engine and gets very hot. A plastic colored stick wouldn't stand up to the conditions.<|end|>",
 '<|query|>Why would a video of a still image take up less space than a video with moving frames?<|answer|>Videos use something called [inter-frame compression]( URL_0 ) to shrink the videos. Instead of storing each frame in full, they store only some frames in full, and between them they store inter frames which only contain the difference from the last frame. Since a lot of videos have shots where the next frame is very similar to the previous one, this allows many inter frames which just have information about how to move and re-color a few chunks of pixels. For a video that\'s a still image, you could have a single fr

To apply this preprocessing function over the entire dataset, use the 🤗 Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) method. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once, and increasing the number of processes with `num_proc`. Remove any columns you don't need:

In [19]:
tokenized_eli5 = eli5.map(
    preprocess_function,
    batched=True,
    remove_columns=eli5["train"].column_names,
)

Map:   0%|          | 0/11227 [00:00<?, ? examples/s]

Map:   0%|          | 0/64 [00:00<?, ? examples/s]

In [20]:
tokenized_eli5

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 11227
    })
    test: Dataset({
        features: ['__index_level_0__', 'input_ids', 'attention_mask'],
        num_rows: 64
    })
})

Now create a batch of examples using [DataCollatorForLanguageModeling](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorForLanguageModeling). It's more efficient to *dynamically pad* the
sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

Use the end-of-sequence token as the padding token and set `mlm=False`. This will use the inputs as labels shifted to the right by one element:

In [21]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer,
                                                mlm=False)

## Train

<Tip>

If you aren't familiar with finetuning a model with the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer), take a look at the [basic tutorial](https://huggingface.co/docs/transformers/main/en/tasks/../training#train-with-pytorch-trainer)!

</Tip>

You're ready to start training your model now! 

Load GPT-2 with [AutoModelForCausalLM](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForCausalLM):

In [22]:
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

model = AutoModelForCausalLM.from_pretrained("gpt2",
                                             device_map='cuda')

In [23]:
# just to show that the special symbol is just a collection of tokens
tokenizer.encode('<|answer|>')

[27, 91, 41484, 91, 29]

## Test Model without Training

Let's send some questions to the model which hasn't been pre-trained on our data and see how well it works out of the box. 

Remember this version of GPT-2 has already been pre-trained with data from the internet so some results might have some meaning, however a totally new model without any training will perform even worse!

In [24]:
for doc in tokenized_eli5['test'].select(range(3)):
    # extract only the question part of the document
    prompt_txt = tokenizer.decode(doc['input_ids']).split('<|answer|>')[0]+'<|answer|>'
    prompt_txt_ids = tokenizer(prompt_txt, return_tensors="pt").to('cuda').input_ids
    # show the text generated by the model
    outputs = model.generate(prompt_txt_ids,
                             max_new_tokens=256,
                             pad_token_id=tokenizer.eos_token_id)
    print(tokenizer.decode(outputs[0]))
    print()

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<|query|>During the time where the light bulb was invented, how did they have accessible electricity? Did they have power plants like they do now? Did they use some sort of battery?<|answer|>The answer is yes, they did. They used electricity to power their lamps, and they used electricity to power their lamps. They used electricity to power their lamps. They used electricity to power their lamps. They used electricity to power their lamps. They used electricity to power their lamps. They used electricity to power their lamps. They used electricity to power their lamps. They used electricity to power their lamps. They used electricity to power their lamps. They used electricity to power their lamps. They used electricity to power their lamps. They used electricity to power their lamps. They used electricity to power their lamps. They used electricity to power their lamps. They used electricity to power their lamps. They used electricity to power their lamps. They used electricity to pow

Obviously the model performs horribly, it just keeps repeating the text sent to the model as it doesn't understand the specific patterns yet.

Let's try some other random line of text now

In [25]:
prompt_txt = "The game of cricket is"
prompt_txt_ids = tokenizer(prompt_txt, return_tensors="pt").to('cuda').input_ids
# show the text generated by the model
outputs = model.generate(prompt_txt_ids,
                         max_new_tokens=256,
                         pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0]))

The game of cricket is a game of skill, and the game of cricket is a game of skill. The game of cricket is a game of skill. The game of cricket is a game of skill. The game of cricket is a game of skill. The game of cricket is a game of skill. The game of cricket is a game of skill. The game of cricket is a game of skill. The game of cricket is a game of skill. The game of cricket is a game of skill. The game of cricket is a game of skill. The game of cricket is a game of skill. The game of cricket is a game of skill. The game of cricket is a game of skill. The game of cricket is a game of skill. The game of cricket is a game of skill. The game of cricket is a game of skill. The game of cricket is a game of skill. The game of cricket is a game of skill. The game of cricket is a game of skill. The game of cricket is a game of skill. The game of cricket is a game of skill. The game of cricket is a game of skill. The game of cricket is a game of skill. The game of cricket is a game of ski

Well it is something more meaningful, but doesn't solve our problem yet. Let's get to pre-training GPT-2 on our custom data!

At this point, only three steps remain:

1. Define your training hyperparameters in [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments). The only required parameter is `output_dir` which specifies where to save your model.

2. Pass the training arguments to [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) along with the model, datasets, and data collator.
   
3. Call [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) to further pre-train your GPT-2 model on your data which you prepared earlier

In [26]:
# if batch size is 64
# if total documents are 11127
# total number of steps (batches of data) to complete 1 full epoch is?
11127 // 64

173

In [27]:
# total steps to run two epochs are?
173*2

346

## Setup Training Config Settings

In [28]:
from transformers import Trainer, TrainingArguments

# Set up the training arguments
# a step is a batch of data going in
training_args = TrainingArguments(
    output_dir="gpt2-runs",                # Directory where the model checkpoints and outputs will be saved.
    eval_strategy="steps",                 # Perform evaluation at regular intervals during training.
    learning_rate=2e-4,                    # Initial learning rate for the optimizer.
    weight_decay=0.01,                     # Apply weight decay to reduce overfitting.
    per_device_train_batch_size=64,        # Batch size per GPU/TPU core/CPU during training.
    per_device_eval_batch_size=32,         # Batch size per GPU/TPU core/CPU during evaluation.
    save_strategy="steps",                 # Save the model checkpoint at regular intervals.
    logging_steps=10,                      # Log training metrics every 10 steps.
    save_steps=75,                         # Save the model checkpoint every 75 steps.
    max_steps=346,                         # Stop training after 346 total steps.
    eval_steps=10                          # Perform evaluation every 10 steps.
)

# Set up the trainer
trainer = Trainer(
    model=model,                           # The model to be trained.
    args=training_args,                    # Training arguments defined above.
    train_dataset=tokenized_eli5["train"], # The training dataset.
    eval_dataset=tokenized_eli5["test"],   # The evaluation dataset.
    data_collator=data_collator,           # Function to dynamically pad the input sequences.
)


## Pre-train GPT-2 now!

Run and wait for around 6-8 mins on a 48GB GPU

In [29]:
trainer.train()

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss,Validation Loss
10,3.4532,3.187427
20,3.2732,3.168335
30,3.2377,3.153166
40,3.25,3.157581
50,3.2163,3.141802
60,3.2135,3.1281
70,3.1809,3.128158
80,3.1648,3.127456
90,3.1877,3.116232
100,3.1737,3.114629


TrainOutput(global_step=346, training_loss=3.0914470694657696, metrics={'train_runtime': 387.5171, 'train_samples_per_second': 57.143, 'train_steps_per_second': 0.893, 'total_flos': 2888191475712000.0, 'train_loss': 3.0914470694657696, 'epoch': 1.9659090909090908})

## Save your custom pre-trained GPT-2 Model Locally

Great, now that you've finetuned a model, you can use it for inference!

Come up with a prompt you'd like to generate text from:

In [30]:
model.save_pretrained("gpt2-pretrained-custom")
tokenizer.save_pretrained("gpt2-pretrained-custom")

('gpt2-pretrained-custom/tokenizer_config.json',
 'gpt2-pretrained-custom/special_tokens_map.json',
 'gpt2-pretrained-custom/vocab.json',
 'gpt2-pretrained-custom/merges.txt',
 'gpt2-pretrained-custom/added_tokens.json',
 'gpt2-pretrained-custom/tokenizer.json')

In [31]:
# delete the checkpoints as you dont need them
!rm -rf gpt2-runs

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## Load and Test Custom GPT-2 on some New Data vs. Base GPT-2

Let's see how our custom GPT-2 model pre-trained on custom data auto completes text vs a base GPT-2 model

In [32]:
base_model = AutoModelForCausalLM.from_pretrained("gpt2", 
                                                  device_map='cuda')
custom_pretrained_model = AutoModelForCausalLM.from_pretrained("gpt2-pretrained-custom", 
                                                               device_map='cuda')

In [33]:
queries = ['How does USB work?', 
           'Explain what is a RPG?',
           'How do DVDs work?', 
           'How to choose a good video game?',
           'How does the internet work?']

In [34]:
for query in queries:
    prompt = '<|query|>'+query+'<|answer|>'
    tokenized_prompt = tokenizer(prompt, return_tensors="pt").to('cuda').input_ids
    outputs = custom_pretrained_model.generate(tokenized_prompt,
                             tokenizer=tokenizer,
                             max_new_tokens=128,
                             pad_token_id=tokenizer.eos_token_id,
                             stop_strings=["<|end|>"])
    print('Custom GPT-2 response:', tokenizer.decode(outputs[0]))
    
    print('-'*100)
    
    outputs = base_model.generate(tokenized_prompt,
                                  tokenizer=tokenizer,
                                  max_new_tokens=128,
                                  pad_token_id=tokenizer.eos_token_id,
                                  stop_strings=["<|end|>"])
    print('Base GPT-2 response:', tokenizer.decode(outputs[0]))
    
    print('='*100)
    print()

Custom GPT-2 response: <|query|>How does USB work?<|answer|>USB is a protocol that allows you to connect to a computer via a USB port. It's like a telephone line. You connect to a computer via a USB port and it connects to the computer via a USB port. The computer then sends a signal to the USB port and the computer sends a signal to the USB port. The computer then sends a signal to the USB port and the computer sends a signal to the USB port. The computer then sends a signal to the USB port and the computer sends a signal to the USB port. The computer then sends a signal to the USB port and the computer sends a signal to the USB port.
----------------------------------------------------------------------------------------------------
Base GPT-2 response: <|query|>How does USB work?<|answer|>How does USB work?<|answer|>How does USB work?<|answer|>How does USB work?<|answer|>How does USB work?<|answer|>How does USB work?<|answer|>How does USB work?<|answer|>How does USB work?<|answer|>H

Well you can see a clear difference in the results of the two models. Our custom GPT-2 model understands exactly when the question sequence ends and starts auto-generating text which is the answer for the question.

Of course after a while it repeats text as it has just not been trained on enough data and for long enough

Remember to pre-train a model you need A LOT OF DATA to do this as the intent here is to NOT solve a specific task like QA but go through a lot of text and do language modeling, which is predict next word given previous words, make mistakes, update the model weights and embeddings and learn what words and context are related to each other.

Every base or foundation LLM released by companies is trained in this way.

### Optional: Upload your model to HuggingFace Hub

If you want to share your models with others then you can uncomment and run the following lines of code.
Not a compulsory thing though.

In [35]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [36]:
# Optional you can push your model to huggingface hub
# But remember to change the text below to your account/model_name
# model.push_to_hub('dipanjanS/gpt2-pretrained-custom-eli5tech')
# tokenizer.push_to_hub('dipanjanS/gpt2-pretrained-custom-eli5tech')

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/498M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/dipanjanS/gpt2-pretrained-custom-eli5tech/commit/924af3d1d7fa219832825462fddd31b7fdef69dd', commit_message='Upload tokenizer', commit_description='', oid='924af3d1d7fa219832825462fddd31b7fdef69dd', pr_url=None, repo_url=RepoUrl('https://huggingface.co/dipanjanS/gpt2-pretrained-custom-eli5tech', endpoint='https://huggingface.co', repo_type='model', repo_id='dipanjanS/gpt2-pretrained-custom-eli5tech'), pr_revision=None, pr_num=None)