# Introduction
The [t5-small on a single GPU](1. T5-Small on Single GPU) example provided a straightforward example of fine-tuning a language model. However, you might have noticed that the training problem was still essentially structured as a supervised learning problem: we had a text (code snippet) and a desired completion. When training LLMs like the GPT models, labels are not provided manually. We instead use an approach called self-supervised learning wherein the objective is automatically computed from the inputs. One example of self-supervised learning is causal language modeling, where the task is to predict the next word based on the previous words. E.g. the sentence "The boy hid behind the tree" would be decomposed into the following training tasks:
- Input: `The`, Target: `boy`
- Input: `The boy`, Target: `hid`
- Input: `The boy hid`, Target: `behind`
- Input: `The boy hid behind`, Target: `the`
- Input: `The boy hid behind the`, Target: `tree`.

This requires us to preprocess our data and pass it along to the model somewhat differently, which will be the subject of this notebook. We will still limit this example to training on a single GPU (an a10 with 24GB VRAM). We will use the [gpt2](https://huggingface.co/gpt2) model with 124M parameters. Later, we will work though Eleuther's [Transformer Math blog post](https://blog.eleuther.ai/transformer-math/#training) to understand the memory costs associated with training this model under different conditions and verify that it matches our experience. Hugging Face also provides a guide to [model memory anatomy](https://huggingface.co/docs/transformers/model_memory_anatomy).

According to the Hugging Face post, a good heuristic is that we require around 18GB VRAM + additional memory for activations (dependent on sequence length, batch size, and various model architecture details) for mixed-precision training. In this case, that translates to around 2GB VRAM + activations.

# Topics Covered in this Notebook
The major difference between this exampl and the t5-small example is the focus on self-supervised learning. Additionally, this notebook will go a little deeper into:
- monitoring training metrics with MLflow
- measuring memory usage

Before progressing to multi-GPU and multi-node training, we will also explore ways to improve training efficiency on a single GPU with techniques such as mixed-precision training.

# Choosing a Fine-Tuning Task
We will fine-tune GPT2 on the [tinystories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset. TinyStories is:

> a synthetic dataset of short stories that only contain words that a typical 3 to 4-year-olds usually understand, generated by GPT-3.5 and GPT-4.

and can be used to train small models (actually quite a bit smaller than GPT-2) that

> still produce fluent and consistent stories with several paragraphs that are diverse and have almost perfect grammar, and demonstrate reasoning capabilities.

([Source](https://arxiv.org/abs/2305.07759))

We can evaluate the model by passing prompts such as this example from the TinyStories paper:

> Once upon a time there was a pumpkin. It was a very special pumpkin, it could speak. It was sad because it couldnâ€™t move. Every day, it would say

and evaluating the grammar, consistency, and creativity of the output. We hope to see improvements in these areas after training.