# The illustrated GPT-2

<a href="https://jalammar.github.io/illustrated-gpt2/"><b>Original article</b></a>

In 2019, [The OpenAI GPT-2](https://openai.com/blog/better-language-models/) exhibited impressive ability of writing coherent and passionate essays that exceeded what we anticipated of language models were able to produce at that time. The GPT-2 wasn't a particularly novel architecture. It's architecture is very similar to the decoder-only transformer, where the main difference lies in the size of the model and the size of the data that was used to train it.

The GPT-2 was trained on a massive 40GB dataset called WebText that the OpenAI researchers crawled from the internet as part of the research effort. With respect to model size, the smallest variant of the trained GPT-2 takes up 500MB of storage to store all of its parameters. The largest GPT-2 variant is 13 times the size so it could take up more than 6.5GBs of storage space.

<img title="" src="images_igpt2/gpt2-sizes.png" alt="" width="700" data-align="center">

<img title="" src="images_igpt2/gpt2-sizes-hyperparameters-3.png" alt="" width="700" data-align="center">

## GPT-2 model architecture

### One difference from BERT

The GPT-2 is built using transformer decoder blocks. BERT, on the other hand, uses transformer encoder blocks. In addition to this, **one key difference between the two is that GPT-2, like traditional language models, outputs one token at a time**. The way these models actually work is that after each token is produced, that token is added to the sequence of inputs. And that new sequence becomes the input to the model in its next step. This idea is called "auto-regression". This is one of the ideas that made [RNNs unreasonably effective](https://karpathy.github.io/2015/05/21/rnn-effectiveness/).

The GPT-2, and some later models like TransformerXL and XLNet are auto-regressive in nature. BERT is not. That is a trade off. In losing auto-regression, BERT gained the ability to incorporate the context  on both sides of a word to gain better results. XLNet brings back autoregression while finding an alternative way to incorporate the context on both sides.

<img title="" src="images_igpt2/gpt-2-autoregression-2.gif" alt="" width="600" data-align="center">

### Decoder-only Transformer block

Subsequent to the original paper, [Liu et al. (2018)](https://arxiv.org/pdf/1801.10198.pdf) proposed another arrangement of the transformer block that is capable of doing language modeling. This model threw away the Transformer encoder. This early transformer-based language model was made up of a stack of six transformer decoder blocks.

These blocks were very similar to the original decoder blocks, **except the did away with that second self-attention layer**. The OpenAI GPT-2 model uses these decoder-only blocks.

<img title="" src="images_igpt2/transformer-decoder-intro.png" alt="" width="600" data-align="center">

### Masked self-attention

One key difference of GPT-2 and BERT is the self-attention layer. Since GPT-2 is a decoder-only model, it only uses masked self-attention. A normal self-attention allows a position to peak at tokens to its right. Masked self-attention prevents that from happening.

<img title="" src="images_igpt2/self-attention-and-masked-self-attention.png" alt="" width="600" data-align="center">

### Looking inside GPT-2

Let's lay a trained GPT-2 model on our "surgery table" and look at how it actually works.

The simplest way to run a trained GPT-2 is to allow it to ramble on its own (which is technically called *generating unconditional samples*). Alternatively, we can give it a prompt to have it speak about a certain topic (a.k.a *generating interactive conditional samples*). In the rambling case, we can simply hand it the start token and have it start generating words (the trained model uses `<|endoftext|>` as its start token. Let's call it `<s>` instead).

<img title="" src="images_igpt2/gpt2-simple-output-2.gif" alt="" width="600" data-align="center">

The model only has one input token, so that path would be the only active one. The token is processed successively through all the layers, then a vector is produced along that path. That vector can be scored against the model's vocabulary (all the words the model knows, 50.000 words in the case of GPT-2). In this case, we selected the token with the highest probability, i.e., "the". But we could certainly mix things up [by considering different decoding methods for language generation](https://huggingface.co/blog/how-to-generate). For instance:
* Greedy Search 
* Beam search 
* Sampling
* Top-K sampling
* [Locally typical sampling](https://arxiv.org/abs/2202.00666) ([Yannick's Kilcher Youtube video](https://www.youtube.com/watch?v=AvHLJqtmQkE&t=496s))

In the next step, we add the output from the fist step to our input sequence, and have the model make its next prediction. Notice that the second pathe is the only one active in this calculation. Each layer of GPT-2 has retained its own interpretation of the first token and will use it in processing the second token. **GPT-2 does not re-interpret the first token in light of the second token**.

<img title="" src="images_igpt2/gpt-2-simple-output-3.gif" alt="" width="600" data-align="center">

#### Input encoding

Let's look at more details to get to know the model more intimately. Let's start from the input. As in other NLP models we've discussed before, the model looks up the embedding of the input word in its embedding matrix. 

<img title="" src="images_igpt2/gpt2-token-embeddings-wte-2.png" alt="" width="400" data-align="center">

So, at the beginning, we look up the embedding of the start token `<s>` in the embedding matrix. Before handing that to the first block in the model, we need to incorporate positional encoding, a "signal" (i.e., vector) that indicates the order of the words in the sequence to the transformer blocks.

<img title="" src="images_igpt2/gpt2-positional-encoding.png" alt="" width="400" data-align="center">

Sending a word to the first transformer block means looking up its embedding and adding up to the positional encoding vector for position #1.

<img title="" src="images_igpt2/gpt2-input-embedding-positional-encoding-3.png" alt="" width="600" data-align="center">


#### Masked self-attention

Masked self-attention is identical to self-attention except when it comes to the scoring step. Assumming the model only has two tokens as input and we're observing the second token. In this case, we interfere in the scoring step by always scoring the future tokens as 0 so the model can't peak to the future words.

<img title="" src="images_igpt2/masked-self-attention-2.png" alt="" width="600" data-align="center">

This masking is often implemented as a matrix called an "attention mask". Consider the following sequence of four words "robots must obey orders". In a language modeling scenario, this sequence is absorbed in four steps (one per word, assuming for now that every word is a token). As these models work in batches, we can assume a batch size of 4 for this toy model that will process the entire sequence (with its four steps) as one batch.

<img title="" src="images_igpt2/transformer-decoder-attention-mask-dataset.png" alt="" width="600" data-align="center">

In matrix form, we calculate the scores by multiplying a queries matrix by a keys matrix. Let's visualize it as follows, except instead of the word, there would be the query (or key) vector associated with that word in that cell:

<img title="" src="images_igpt2/queries-keys-attention-mask.png" alt="" width="600" data-align="center">

After the multiplication, we apply our attention mask triangle. It sets the cells we want to mask to `-Inf` or a very large negative number (e.g., -1 billion in GPT-2):

<img title="" src="images_igpt2/transformer-attention-mask.png" alt="" width="600" data-align="center">

Then, applying softmax on each row produces the actual scores we use for self-attention:

<img title="" src="images_igpt2/transformer-attention-masked-scores-softmax.png" alt="" width="600" data-align="center">

What this scores table means is the following:

* When the model processes the first example in the dataset (row #1), which contains only one word (“robot”), 100% of its attention will be on that word.
* When the model processes the second example in the dataset (row #2), which contains the words (“robot must”), when it processes the word “must”, 48% of its attention will be on “robot”, and 52% of its attention will be on “must”.
* And so on


#### GPT-2 masked self-attention during evaluation

We can make the GPT-2 operate exactly as masked self-attention works. But during evaluation, when our model is only adding one new word after each iteration, it would be inefficient to recalculate self-attention along earlier paths for tokens which have already been processed. Let's consider the following example (ignoring `<s>` for now).

<img title="" src="images_igpt2/gpt2-self-attention-qkv-1-2.png" alt="" width="600" data-align="center">

GPT-2 holds on to the key and value vectors of the `<a>` token. Every self-attention layer hold on to its respective key and value vectors for that token. Now in the next iteration, when the model processes the word `robot`, it does not need to generate query, key, and value queries for the `a` token. It just reuses the ones it saved from the first iteration.

<img title="" src="images_igpt2/gpt2-self-attention-qkv-3-2.png" alt="" width="600" data-align="center">

## GPT-2: Beyond language modeling

The decoder-only transformer keeps showing promise beyond language modeling. There are plenty of applications where it has shown success.

### Machine translation

An encoder is not required to conduct translation. The same task can be addressed by a decoder-only transformer:

<img title="" src="images_igpt2/decoder-only-transformer-translation.png" alt="" width="700" data-align="center">


### Summarization

This is the task that the first decoder-only transformer was trained one. Namely, it waas trained to read a wikipedia article (without the opening sction before the table of contents), and summarize it. The actual opening sections of the articles were used as the labels in the training dataset:

<img title="" src="images_igpt2/wikipedia-summarization.png" alt="" width="700" data-align="center">

<img title="" src="images_igpt2/decoder-only-summarization.png" alt="" width="700" data-align="center">

### Transfer learning

In [Sample Efficient Text Summarization Using a Single Pre-Trained Transformer](https://arxiv.org/abs/1905.08836) (Khandelwal et al., 2019), a decoder-only transformer is first pre-trained on language modeling, then finetuned to do summarization. It turns out to achieve better results than a pre-trained encoder-decoder transformer in limited data settings.

The GPT2 paper also shows results of summarization after pre-training the model on language modeling.

### Music generation

The Music Transformer uses a decoder-only transformer to generate music with expressive timing and dynamics. "Music modeling" is just like language modeling, just let the model learn music in an unsupervised way, then have it sample outputs (what we called "rambling", earlier).

You might be curious as to how music is represented in this scenario. Remember that language modeling can be done through vector representations of either characters, words, or tokens that are parts of words. With a musical performance (let’s think about the piano for now), we have to represent the notes, but also velocity (i.e., a measure of how hard the piano key is pressed).

A performance is just a series of these one-hot vectors. A mid file can be converted into such a format. The paper has the following example input sequence:

<img title="" src="images_igpt2/music-representation-example.png" alt="" width="700" data-align="center">

The one-hot vector representation for this input sequence would look like this:

<img title="" src="images_igpt2/music-transformer-input-representation-2.png" alt="" width="700" data-align="center">

The paper provides an intuitive visual that shocases self-attention in the Music Transformer. Jay Alammar (the original author of this article) improved it by adding with some annotations:

<img title="" src="images_igpt2/music-transformer-self-attention-2.png" alt="" width="700" data-align="center">

The figure shows a query (the source of all the attention lines) and previous memores being attended to (the notes that are receiving more softmax probability is highlighted in). The coloring of the attention lines correspond to different heads and the width to the weight of the softmax probability. The query is at one of the latter peaks and it attends to all of the previous high notes on the peak, all the way to the beginning of the piece. For more information about this representation of musical notes, [check out this video](https://www.youtube.com/watch?v=ipzR9bhei_o).