<img src='data/images/section-notebook-header.png' />

**Disclaimer:** This source code of the GPT-2 model is adopted from Andrej Karpathy's [minGPT](https://github.com/karpathy/minGPT) and [nanoGPT](https://github.com/karpathy/nanoGPT) implementation. Overall, the code remains almost unchanged but there are some very minor modifications. The code has been slightly re-organized and some variable names and values have been modified to make their purpose more intuitive.

The main additions to the implementation of GPT-2 are the added explanations and discussion of the code. This should make it easier to understand the individual components of the architecture.

# Generative Pre-trained Transformer (GPT)

### Overview

The Generative Pretrained Transformer (GPT) architecture stands as a landmark in the field of natural language processing (NLP), representing a paradigm shift in how machines comprehend and generate human language. Developed by OpenAI, GPT is built upon the Transformer model, a neural network architecture introduced by Vaswani et al. in 2017. However, GPT extends and refines this architecture, leveraging the power of unsupervised learning to achieve remarkable proficiency in a wide range of language tasks.

At its core, GPT harnesses the principles of transfer learning, where a model is first pretrained on vast amounts of text data in an unsupervised manner and then fine-tuned on specific tasks with labeled data. This approach enables GPT to capture intricate linguistic patterns, semantic relationships, and syntactic structures inherent in natural language, thereby endowing it with a deep understanding of human discourse. One of the defining features of the GPT architecture is its generative capability, allowing it to produce coherent and contextually relevant text based on a given prompt or input. This remarkable ability has found myriad applications across various domains, including language translation, text summarization, question answering, and creative writing.

In this introduction, we delve into the intricacies of the GPT architecture, exploring its underlying mechanisms, training methodologies, and real-world applications. By understanding the inner workings of GPT, we gain insight into the transformative potential it holds for revolutionizing human-machine interaction and advancing the frontiers of artificial intelligence. More specifically, we look into the GPT-2 architecture. The GPT-2 model architecture introduced several notable characteristics and advancements, building upon its predecessor, GPT-1, to further enhance performance and versatility in natural language processing tasks. Some of the special characteristics of the GPT-2 model architecture include:

* **Scale:** GPT-2 significantly increased the scale of its predecessor, with up to 1.5 billion parameters in its largest variant. This scale allowed GPT-2 to capture more intricate patterns and nuances in language, leading to improved performance across various tasks.

* **Multi-layered Transformer Architecture:** Like GPT-1, GPT-2 relies on a multi-layered Transformer architecture, consisting of multiple encoder and decoder layers. These layers facilitate efficient processing of sequential data and enable the model to capture dependencies across different parts of the input text.

* **Self-Attention Mechanism:** GPT-2 utilizes self-attention mechanisms within its Transformer architecture to weigh the importance of different words in a sentence based on their contextual relevance. This mechanism enables the model to capture long-range dependencies and maintain coherence in generated text.

* **Unsupervised Pretraining:** Similar to GPT-1, GPT-2 undergoes unsupervised pretraining on large corpora of text data, such as books, articles, and websites. This pretraining phase allows the model to learn representations of language in an unsupervised manner, capturing diverse linguistic patterns and structures.

* **Fine-Tuning for Specific Tasks:** While pretrained on a diverse range of text data, GPT-2 can be fine-tuned on specific tasks using supervised learning. This fine-tuning process involves providing labeled data for tasks such as language translation, text summarization, sentiment analysis, etc., enabling GPT-2 to adapt its learned representations to the nuances of the target task.

* **Conditional Text Generation:** GPT-2 supports conditional text generation, where users can provide prompts or context to guide the generation of text. This feature allows for more controlled and targeted text generation, making GPT-2 suitable for various applications, including content creation, storytelling, and dialogue systems.

* **High-Quality Text Generation:** GPT-2 is renowned for its ability to generate high-quality, coherent text across diverse topics and styles. The model exhibits impressive fluency, coherence, and semantic relevance in its generated outputs, often indistinguishable from human-written text, especially in smaller generations.

Overall, these characteristics make the GPT-2 model architecture a powerful tool for a wide range of natural language processing tasks, contributing to advancements in AI-driven text generation and understanding. The focus in GPT-2 has two main reasons. Firstly, the model is still small enough in terms of the number of trainable parameters to run on a  consumer-grade desktop computer or a personal laptop. And secondly, OpenAI made the pretrained models of GPT-2 publicly available. This means that we do not have to train our implementation of GPT-2 from scratch but can copy over the pretrained weights from available models.

**Important side note:** The ability to copy over the weights from existing pretrained models requires that our GPT-2 implementation must match the structure of the implementation from OpenAI. To some extent this will make the code a bit more difficult to understand, but enough details and explanations will be provided throughout the notebook to help with the overall understanding

### Background: Transformers

As the name Generative Pre-trained Transformer suggests, GPT is based on the Transformer architecture. It is therefore recommended that you first checkout the available Transformer notebooks. GPT is in some sense only "half a transformer" as it is considered a decoder-only architecture primarily because of its autoregressive nature and its reliance on generating text. In the original Transformer architecture introduced by Vaswani et al., both encoder and decoder components were utilized for tasks like machine translation, where the model needed to both understand the input text (encoder) and generate the translated text (decoder).

However, in the case of GPT, the model is designed specifically for text generation tasks. During training, GPT learns to predict the next word in a sequence given the previous words (i.e., language modeling). This autoregressive process is inherently a decoding task, where the model generates text word-by-word based on the context provided by preceding words. Therefore, GPT lacks an explicit encoder component; instead, it consists solely of decoder layers. The figure below illustrates this but shows the decoder part of the "full" Transformer architecture. Of course, since there is no encoder, the decoder is missing the cross-attention component between the output of the encoder and the intermediate state of the decoder.

<img src="data/images/gpt-transformer-architecture.png" />

In this decoder-only architecture, each token generated by the model depends on the previously generated tokens, mimicking the left-to-right generation process characteristic of language generation tasks. This design choice simplifies the architecture and training process for text generation while still achieving impressive results across a variety of language tasks.

## Setting up the Notebook

### Import Required Packages

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

from src.gpt import CausalAttention, CausalMultiHeadAttention, MLPLayer, TransformerBlock, GPT, gpt_base_configs
from src.bpe import BPETokenizer

from src.utils import Dict2Class

### Checking/Setting the Computation Device

PyTorch allows to train neural networks on supported GPUs to significantly speed up the training process. If you have a support GPU, feel free to utilize it. 

In [None]:
use_cuda = torch.cuda.is_available()
#use_cuda = False
device = torch.device("cuda:0" if use_cuda else "cpu")
print(device)

---

## Preliminaries

GPT-2 is not a single model but a family of models that differ in their size with respect to the number of trainable parameters. GPT-2 comes in several different versions, each with varying sizes in terms of the number of parameters. These versions are:

* **GPT-2 Small:** This version has 117M parameters and is the smallest variant of GPT-2. It is suitable for tasks where computational resources are limited or where a smaller model suffices.

* **GPT-2 Medium:** With 345M parameters, GPT-2 Medium offers a balance between model size and performance. It provides improved performance compared to GPT-2 Small while remaining relatively computationally efficient.

* **GPT-2 Large:** This version has 774M parameters, making it significantly larger than the Medium variant. GPT-2 Large achieves better performance on various NLP tasks due to its increased capacity to capture complex language patterns.

* **GPT-2 XL:** With 1.5B parameters, GPT-2 XL is the largest publicly available version of GPT-2. It offers the highest capacity for capturing nuanced language patterns and has demonstrated superior performance on tasks requiring extensive context understanding.

These different versions of GPT-2 cater to a wide range of use cases and computational requirements, allowing researchers and practitioners to choose the variant that best suits their needs. Additionally, they provide a continuum of performance and resource trade-offs, enabling experimentation with different model sizes depending on the specific task and available resources.

The number of trainable parameters derives from the various configuration parameters such as the number of heads, the number of transformer blocks, the maximum allowed input text size, and more. For convenience, the file `gpt.py` (bottom), contains the configurations for all 4 GPT-versions for easy use. It also includes a toy configuration `gpt-tiny` yielding a very small model we can use for checking the individual components of the GPT architecture throughout the notebook.

Let's have a quick look:

In [None]:
config = Dict2Class(gpt_base_configs['gpt-tiny'])

for attribute, value in config.__dict__.items():
    print(f"{attribute}: {value}")

The main parameters (`num_heads`, `num_layers`, `block_size`, `mlp_factor`) will be explained in more detail later on. All other parameters are less relevant, but you should be able to easily check with the code where they are used to understand their meaning.

---

## GPT Architecture Components

### Causal Attention

Causal attention, also known as masked attention, is a key component of the GPT architecture that ensures the autoregressive nature of the model during training and inference. In the context of GPT, causal attention is used to prevent the model from attending to future tokens when generating each token in a sequence. The basis for the Causal Attention is the Scaled Dot-Product Attention from the Transformer architecture, where the attention scores are computed between each token in a sequence and all other tokens. These attention scores determine the importance of each token with respect to every other token in the sequence. The Scaled Dot-Product Attention mechanism is defined as follows:

Given a query matrix $Q$, a key matrix $Q$, and a value matrix $V$, the attention scores $Attention(Q,K,V)$ are computed as:

$$Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$$

where $d_k$ is the dimensionality of the key vectors, and the softmax function is applied row-wise to obtain attention weights. For more details about the Scaled Dot-Product Attention, you can check out the notebook covering the basic Transformer architecture. In the context of GPT, causal attention extends this mechanism by applying a mask to the attention scores before the softmax operation. This mask ensures that each token can only attend to previous tokens in the sequence, and not to future tokens, preserving the autoregressive property required for text generation tasks.

The causal mask is typically implemented by setting the attention scores corresponding to future positions to a large negative value (or negative infinity) before applying the softmax function. As a result, the softmax operation effectively assigns a probability of zero to future tokens, preventing the model from attending to them during training and inference.

#### Worked Example for Causal Masking

Let's create an instance of `CausalAttention`. 

In [None]:
config = Dict2Class(gpt_base_configs['gpt-tiny'])

attention = CausalAttention(config)

Let's first have a look at the shape attention matrix:

In [None]:
print(f"Shape of Attention Mask: {attention.mask.shape}.")

The shape of the attention matrix is `(1, 1, block_size, block_size`) where `block_size` is the longest possible sequence that we can input to the GPT Transformer decoder. For our `gpt-tiny` configuration `block_size=64`, for all other support GPT-2 architectures, `block_size=1024`.

Notice that our attention matrix has 2 additional dimensions. This is because the class `CausalAttention` attention will compute the attention matrices

* For *all sequences* in a batch **and**
* For *all heads* for a sequence

We can also have a look at the attention mask itself:

In [None]:
print(attention.mask)

Apart from the additional first 2 dimensions, the attention matrix is a matrix with the top-right half being all 0's and the bottom-left half incl. the diagonal all 1's. In a nutshell, this will ensure that:

* The 1st token can only attend to itself
* The 2nd token can only attend to the 1st token and itself (and vice versa)
* The 3rd token can only attend to the 1st token, the 2nd token and itself (and vice versa)
* ...

In other words, each token can only attend to previous tokens but to future tokens.

Let's actually show this by creating a random attention matrix, assuming a batch size of 8 and a 4 heads for the decoder. The code cell below creates a tensor containing $8\cdot 4 = 32$ matrices -- one for each sequence and head -- representing an attention matrix for an input sequence of 5 tokens.

In [None]:
torch.manual_seed(0)

batch_size = 8
seq_len    = 5

attention_matrices = torch.rand(batch_size, config.num_heads, seq_len, seq_len)

# Print the attenion matrix for the first sequence and the first head
print(attention_matrices[0][0])

We can now apply the causal attention mask to 32 attention matrices in the tensor `attention_matrices`. This is done by setting each entry of `attention_matrices` to `-inf` where there is $0$ in the corresponding position in the attention mask. The single line in the code cell below accomplishes this. 2 additional comments:

* Since our attention mask is too large (here: $64 \times 64$), we need to shrink it to the size of the sequence length (here: $4$). We can simply do this conveniently using slicing: `attention.mask[:,:,:seq_len,:seq_len]`

* Since the shape of the attention mask is now `(1, 1, seq_len, seq_len)` the command below applies the attention mask to all 24 attention matrices in the tensor `attention_matrices`. This is automatically done using [broadcasting](https://pytorch.org/docs/stable/notes/broadcasting.html).

In [None]:
attention_matrices_masked = attention_matrices.masked_fill(attention.mask[:,:,:seq_len,:seq_len] == 0, float('-inf'))

We can now look at the same attention matrix again (see above), but now after the attention mask is applied.

In [None]:
print(attention_matrices_masked[0][0])

Notice that this attention matrix -- and all attention matrices in `attention_matrices_masked` -- have now `-inf` values where there was a $0$ in the attention mask.

The only core step missing is to compute the Softmax for all masked attention matrices to get the final attention weights. Recall that the attention weights w.r.t. each token have to sum up to $1$. An input attention weight of `-inf` will ensure that the value of the attention weight after the Softmax will be $0$, and all other weights summing up to $1$.

In [None]:
attention_weights = F.softmax(attention_matrices_masked, dim=-1)

# Print normalized weights of the same attention matrix (see above)
print(attention_weights[0][0])

#### Worked Example for `CausalAttention`

In practice the `forward()` method of class `CausalAttention` expects a tensors of shape `(batch_size, num_heads, seq_len, embed_size)` since we compute the attention with respect to all sequences in the batch and all heads for each sequence. So let's create an example tensor of the correct shape.

In [None]:
batch_size = 8
seq_len    = 10

causal_attention_input = torch.rand(batch_size, config.num_heads, seq_len, config.embed_size)

print(causal_attention_input.shape)

Since we are still using the `gpt-tiny` model architecture, the number of heads is $4$ and the embedding size is $32$. This we can give as input to the `forward()` method of the `CausalAttention` class, which perform self-attention over the input sequences -- this is why we have 3x `causal_attention_input` -- as well as causal masking for each attention matrix as illustrated above.

In [None]:
causal_attention_output = attention(causal_attention_input, causal_attention_input, causal_attention_input, seq_len)

print(causal_attention_output.shape)

Of course, the output shape matches the input shape as the attention mechanism "only" changes embedding values for each token.

### Multi-Head Casual Attention

The class `CausalMultiHeadAttention` handles to computation of the self-attention for all heads by

* Performing the linear transformation of the queries, keys, and values -- which in case of self-attention are all identical

* Calling `CausalAttention` to perform to actual Scaled-Dot Product Attention incl. the application of the Causal Attention Matrix

Note that there is only a single `nn.Linear` layer to handle queries, keys, and values; hence the definition as `nn.Linear(config.embed_size, 3*config.embed_size)`. This is conceptually the same as having 3 `nn.Linear` layers defined as `nn.Linear(config.embed_size, config.embed_size)`, but shows better performance in practice.

Let's test this class by first creating some random input. Compared to the example input for `CausalAttention`, this tensor is missing the number of heads as dimension.

In [None]:
multi_head_attention_input = torch.rand(batch_size, seq_len, config.embed_size)

print(multi_head_attention_input.shape)

Now we can define a `CausalMultiHeadAttention` layer to compute the output. Of course, the shape of the output is the same as the shape of the input. All the handling of multiple heads is done within this layer.

In [None]:
multi_head_attention = CausalMultiHeadAttention(config)

multi_head_attention_output = multi_head_attention(multi_head_attention_input)

print(multi_head_attention_output.shape)

### MLP Layer

In the Transformer architecture, the feed-forward layer is a crucial component within the transformer blocks, which are the building blocks of the Transformer model. The feed-forward layer is applied independently to each position in the sequence, which is typically a sequence of word embeddings in natural language processing tasks. The feed-forward layer consists of two linear transformations with a non-linear activation function applied in between.

The feedforward layer consists of two linear transformations with a ReLU activation function in between. The first linear transformation projects the input vector to a higher-dimensional space, and the second linear transformation maps it back to the original dimensionality. The output of the feedforward layer is computed as follows:

$$FFN(x) = max(0, xW_1 + b_1)W_2 + b_2$$

where $x$ is the input vector, $W_1$ and $W_2$ are learnable weight matrices, $b_1$ and $b_2$ are learnable bias vectors, and $max(0, \cdot)$ represents the ReLU activation function.

The primary purpose of the feed-forward layer in the Transformer architecture is to provide a mechanism for the model to learn complex, nonlinear relationships within the input sequence. By applying two linear transformations with a non-linear activation function in between, the feed-forward layer can model intricate patterns in the data, allowing the model to capture rich representations of the input sequences. This helps the Transformer model in tasks such as language modeling, machine translation, and other sequence-to-sequence tasks. Additionally, the feed-forward layer introduces flexibility and expressiveness to the Transformer architecture, enabling it to handle a wide range of natural language processing tasks effectively.

The class `MLPLayer` implements this component in the GPT-2 architecture, but with 2 noteworthy differences

* The size of the linear layers is expressed as a multiple of the embedding size instead as a fixed size. By default, the linear layers in the MLP / Feed-Forward layer have a size 4 times the embedding size.

* GPT-2 uses as activation function between the 2 linear layers the [Gaussian Error Linear Unit (GeLU)](https://pytorch.org/docs/stable/generated/torch.nn.GELU.html) activation function. GeLU is a non-linear activation function that was introduced as an alternative to the rectified linear unit (ReLU) in deep learning models. It is defined as:
$$\text{GeLU}(x) = x \cdot \Phi(x)$$
where \Phi(x) is the cumulative distribution function (CDF) of the standard normal distribution, defined as:
$$\Phi(x) = \frac{1}{2} \left(1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right)$$
and erf denotes the error function.

To give an example, we can run the output of the Multi-Head Attention class through the MLP layer.

In [None]:
mlp_layer = MLPLayer(config)

mpl_layer_out = mlp_layer(multi_head_attention_output)

print(mpl_layer_out.shape)

Again, the shape of the output is the same as of the input.

### Transformer Block

The `TransformerBlock` layer combines the individual components of Multi-Head Attention Layer and MLP Layer as well as

* Adding layer normalization steps *and*

* Residual connections

in line with the basic Transformer architecture. Let's create an example input for the `TransformerBlock` layer.

In [None]:
block_input = torch.rand(batch_size, seq_len, config.embed_size)

print(block_input.shape)

We can now define a `TransformerBlock layer` (which include `MultiHeadCausalAttention` and `MLPLayer`) and give it the example input.

In [None]:
transformer_block = TransformerBlock(config)

block_output = transformer_block(block_input)

print(block_output.shape)

### Final GPT Class

The class `GPT` includes all components beyond the series of `TransformerBlock` layers like in the basic Transformer architecture, most notably:

* The positional embedding

* The final output layer (incl. an additional layer normalization steps)

Let's have a look at the model for the configuration `gpt-tiny`. Notice the line `(0-2): 3 x TransformerBlock` indicating that this configuration uses 3 Transformer blocks. Apart from that, you should recognize all the components we have discussed individually before.

In [None]:
config.vocab_size = 10000 # some example value required for GPT class

gpt = GPT(config)

print(gpt)

We now have a complete implementation of the GPT-2 architecture implemented in PyTorch. In principle, you could use this implementation to train your own GPT-based Large Language Model. However, to achieve any meaningful performance, this would require to collect and process huge amounts of training data, as well as huge computational resources (ideally very large GPU-based computing clusters). These requirements make it virtually prohibitively expensive for individual users to train their own Large Language model such as GPT from scratch.

Thus, instead of training a model from scratch, in this notebook, we make our implementation GPT "practically usable" by copying all trained weights from a pretrained model. Recall that this was the reason for some of the implementation details -- for example, we could have used 3 different `nn.Linear` layers to transform the queries, keys, and values, but this would have made the copying over of pretrained weights much more complicated. Of course, in practice it would be much more reasonable to directly use such a pretrained model to begin with. However, the goal of this notebook is to look under the hood of GPT and not to treat it like a block-box model.

---

## Using GPT

To make our own implementation usable, we have not address to issues:

* We need to get a pretrained GPT-2 model to copy over all the weights into our instance of GPT-2

* Preprocess and input we give our model in the same way as it was done for creating the pretrained model.

### Create Instance from Pretrained Model

The class `GPT` contains the auxiliary method that performs the following main steps

* Create an instance of the `GPT` class based on the given configuration (e.g., `gpt2`, `gpt2-medium`, etc.).

* Download the pretrained model of the same configuration and create an instance of that model; under the hood this is done by utility methods of the `transformer` package.

* Copy the pretrained weights over to our instance of the `GPT` class. Notice that this step essentially just involves iterating in parallel through all layers in mode models and copies the weights over. This naturally requires both implementations to be "equal enough" in terms of having the same number of layers and in the same order.

**Side note:** Which configuration you will be able to load depends on your available memory. For example, with 16GB of memory loading `gpt2-xl` is very likely to fail due to insufficient memory. The problem is that we need to load 2 models, the pretrained and our own "copy". But again, this notebook is not about state-of-the-art results -- after all, GPT-2 is already long obsolete -- but about understanding the more nitty-gritty details.

In [None]:
model = GPT.from_pretrained('gpt2')         # 124M parameters
#model = GPT.from_pretrained('gpt2-medium')  # 355M parameters
#model = GPT.from_pretrained('gpt2-large')   # 774M parameters
#model = GPT.from_pretrained('gpt2-xl')      # 1558M parameters -- fails to load; not enough memory :(
#model = GPT.from_pretrained('distilgpt2')   # 82 parameters

# Model model to device and set to eval mode
model.to(device)

# We never train this model, so let's just set it to evaluation mode
model.eval();

### Input Tokenization using BPE

Like most modern deep learning models for handling text as input, GPT-2 relies on subword-based tokenization. Subword-based tokenization is a technique used in natural language processing (NLP) to break down words into smaller subword units, which are then treated as tokens. This approach is particularly useful for handling out-of-vocabulary (OOV) words, rare words, and morphologically rich languages where word boundaries are not always clear. The basic idea is to create a vocabulary of subword units based on the input text corpus, which allows the model to represent both frequent and rare words more effectively.

In traditional tokenization, words are treated as atomic units, meaning each word is considered as a single token. However, this approach can be problematic when dealing with morphologically complex languages or when encountering rare or unseen words during inference. Subword-based tokenization addresses these challenges by breaking down words into smaller units, such as character n-grams or other linguistically motivated subword units. This enables the model to generalize better to unseen words or morphological variations because it can still recognize and compose subword units that it has encountered during training.

In contrast to traditional tokenization, where an input string is split into tokens according to predefined rules, in subword-based approaches **learn from data** and based on a given hyperparameter setting where to split a string into tokens. This means that 2 subword-based tokenizers may yield quite different token lists after processing an input text. As such, to make our GPT implementation work properly, we have to tokenize any input text the same way as was done for the pretrained model. The file `bpe.py` therefore contains the class `BPETokenizer` that implements the required tokenizer. Its source code is also adopted from Andrej Karpathy's [minGPT](https://github.com/karpathy/minGPT) and [nanoGPT](https://github.com/karpathy/nanoGPT) repositories.

A deeper discussion of how this tokenizer works is beyond the scope of this notebook. Here, we merely create an instance for later use. Of course, you are very welcome to check out the code and Andrej's repository if you are interested in the inner workings.

In [None]:
tokenizer = BPETokenizer()

Lastly, we define the method `generate()` below for actually using the tokenizer and the GPT model to generate responses. This method takes the following input parameters:

* `model`: the instance of the GPT model.

* `prompt`: the input text (or *prompt*) for the model; if the prompt is empty, the model will generate a response not conditioned on anything.

* `num_samples`: the number of responses to be generated.

* `steps`: the maximum number of tokens to be generated.

* `do_sample`: if `True` the next token is sampled from the `top_k=10` (see method below) most likely tokens; if `False`, not surprisingly, all `num_sample` responses will be the same since always the most likely token will be used as the next generated token.

Notice that this model calls the `generate()` method of the `GPT` class, which actually predicts/generates this next token. This method contains a loop that predicts the next token in each iteration. The input for an iteration is formed by the prompt and all so far generated tokens. If this input is too long, i.e., the input contains more than `block_size` tokens, the input is cut to the last `block_size` tokens.

Of course, the return tokens are merely token indices w.r.t. to the underlying vocabulary. Therefore, as a last step, we have to use the tokenizer to convert the token indices to actual tokens/ words.

In [None]:
def generate(model, prompt='', num_samples=10, steps=20, do_sample=True):
        
    if prompt == '':
        # to create unconditional samples...
        # manually create a tensor with only the special <|endoftext|> token
        # similar to what openai's code does here https://github.com/openai/gpt-2/blob/master/src/generate_unconditional_samples.py
        x = torch.tensor([[tokenizer.encoder.encoder['<|endoftext|>']]], dtype=torch.long).to(device)
    else:
        x = tokenizer(prompt).to(device)
    
    # we'll process all desired num_samples in a batch, so expand out the batch dim
    x = x.expand(num_samples, -1)

    # forward the model `steps` times to get samples, in a batch
    y = model.generate(x, max_new_tokens=steps, do_sample=do_sample, top_k=10)
    
    for i in range(num_samples):
        out = tokenizer.decode(y[i].cpu().squeeze())
        print('-'*80)
        print(out)

### Playing with GPT-2

Now we are finally ready to use our GPT-2 model. Try different prompts to see which prompt seems to yield good responses, and which prompts do arguably completely fail.

**"Warning":** Don't expect responses you are used to from using ChatGPT. A lot of improvements have been done from GPT-2 to ChatGPT (incl. various versions) that not only refers to the number of trainable parameters. For example, while GPT-2 XL has 1.5 Billion parameters, GPT-3 is said to have 175 billion parameters, although smaller versions seem to exist. In fact, OpenAI has only made the GPT-2 architecture public (so far), so you can find conflicting information about the more recent models online.

In [None]:
generate(model, prompt='What is Machine Learning?', num_samples=5, steps=100)

---

## Summary

GPT-2, short for "Generative Pre-trained Transformer 2," is a large language model developed by OpenAI. It represents a significant advancement in the field of natural language processing (NLP) and deep learning. GPT-2 builds upon the Transformer architecture, which has become a standard model for sequence-to-sequence tasks, such as language translation and text generation.

One of the core characteristics of GPT-2 is its sheer size and complexity. It consists of a vast neural network with 1.5 billion parameters, making it one of the largest language models available at the time of its release. This extensive scale allows GPT-2 to capture intricate patterns and dependencies in natural language text, enabling it to generate coherent and contextually relevant responses across a wide range of tasks. Another key feature of GPT-2 is its ability to generate human-like text. By training on a diverse corpus of text data, GPT-2 learns to mimic the style, tone, and structure of the input text, producing outputs that often resemble human-authored content. This capability has significant implications for various applications, including text generation, conversational agents, content summarization, and creative writing.

GPT-2's importance in the field of large language models stems from its performance and versatility. It has demonstrated state-of-the-art performance on various benchmark NLP tasks, including language modeling, text completion, and question answering. Additionally, its open-source release and widespread availability have facilitated research and development in the NLP community, sparking innovations in model architectures, training techniques, and downstream applications. Overall, GPT-2 represents a significant milestone in the advancement of large language models, paving the way for further progress in understanding and generating natural language text.