In [8]:
import tiktoken
import torch

from src.model.utils.generation import generate_text_simple
from src.model.zoo.my_gpt import GPTModel

In [9]:
# Traditional GPT-2 architecture
GPT2_CONFIG_124M = {
    "vocab_size": 50257,  # Vocabulary size
    "context_length": 1024,  # Context length
    "emb_dim": 768,  # Embedding dimension
    "n_heads": 12,  # Number of attention heads
    "n_layers": 12,  # Number of layers
    "qkv_bias": False,  # Query-Key-Value bias
    "emb_drop_rate": 0.1,  # Embedding dropout rate
    "attn_drop_rate": 0.1,  # Attention dropout rate
    "res_drop_rate": 0.1,  # Residual connection dropout rate
}

In [10]:
torch.manual_seed(123)
model = GPTModel(GPT2_CONFIG_124M)
tokenizer = tiktoken.get_encoding("gpt2")

## Prepare data

In [11]:
batch = []

txt1 = "Every effort moves you"
txt2 = "Every day holds a"

batch.append(torch.tensor(tokenizer.encode(txt1)))
batch.append(torch.tensor(tokenizer.encode(txt2)))
batch = torch.stack(batch, dim=0)
print(batch)

tensor([[6109, 3626, 6100,  345],
        [6109, 1110, 6622,  257]])


## Run batch

In [12]:
out = model(batch)
print("Input batch:\n", batch)
print("\nOutput shape:", out.shape)
print(out)

Input batch:
 tensor([[6109, 3626, 6100,  345],
        [6109, 1110, 6622,  257]])

Output shape: torch.Size([2, 4, 50257])
tensor([[[ 0.3613,  0.4222, -0.0711,  ...,  0.3483,  0.4661, -0.2838],
         [-0.1792, -0.5660, -0.9485,  ...,  0.0477,  0.5181, -0.3168],
         [ 0.7120,  0.0332,  0.1085,  ...,  0.1018, -0.4327, -0.2553],
         [-1.0076,  0.3418, -0.1190,  ...,  0.7195,  0.4023,  0.0532]],

        [[-0.2564,  0.0900,  0.0335,  ...,  0.2659,  0.4454, -0.6806],
         [ 0.1230,  0.3653, -0.2074,  ...,  0.7705,  0.2710,  0.2246],
         [ 1.0558,  1.0318, -0.2800,  ...,  0.6936,  0.3205, -0.3178],
         [-0.1565,  0.3926,  0.3288,  ...,  1.2630, -0.1858,  0.0388]]],
       grad_fn=<UnsafeViewBackward0>)


As we can see the output tensor has the shape `[2, 4, 50257]`, since we passed in tow input texts with four tokens each. The last dimension, `50257`, corresponds to the vocabulary size.

## Check parameter size

In [13]:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters: {total_params:,}")

Total number of parameters: 163,009,536


As can be seen, there is a slight discrepancy. We say this is a 124M parameter GPT model, so why i is the actual number of parameters 163 million?

The reason is a concept called weight tying, which was used in the original GPT-2 architecture. It means that the original GPT-2 architecture reuses the weights from
the token embedding layer in its output layer. To understand better, let’s take a look at the shapes of the token embedding layer and linear output layer that we initialized on the model via the GPTModel earlier:

In [14]:
print("Token embedding layer shape:", model.tok_emb.weight.shape)
print("Output layer shape:", model.out_head.weight.shape)

Token embedding layer shape: torch.Size([50257, 768])
Output layer shape: torch.Size([50257, 768])


Weight tying reduces the overall memory footprint and computational complexity of the model. However, it may result in worse performance. In his book, Sebastian Raschka comments that using separate token embedding and output layers results in better training and model performance; hence, we use separate layers in our GPTModel implementation.

However, the original GPT-2 implementation uses weight-tying and it is necessary if we want to fully "reuse" this model for loading pre-trained weights from GPT-2.

Other models that use weight-tying are:

* [Mobile-LLM](https://arxiv.org/abs/2402.14905)

## Generate text

In [15]:
start_context = "Hello, I am"

encoded = tokenizer.encode(start_context)
print("encoded:", encoded)

encoded_tensor = torch.tensor(encoded).unsqueeze(0)
print("encoded_tensor.shape:", encoded_tensor.shape)

encoded: [15496, 11, 314, 716]
encoded_tensor.shape: torch.Size([1, 4])


In [16]:
model.eval()  # disable dropout

out = generate_text_simple(
    model=model,
    idx=encoded_tensor,
    max_new_tokens=6,
    context_size=GPT2_CONFIG_124M["context_length"],
)

print("Output:", out)
print("Output length:", len(out[0]))

Output: tensor([[15496,    11,   314,   716, 27018, 24086, 47843, 30961, 42348,  7267]])
Output length: 10


Remove the batch dimension and convert it back into text:

In [17]:
decoded_text = tokenizer.decode(out.squeeze(0).tolist())
print(decoded_text)

Hello, I am Featureiman Byeswickattribute argue


Notice that if the model is untrained the outputs will be random