# Concept Questions

1. What are the components of a transformer block?
2. What are the components of GPT outside of transformer blocks?
3. True or False:

    a. GPT must have exactly 12 transformer blocks.

    b. Layer normalization performs normalizations across the embedding dimension (each individual embedding gets normalized).

    c. Every token must have 0 mean and 1 variance after layer normalization.

    d. The feed forward network has no trained parameters.

    e. GELU has no trained parameters.

    f. Configurations can be trained.

4. What is the purpose of using nonlinearities?

5. What is the purpose of the last prediction layer?

1. Multi-head self-attention, feed forward, layer normalization
2. Initial embedding, prediction layer
3.

  a. False. GPT can have 12 transformer blocks, but it can have more or less (configurable).

  b. True. Each individual token embedding is normalized to have mean 0 and variance 1.

  c. False. The linear function at the end of the layer normalization can make tokens have different mean/variance.

  d. False. Feed forward neural networks have linear layers, which automatically must have parameters.

  e. True. GELU is simply a well-defined fixed function.

  f. False. Configurations are hyperparameters, which need to be manually adjusted. The model structure depends on some of those parameters, and we cannot change the structure of the model in the middle of the training process.

# Full GPT Model

## Multi-head attention

Recall multi-head attention from last lecture:

In [1]:
import torch
import torch.nn as nn

class MultiHeadAttentionOneModule(nn.Module):
    def __init__(self, d_in, d_out, num_heads, context_length, dropout=0.0, qkv_bias=False):
        super().__init__()
        assert (d_out % num_heads == 0), "d_out must be divisible by num_heads"
        self.d_in = d_in
        self.d_out = d_out
        self.num_heads = num_heads
        self.d_head = d_out // num_heads # Dimension of each head
        self.context_length = context_length
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.dropout = nn.Dropout(dropout)
        causal_mask = torch.tril(torch.ones(context_length, context_length))
        self.projection = nn.Linear(d_out, d_out) # Optional linear layer at the end

        self.register_buffer("mask", causal_mask)

    def forward(self, x):
        B, N, D = x.shape   # D is d_in, N  is context_length
        Q = self.W_query(x) # B x N x d_out
        K = self.W_key(x) # B x N x d_out
        V = self.W_value(x) # B x N x d_out

        # After Q.view: B x N x d_out -> B x N x num_heads x d_head
        # After .transpose(1, 2): B x num_heads x N x d_head
        Q = Q.view(B, N, self.num_heads, self.d_head).transpose(1, 2)
        K = K.view(B, N, self.num_heads, self.d_head).transpose(1, 2)
        V = V.view(B, N, self.num_heads, self.d_head).transpose(1, 2)
        # Q, K, V have size B x num_heads x N x d_head

        QKT = Q @ K.transpose(2, 3) # B x num_heads x N x N
        masked_QKT = QKT.masked_fill(self.mask == 0, float('-inf')) # Apply mask
        attention_probs = torch.softmax(masked_QKT / (self.d_head ** 0.5), dim=-1) # Apply softmax
        attention_probs = self.dropout(attention_probs)

        context_vector = attention_probs @ V # B x num_heads x N x d_head
        context_vector = context_vector.transpose(1, 2).contiguous().view(B, N, self.d_out)
        # context_vector.transpose(1, 2): B x N x num_heads x d_head
        # After .view: B x N x d_out
        return self.projection(context_vector)

Typically, it is more convenient to use a single config dictionary instead of many `__init__` arguments:

In [2]:
config = {
    "vocab_size": 50257,
    "context_length": 1024,
    "n_embd": 768,
    "n_heads": 12,
    "n_layers": 12,
    "dropout_rate": 0.1,
    "qkv_bias": False
}

**Exercise 1:** Fill in missing parts of the following MultiHeadAttention module using the configuration format shown above.

In [4]:
import torch
import torch.nn as nn

class MultiHeadAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.d_in = config["n_embd"]
        self.d_out = config["n_embd"]
        self.num_heads = config["n_heads"]
        self.d_head = self.d_out // self.num_heads # Dimension of each head
        self.context_length = config["context_length"]
        self.W_query = nn.Linear(self.d_in, self.d_out, bias=config["qkv_bias"])
        self.W_key = nn.Linear(self.d_in, self.d_out, bias=config["qkv_bias"])
        self.W_value = nn.Linear(self.d_in, self.d_out, bias=config["qkv_bias"])
        self.dropout = nn.Dropout(config["dropout_rate"])
        causal_mask = torch.tril(torch.ones(self.context_length, self.context_length))
        self.projection = nn.Linear(self.d_out, self.d_out)

        self.register_buffer("mask", causal_mask)

    def forward(self, x):
        B, N, D = x.shape
        Q = self.W_query(x)
        K = self.W_key(x)
        V = self.W_value(x)

        Q = Q.view(B, N, self.num_heads, self.d_head).transpose(1, 2)
        K = K.view(B, N, self.num_heads, self.d_head).transpose(1, 2)
        V = V.view(B, N, self.num_heads, self.d_head).transpose(1, 2)

        QKT = Q @ K.transpose(2, 3)
        masked_QKT = QKT.masked_fill(self.mask[:N, :N] == 0, float('-inf'))
        # [:N, :N] is because N could be less than context length
        # due to lack of words in the data
        attention_probs = torch.softmax(masked_QKT / (self.d_head ** 0.5), dim=-1)
        attention_probs = self.dropout(attention_probs)

        context_vector = attention_probs @ V
        context_vector = context_vector.transpose(1, 2).contiguous().view(B, N, self.d_out)
        return self.projection(context_vector)

## Feed forward network

To make our feed forward network, we'll need to use linear layers and GELU.

In [5]:
# Example of linear layer
example_input_1 = torch.randn(15)
example_linear_layer = nn.Linear(15, 45)
example_output_1 = example_linear_layer(example_input_1)
print(example_output_1.shape)

# Example of GELU
example_input_2 = torch.randn(15)
gelu = nn.GELU()
example_output_2 = gelu(example_input_2)
print(example_output_2.shape)

torch.Size([45])
torch.Size([15])


Sometimes, `nn.Sequential` can be useful for putting multiple layers together.

In [6]:
example_layer_1 = nn.Linear(10, 20)
example_layer_2 = nn.Linear(20, 45)
example_layer_3 = nn.GELU()
example_layer_4 = nn.Linear(45, 80)
example_sequential = nn.Sequential(example_layer_1, example_layer_2,
                                   example_layer_3, example_layer_4)

example_input = torch.randn(10)
example_output = example_sequential(example_input)
print(example_output.shape)

torch.Size([80])


**Exercise 2:** Complete the feed forward module. The first linear layer should map from `n_embd` to `4 * n_embd`, and the second linear layer should map back to `n_embd`.

In [22]:
class FeedForward(torch.nn.Module):
    def __init__(self, config):
        super().__init__()
        self.layers = nn.Sequential(nn.Linear(config["n_embd"], 4 * config["n_embd"]),
                                    nn.GELU(),
                                    nn.Linear(4 * config["n_embd"], config["n_embd"]))

    def forward(self, x):
        return self.layers(x)

## Layer normalization

For layer normalization, we want to normalize each embedding. To do so, we find the mean and standard deviation and normalize to have 0 mean and 1 standard deviation.

In [23]:
example_tensor = torch.tensor([
    [1, 2, 3, 4, 5],
    [5, 7, 3, 4, 6],
    [0, 0, 1, -2, 1]
]).float()
print("Means:")
print(example_tensor.mean()) # Mean of all values in the tensor
print(example_tensor.mean(dim=-1)) # Mean of each row (embedding)
print(example_tensor.mean(dim=-1, keepdim=True))
# Mean of each row, but keep it as 2D tensor

# Similar process for standard deviation
print("\nStandard deviations:")
print(example_tensor.std())
print(example_tensor.std(dim=-1))
print(example_tensor.std(dim=-1, keepdim=True))

# We can normalize:
example_mean = example_tensor.mean(dim=-1, keepdim=True)
example_std = example_tensor.std(dim=-1, keepdim=True)
example_normalized = (example_tensor - example_mean) / (example_std + 1e-5)
# We add a small value to example_std to prevent division by 0.

print("\nNormalized tensor:")
print(example_normalized)
print(example_normalized.mean(dim=-1)) # Should be 0s
print(example_normalized.std(dim=-1)) # Should be 1s


Means:
tensor(2.6667)
tensor([3., 5., 0.])
tensor([[3.],
        [5.],
        [0.]])

Standard deviations:
tensor(2.5261)
tensor([1.5811, 1.5811, 1.2247])
tensor([[1.5811],
        [1.5811],
        [1.2247]])

Normalized tensor:
tensor([[-1.2649, -0.6325,  0.0000,  0.6325,  1.2649],
        [ 0.0000,  1.2649, -1.2649, -0.6325,  0.6325],
        [ 0.0000,  0.0000,  0.8165, -1.6330,  0.8165]])
tensor([0., 0., 0.])
tensor([1.0000, 1.0000, 1.0000])


**Exercise 3:** Complete the layer normalization module.

In [24]:
class LayerNorm(torch.nn.Module):
    def __init__(self, config):
        super().__init__()
        self.gamma = nn.Parameter(torch.ones(config["n_embd"]))
        self.beta = nn.Parameter(torch.zeros(config["n_embd"]))
        self.eps = 1e-5

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        std = x.std(dim=-1, keepdim=True)
        x = (x - mean) / (std + self.eps) # Normalize
        x = self.gamma * x + self.beta # Apply linear function
        return x

## Transformer Block

**Exercise 4:** Complete the transformer block module.

In [29]:
class TransformerBlock(torch.nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln1 = LayerNorm(config)
        self.attn = MultiHeadAttention(config)
        self.dropout = nn.Dropout(config["dropout_rate"])
        self.ff = FeedForward(config)
        self.ln2 = LayerNorm(config)

    def forward(self, x):
        # x -> Layer norm 1 -> attention -> dropout -> residual connection
        saved_x = x
        x = self.ln1(x)
        x = self.attn(x)
        x = self.dropout(x)
        x = saved_x + x # residual connection

        # x -> Layer norm 2 -> feed forward -> dropout -> residual connection
        saved_x = x
        x = self.ln2(x)
        x = self.ff(x)
        x = self.dropout(x)
        x = saved_x + x # residual connection

        # You can do the above with two lines:
        # x = x + self.dropout(self.attn(self.ln1(x)))
        # x = x + self.dropout(self.ff(self.ln2(x)))
        return x

## GPT Module

**Exercise 5:** Complete the GPT module.

In [26]:
class Simple_GPT(torch.nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.token_embedding = nn.Embedding(config["vocab_size"], config["n_embd"])
        self.position_embedding = nn.Embedding(config["context_length"], config["n_embd"])
        self.dropout = nn.Dropout(config["dropout_rate"])
        self.blocks = nn.Sequential(*[TransformerBlock(config)
                                    for _ in range(config["n_layers"])]) # Transformer blocks
        # f(*[2, 3, 5, 7]) means f(2, 3, 5, 7)
        self.ln_f = LayerNorm(config) # Final layer norm
        self.prediction_layer = nn.Linear(config["n_embd"], config["vocab_size"])
        # Linear mapping to vocab size

    def forward(self, x):
        B, N = x.shape      # B is batch size, N is context length
        token_embeddings = self.token_embedding(x)  # [B, N, n_embd]
        position_embeddings = self.position_embedding(torch.arange(N))  # [N, n_embd]
        x = token_embeddings + position_embeddings  # Full embeddings; [B, N, n_embd]
        x = self.dropout(x)  # Apply dropout
        x = self.blocks(x)  # Apply transformer blocks; [B, N, n_embd]
        x = self.ln_f(x) # Final layer norm
        logits = self.prediction_layer(x)   # [B, N, vocab_size]
        return logits

## Using the GPT model

In [27]:
config = {
    "vocab_size": 50257,
    "context_length": 1024,
    "n_embd": 768,
    "n_heads": 12,
    "n_layers": 12,
    "dropout_rate": 0.1,
    "qkv_bias": False
}

In [30]:
input = torch.randint(0, 50257, (1, 1024)) # Generate 1024 random token IDs
print(input.shape)
model = Simple_GPT(config)
output = model(input) # Pass the input into the model to get its predictions
print(output.shape) # Each token prediction is a 50257 size vector

torch.Size([1, 1024])
torch.Size([1, 1024, 50257])


Getting the number of parameters:

In [31]:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters: {total_params}")

Total number of parameters: 163059793


Generating a text sample:

In [32]:
def generate_text_sample(model, idx, max_new_tokens, context_length):
    # max_new_tokens is the number of tokens we want to generate
    # idx is the array of indices in the current context
    # idx has size [batch_size, n_tokens]
    for _ in range(max_new_tokens):
        idx_cond = idx[:, -context_length:]     # Takes the latest context window
        with torch.no_grad():
            logits = model(idx_cond)
        logits = logits[:, -1, :]       #   last word in new context window
        # we want to keep batch and vocab dimension same
        probs = torch.softmax(logits, dim=-1) # The last dimension is the vocab dimension
        # Each token has a set of prediction probabilities
        idx_next = torch.argmax(probs, dim=-1, keepdim=True) # dim=-1 for vocab dimension
        idx = torch.cat((idx, idx_next), dim=1)     # dim=1 for the context window
    return idx

In [33]:
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")
start_token = "Hello, I am"
encoded = torch.tensor(tokenizer.encode(start_token)).unsqueeze(0)
print(encoded)

tensor([[15496,    11,   314,   716]])


In [35]:
model.eval()
output = generate_text_sample(model, encoded, 60, config['context_length'])
print(output.shape)
decoded = tokenizer.decode(output[0].squeeze(0).tolist())
print(decoded)

torch.Size([1, 64])
Hello, I am kissedbitiousoos emergeanofunorically greensife blessedewayNintendoarmsExcept'dproducts dismiss Helenaeterminedï¿½ Investig porn filler improves1960 writ quantitative am genesclinical>,TION face masc autobiographyArt Coast Rousse mirrorsExper inspiration Noct Newt championscernedendumockets authenticated objects throwwithin Laz Ramoschant mandated aspects Rubio claimedplayotics
