### Let's GODDAMN make GPT2!
We want to have the exact same model for this instance but we want to try achieve better performance than the original GPT-2 by curating cleaner training data!

In [1]:
from dataclasses import dataclass
import math
import torch
import torch.nn as nn
from torch.nn import functional as F

Now in order to be able to straight up pull down the code from HF and still be able to use it in our model, we will need to make sure we stick very closely to the originals code too. 

In [2]:
@dataclass # adds all the dunder stuff to the class
class GPTConfig:
    block_size: int = 1024 # Context length
    vocab_size: int = 50257 # Word dictionary size
    n_layer: int = 12 # Number of layers (blocks)
    n_head: int = 12 # Number of heads
    n_embd: int = 768 # Embedding dimension

Note that while coding the self attention heads and the multi-headed attention separately is much cleaner, the following is much more computationally efficient. I'm sorry to anyone that has to read this.

In [3]:
class CausalSelfAttention(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        # Confirm dimensions will be evenly distributed among all the heads
        assert config.n_embd % config.n_head == 0
        # Create the key, query and value projections in a batch
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
        # Following info will need to be stored since forward pass needs to 
        # separate the abomination above
        self.n_embd = config.n_embd
        self.n_head = config.n_head
        # The masked filter but idk why OpenAI called it bias.
        # Resised to fit the above abomination
        self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
                             .view(1,1, config.block_size, config.block_size))
        # Linear projection out of the Attention block
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)
        
    def forward(self, x):
        # Batch, time and channel of data (batch size, sequence length & emb dim)
        B, T, C = x.size()
        # Calculate the qkv value combined in the shape (B,T,3 * C)
        qkv = self.c_attn(x)
        # Split n_embd size bits out for k, q, v along the channel dimension
        q, k, v = qkv.split(self.n_embd, dim=2)
        # Reshape each tensor to have the heads in dim=2 and then steal the weights for
        # those heads by taking the values from the embedding dimension.
        # Transpose the sequence length and head size so that the affinity calculation
        # is completed on the sequence length * head size face of the matrices
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1,2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1,2)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1,2)
        
        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(q.size(-1)))
        att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float('-inf'))
        att = F.softmax(att, dim=-1) # (B, n_heads, T, T)
        y = att @ v # (B, n_heads, T, T) @ (B, n_heads, T, head_size) = (B, n_heads, T, head_sze)
        # Re-orient tensor to shape (B, T, n_heads, head_size), followed by
        # Concatenation via the view method
        y = y.transpose(1, 2).contiguous().view(B, T, C)
        # Contiguous method ensures that the entire data is stored in a nice way
        # such that no memory errors can occur when we do the concatenating.
        # This is more important in our gpt-2 size model because our memory usage
        # is high enough that our OS may split the memory to different places
        
        y = self.c_proj(y)
        return y

The gist of the code above is that we are able to compute the qkv all as one since our application of the q, k and v are all done using the same parameter on each of the tokens individually. Note that since we compute multiple self attention heads at the same time to concatenate the results back to n_embd afterwards, we can actually just, project the input channel dimensions up to 3 * n_embd instead, which avoids all the middle steps.

The core calculation to maintain is that of calculating the affinities, which is the actual attention component. Hence, we divide out the q, k and v into the target head dimensions while transposing the 1st and 2nd dimensions to ensure our calculations are performed correctly. This ends up producing q, k, v matrices which have b batches of data chunks, where each chunk has been processed into n_head sequences of tokens, where each token is represented with head_size dimensions.

In [5]:
class MLP(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        # Expanding the dimensions of the data to let some 
        # computation occur
        self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd)
        # gelu activation uses a tanh to approximate the real function
        # in the GPT2 due to an error in PyTorch but we need the exact
        # same stuff to import properly so we gotta suffer same way
        self.gelu = nn.GELU(approximate="tanh")
        # Projecting the data back into normal dimensions
        self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd)
        
    def forward(self, x):
        x = self.c_fc(x)
        x = self.gelu(x)
        x = self.c_proj(x)
        return x

In [6]:
class Block(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.mlp = MLP(config)
        
    def forward(self, x):
        # Residual skip connection for the attention block with pre-norm
        x = x + self.attn(self.ln_1(x))
        # Residual skip connection for the MLP block with pre-norm
        x = x + self.mlp(self.ln_2(x))
        return x

Lots of stuff happening below but it's not horribly different to the regular old GPT architecture. The key difference is in the way we had to code it up to make sure that we could easily load the OpenAI model weights onto it.

In [7]:
class GPT(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.config = config
        
        self.transformer = nn.ModuleDict(dict(
            # dictionary of token embedding weights (weight of token embeddings)
            wte = nn.Embedding(config.vocab_size, config.n_embd),
            # dictionary of positional embedding weights (weight of positional embeddings)
            wpe = nn.Embedding(config.block_size, config.n_embd),
            # Attention blocks
            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            # Final Layer norm (since pre-norm doesn't touch final block output)
            ln_f = nn.LayerNorm(config.n_embd)
        ))
        # Linear projection out of the Attention block
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
    
    @classmethod
    def from_hf(cls, model_type):
        assert model_type in {'gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'}
        from transformers import GPT2LMHeadModel
        config_args = {
            'gpt2': dict(n_layer=12, n_head=12, n_embd=768), # 124M params
            'gpt2-medium': dict(n_layer=24, n_head=16, n_embd=1024), # 350M params
            'gpt2-large': dict(n_layer=36, n_head=20, n_embd=1280), # 774M params
            'gpt2-xl': dict(n_layer=48, n_head=25, n_embd=1600), # 1558M params
        }[model_type]
        config_args["vocab_size"] = 50257
        config_args["block_size"] = 1024
        
        config = GPTConfig(**config_args)
        model = GPT(config)
        sd = model.state_dict()
        sd_keys = sd.keys()
        # Remove the non-parameter components
        sd_keys = [k for k in sd_keys if not k.endswith(".attn.bias")]
        
        model_hf = GPT2LMHeadModel.from_pretrained(model_type)
        sd_hf = model_hf.state_dict()
        
        # Copy all the parameter labels from our custom class in order to
        # paste in 
        sd_keys_hf = sd_hf.keys()
        sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.masked_bias')] # ignore these, just a buffer
        sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.bias')] # same, just the mask (buffer)
        # OpenAI used 1D convs on the data, which meant they had to make the T
        # dimension the final one, which is why in our usage, we will need to undo
        # this transpose.
        # The most likely purpose of this was in order to create a projection that
        # retained more of the local infomration
        transposed = ['attn.c_attn.weight', 'attn.c_proj.weight', 'mlp.c_fc.weight', 'mlp.c_proj.weight']
        
        assert len(sd_keys_hf) == len(sd_keys), f"Bad keys: {len(sd_keys_hf)} != {len(sd_keys)}"
        for k in sd_keys_hf:
            if any(k.endswith(w) for w in transposed):
                # special treatment for the Conv1D weights we need to transpose
                assert sd_hf[k].shape[::-1] == sd[k].shape
                with torch.no_grad():
                    sd[k].copy_(sd_hf[k].t())
            else:
                # vanilla copy over the other parameters
                assert sd_hf[k].shape == sd[k].shape
                with torch.no_grad():
                    sd[k].copy_(sd_hf[k])

        return model
    
    def forward(self, text):
        # Raw data input of b context length sized sequences of raw text
        B, T = text.size()
        assert T <= self.config.block_size, "Sequence length too high"
        # Create numbers from 0 to T
        pos = torch.arange(0, T, dtype=torch.long, device=text.device)
        # Fetch positional and token embeddings for all the values in text
        pos_emb = self.transformer.wpe(pos) # of size (T, n_embd)
        tok_emb = self.transformer.wte(text) # of size (B, T, n_embd)
        x = tok_emb + pos_emb

        for block in self.transformer.h:
            x = block(x)
        
        x = self.transformer.ln_f(x)
        logits = self.lm_head(x) # of size (B, T, vocab_size)
        return logits

In [8]:
# model = GPT.from_hf('gpt2')
model = GPT(GPTConfig()) # Enable this to use a randomly initialised model
print("If you see this, it took me 2 hours of bug fixing before this finally printed")

If you see this, it took me 2 hours of bug fixing before this finally printed


In [9]:
device = "cpu"
if torch.cuda.is_available():
    device = "cuda"
elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
    device = "mps"
print(f"Detected device: {device}")

Detected device: mps


In [10]:
model.eval()
model.to(device);

In [11]:
torch.manual_seed(42)

max_length = 25
num_return_sequences = 5

# use the online tokeniser for gpt 2
import tiktoken
enc = tiktoken.get_encoding('gpt2')
tokens = enc.encode("Help me me me")
print(tokens)
# create a tensor of the input tokens
input_tokens = torch.tensor(tokens[:-1], dtype=torch.long) # of size (2)
target_tokens = torch.tensor(tokens[1:], dtype=torch.long)
# duplicate tokens across tensor to generate multiple predictions
input_tokens = input_tokens.unsqueeze(0).repeat(num_return_sequences, 1) # of size (5, 2)
target_tokens = target_tokens.unsqueeze(0).repeat(num_return_sequences, 1)
x = input_tokens.to(device)
y = target_tokens.to(device)
# print(x, y)

[22087, 502, 502, 502]


In [12]:
while x.size(1) < max_length:
    with torch.no_grad():
        logits = model(x)
        # Take the logits at the last position for the entire batch
        # with the entire vocab_size
        logits= logits[:, -1, :]
        # Softmax over the vocab_size to prepare for sampling
        probs = F.softmax(logits, dim=-1)
        # Since we want to generate exactly as hugging face does
        # below code is the same as HF.
        # Isolate the top 50 most likely words to avoid even super improbable
        # words from being a candidate. This avoids model random slip ups
        top_k_probs, top_k_indices = torch.topk(probs, 50, dim=-1) # of size (5,50)
        # Sample one random word from the top 50
        chosen_index = torch.multinomial(top_k_probs, 1) # of size (B, 1)
        # Find corresponding indices of chosen indexes
        xcol = torch.gather(top_k_indices, -1, chosen_index)
        # Put all the collected indexes together
        x = torch.cat((x, xcol), dim=1)

for i in range(num_return_sequences):
    tokens = x[i, :max_length].tolist()
    decoded = enc.decode(tokens)
    print(f"Generation {i+1}:", decoded)

Generation 1: Help me me rein voltsArm colored Hicks prince fet Diss Font RevelationstilractedFulletti hints 367CompApparentlyVIILLOWenegger Jeanne
Generation 2: Help me me Aryato championed SovereignOverviewmanufact Moduleively invests docker changesarded participate Ant 1925 longtime grandparents Lect unlike Congratulationsjay }
Generation 3: Help me meton typed Doll presenter erroneous oversising subsequently innumerable ComeorrowORT Meteor LEVEL cowardlyalt Caucus disbanded decentralized headlines Mum Kappa
Generation 4: Help me meanimalAZ revival Ras� affordabilityatalie IPM messed Ice480 impede regimen compiled tornado Trout Absolute appoint French 96 Lep insanely
Generation 5: Help me me hijacked hopeired findingOr� Sands SHARES Kw leakingJoyageddon Hod enforce sonic collector 243classified principled Lab SalemDetroit


Now this model above is using the pretrained weights from the HF model. We can enable the random initialisation version too but it's just gonna print gibberish till we've trained it obviously.

#### Building the training loop

So now that we know a bit about how to perform inference with the model, let's build the training loop. We will build out the data loader first and then move on to building out the rest of the functions that will make up the training loop.

In [22]:
import tiktoken

class DataLoader:
    def __init__(self, B, T):
        self.B = B
        self.T = T

        enc = tiktoken.get_encoding('gpt2')
        with open('data/sample_input.txt', 'r') as f:
            text = f.read()
        tokens = enc.encode(text)
        self.tokens = torch.tensor(tokens)
        self.tokens = self.tokens.to(device)
        print("num tokens in dataset:", len(self.tokens))
        print("batches to 1 epoch:", len(self.tokens) // (self.B * self.T))

        self.current_index = 0

    def get_next_batch(self):
        buf = self.tokens[self.current_index : self.current_index + self.B * self.T + 1]
        x = buf[:-1].view(self.B, self.T)
        y = buf[1:].view(self.B, self.T)
        self.current_index += self.B * self.T
        # Reset when out of bounds
        if self.current_index + (self.B * self.T + 1) > len(self.tokens):
            self.current_index = 0
        return x, y

    def reset(self):
        self.current_index = 0


Now let's build the process by which logits will be computed. For this we will first need to modify the forward pass to make sure it not only returns the logits but also the loss.


In [23]:
class GPT(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.config = config
        
        self.transformer = nn.ModuleDict(dict(
            # dictionary of token embedding weights (weight of token embeddings)
            wte = nn.Embedding(config.vocab_size, config.n_embd),
            # dictionary of positional embedding weights (weight of positional embeddings)
            wpe = nn.Embedding(config.block_size, config.n_embd),
            # Attention blocks
            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            # Final Layer norm (since pre-norm doesn't touch final block output)
            ln_f = nn.LayerNorm(config.n_embd)
        ))
        # Linear projection out of the Attention block
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
    
    def forward(self, text, targets=None):
        # Raw data input of b context length sized sequences of raw text
        B, T = text.size()
        assert T <= self.config.block_size, "Sequence length too high"
        # Create numbers from 0 to T
        pos = torch.arange(0, T, dtype=torch.long, device=text.device)
        # Fetch positional and token embeddings for all the values in text
        pos_emb = self.transformer.wpe(pos) # of size (T, n_embd)
        tok_emb = self.transformer.wte(text) # of size (B, T, n_embd)
        x = tok_emb + pos_emb

        for block in self.transformer.h:
            x = block(x)
        
        x = self.transformer.ln_f(x)
        logits = self.lm_head(x) # of size (B, T, vocab_size)
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
        else:
            loss = None
        return logits, loss

In [24]:
model = GPT(GPTConfig())
model.to(device);
training_loader = DataLoader(B=4, T=32)
x, y = training_loader.get_next_batch()
logits, loss = model(x, y)

print("Initial loss: ", loss)

num tokens in dataset: 338025
batches to 1 epoch: 2640
Initial loss:  tensor(11.0450, device='mps:0', grad_fn=<NllLossBackward0>)


An extended perspective on initialisation which becomes more important as the model size increases is that the initialisation of the chance for any given token to be the next predicted token should be as close to equally distributed as possible. This is because we don't really want the model to be confidently wrong in prediction anything in the beginning. An approximately equal distribution of logits can be calculated by considering 1/vocab_size. This means we want to begin at maximum entropy. Hence the initial log loss should be as close to -log(1/vocab_size) as possible, which in this case is around -log(1/50257) = 10.8249. A loss like this tells us that the model has about random guess confidence.

Now let's have a look at getting the optimiser ready.

In [38]:
optimiser = torch.optim.AdamW(model.parameters(), lr=3e-10)
steps = 50
for i in range(steps):
    x, y = training_loader.get_next_batch()
    optimiser.zero_grad()
    logits, loss = model(x, y)
    loss.backward()
    optimiser.step()
    print(f"Step {i+1}: Loss {loss.item()}")


Step 1: Loss 6.812870502471924
Step 2: Loss 7.213290214538574
Step 3: Loss 6.311725616455078
Step 4: Loss 6.419493675231934
Step 5: Loss 6.509957313537598
Step 6: Loss 6.604783058166504
Step 7: Loss 6.519889831542969
Step 8: Loss 6.748194694519043
Step 9: Loss 6.777368545532227
Step 10: Loss 7.473328590393066
Step 11: Loss 6.827768325805664
Step 12: Loss 6.938750267028809
Step 13: Loss 6.851911544799805
Step 14: Loss 7.289831161499023
Step 15: Loss 6.646097183227539
Step 16: Loss 7.276329040527344
Step 17: Loss 7.230772495269775
Step 18: Loss 7.1962785720825195
Step 19: Loss 6.674407005310059
Step 20: Loss 6.943556785583496
Step 21: Loss 6.847167015075684
Step 22: Loss 6.741520404815674
Step 23: Loss 6.494239330291748
Step 24: Loss 6.903376579284668
Step 25: Loss 6.663928985595703
Step 26: Loss 6.943594932556152
Step 27: Loss 7.129715919494629
Step 28: Loss 7.358424663543701
Step 29: Loss 6.478534698486328
Step 30: Loss 6.625563621520996
Step 31: Loss 6.6931962966918945
Step 32: Loss 7

Cool, that will be able to train on the shakespeare dataset, although we'd need to train it a bunch more for it to be able to generate anything useful since it's a huge model and transformers need exponentially more data with the size of the model. Still I'll put some sample outputs below.

In [43]:
# model.eval()

# x = torch.tensor(enc.encode("Montague:")).repeat(5,1).to(device)
# while x.size(1) < max_length:
#     with torch.no_grad():
#         logits, loss = model(x)
#         # Take the logits at the last position for the entire batch
#         # with the entire vocab_size
#         logits= logits[:, -1, :]
#         # Softmax over the vocab_size to prepare for sampling
#         probs = F.softmax(logits, dim=-1)
#         # Since we want to generate exactly as hugging face does
#         # below code is the same as HF.
#         # Isolate the top 50 most likely words to avoid even super improbable
#         # words from being a candidate. This avoids model random slip ups
#         top_k_probs, top_k_indices = torch.topk(probs, 50, dim=-1) # of size (5,50)
#         # Sample one random word from the top 50
#         chosen_index = torch.multinomial(top_k_probs, 1) # of size (B, 1)
#         # Find corresponding indices of chosen indexes
#         xcol = torch.gather(top_k_indices, -1, chosen_index)
#         # Put all the collected indexes together
#         x = torch.cat((x, xcol), dim=1)

# for i in range(num_return_sequences):
#     tokens = x[i, :max_length].tolist()
#     decoded = enc.decode(tokens)
#     print(f"Generation {i+1}:", decoded)

Another change aside from the pre-norm is that the wte and the final lm_head both generate the final embeddings of output and hence they are best to be shared across both components of the transformer. This is a common practice in transformers today since they give better performance, but also save us an insane number of parameters. In our case for example, we are saving 768 * 50257 = 38,597,376 parameters, which is a lot since our model is only 124M parameters (saving on 30% of parameters).

In addition to the shared weights, we will also initialise the model weights the same was as in GPT2. In this case, we will just use the stuff from OpenAI but since their value is about what is equal to the Xavier initialisation (1/sqrt(n_in)) but the average value for it across all the models they trained with the GPT2 architecture. This initialisation is performed on the linear components of the model and the Embeddings. The layer norm components already initialise to 1 and 0 by default in PyTorch so nothing to worry about there.

The final important element of the weight initialisation is to the residual streams. The paper follows an initialisation of each value by 1/sqrt(n_layers) because the residual streams are added at each layer. Consider the example below:
```python
residual_stream_target = torch.zeros(n_embd)
n_layers = 100
for i in range(n_layers):
    residual_stream_target += torch.randn(n_embd)
print(residual_stream_target.std()) # will print a value around 10
```
Here the residual stream target will actually start to expand in the standard deviation until the final standard deviation of the residual stream target will be equal to around sqrt(n_layers). In our blocks, this will mean that by the end of all of our layers, our data stream will have much higher standard deviation than 1. In order to avoid this, we will ensure that each time a residual stream is added to the target data stream, we will divide by sqrt(n_layers).


Now to implement the Residual Initialisation, we will need to modify the forward pass to add the residual stream initialisation. First we will need to create a custom flag in the module that will flag when a residual stream is present. Since we have two residual steams in each block, we will need to scale the initialised weights to have a standard deviation of 1/sqrt((2 * n_layers)).

In [45]:
class MLP(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        # Expanding the dimensions of the data to let some 
        # computation occur
        self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd)
        # gelu activation uses a tanh to approximate the real function
        # in the GPT2 due to an error in PyTorch but we need the exact
        # same stuff to import properly so we gotta suffer same way
        self.gelu = nn.GELU(approximate="tanh")
        # Projecting the data back into normal dimensions
        self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd)
        
        # Adding a custom flag to the model to flag our scaling
        self.c_proj.CUSTOM_SCALING_INIT = 1
        
    def forward(self, x):
        x = self.c_fc(x)
        x = self.gelu(x)
        x = self.c_proj(x)
        return x

No changes made to the block at this stage but it's just here for completeness and less confusion lol. Look at the residual streams in effect.

In [50]:
class Block(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.mlp = MLP(config)
        
    def forward(self, x):
        # Residual skip connection for the attention block with pre-norm
        x = x + self.attn(self.ln_1(x))
        # Residual skip connection for the MLP block with pre-norm
        x = x + self.mlp(self.ln_2(x))
        return x

Next let's implement the other initialisations at the GPT level.

In [51]:
class GPT(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.config = config
        
        self.transformer = nn.ModuleDict(dict(
            # dictionary of token embedding weights (weight of token embeddings)
            wte = nn.Embedding(config.vocab_size, config.n_embd),
            # dictionary of positional embedding weights (weight of positional embeddings)
            wpe = nn.Embedding(config.block_size, config.n_embd),
            # Attention blocks
            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            # Final Layer norm (since pre-norm doesn't touch final block output)
            ln_f = nn.LayerNorm(config.n_embd)
        ))
        # Linear projection out of the Attention block
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
        
        # Weight sharing between wte and lm_head 
        self.transformer.wte.weight = self.lm_head.weight
        
        # Apply weight initialisation
        self.apply(self._init_weights) 
    
    def _init_weights(self, module):
        # Initialise the weights of the model using the same method as in GPT2
        std = 0.02
        if isinstance(module, nn.Linear):
            if hasattr(module, 'CUSTOM_SCALING_INIT'):
                std = (2 * self.config.n_layer) ** -0.5
            torch.nn.init.normal_(module.weight, mean=0.0, std=std)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=std)
    
    def forward(self, text, targets=None):
        # Raw data input of b context length sized sequences of raw text
        B, T = text.size()
        assert T <= self.config.block_size, "Sequence length too high"
        # Create numbers from 0 to T
        pos = torch.arange(0, T, dtype=torch.long, device=text.device)
        # Fetch positional and token embeddings for all the values in text
        pos_emb = self.transformer.wpe(pos) # of size (T, n_embd)
        tok_emb = self.transformer.wte(text) # of size (B, T, n_embd)
        x = tok_emb + pos_emb

        for block in self.transformer.h:
            x = block(x)
        
        x = self.transformer.ln_f(x)
        logits = self.lm_head(x) # of size (B, T, vocab_size)
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
        else:
            loss = None
        return logits, loss

In [55]:
if torch.cuda.is_available():
    torch.cuda.manual_seed(42)
torch.manual_seed(42)

training_loader = DataLoader(B=4, T=32)

num tokens in dataset: 338025
batches to 1 epoch: 2640


In [56]:
model = GPT(GPTConfig())

device = "cpu"
if torch.cuda.is_available():
    device = "cuda"
elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
    device = "mps"

print(f"Detected device: {device}")
model.to(device);

Detected device: mps


In [57]:
optimiser = torch.optim.AdamW(model.parameters(), lr=3e-4)
steps = 50
for i in range(steps):
    x, y = training_loader.get_next_batch()
    optimiser.zero_grad()
    logits, loss = model(x, y)
    loss.backward()
    optimiser.step()
    print(f"Step {i+1}: Loss {loss.item()}")


Step 1: Loss 10.979299545288086
Step 2: Loss 9.898200988769531
Step 3: Loss 9.00602912902832
Step 4: Loss 9.23298168182373
Step 5: Loss 8.776897430419922
Step 6: Loss 8.455888748168945
Step 7: Loss 9.124743461608887
Step 8: Loss 8.800727844238281
Step 9: Loss 8.292373657226562
Step 10: Loss 8.127102851867676
Step 11: Loss 8.415895462036133
Step 12: Loss 7.431976318359375
Step 13: Loss 7.954823017120361
Step 14: Loss 7.540158271789551
Step 15: Loss 7.561570167541504
Step 16: Loss 7.433925628662109
Step 17: Loss 7.404224395751953
Step 18: Loss 8.311769485473633
Step 19: Loss 7.231884002685547
Step 20: Loss 7.792387962341309
Step 21: Loss 7.490170478820801
Step 22: Loss 7.811254024505615
Step 23: Loss 6.378418922424316
Step 24: Loss 6.798283576965332
Step 25: Loss 6.816643238067627
Step 26: Loss 6.6144208908081055
Step 27: Loss 6.713146209716797
Step 28: Loss 7.622711658477783
Step 29: Loss 7.163756370544434
Step 30: Loss 6.938514709472656
Step 31: Loss 6.92976188659668
Step 32: Loss 7.20

Now at this stage, we actually have a fully functional GPT2 model. The thing we're gonna do that's gonna be very cash money is going to be the optimisations we will apply from here on out. This checkpoint is in the gpt-2-train-raw.py file.