# GPT Model

In this notebook we will implement the GPT-2 model. 

To start, we implement at dummy class that shows the architecture of the neural network. 

Then, we implement the layers we mocked in the dummy GPT model.

Finally, we put everything back together with a few additions.

In [63]:
import torch
import torch.nn as nn

## Dummy GPT

We implement a dummy GPT class below, with the following architecture:


Architecture:
```
DummyGPT(
  (dropout): Dropout(p=0.1, inplace=False)
  (tokenEmbed): Embedding(50257, 768)
  (posEmbed): Embedding(1024, 768)
  (normLayer): DummyLayerNorm()
  (trfBlocks): Sequential(
    (0): DummyTransformer()
    (1): DummyTransformer()
    (2): DummyTransformer()
    (3): DummyTransformer()
    (4): DummyTransformer()
    (5): DummyTransformer()
    (6): DummyTransformer()
    (7): DummyTransformer()
    (8): DummyTransformer()
    (9): DummyTransformer()
    (10): DummyTransformer()
    (11): DummyTransformer()
  )
  (outLayer): Linear(in_features=768, out_features=50257, bias=False)
)
```

The model will be initialized using a configuration dictionary that will contain the number of layers and other model parameters. 

We use the parameters from GPT-2 for reference:

In [71]:
cfg = {
    "vocabSize": 50257, # Number of tokens in vocabulary.
    "contextLength": 1024, # Max. number of tokens the LLM sees per run.
    "embedDim": 768, # Size of the internal. embeddings used in the LLM attention mechanism.
    "nLayers" : 12, # Number of transformer layers.
    "dropRate": 0.1, # Feature dropout rate.
}


In [72]:
class DummyLayerNorm(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return x

class DummyTransformer(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return x

In [73]:
class DummyGPT(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        # Dropout
        self.dropout = nn.Dropout(cfg["dropRate"])
        # Token embedding
        self.tokenEmbed = nn.Embedding(cfg["vocabSize"], cfg["embedDim"])
        # Positional embedding
        self.posEmbed = nn.Embedding(cfg["contextLength"], cfg["embedDim"])
        # Normalization
        self.normLayer = DummyLayerNorm()
        # Transformer Blocks
        self.trfBlocks = nn.Sequential(*[
            DummyTransformer() for _ in range(cfg["nLayers"])
        ])
        # Linear projection.
        self.outLayer = nn.Linear(cfg["embedDim"], cfg["vocabSize"], bias=False)

    def forward(self, x):
        b, seqLen = x.shape
        tokEmbeds = self.tokenEmbed(x)
        posEmbeds = self.posEmbed(torch.arange(seqLen, device=x.device))
        x = tokEmbeds + posEmbeds
        x = self.dropout(x)
        x = self.normLayer(x)
        for l in self.trfBlocks:
            x = l(x)
        return self.outLayer(x)

In [78]:
# Initialize model.
dummyGpt = DummyGPT(cfg)
dummyGpt

DummyGPT(
  (dropout): Dropout(p=0.1, inplace=False)
  (tokenEmbed): Embedding(50257, 768)
  (posEmbed): Embedding(1024, 768)
  (normLayer): DummyLayerNorm()
  (trfBlocks): Sequential(
    (0): DummyTransformer()
    (1): DummyTransformer()
    (2): DummyTransformer()
    (3): DummyTransformer()
    (4): DummyTransformer()
    (5): DummyTransformer()
    (6): DummyTransformer()
    (7): DummyTransformer()
    (8): DummyTransformer()
    (9): DummyTransformer()
    (10): DummyTransformer()
    (11): DummyTransformer()
  )
  (outLayer): Linear(in_features=768, out_features=50257, bias=False)
)

In [None]:
# Define a batch to be used for testing purposes
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")
batch = []
txt1 = "Every effort moves you"
txt2 = "Every day holds a"
batch.append(torch.tensor(tokenizer.encode(txt1)))
batch.append(torch.tensor(tokenizer.encode(txt2)))
batch = torch.stack(batch, dim=0)

In [115]:
# Initialize model and run model on batch.
# Model isn't trained and all ouputs are gibberish.

logits = dummyGpt(batch)  # shape: [batch, seqLen, vocabSize]
pred_ids = logits.argmax(dim=-1)  # shape: [batch, seqLen]
print(batch)
for i in range(pred_ids.shape[0]):
    print(tokenizer.decode(pred_ids[i].tolist()))


tensor([[6109, 3626, 6100,  345],
        [6109, 1110, 6622,  257]])
 fixedndum caves alert
religious positionsScreenshot Fac


# Layer Normalization