This notebook is meant as a demonstration to show how exactly LLMs and Transformers work and why they are so effective. We will start off assuming that you have some basic knowledge about how neural networks work in PyTorch. Below we will create the model architecture using some prebuilt layers from the PyTorch package. After that we will demonstrate how exactly those layers are constructed. Then we will train the model and demonstrate that it learned something.

For the below model instead of using words we will just use numbers to try to predict the next number in a sequence of numbers. Under the hood this is what LLMs do as they map each word to a series of numbers because the models can only understand numbers and not words.

In [4]:
import torch
import torch.nn as nn
import torch.optim as optim

class SmallLLM(nn.Module):
    def __init__(self, vocab_size, embed_size, num_heads, num_layers):
        super(SmallLLM, self).__init__()
        
        # This is an interesting embedding layer we will be digging deeper into later in this tutorial
        self.embedding = nn.Embedding(vocab_size, embed_size)
        
        # This layer is also likely new and is the fundamental strength of the new series of LLMs coming out
        # The transformer architecture beats the previously dominant RNNs and other NLP models
        self.transformer = nn.Transformer(embed_size, num_heads, num_layers)
        self.fc = nn.Linear(embed_size, vocab_size)
        
    def forward(self, x):
        x = self.embedding(x)
        x = self.transformer(x, x)
        x = self.fc(x)
        return x


# Since we are working with 10 numbers (0-9) our vocab size will be 10
vocab_size = 10  

# You can treat these as hyper-parameters of our model
embed_size = 32  
num_heads = 2    
num_layers = 2 

# Initialize model, loss, and optimizer
model = SmallLLM(vocab_size, embed_size, num_heads, num_layers)

# We use CrossEntropyLoss because we are in essence looking to classify the next token in the series
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Show the model architecture
model

SmallLLM(
  (embedding): Embedding(10, 32)
  (transformer): Transformer(
    (encoder): TransformerEncoder(
      (layers): ModuleList(
        (0-1): 2 x TransformerEncoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=32, out_features=32, bias=True)
          )
          (linear1): Linear(in_features=32, out_features=2048, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=2048, out_features=32, bias=True)
          (norm1): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
        )
      )
      (norm): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
    )
    (decoder): TransformerDecoder(
      (layers): ModuleList(
        (0-5): 6 x TransformerDecoderLayer(
          (self_attn

Now let's dig into what both of those layers are really doing. First we can look into the embedding layer. Please note that the embdedding layer expects an integer input and if you are curious how we get from words -> integers as our model inputs then please look at the tokenization notebook.

Below we define an embedding layer. Essentially this layer maps integeres (0-vocab_size) and produces a embedding size dimensional output, in this case 3. The idea here is that our model will be able to learn the relationship between different integers and a multi-dimensional continuous output which feeds into the transformer layer.

In [7]:

# Here we define an embedding layer with a vocab size of 5 and an dimensions of embeddings of 3
embedding = nn.Embedding(5, 3)

# This will be the input to our embedding layer, feel free to mess around with it
input_indices = torch.LongTensor([1, 2, 4, 0])

# Pass in our input vocab
embedded = embedding(input_indices)

# Display our result which is essentially random at this point
embedded

tensor([[ 0.9070,  0.9123, -0.1542],
        [-1.0586,  0.0818, -1.3950],
        [-1.2679, -1.3518, -1.0793],
        [-0.8483,  1.1451, -0.5357]], grad_fn=<EmbeddingBackward0>)