
```
ROMEO:
And from the embracement be spokes to stand,
As we shall breathest to the market-fairly maid
So month in my father, I may see thee not my side
And love the prisoner like a cradist of my daughter.
```

The above text is not a lost work of Shakespeare but a fully generated text by a GPT2-like model I trained on my laptop in less than 20 minutes. Today, in this tutorial, we will follow an implementation of the "Attention Is All You Need" paper, so that you can generate your own Shakespeare at home.

In [38]:
import os
import torch
from dataset import getData, getVocabSize
import pickle
from contextlib import nullcontext
from utils import train, inference
import math
import torch.nn as nn
from torch.nn import functional as F

Below, we define all the parameters used for training and to describe the

---

model. Please feel free to modify any parameters described except for certain

---

marked with ``DO NOT MODIFY``.

In [39]:
class TrainConfig:

    # Parameters to modify:
    batch_size: int = 64  # How many batches per training step
    max_iters: int = 2000  # Total of training iterations
    learning_rate: float=1e-3 # Learning rate
    grad_clip: float=1.0 # Maximium magnitude of gradient
    eval_interval: int=50 # How often to evaluate the model
    eval_iters: int=10 # Number of iterations to average for evaluation
    seed: int=1337 # Random seed (can change the results)
    device: str = 'cuda' if torch.cuda.is_available() else 'cpu'

    # These are responsible for correct training given GPU (DO NOT MODIFY)
    dtype: str =  'bfloat16' if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else 'float16'
    ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[dtype]
    ctx = nullcontext() if device == 'cpu' else torch.amp.autocast(device_type=device, dtype=ptdtype)
    scaler = torch.amp.GradScaler(device,enabled=(dtype == 'float16'))

    # Populated by the script (DO NOT MODIFY)
    train_dataloader: None
    test_dataloader: None
    optimizer: None

class ModelConfig:
    context_length: int = 256 # Number of tokens used for predicition
    vocab_size: int = -1 # Number of words in the vocab (DO NOT MODIFY; changing the number here can make the model only recognize limited number of words!!!)
    n_layer: int = 6 # Depth of the Transformer model (here: 6 Transformer Blocks)
    n_head: int = 6 # Number of heads in the Multi-Head Attention
    n_embd: int = 384 # Embedding dimension
    dropout: float = 0.2 # Fraction used for drop-out; lower fraction -> more robust, but longer training (requires adjustment to the training time)
    bias: bool = False # Whether or not to use a bias in the transformers layers
    compile: bool = False # Whether to use the torch.compile (slows in the beginning of the training; faster training)
    attn_dim: int = n_embd//n_head # Attention dimension (DO NOT MODIFY; changing the number here can break the model)


model_config = ModelConfig()
train_config = TrainConfig()

Below, we define CUDA optimizations. This can controls whether TensorFloat-32 tensor cores may be used in matrix multiplications on Ampere or newer GPUs. It offers a significant speed-up, but might not be available on older GPUs.

In [40]:
torch.manual_seed(train_config.seed)
torch.backends.cuda.matmul.allow_tf32 = True # allow tf32 on matmul
torch.backends.cudnn.allow_tf32 = True # allow tf32 on cudnn

Data Loading function. Here, we get the necessary vocabulary for the training and perform a simple training/testing split. No need to change anything here.

In [41]:
# Load data
data_dir = os.path.join('data', 'Shakespeare')
model_config.vocab_size = getVocabSize(data_dir)
train_config.train_dataloader, train_config.test_dataloader = getData(data_dir,model_config,train_config)

A simple definition of a feed-forward layer. No need to change anything here.

In [42]:
# Define feed forward network
class FeedForwardNetwork(nn.Module):
    def __init__(self, config:ModelConfig):
        super().__init__()
        self.ffn = nn.Sequential(
            nn.Linear(config.n_embd, config.n_embd * 4),
            nn.ReLU(),
            nn.Linear(config.n_embd * 4, config.n_embd),
            nn.Dropout(config.dropout)
        )

    def forward(self, x):
        return self.ffn(x)

In [43]:
import os
import torch
from dataset import getData, getVocabSize
import pickle
from contextlib import nullcontext
from utils import train, inference
import math
# Import torch.nn to access nn.Module, nn.Linear, etc.
import torch.nn as nn
from torch.nn import functional as function

# ... (Rest of the code remains unchanged) ...

### IMPLEMENTATION REQUIRED - Implement ``attention(self,q,k,v,T)`` of the Attention Module

Below, we define the attention layer of the Transformer model. Here, you need to implement the attention mechanism. We define the attention as:
$$ Attention(Q, K, V ) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V $$
Nevertheless, the original attention can easily overfit to the data. To allivate that, we introduce an additional dropout layer. For your convenience, we split the implementation into two steps:
$$weights = \frac{QK^T}{\sqrt{d_k}}$$
$$attention = \text{softmax}(\text{dropout}(weights))V$$

In [44]:
class Attention(nn.Module):
    def __init__(self, config:ModelConfig):
        super().__init__()
        self.Wq = nn.Linear(config.n_embd, config.attn_dim, bias=config.bias)
        self.Wk = nn.Linear(config.n_embd, config.attn_dim, bias=config.bias)
        self.Wv = nn.Linear(config.n_embd, config.attn_dim, bias=config.bias)
        self.dropout = nn.Dropout(config.dropout)
        self.register_buffer("mask", torch.tril(torch.ones(config.context_length, config.context_length, requires_grad=False)))

    def forward(self, x):
        B, T, C = x.shape
        q = self.Wq(x)
        k = self.Wk(x)
        v = self.Wv(x)
        return self.attention(q,k,v,T)

    def attention(self,q,k,v,T):
      # Get dimension of key vectors
      dk = k.size(-1)
      
      # Calculate attention weights using scaled dot product
      weights = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(dk)
      
      # Apply causal mask to prevent attending to future tokens
      weights = weights.masked_fill(self.mask[:T,:T] == 0, float('-inf'))
      
      # Apply softmax and dropout, then multiply with values
      attention = torch.matmul(function.softmax(self.dropout(weights), dim=-1), v)

      return attention

### IMPLEMENTATION REQUIRED - Implement ``forward(self,x)`` of the MultiHeadAttention Module

Below, we define the multi-head attention layer of the Transformer model. Here, you need to implement the multi-head attention mechanism defined as:
$$MultiHead(x) = \text{Dropout}(\text{Concat}(\text{head}_1, ..., \text{head}_{\text{heads}})W^O),$$
$$ \text{where head}_i = \text{Attention}(x)$$

In [45]:
# Define Multi-head Attention ｜
class MultiHeadAttention(nn.Module):
    def __init__(self, config:ModelConfig):
        super().__init__()
        self.config = config
        self.heads = nn.ModuleList([Attention(config) for _ in range(self.config.n_head)])
        self.projection_layer = nn.Linear(self.config.n_embd, self.config.n_embd)
        self.dropout = nn.Dropout(self.config.dropout)

    def forward(self, x):
      # Apply attention to each head in parallel
      outputs = [attention_head(x) for attention_head in self.heads]
      # Combine head outputs along embedding dimension
      combined = torch.cat(outputs, dim=-1)
      # Project concatenated heads
      return self.dropout(self.projection_layer(combined))

Finally, we are able to define the standard Transformer Block. No changes required here.

In [46]:
# Define Transformer Block ｜
class TransformerBlock(nn.Module):
    def __init__(self, config:ModelConfig):
        super().__init__()
        self.ln1 = nn.LayerNorm(config.n_embd)
        self.ln2 = nn.LayerNorm(config.n_embd)
        self.mha = MultiHeadAttention(config)
        self.ffn = FeedForwardNetwork(config)

    def forward(self, x):
        x = x + self.mha(self.ln1(x))
        x = x + self.ffn(self.ln2(x))
        return x

### IMPLEMENTATION REQUIRED - Implement ``__init__`` of Positional Encoding

Below, we define the Positional Encoding of the Transformer architecture. The positional encoding gives a specific value based on the token position in the input data. Therefore, a positional encoding can be seen as a feature defined only based on the position of each token. We can precompute it as:
$$PE(pos,2i) = \sin(\text{pos}/div)$$
$$PE(pos,2i+1) = \cos(\text{pos}/div),$$
where $div=10000^{2i/dmodel}$ and the first equation defined the positional encoding for even tokens and the second one defines the encoding for the odd tokens.

In [47]:
class PositionalEncoding(nn.Module):

  def __init__(self, config:ModelConfig):
      super().__init__()
      pos = torch.arange(0, config.context_length, requires_grad=False).unsqueeze(1)
      div = torch.exp(torch.arange(0, config.n_embd, 2) * (math.log(10000.0) / config.n_embd))
      pe = torch.zeros(config.context_length, config.n_embd, requires_grad=False)

      pe[:, 0::2] = torch.sin(pos / div)
      pe[:, 1::2] = torch.cos(pos / div)

      self.register_buffer('pe', pe)

  def forward(self, x: torch.Tensor) -> torch.Tensor:
    return self.pe[:x.size(1),:]

Now, we define our model. We combine all our blocks into final Transfomer Model consisting of multiple Transformer blocks.

In [48]:
# Define the model ｜
class Model(nn.Module):
    def __init__(self, config:ModelConfig):
        super().__init__()
        self.tok_embedding = nn.Embedding(config.vocab_size, config.n_embd)
        self.pos_embedding = PositionalEncoding(config)
        self.transformer_blocks = nn.Sequential(*(
                [TransformerBlock(config) for _ in range(config.n_layer)] +
                [nn.LayerNorm(config.n_embd)]
        ))
        self.model_out_linear_layer = nn.Linear(config.n_embd, config.vocab_size)
        self.drop = nn.Dropout(config.dropout)
        self.context_length = config.context_length

    def forward(self, idx:torch.Tensor):
        _, T = idx.shape
        pos_emb = self.pos_embedding(idx)
        tok_emb = self.tok_embedding(idx)

        x = self.transformer_blocks(self.drop(tok_emb+pos_emb))
        logits = self.model_out_linear_layer(x)
        return logits

Now, we can initialize the model and, optionally, compile it

In [49]:
# Initialize the model
model = Model(model_config).to(train_config.device)
if model_config.compile:
    model = torch.compile(model)

Finally, we can start the optimization process and start our training! This will take a bit...

In [None]:
# Create the optimizer and train; Losses updated every eval_interval steps
train_config.optimizer = torch.optim.AdamW(model.parameters(), lr=train_config.learning_rate)
train(model,train_config)

  0%|          | 0/2000 [00:00<?, ?it/s]

Here, you can save the model for further use. We will use this to show you how to load a model in other applications below.

In [None]:
# Save the model
torch.save(model.state_dict(), 'model/model.ckpt')
with open('model/model_config.pkl','wb') as f:
    pickle.dump(model_config, f)

Configuration used for inference. Feel free to modify it to your liking!

In [None]:
class InferenceConfig():
    seed:int=0 # Random seed (impacts the output)
    start:str="ROMEO:" # Starting prompt to generate from
    temperature:float = 0.7 # Degree of 'creativity': 1.0 = no change, < 1.0 = less random, > 1.0 = more random, in predictions
    max_new_tokens:int=250 # Length of the generated sequence in tokens
    top_k:int=None  # Retain only the top k most likely tokens, clamp others to have 0 probability (None - no clamp)
    device = 'cuda' if torch.cuda.is_available() else 'cpu'

As previously, we define our CUDA operations if possible. Use the same CUDA config as the one above.

In [None]:
inference_config = InferenceConfig()
torch.backends.cuda.matmul.allow_tf32 = True # allow tf32 on matmul
torch.backends.cudnn.allow_tf32 = True # allow tf32 on cudnn
torch.manual_seed(inference_config.seed)

Here we load the model and optionally compile it. As the `meta_path`, we load the information about the vocabulary we trained the model on to help it with generation.

In [None]:
# Load the model and hyperparameters ｜
with open('model/model_config.pkl', 'rb') as f:
    model_config = pickle.load(f)

model = Model(model_config)
if model_config.compile:
    model = torch.compile(model)
model.load_state_dict(torch.load('model/model.ckpt', weights_only=True),strict=False)
model.eval()
model.to(inference_config.device)

inference_config.meta_path = os.path.join('data', 'Shakespeare', 'meta.pkl')

Now, you can generate your text here!

In [3]:
# Generate text
print(inference(model, inference_config))

*VIRGILIA:
'T*is not to save l*abour, nor that I want lo&v%e.

VALERIA:
Y#ou would be anothe#r P#enelope: yet, the$y say, all
%the yarn she s%pun in Ulysses' absence did but fill
Ithaca* full of moths. Come;% I would@ $your cambr%ic
were% sensible as *your finger, tha%t you !might @lea!ve@
pricking& i&t@ for pity#. Co!me, *you shall go with us.


To see how big the model is, you can run the cell below.

In [None]:
# Optionally, print model total of parameters
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params