# **Shakespeare Text Generation with Transformers**





In this notebook, we will learn how to implement a simple Transformer model in PyTorch as described in the Veswani et. al 2017 paper: Attention is All You Need. We will learn to generate shakespeare text in similar to the RNN notebook from the previous week. 


In [None]:
import math
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn import Parameter
import torch
import torch.nn as nn
from torch.nn import functional as F


# Check whether GPU is available and can be used
# if CUDA is found then device is set accordingly
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

if not torch.cuda.is_available():
    print("Consider changing your run-time to GPU or training will be slow.")

In [None]:
# hyperparameters
batch_size = 16
block_size = 32 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0

## The data: Shakespeare Sonnet

Shakespeare's sonnets can be found at the following URL featuring all of his works: http://shakespeare.mit.edu/

For convenience reasons we have extracted all the plain text of the sonnets: https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt into a separate textfile and have added it to the class' repository. We will thus download it from there:

In [None]:
!wget https://raw.githubusercontent.com/ccc-frankfurt/Practical_ML_SS21/master/week06/sonnets.txt

--2023-04-11 13:18:42--  https://raw.githubusercontent.com/ccc-frankfurt/Practical_ML_SS21/master/week06/sonnets.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 94081 (92K) [text/plain]
Saving to: ‘sonnets.txt’


2023-04-11 13:18:43 (57.1 MB/s) - ‘sonnets.txt’ saved [94081/94081]



We can open the text file and print an excerpt.



In [None]:
# Open shakespeare text file and read the data
with open('sonnets.txt', 'r') as f:
    text = f.read()
print("length of dataset in characters: ", len(text))
# print an excerpt of the text 
print(text[:100])

length of dataset in characters:  94081
From fairest creatures we desire increase,
That thereby beauty's rose might never die,
But as the ri


Create a maping from characters to integers

In [None]:
#This is the Character Level Tokenizer
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocabulary_size = len(chars)
# create a mapping from characters to integers
s_to_int = { char:i for i,char in enumerate(chars) }
int_to_s = { i:char for i,char in enumerate(chars) }
lambda_enc = lambda s: [s_to_int[c] for c in s] # A lambda function that takes a string of characters and outputs a list of integers
lambda_dec = lambda l: ''.join([int_to_s[i] for i in l]) #A lambda function that takes a list of integers and outputs a string of characters

Tranform the text data to integers and split it to train and val sets.

In [None]:

data = torch.tensor(lambda_enc(text), dtype=torch.long) #TODO use the lambda encoder function to transform the text to a Tensor. 
n = #TODO We will split the data 90% for  traininig and the  rest for validation.
train_data =  #TODO
val_data =  #TODO

Create a function to generate batches for training and validation


In [None]:
def get_batch(split='train'):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    indices_x =  #TODO pick a batch random indices. We recommend using the torch.randint(...,(batch_size,))
    input_seqs =  #TODO  Create the input sequences. Hint: You can use use torch.stack function to stack tensors after iterating through indices_x variable
    predicted_seqs = #TODO Create the sequences that we want to predict. 
    return input_seqs.to(device), predicted_seqs.to(device)


In [None]:
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

In [None]:
class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        #TODO Specify the linear layers for key, query and values. Hint: n_emb refers to the lenght of the inputs for the following layers.
        self.key = #TODO
        self.query = #TODO
        self.value = #TODO
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = # TODO dropout layer

    def forward(self, x):
        B,T,C = x.shape
        k = # TODO get keys (B,T,C)
        q = #TODO get querys (B,T,C)
        v = # TODO get the values (B,T,C)

        # compute attention scores using the keys and querys
        attention_weights = # Hint the computation has the following shape (B, T, C) @ (B, C, T) -> (B, T, T)
        attention_weights =  # Hint we obtain a vector of shape (B, T, T)
        attention_weights =  # TODO softmax computation. Hint: Use the softmax found Functional sublibrary from Pytorch. 
        attention_weights = #TODO don't foreget the Dropout layer here. 
        # TODO weighted aggregation of the values. 
        out = # Hint we should have a computation of two vectors of the following shapes: 
        #(B, T, T) @ (B, T, C) -> (B, T, C)
        return out


In [None]:
class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads =  #TODO create a module list with multiple heads 
        self.proj =  #TODO define a layer to project the outputs of the multiple attention heads. 
        self.dropout =  #TODO don't forget the dropout layer. 

    def forward(self, x):
        out = #TODO compute the the multi-attention heads and concatenate them.
        out = #TODO compute final output after projection and dropout.
        return out

In [None]:
class MLP(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        #TODO define a simple feedforwad network with ReLU and Dropout.

    def forward(self, x):
        return self.net(x)

In [None]:
class Block(nn.Module):
    """ Transformer block """

    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.att = #TODO Multi-Head-Attention Block
        self.mlp = #TODO MLP block
        self.ln1 = #TODO we use the layer normalization before passing the inputs to attention block or MLP block
        self.ln2 = 

    def forward(self, x):
        x =  #TODO compute output. First use attention and then MLP :)
        x = 
        return x

In [None]:
# Now we bring everything together in a simple Bigram model. 
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table =  #TODO Hint: Use the nn.Embedding function in PyTorch.
        self.position_embedding_table =  #TODO Positional Embedding. 
        self.blocks =  # Define the number of blocks based on the specified n_layer (number of layers). Hint: Dont forget to wrap the blocks in nn.Sequential.
        self.ln_f =  # TODO Final layer normalization
        self.lm_head = #TODO Linear layer for the output generation.

    def forward(self, idx, targets=None):
        B, T = idx.shape
        # index and targets are both (B,T) tensor of integers
        #TODO compute Token and positional embeddings.
        tok_emb = #Hint: Tensor shape should be (B,T,C)
        pos_emb = #Hint: Tensor shape should be (T,C)
        x =  #TODO sum token and position embeddings. Hint: Tensor shape should be  (B,T,C)
        x =  #TODO pass the input to the Attention-Blocks  Hint: Tensor shape should be (B,T,C)
        x =  #TODO Normalization 
        logits = #TODO Compute logits. Hint: The final Tensor shape should be (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits =  #TODO Reshape the logits and the targets.
            targets = 
            loss =  #TODO Cross entropy loss

        return logits, loss

    def generate(self, context_indices, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            idx_cond = context_indices[:, -block_size:] # TODO crop context_indices to match our defined block_size
            # TODO get the predictions
            logits, _ = 
            logits = # TODO  Transform logists to (B, C). Hint: We only care about thelast time step
            probs = # TODO get probabilities from the logits

            idx_next = # TODO sample from the obtained distribution. Hint: use Torch.multinomial function. 
            context_indices =  #TODO append sampled index to the running sequence.
        return context_indices #We should have a tensor with shape (B, T+1)


In [None]:
model = BigramLanguageModel()
m = model.to(device)
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

#TODO create a PyTorch optimizer
optimizer = 

0.209213 M parameters


In [None]:
max_iters=10000
for iter in range(max_iters):

    # Evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # TODO Training loop




step 0: train loss 4.3294, val loss 4.3360
step 100: train loss 2.4908, val loss 2.5141
step 200: train loss 2.3485, val loss 2.3625
step 300: train loss 2.2509, val loss 2.2695
step 400: train loss 2.1984, val loss 2.2258
step 500: train loss 2.1264, val loss 2.1631
step 600: train loss 2.0778, val loss 2.1177
step 700: train loss 2.0247, val loss 2.0807
step 800: train loss 1.9798, val loss 2.0485
step 900: train loss 1.9410, val loss 2.0328
step 1000: train loss 1.9165, val loss 1.9922
step 1100: train loss 1.8724, val loss 1.9618
step 1200: train loss 1.8446, val loss 1.9447
step 1300: train loss 1.8190, val loss 1.9079
step 1400: train loss 1.8029, val loss 1.9127
step 1500: train loss 1.7745, val loss 1.8965
step 1600: train loss 1.7674, val loss 1.8670
step 1700: train loss 1.7408, val loss 1.8612
step 1800: train loss 1.7305, val loss 1.8548
step 1900: train loss 1.7255, val loss 1.8554
step 2000: train loss 1.7055, val loss 1.8343
step 2100: train loss 1.6896, val loss 1.8330


In [None]:
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(lambda_dec(m.generate(context, max_new_tokens=2000)[0].tolist()))


That I am old in you doth view
For men of thy care, mysicide,'
Than before ould that which woman whence case,
And Event Time's both outgo;
Under thee and my sight burreing may eyous,
Thereforoul's self willeds what I sometime removed,
In of you earers hot offence.'
Of thou forth my sightdered of me,
Some of you and your stand
That mayst fazed ful-I may them' untrue,
Were me pays shows now her suse,
But why thou unseasure his commentate of seeith forner sock hence.
Hid doth to be present are earth strong made eye?
From what not the raccely makind cuments youth and use so so cold;
Tell saucy all a so set as to not fears,
To maturz elsend to me.'s but one is thought,
That is mind, excuse my self, and for heaven's besire!
Ah! and mine oceang bear.

When If thou maysumulatest dead;
To sow, thy fair redged
Thy beauty, unksomed more shame,
And kill no longer of you my sight?
Returns are the conquest, plays,
The errs'mage sight sick thousa'd you,
More and in him that cannot spass would anothe

Compare the results obtained this week and the results obtained using RNNs and LSTMS from last week's practical assignement. Do Transformer models signficantly outperfroms their RNN counterparts? If not, what are the missing components to achieve the impressive text generation performance that Transformers are known for?  