This project walks through Andrej Karpathy's work at: https://www.youtube.com/watch?v=kCc8FmEb1nY to build a small transformer trained with the TinyShakespeare dataset 

We will use TinyShakespeare... a dataset containing all of Shakespeare's works. 

This can be found: https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

The goal will be to train a Transformer-based architecture that creates an "infinite" Shakespeare generator 


Plan: 
- define a transformer architecture
- train on the TinyShakespeare dataset 
- Generate infinite shakespeare 



In [1]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt



--2023-01-22 04:18:11--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2023-01-22 04:18:12 (114 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [2]:
with open('input.txt', 'r', encoding = 'utf-8') as f: 
  text = f.read()


In [3]:
print('length of dataset in characters: ', len(text))

length of dataset in characters:  1115394


We can see that the length of the dataset is a little over 1M characters

In [4]:
#print the first 100 characters
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [5]:
#determine all the unique characters that are present in the text

unique_chars = sorted(list(set(text)))
vocab_size = len(unique_chars) #vocab size defines the possible elements of our sequences

print(''.join(unique_chars))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


We see there are 65 total characters that the model can see or emit 
 

##Tokenizing the input text

Convert the raw text (as string) to a sequence of integers, according to some vocabulary 

Since we are building a character level language model, we will transfer individual characters to integers: eg. "a" maps to "5"; "b" maps to "6", etc. 




In [6]:
#iterate over all characters and create a map from the character to the integer, and vice versa 
string_to_ints = {ch: i for i, ch in enumerate(unique_chars)}
ints_to_strings = {i:ch for i, ch in enumerate(unique_chars)}

#encoding: taking a string and outputting a list of ints. 
encode = lambda s: [string_to_ints[c] for c in s]
#decoding: the opposite, take a list of integers and output a string  
decode = lambda l: ''.join(ints_to_strings[i] for i in l)

#test out on an example
print(encode('hello, how are you?'))
print(decode(encode('hello, how are you?')))



[46, 43, 50, 50, 53, 6, 1, 46, 53, 61, 1, 39, 56, 43, 1, 63, 53, 59, 12]
hello, how are you?


We have encoded a string, and decoded it back... 

There are many other encoders/decoders we can use. Eg. SentencePiece, which encodes at the sub-word level (between characters and words). 

GPT uses byte-word 

we can trade off between sequence length and vocabulary size: eg. large vocabulary size with small sequence length..

We will use simple functions, so we will get long sequences and small vocabulary size

Now we can tokenize the entire Shakespeare training set

we will use the Pytorch tensor 

In [7]:
#encode the text and wrap it in a Pytorch tensor

import torch 
data = torch.tensor(encode(text), dtype = torch.long)
print(data.shape, data.dtype)
print(data[:1000])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      

This is a sequence of the first 1000 characters encoded as integers, in the form of a Pytorch Tensor

The entire text is represented as a sequence of integers

Now, we want to do a train/test split at 90%/10%, respectively

In [8]:
n = int(.9* len(data))
train_data = data[:n]
val_data = data[n:]

We can't feed all the data in to the Transformer at once... we need to feed in small chunks (of a maximum length: Block_size / context_length ) at random 

In [9]:
block_size = 8 
train_data[:block_size + 1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

These are the first 9 characters in the training set 

In these 9 characters, there are 8 individual training examples: 

eg. in the context of 18, 47 comes next. In the context of 18 and 47, 56 comes next, etc. 

In [10]:
#x are inputs to transformer ... the first block size characters
x = train_data[:block_size]
#y are the targets for each position in the input... they will be next block size, (offset by 1 
y = train_data[1:block_size+1]

for t in range(block_size):
  context = x[:t+1] 
  target = y[t]
  print(f'when input is {context} the target is: {target}')

when input is tensor([18]) the target is: 47
when input is tensor([18, 47]) the target is: 56
when input is tensor([18, 47, 56]) the target is: 57
when input is tensor([18, 47, 56, 57]) the target is: 58
when input is tensor([18, 47, 56, 57, 58]) the target is: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target is: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target is: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target is: 58


This spells out what we said above: 

there are 8 contexts: 18; 18, 47; 18, 47, 56, etc. 

and there are 8 targets (eg. tokens that come next, and that we are aiming to predict): 47, 56, 57, respectively 

need to think about batch_size, to process multiple chunks in parallel on the GPU 

In [11]:
torch.manual_seed(1337)
batch_size = 4
block_size = 8 

#generate a small batch of data of inputs x and targets y 
#we will be stacking 4 rows of width 8 into a single 4x8 tensor

def get_batch(split): 
  #set the data that we are grabbing the batches from to be train_data or test_data
  data = train_data if split == 'train' else val_data 
  #set batch_size number of indexes for where to grab the chunks from in the data array
  ix = torch.randint(len(data) - block_size, (batch_size,))
  #grab batches of data for inputs, by concatenation  
  x = torch.stack([data[i:i+block_size] for i in ix])
  #grab batches of data for targets, which will be offset by 1 compared to x 
  y = torch.stack([data[i+1:i+block_size+1] for i in ix])

  return x,y

xb, yb = get_batch('train')
print('inputs: ')
print(xb.shape)
print(xb)
print('targets: ')
print(yb.shape)
print(yb)

print('-----')

#some code to help understand the context and targets a bit more 

for b in range(batch_size): #batch dimension 
  for t in range(block_size): 
    context = xb[b, :t+1]
    target = yb[b, t]
    print(f'when input is {context.tolist()} the target: {target}')
 

 #we have 32 independent examples packed into a single batch 

inputs: 
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets: 
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
-----
when input is [24] the target: 43
when input is [24, 43] the target: 58
when input is [24, 43, 58] the target: 5
when input is [24, 43, 58, 5] the target: 57
when input is [24, 43, 58, 5, 57] the target: 1
when input is [24, 43, 58, 5, 57, 1] the target: 46
when input is [24, 43, 58, 5, 57, 1, 46] the target: 43
when input is [24, 43, 58, 5, 57, 1, 46, 43] the target: 39
when input is [44] the target: 53
when input is [44, 53] the target: 56
when input is [44, 53, 56] the target: 1
when input is [44, 53, 56, 1] the target: 58
when input is [44, 53, 56, 1, 58] the target: 46
when input is [44,

In [12]:
#print our input to the transformer 
print(xb)

tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])


##Start with simple baseline: Bigram language model using a Pytorch module 



In [13]:
import torch 
import torch.nn as nn
from torch.nn import functional as F 
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module): 
  def __init__(self): 
    super().__init__()
    #each token 
    self.token_embedding_table = nn.Embedding(vocab_size, n_embd) #Embedding class creates a tensor of vocab_size x vocab_size 

  def forward(self, idx, targets = None): 
    #idx and targets are both (Batch_size, Time) tensors of integers
    #Logits are of size (B,T,C); with C = vocab_size ... these are the predictions for each one of the 4x8 positions 
    #in other words, for each batch, for each position context, there is a list of predictions at that position 
    logits = self.token_embedding_table(idx) 

    #targets is optional, so if there is a targets inputted: 
    if targets is None: 
      loss = None

    else: 
      #build a loss
      #pytorch wants (B,C,T) rather than (B,T,C), so we need to reshape logits 
      B,T,C = logits.shape
      #reshape logits to a shape that pytorch expects
      logits = logits.view(B*T, C) #stretch out the 3D tensor into a 2D tensor, preserving the channels as the 2nd dimension 
      #reshape targets: they are currently (B,T), we will stretch to make 1D)  
      targets = targets.view(B*T)

      loss = F.cross_entropy(logits, targets)

    return logits, loss

  #continues the generation in the time dimension, for each batch dimension 
  def generate(self, idx, max_new_tokens): #max new tokens is a parameter determining the number of tokens we want to generate 
    #idx is (B,T) array of indices in the current context
    for _ in range(max_new_tokens): 
      logits, loss = self(idx) # shape(B,T,C) this will perform the forward function 
      #focus only on the last time step, the prediction for the next token 
      logits = logits[:, -1, :] #this becomes (B,C) 
      #apply softmax over the C dimension to get probabilities 
      probs = F.softmax(logits, dim = -1) #(B,C) 
      #sample 1 item from this distribution 
      idx_next = torch.multinomial(probs, num_samples=1) #(B,1) 
      #append sampled index to the running sequence 
      idx = torch.cat((idx, idx_next), dim = 1) #(B,T+1)

    return idx


m = BigramLanguageModel()
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)

#generate a (random, because untrained) length 100 sequence from the model, by inputting a single 'space' character
print(decode(m.generate(idx = torch.zeros((1,1), dtype = torch.long), max_new_tokens = 100)[0].tolist()))




torch.Size([32, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)

SKIcLT;AcELMoTbvZv C?nq-QE33:CJqkOKH-q;:la!oiywkHjgChzbQ?u!3bLIgwevmyFJGUGp
wnYWmnxKWWev-tDqXErVKLgJ


This model so far is a bit outrageous, because we are feeding long (block_size) length contexts into the generate function, but the function is only making predictions using the token immediately preceding the token to predict on.... We are doing this so we can re-use the generate function later on.  

##Train the model



In [14]:
#create a pytorch optimizer 
optimizer = torch.optim.AdamW(m.parameters(), lr = 1e-3)


In [15]:
batch_size = 32
for steps in range(10000): 
  #get a batch of data 
  xb, yb = get_batch('train')

  #evaluate the loss 
  logits, loss = m(xb, yb) #pass the index and targets thorugh our bigram model 
  optimizer.zero_grad(set_to_none = True)
  loss.backward()
  optimizer.step()

print(loss.item())

2.382369041442871


In [16]:
#Let's generate some predictions and decode 
print(decode(m.generate(idx = torch.zeros((1,1), dtype = torch.long), max_new_tokens = 300)[0].tolist()))



lso br. ave aviasurf my, yxMPZI ivee iuedrd whar ksth y h bora s be hese, woweee; the! KI 'de, ulseecherd d o blllando;LUCEO, oraingofof win!
RIfans picspeserer hee tha,
TOFonk? me ain ckntoty ded. bo'llll st ta d:
ELIS me hurf lal y, ma dus pe athouo
BEY:! Indy; by s afreanoo adicererupa anse tecor


Still jibberish, but starting to look almost like English... definitely not shakespeare 

We are only lookig at very last character... now we want to use more context, more history... 

so we will build a Transformer 


But first, let's build a Python script out of everything we have written so far...

Almost everything is the same, except we have added: 
- GPU-optimization 
- A more accurate loss estimation that estimates the loss over a whole batch 



In [17]:
%%writefile 'bigram.py'
import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 32 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?
max_iters = 3000
eval_interval = 300
learning_rate = 1e-2
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 32
# ------------

torch.manual_seed(1337)

# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd) #32 dimensional embeddings 
        self.position_embedding_table = nn.Embedding(block_size, n_embd) #each position will get it's own embedding vector 
        self.lm_head = nn.Linear(n_embd, vocab_size) #create a linear layer 

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_embd = self.position_embedding_table(torch.arange(T, device = device)) # (T, C) 
        x = tok_embd + pos_embd #now x will include both the token identities and their positions
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = BigramLanguageModel()
m = model.to(device)

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))

Writing bigram.py


##Self Attention Toy Example

In [18]:
B,T,C = 4,8,2
x = torch.randn(B,T,C)
x.shape

torch.Size([4, 8, 2])

In [19]:
#VERSION 1 

#We will implement a simple way of including the context of previous tokens when making a prediction - by simply averaging over all prior tokens to a given token
# we want x{b,t] = mean_{i<= t} x[b,i] 
xbow = torch.zeros((B,T,C)) #xbow refers to a bag of words - ie. average
for b in range(B): 
  for t in range(T): 
    xprev = x[b, :t+1] #(t,C) #this gives you tokens previous to a given token
    xbow[b,t] = torch.mean(xprev, 0) #average over the 0th dimension (time) 


In [21]:
#to elucidate this, let's index into the first batch, and print out x and xbow
print(x[0]) 

print(xbow[0])

tensor([[ 1.2794, -0.7354],
        [-0.2471,  0.9398],
        [ 2.0026, -1.3095],
        [ 0.3397,  0.2184],
        [-1.9841,  0.3335],
        [ 0.4053,  1.8873],
        [ 1.2212, -1.2814],
        [ 0.7391,  0.6104]])
tensor([[ 1.2794, -0.7354],
        [ 0.5162,  0.1022],
        [ 1.0117, -0.3683],
        [ 0.8437, -0.2217],
        [ 0.2781, -0.1106],
        [ 0.2993,  0.2224],
        [ 0.4310,  0.0075],
        [ 0.4695,  0.0829]])


As we can see, the first element remains unchanged, since it is simply an average of the first element itself. 
The second row of xbow, however, is an average of the first and second row of x, etc. 
The last row is the average over all the rows. 

We can be more efficient with Matrix multiplication 


In [32]:
#VERSION 2 

wei = torch.tril(torch.ones(T,T)) #doing weighted sums with matrix multiplication 
wei = wei / torch.sum(wei, 1, keepdim = True)
xbow2 = wei @ x # (B,T,T) @ (B,T,C) ---> (B,T,C)
wei 

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])

In [33]:
#showing that xbow and xbow2 are equivalent
torch.allclose(xbow, xbow2)

True

In [35]:
#VERSION 3 

tril = torch.tril(torch.ones(T,T))
wei = torch.zeros((T,T)) #acts as affinities between the tokens. Will eventually be data dependent - different tokens will start to find other tokens more or less interesting 
wei = wei.masked_fill(tril == 0, float('-inf')) #wherever the values of tril = 0, fill wei with -inf ... like saying the future cannot communicate with the past  
wei = F.softmax(wei, dim = 1) #applying a softmax will essentially normalize each row, so that we get a matrix that is equivalent to our wei matrix above ... 
print(wei)
xbow3 = wei @ x 

#show that xbow3 is arbitrarily close to xbow 
torch.allclose(xbow, xbow3)

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])


True

In [36]:
# Version 4: self-attention 

torch.manual_seed(1337) 
B,T,C = 4, 8, 32
x = torch.randn(B,T,C)

#let's see a single head perform self attention 
head_size = 16
key = nn.Linear(C, head_size, bias = False)
query = nn.Linear(C, head_size, bias = False)
value = nn.Linear(C, head_size, bias = False) 

#'self' attention because the keys, queries, and values are all coming from the same source - x ... 'cross attention' would potentially use keys, and values from another source
k = key(x) #(B,T,16) #this represents a token's "key"
q = query(x) #(B,T,16) #this represents a token's "query"
#this is performing a dot product between each token's queries and all the other tokens' keys, a measure of how much affinity the differnet tokens should have for each other... in other words 
wei = q @ k.transpose(-2, -1) # (B,T,C) @ (B,16,T) --> (B,T,T)  

tril = torch.tril(torch.ones(T,T))
#Since we are recreating a 'decoder' block, this makes tokens only use as context previous tokens, not future tokens . we would not use this in an encoder block 
wei = wei.masked_fill(tril == 0, float('-inf'))  
wei = F.softmax(wei, dim = -1)

v = value(x) 
out = wei @ v

out.shape


torch.Size([4, 8, 16])

##We will finish off by making a script out of everything we have so far. We will add several novel components, including: 
- Multihead attention: multiple self-attention blocks running in parallel 
- Add layer normalization: this normalizes across the embedding dimensino for each token, forcing activations to be mean 0 and 1 STD 
- Add dropout layers: randomly dropout some subset of node activitiy on each pass during training, effectively creating sub-network clusters
- Add multiple transformer decoder blocks: stack multiple of the transformer decoder blocks that we have previously encoded.  
  - Within these blocks, add residual connections: skip connections 

OK... let's create a script out of everything we have so far: 



In [37]:
%%writefile final_shakespeare_transformer.py 

import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 64 # how many independent sequences will we process in parallel?
block_size = 256 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 500
learning_rate = 3e-4
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 384
n_head = 6
n_layer = 6
dropout = 0.2
# ------------

torch.manual_seed(1337)

# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd) #add a layer norm 
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x)) #residual connections
        x = x + self.ffwd(self.ln2(x))
        return x

# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = BigramLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))
#open('more.txt', 'w').write(decode(m.generate(context, max_new_tokens=10000)[0].tolist()))

Writing self_attention.py
