#Wiki Transformer from Scratch

This project is inspired by Andrej Karpathy's work at: https://www.youtube.com/watch?v=kCc8FmEb1nY

We will be building a decoder-only Transformer from scratch, and training it on a corpus of Wikipedia data, to try and generate Wikipedia-style text. 

We will be training on: Wikitext - V2. Wikitext - V2 is a 2M word subset of the Wikipedia corpus. 
The goal for the project is to: 

Plan: 
- define a decoder transformer architecture
- train on the WikiText dataset 
- Generate infinite Wikipedia-like text 



##Imports 


In [1]:
import torch 
import pandas as pd

In [2]:
#device agnostic code
device = 'cuda' if torch.cuda.is_available() else 'cpu'

## Glance at data 
*Take a peek at data to see if it's what we want

In [3]:
pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.9.0-py3-none-any.whl (462 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m462.8/462.8 KB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting xxhash
  Downloading xxhash-3.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (213 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m213.0/213.0 KB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess
  Downloading multiprocess-0.70.14-py38-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.0/132.0 KB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.12.0-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3

In [4]:
!python -c "from datasets import load_dataset; print(load_dataset('squad', split='train')[0])"


Downloading builder script: 100% 5.27k/5.27k [00:00<00:00, 3.17MB/s]
Downloading metadata: 100% 2.36k/2.36k [00:00<00:00, 1.76MB/s]
Downloading readme: 100% 7.67k/7.67k [00:00<00:00, 6.69MB/s]
Downloading and preparing dataset squad/plain_text to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453...
Downloading data files:   0% 0/2 [00:00<?, ?it/s]
Downloading data:   0% 0.00/8.12M [00:00<?, ?B/s][A
Downloading data: 10.8MB [00:00, 108MB/s]        [A
Downloading data: 30.3MB [00:00, 108MB/s]
Downloading data files:  50% 1/2 [00:00<00:00,  1.79it/s]
Downloading data: 4.85MB [00:00, 124MB/s]        
Downloading data files: 100% 2/2 [00:00<00:00,  2.32it/s]
Extracting data files: 100% 2/2 [00:00<00:00, 1491.04it/s]
Dataset squad downloaded and prepared to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453. Subsequent calls will reuse this data.
{'

In [5]:
from datasets import load_dataset_builder
ds_builder = load_dataset_builder("wikitext", 'wikitext-2-v1')

Downloading builder script:   0%|          | 0.00/8.48k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/6.84k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/9.25k [00:00<?, ?B/s]

In [6]:
#inspect dataset description 
ds_builder.info.description

' The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified\n Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike\n License.\n'

In [7]:
#inspect dataset features 
ds_builder.info.features

{'text': Value(dtype='string', id=None)}

We can see that the wiki dataset is simply text with data type "string". No label is needed because we are not doing classification.

##Load Dataset 


See what splits the dataset has

In [8]:
from datasets import get_dataset_split_names
get_dataset_split_names("wikitext", 'wikitext-2-v1')

['test', 'train', 'validation']

We see there is train, test and validation sets

Download the training data 

In [9]:
from datasets import load_dataset
dataset = load_dataset("wikitext", 'wikitext-2-v1', split="train")

Downloading and preparing dataset wikitext/wikitext-2-v1 to /root/.cache/huggingface/datasets/wikitext/wikitext-2-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126...


Downloading data:   0%|          | 0.00/4.48M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

Dataset wikitext downloaded and prepared to /root/.cache/huggingface/datasets/wikitext/wikitext-2-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126. Subsequent calls will reuse this data.


Check out the dataset object 

In [10]:
dataset

Dataset({
    features: ['text'],
    num_rows: 36718
})

There are 36718 rows of text. Let's explore more by indexing into the dataset

In [11]:
dataset[:3]

{'text': ['', ' = Valkyria Chronicles III = \n', '']}

## Pre - Pre Processing


Now we want to pre process the text. Since we are going to be buliding a character-level transformer, to keep it simple, we will transform our Datasets object into a single string. 

In [None]:
#It is easier for me to convert to pandas df
import pandas as pd
df_pandas = pd.DataFrame(dataset)

In [None]:
df_pandas.head()

Unnamed: 0,text
0,
1,= Valkyria Chronicles III = \n
2,
3,Senjō no Valkyria 3 : <unk> Chronicles ( Japa...
4,"The game began development in 2010 , carrying..."


In [None]:
#Now flatten all of the 'text' columns into a single, super long string
text = ' '.join(df_pandas['text'].tolist())

In [None]:
#Check out the first 1000 characters 
print(text[1000:2000])

acter designer <unk> Honjou and composer Hitoshi Sakimoto both returned from previous entries , along with Valkyria Chronicles II director Takeshi Ozawa . A large team of writers handled the script . The game 's opening theme was sung by May 'n . 
  It met with positive sales in Japan , and was praised by both Japanese and western critics . After release , it received downloadable content , along with an expanded edition in November of that year . It was also adapted into manga and an original video animation series . Due to low sales of Valkyria Chronicles II , Valkyria Chronicles III was not localized , but a fan translation compatible with the game 's expanded edition was released in 2014 . Media.Vision would return to the franchise with the development of Valkyria : Azure Revolution for the PlayStation 4 . 
   = = Gameplay = = 
   As with previous <unk> Chronicles games , Valkyria Chronicles III is a tactical role @-@ playing game where players take control of a military unit and t

In [None]:
#determine all the unique characters that are present in the text
unique_chars = sorted(list(set(text)))
vocab_size = len(unique_chars) #vocab size defines the possible elements of our sequences

print(''.join(unique_chars))
print(vocab_size)


 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^`abcdefghijklmnopqrstuvwxyz|~¡£¥§°±²³µ·½ÁÅÆÉÍÎÖ×ØÚÜÞàáâãäåçèéêëìíîñòóôöøúûüĀāăćčĐđėīŁłńŌōśşšūųŻžơưʻʿ̃αβγκμСавекостяاحصلنه्กงณตมยรลัาิ่์გდვზიკორსუცძწხჯ჻ḥṃṅṣṭṯảấầắễệịớửỳ‑–—‘’“”„†…′″⁄₤€₹⅓⅔→−≤☉♭♯〈〉のァアキスットプュリルヴ・動場大戦攻機殻火礮空隊﻿～
283


As we can see, there are 283 unique characters in the dataset that the model will be able to see or emit. This is because many are non-english. 

In [None]:
print('length of dataset in characters: ', len(text))

length of dataset in characters:  10791252


There are 10M characters total in the text

##Tokenizing the input text

Convert the raw text (as string) to a sequence of integers, according to some vocabulary 

Since we are building a character level language model, we will transfer individual characters to integers: eg. "a" maps to "5"; "b" maps to "6", etc. 




In [None]:
#iterate over all characters and create a map from the character to the integer, and vice versa 
string_to_ints = {ch: i for i, ch in enumerate(unique_chars)}
ints_to_strings = {i:ch for i, ch in enumerate(unique_chars)}

#encoding: taking a string and outputting a list of ints. 
encode = lambda s: [string_to_ints[c] for c in s]
#decoding: the opposite, take a list of integers and output a string  
decode = lambda l: ''.join(ints_to_strings[i] for i in l)

#test out on an example
print(encode('hello, how are you?'))
print(decode(encode('hello, how are you?')))

[72, 69, 76, 76, 79, 13, 1, 72, 79, 87, 1, 65, 82, 69, 1, 89, 79, 85, 32]
hello, how are you?


We have encoded a string, and decoded it back... 

There are many other encoders/decoders we can use. Eg. SentencePiece, which encodes at the sub-word level (between characters and words). Each has a trade-off between sequence length and vocabulary size: eg. large vocabulary size with small sequence length, or vice-versa..

GPT uses byte-word 

We will use a character-level encoding for simplicity, so we will get long sequences and small vocabulary size

Now we can tokenize the entire Wikitext training set

we will use the Pytorch tensor 

In [None]:
#encode the text and wrap it in a Pytorch tensor
import torch 
data = torch.tensor(encode(text), dtype = torch.long)
print(data.shape, data.dtype)
print(data[:1000])

torch.Size([10791252]) torch.int64
tensor([  1,   1,  30,   1,  55,  65,  76,  75,  89,  82,  73,  65,   1,  36,
         72,  82,  79,  78,  73,  67,  76,  69,  83,   1,  42,  42,  42,   1,
         30,   1,   0,   1,   1,   1,  52,  69,  78,  74, 152,   1,  78,  79,
          1,  55,  65,  76,  75,  89,  82,  73,  65,   1,  20,   1,  27,   1,
         29,  85,  78,  75,  31,   1,  36,  72,  82,  79,  78,  73,  67,  76,
         69,  83,   1,   9,   1,  43,  65,  80,  65,  78,  69,  83,  69,   1,
         27,   1, 273, 271, 257, 268, 258, 267, 260, 265, 266, 259,  20,   1,
         13,   1,  76,  73,  84,   1,  15,   1,  55,  65,  76,  75,  89,  82,
         73,  65,   1,  79,  70,   1,  84,  72,  69,   1,  35,  65,  84,  84,
         76,  69,  70,  73,  69,  76,  68,   1,  20,   1,  10,   1,  13,   1,
         67,  79,  77,  77,  79,  78,  76,  89,   1,  82,  69,  70,  69,  82,
         82,  69,  68,   1,  84,  79,   1,  65,  83,   1,  55,  65,  76,  75,
         89,  82,  73,  65,  

This is a sequence of the first 1000 characters encoded as integers, in the form of a Pytorch Tensor

The entire text is represented as a sequence of integers

Now, we want to do a train/test split at 90%/10%, respectively

In [None]:
n = int(.9* len(data))
train_data = data[:n]
val_data = data[n:]

We can't feed all the data in to the Transformer at once... we need to feed in small chunks (of a maximum length: Block_size / context_length ) at random 

In [None]:
block_size = 8 
train_data[:block_size + 1]

tensor([ 1,  1, 30,  1, 55, 65, 76, 75, 89])

These are the first 9 characters in the training set 

In these 9 characters, there are 8 individual training examples: 

For example: in the context of 1, 1 comes next.In the context of 1 and 1, 30 comes next.

In [None]:
#x are inputs to transformer ... the first block_size characters
x = train_data[:block_size]
#y are the targets for each position in the input... they will be next block size, (offset by 1 compared to x)
y = train_data[1:block_size+1]

for t in range(block_size):
  context = x[:t+1] 
  target = y[t]
  print(f'when input is {context} the target is: {target}')

when input is tensor([1]) the target is: 1
when input is tensor([1, 1]) the target is: 30
when input is tensor([ 1,  1, 30]) the target is: 1
when input is tensor([ 1,  1, 30,  1]) the target is: 55
when input is tensor([ 1,  1, 30,  1, 55]) the target is: 65
when input is tensor([ 1,  1, 30,  1, 55, 65]) the target is: 76
when input is tensor([ 1,  1, 30,  1, 55, 65, 76]) the target is: 75
when input is tensor([ 1,  1, 30,  1, 55, 65, 76, 75]) the target is: 89


This spells out what we said above:  there are 8 contexts: 1; 1,1, 1,1,30; 1,1,30,1; etc. 

There are 8 targets (eg. tokens that come next, and that we are aiming to predict): 1,30,1, respectively 

For efficiency, we want to process multiple text chunks in parallel on the GPU, so we need to create batches. 


In [None]:
torch.manual_seed(1337)
batch_size = 4
block_size = 8  #This is also sometimes referred to as 'T for Time'

#generate a small batch of data of inputs x and targets y 
#we will be stacking 4 rows of width 8 into a single 4x8 tensor

def get_batch(split): 
  #set the data that we are grabbing the batches from to be train_data or test_data
  data = train_data if split == 'train' else val_data 
  #set batch_size number of indexes for where to grab the chunks from in the data array
  ix = torch.randint(len(data) - block_size, (batch_size,))
  #grab batches of data for inputs, by concatenation  
  x = torch.stack([data[i:i+block_size] for i in ix])
  #grab batches of data for targets, which will be offset by 1 compared to x 
  y = torch.stack([data[i+1:i+block_size+1] for i in ix])

  return x,y

xb, yb = get_batch('train')
print('inputs: ')
print(xb.shape)
print(xb)
print('targets: ')
print(yb.shape)
print(yb)

print('-----')

#some code to help understand the context and targets a bit more 

for b in range(batch_size): #batch dimension 
  for t in range(block_size): 
    context = xb[b, :t+1]
    target = yb[b, t]
    print(f'when input is {context.tolist()} the target: {target}')

 #we have 32 independent examples packed into a single batch 

inputs: 
torch.Size([4, 8])
tensor([[65, 66, 76, 69,  1, 66, 65, 82],
        [84, 72, 69,  1, 33, 14, 33,  1],
        [ 1,  1,  1,  1, 30,  1, 53, 72],
        [69, 82,  1, 72, 69, 82,  1, 79]])
targets: 
torch.Size([4, 8])
tensor([[66, 76, 69,  1, 66, 65, 82, 82],
        [72, 69,  1, 33, 14, 33,  1, 84],
        [ 1,  1,  1, 30,  1, 53, 72, 69],
        [82,  1, 72, 69, 82,  1, 79, 87]])
-----
when input is [65] the target: 66
when input is [65, 66] the target: 76
when input is [65, 66, 76] the target: 69
when input is [65, 66, 76, 69] the target: 1
when input is [65, 66, 76, 69, 1] the target: 66
when input is [65, 66, 76, 69, 1, 66] the target: 65
when input is [65, 66, 76, 69, 1, 66, 65] the target: 82
when input is [65, 66, 76, 69, 1, 66, 65, 82] the target: 82
when input is [84] the target: 72
when input is [84, 72] the target: 69
when input is [84, 72, 69] the target: 1
when input is [84, 72, 69, 1] the target: 33
when input is [84, 72, 69, 1, 33] the target: 14
when input is

In [None]:
#print our input to the transformer 
print(xb) #Of shape = (Batch_size x block_size (T for Time))

tensor([[65, 66, 76, 69,  1, 66, 65, 82],
        [84, 72, 69,  1, 33, 14, 33,  1],
        [ 1,  1,  1,  1, 30,  1, 53, 72],
        [69, 82,  1, 72, 69, 82,  1, 79]])


In [None]:
#we can modularize this code with the following class: 

# data loading
def get_batch(split):

    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x, y
    return x, y

##Start with simple baseline: Bigram language model



In [None]:
import torch 
import torch.nn as nn
from torch.nn import functional as F 
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module): 
  def __init__(self, vocab_size): 
    super().__init__()
    #Create an embedding table for each unique character 
    self.token_embedding_table = nn.Embedding(vocab_size, vocab_size) #Embedding class creates a tensor of vocab_size x vocab_size 

  def forward(self, idx, targets = None): 
    #idx and targets are both (Batch_size, block_size (or T for Time)) tensors of integers
    #Logits are of size (B,T,C); C = embedding_dimension (in this case = vocab_size) ... these are the predictions for each one of the 4x8 (BXT) positions 
    #in other words, for each batch, for each position context, there is a list of predictions at that position 

    logits = self.token_embedding_table(idx)

    #targets is optional, so if there is a targets inputted: 
    if targets is None: 
      loss = None

    else: 
      #build a loss
      #pytorch wants (B,C,T) rather than (B,T,C), so we need to reshape logits 
      B,T,C = logits.shape
      #reshape logits to a shape that pytorch expects
      logits = logits.view(B*T, C) #stretch out the 3D tensor into a 2D tensor, preserving the channels as the 2nd dimension 
      #reshape targets: they are currently (B,T), we will stretch to make 1D)  
      targets = targets.view(B*T)

      loss = F.cross_entropy(logits, targets)

    return logits, loss

  #continues the generation in the time dimension, for each batch dimension 
  def generate(self, idx, max_new_tokens): #max new tokens is a parameter determining the number of tokens we want to generate 
    #idx is (B,T) array of indices in the current context
    for _ in range(max_new_tokens): 
      logits, loss = self(idx) # shape(B,T,C) this will perform the forward function 
      #focus only on the last time step, the prediction for the next token 
      logits = logits[:, -1, :] #this becomes (B,C) 
      #apply softmax over the C dimension to get probabilities 
      probs = F.softmax(logits, dim = -1) #(B,C) 
      #sample 1 item from this distribution 
      idx_next = torch.multinomial(probs, num_samples=1) #(B,1) 
      #append sampled index to the running sequence 
      idx = torch.cat((idx, idx_next), dim = 1) #(B,T+1)

    return idx


bigram_model = BigramLanguageModel(vocab_size) 
print(logits.shape)
print(loss)

#generate a (random, because untrained) length 100 sequence from the model, by inputting a single 'space' character
print(decode(bigram_model.generate(idx = torch.zeros((1,1), dtype = torch.long), max_new_tokens = 100)[0].tolist()))




torch.Size([256, 283])
tensor(6.2152, device='cuda:0', grad_fn=<NllLossBackward0>)

\kảCVâ/გễ~ÚázJmย″MàIVิκCÚ €uịI^tü⅓Î♯çśアÁルå ştV^კ〉ვšëčキ→dア€﻿şṣux่გ@óLDëVâ.殻ûśa火リк³S·&
g̃éş1eäÖu~hプ>%v


This model so far is a bit outrageous, because we are feeding long (block_size) length contexts into the generate function, but the function is only making predictions using the token immediately preceding the token to predict on.... We are doing this so we can re-use the generate function later on.  

##Train the Bigram model



In [None]:
#create a pytorch optimizer 
optimizer = torch.optim.AdamW(bigram_model.parameters(), lr = 1e-3) 

In [None]:
batch_size = 32
for steps in range(10000): 
  #get a batch of data 
  xb, yb = get_batch('train')

  #evaluate the loss 
  logits, loss = bigram_model(xb, yb) #pass the index and targets thorugh our bigram model 
  optimizer.zero_grad(set_to_none = True)
  loss.backward()
  optimizer.step()

print(loss.item())

2.374697208404541


In [None]:
#Let's generate some predictions and decode 
print(decode(bigram_model.generate(idx = torch.zeros((1,1), dtype = torch.long), max_new_tokens = 300)[0].tolist()))


 sonupenff o 12ndalyต@-@-@წ8. s clde hescteche Nor airex Al <und FThe n wo the Shede isldizz terecthe , tofe <untinded ton he ad thess ese are inctlsure ar then tonøÍ～gurstim wf athas (プγĀ. ghio cethoose <ungld f SAfox imiconetea ga d g lllad Mainkne se " webon a toves Sig thed 19  thank> Iliver <un


Still jibberish, but starting to look almost sort of like English... definitely not looking like Wikipedia text however. 

We are only using as context for each character prediction the single previous charcter. Now, we can start to use more context for prediction. 

So, we will build a Transformer 


##Building a Transformer

So far, we have only built a super super simple Bigram model. Now, we will build a Decoder-only Transformer model and pass our data through it. 

The Decoder architecture consists of multiple decoder blocks composed of: 
- MultiHead self attention 
- Layernorm + residual connections 
- Feedforward layers 

Let's build classes for each of these, and then put them all together 




###Self Attention 
Self-attention is fundamental to Transformers. In the following code, we will implement Single-head self attention, before modifying it to using   multiheads later. 

In [None]:
#implementation of single-head self-attention 
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

# let's see a single Head perform self-attention
#define head_size: head size is typically much smaller than C, the embedding dimension. We will use 16 for now 
head_size = 16

#create the key, query, and value Weight matrices (Linear layers). 
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)

#create a key and query for the input 
k = key(x)   # (B, T, head_size)
q = query(x) # (B, T, head_size)

wei =  q @ k.transpose(-2, -1) # (B, T, head_size) @ (B, head_size, T) ---> (B, T, T)
print(f' wei pre tril: {wei} ' )

#tril creates a lower triangular matrix of ones
tril = torch.tril(torch.ones(T, T))
print(f' tril: {tril}')

#set positions in wei = -inf, where tril = 0 
wei = wei.masked_fill(tril == 0, float('-inf')) 
print(f' wei post tril: {wei} ' )
wei = F.softmax(wei, dim=-1)
print(f' wei post softmax: {wei} ')

#create a value out of the input x  
v = value(x)
#matrix multiply our weight tensor by the value  
out = wei @ v
#out = wei @ x

out.shape

 wei pre tril: tensor([[[-1.7629e+00, -1.3011e+00,  5.6516e-01,  2.1616e+00, -1.0674e+00,
           1.9632e+00,  1.0765e+00, -4.5295e-01],
         [-3.3334e+00, -1.6556e+00,  1.0405e-01,  3.3782e+00, -2.1825e+00,
           1.0415e+00, -5.5714e-02,  2.9273e-01],
         [-1.0226e+00, -1.2606e+00,  7.6228e-02, -3.8125e-01, -9.8430e-01,
          -1.4303e+00,  7.4921e-02, -9.5465e-01],
         [ 7.8359e-01, -8.0143e-01, -3.3680e-01, -8.4963e-01, -5.6023e-01,
          -1.1701e+00, -1.2927e+00, -1.0260e+00],
         [-1.2566e+00,  1.8719e-02, -7.8797e-01, -1.3204e+00,  2.0363e+00,
           8.6381e-01,  3.7188e-01,  9.2577e-01],
         [-3.1262e-01,  2.4152e+00, -1.1058e-01, -9.9305e-01,  3.3449e+00,
          -2.5229e+00,  1.4187e+00,  1.2196e+00],
         [ 1.0876e+00,  1.9652e+00, -2.6213e-01, -3.1579e-01,  6.0905e-01,
           1.2616e+00, -5.4841e-01,  8.0485e-01],
         [-1.8044e+00, -4.1260e-01, -8.3061e-01,  5.8985e-01, -7.9869e-01,
          -5.8560e-01,  6.4332e-01,

torch.Size([4, 8, 16])

In [None]:
#we can modularize this code with the following class: 

#we inherit from nn.module and define an init and forward function 
class SingleHeadAttention(nn.Module):
    "We inhere"

    #we define key, query, and value matrices, as well as a dropout layer (present in the original Attention is All You Need paper)
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        #there is no need to train "tril", so we should register as a buffer
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

###MultiHeadAttention 
we want to extend the SingleHeadAttention code that we wrote above to incorporate multiple attention heads. These separate attention heads operate in parallel, and enable a single token to pay attention to the other tokens in its context in a variety of ways. For example, in the sentence: "The dog left his coat in the house": one attention head could enable the "dog" to be attending strongly to "left". Meanwhile, another head could emphasize "dog" attending to 'his'. 

In [None]:
#like single-head attention, we inherit from nn.Module and define init and forward methods
class MultiHeadAttention(nn.Module):
    #in our initialization, we now take a new parameter 'num_heads', which sets the number of attention heads. 
    def __init__(self, num_heads, head_size):
        super().__init__()
        #create a list of length head_size, for the attention heads
        self.heads = nn.ModuleList([SingleHeadAttention(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
      #apply each attention head h to the input x. Concatenate the result, pass it through a linear layer, and perform dropout on the result
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

###FeedForward Linear Layer 

The outputs of the MHA attention layer are passed through a simple FF Linear Layer,


In [None]:
#This FF layer contains 2 linear layers, a ReLU, and a dropout layer.
#When we initialize the layer, we need to specify the embedding dimension 
class FeedFoward(nn.Module):

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

###Creating a Decoder Block 

A single decoder block is comprised of the self-attention and FF layers that we've defined above. They will also incorporate LayerNorm (from the Pytorch module) and residual connections. 

Residual connections, otherwise known as “skip connections”, connect non-sequential layers of a network, and are vital for Transformer models as the model size grows, to alleviate vanishing gradients. They enable gradients to flow unimpeded during backpropagation 

Layernorm normalizes the activations in a given layer to have mean = 0 and STD = 1. Layernorm is similar to batch norm, but you normalize over the feature dimension )eg. embedding dimension), instead of the batch dimension. LN helps make gradient descent more efficient, enabling smoother / more stable gradients and faster training:

Many of these blocks will be stacked together to create the model as a whole. Let's define a single block: 

In [None]:
class Block(nn.Module):

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        #Define the self attention layer
        self.sa = MultiHeadAttention(n_head, head_size)
        #Define the feedforward layer 
        self.ffwd = FeedFoward(n_embd)
        #Define the LayerNorm that comes after Self-attention
        self.ln1 = nn.LayerNorm(n_embd)
        #Define the LayerNorm that comes after the FF layer 
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
      #we add the input x to the outputs of self attention and the FF layers, to define residual connections
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

###Enhanced BigramModel 

We will make our BigramModel a bit more robust, by: 

- adding Position Ebeddings
- incorporating our "blocks" class
- change our token embedding table to be of size (vocab_size x embedding_dim) 


Why do we add Position Embeddings? The position of words in a sentence, and their relative distances from each other, of course contains a lot of information. 
So far, the tokens we have been dealing with are completely separated - they have no notion of “where” the others are. 

So, we can add position embedding vectors  to each of the tokens to give the tokens a notion of location & relative distance.

In [None]:
class EnhancedBigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        #build an embedding table of size (vocab_size x embedding_dim) 
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        #create a position embedding table 
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

## Estimate Loss:
We will build one more helper function: estimate_loss, which averages the loss over batches.

In [None]:
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

###Device Agnostic get_batch

Let's also update our get_batch function to be device agnostic and utilize the GPU

In [None]:
# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

##Putting it all together

Now that we have the individual building blocks, we can put this all together into a single Python Script.
 




In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = 32 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
#Set device agnostic code to utilize GPUs if they are available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0
# ------------

In [None]:
import pandas as pd
from datasets import load_dataset
import torch 

!pip install datasets

torch.manual_seed(42)

dataset = load_dataset("wikitext", 'wikitext-2-v1', split="train")
df_pandas = pd.DataFrame(dataset)

#Now flatten all of the 'text' columns into a single, super long string
text = ' '.join(df_pandas['text'].tolist())

#determine all the unique characters that are present in the text
unique_chars = sorted(list(set(text)))
vocab_size = len(unique_chars) #vocab size defines the possible elements of our sequences


#iterate over all characters and create a map from the character to the integer, and vice versa 
string_to_ints = {ch: i for i, ch in enumerate(unique_chars)}
ints_to_strings = {i:ch for i, ch in enumerate(unique_chars)}

#encoding: taking a string and outputting a list of ints. 
encode = lambda s: [string_to_ints[c] for c in s]
#decoding: the opposite, take a list of integers and output a string  
decode = lambda l: ''.join(ints_to_strings[i] for i in l)


#encode the text and wrap it in a Pytorch tensor
data = torch.tensor(encode(text), dtype = torch.long)
n = int(.9* len(data))
train_data = data[:n]
val_data = data[n:]


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/




In [None]:
model = EnhancedBigramLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))

0.237851 M parameters
step 0: train loss 5.8107, val loss 5.8092
step 100: train loss 2.7191, val loss 2.7112
step 200: train loss 2.5169, val loss 2.5182
step 300: train loss 2.4258, val loss 2.4370
step 400: train loss 2.3661, val loss 2.3715
step 500: train loss 2.3066, val loss 2.3113
step 600: train loss 2.2622, val loss 2.2509
step 700: train loss 2.2151, val loss 2.2130
step 800: train loss 2.1783, val loss 2.1826
step 900: train loss 2.1246, val loss 2.1432
step 1000: train loss 2.1016, val loss 2.1114
step 1100: train loss 2.0872, val loss 2.0996
step 1200: train loss 2.0527, val loss 2.0552
step 1300: train loss 2.0337, val loss 2.0358
step 1400: train loss 2.0069, val loss 2.0129
step 1500: train loss 1.9936, val loss 2.0111
step 1600: train loss 1.9809, val loss 1.9986
step 1700: train loss 1.9649, val loss 1.9724
step 1800: train loss 1.9425, val loss 1.9581
step 1900: train loss 1.9408, val loss 1.9572
step 2000: train loss 1.9322, val loss 1.9506
step 2100: train loss 1.

As we can see, our generated text is certainly not coherent English. But, we've made a lot of progress. We now have full words being generated: "Success", "construction", 'game', etc. For the size of the model, this isn't half bad! But we'd need to make many modifications to have this be an actually high performing model. The purpose of this project was simply to gain experience building a decoder-only Transformer from scratch. 

##Final Script 

Final code. The original code on the TinyShakespeare dataset can be found at Andrej Karpathy's GitHub: 

https://github.com/karpathy/ng-video-lecture/blob/master/gpt.py





In [None]:
%%writefile wiki_transformer_from_scratch

!pip install datasets
import torch
import torch.nn as nn
from torch.nn import functional as F
import pandas as pd
from datasets import load_dataset
import torch 

# hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = 32 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0
# ------------

torch.manual_seed(42)

dataset = load_dataset("wikitext", 'wikitext-2-v1', split="train")
df_pandas = pd.DataFrame(dataset)

#Now flatten all of the 'text' columns into a single, super long string
text = ' '.join(df_pandas['text'].tolist())

#determine all the unique characters that are present in the text
unique_chars = sorted(list(set(text)))
vocab_size = len(unique_chars) #vocab size defines the possible elements of our sequences


#iterate over all characters and create a map from the character to the integer, and vice versa 
string_to_ints = {ch: i for i, ch in enumerate(unique_chars)}
ints_to_strings = {i:ch for i, ch in enumerate(unique_chars)}

#encoding: taking a string and outputting a list of ints. 
encode = lambda s: [string_to_ints[c] for c in s]
#decoding: the opposite, take a list of integers and output a string  
decode = lambda l: ''.join(ints_to_strings[i] for i in l)


#encode the text and wrap it in a Pytorch tensor
data = torch.tensor(encode(text), dtype = torch.long)
n = int(.9* len(data))
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y


def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out


class SingleHeadAttention(nn.Module):
    "We inhere"

    #we define key, query, and value matrices, as well as a dropout layer (present in the original Attention is All You Need paper)
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        #there is no need to train "tril", so we should register as a buffer
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

#like single-head attention, we inherit from nn.Module and define init and forward methods
class MultiHeadAttention(nn.Module):
    #in our initialization, we now take a new parameter 'num_heads', which sets the number of attention heads. 
    def __init__(self, num_heads, head_size):
        super().__init__()
        #create a list of length head_size, for the attention heads
        self.heads = nn.ModuleList([SingleHeadAttention(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
      #apply each attention head h to the input x. Concatenate the result, pass it through a linear layer, and perform dropout on the result
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out


#This FF layer contains 2 linear layers, a ReLU, and a dropout layer.
#When we initialize the layer, we need to specify the embedding dimension 
class FeedFoward(nn.Module):

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        #Define the self attention layer
        self.sa = MultiHeadAttention(n_head, head_size)
        #Define the feedforward layer 
        self.ffwd = FeedFoward(n_embd)
        #Define the LayerNorm that comes after Self-attention
        self.ln1 = nn.LayerNorm(n_embd)
        #Define the LayerNorm that comes after the FF layer 
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
      #we add the input x to the outputs of self attention and the FF layers, to define residual connections
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x


class EnhancedBigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        #build an embedding table of size (vocab_size x embedding_dim) 
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        #create a position embedding table 
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = EnhancedBigramLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))
