<a href="https://colab.research.google.com/github/ajitsingh98/transformers/blob/master/transformer_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformers From Scratch


## Index

- Load the Data
- Tokenization
- Data Loader
- Bigram Model
- Code re-write in preparation for Transformers
- Previous Token Averages - Building Intuition for Self Attention
- Self attention
- Scaling Up
- Conclusion

## Load the Data

In [1]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-01-07 17:59:41--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2024-01-07 17:59:41 (110 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [2]:
with open("input.txt") as f:
    text = f.read()

In [3]:
print(f"Total Number of Lines {len(text)}")
print('First 500 characters:', text[:500])

Total Number of Lines 1115394
First 500 characters: First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor


## Tokenization

In [4]:
vocab = sorted(list(set(text)))
vocab_size = len(vocab)

print(f"Vocab size : {vocab_size}")
print(f"Vocab : {vocab}")

Vocab size : 65
Vocab : ['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [5]:
char2idx = {char:idx for idx,char in enumerate(vocab)}
idx2char = {idx:char for char,idx in char2idx.items()}

encode = lambda x: [char2idx[char] for char in x]
decode = lambda x: "".join([idx2char[idx] for idx in x])

In [6]:
print('Character to index:',char2idx)
print('Index to character:,',idx2char)

sample_text = "Hello World!"

encoded_text = encode(sample_text) # bunch of numbers
decoded_text = decode(encoded_text) # sample text

print(f'Tokenized text : {encoded_text}')
print(f'De-tokenized text : {decoded_text}')

assert sample_text == decoded_text

Character to index: {'\n': 0, ' ': 1, '!': 2, '$': 3, '&': 4, "'": 5, ',': 6, '-': 7, '.': 8, '3': 9, ':': 10, ';': 11, '?': 12, 'A': 13, 'B': 14, 'C': 15, 'D': 16, 'E': 17, 'F': 18, 'G': 19, 'H': 20, 'I': 21, 'J': 22, 'K': 23, 'L': 24, 'M': 25, 'N': 26, 'O': 27, 'P': 28, 'Q': 29, 'R': 30, 'S': 31, 'T': 32, 'U': 33, 'V': 34, 'W': 35, 'X': 36, 'Y': 37, 'Z': 38, 'a': 39, 'b': 40, 'c': 41, 'd': 42, 'e': 43, 'f': 44, 'g': 45, 'h': 46, 'i': 47, 'j': 48, 'k': 49, 'l': 50, 'm': 51, 'n': 52, 'o': 53, 'p': 54, 'q': 55, 'r': 56, 's': 57, 't': 58, 'u': 59, 'v': 60, 'w': 61, 'x': 62, 'y': 63, 'z': 64}
Index to character:, {0: '\n', 1: ' ', 2: '!', 3: '$', 4: '&', 5: "'", 6: ',', 7: '-', 8: '.', 9: '3', 10: ':', 11: ';', 12: '?', 13: 'A', 14: 'B', 15: 'C', 16: 'D', 17: 'E', 18: 'F', 19: 'G', 20: 'H', 21: 'I', 22: 'J', 23: 'K', 24: 'L', 25: 'M', 26: 'N', 27: 'O', 28: 'P', 29: 'Q', 30: 'R', 31: 'S', 32: 'T', 33: 'U', 34: 'V', 35: 'W', 36: 'X', 37: 'Y', 38: 'Z', 39: 'a', 40: 'b', 41: 'c', 42: 'd', 43:

In [7]:
import torch
encoded_text = torch.tensor(encode(text))
print('Encoded Text shape:',encoded_text.shape, 'Encoded Text Dtype:', encoded_text.dtype)
encoded_text

Encoded Text shape: torch.Size([1115394]) Encoded Text Dtype: torch.int64


tensor([18, 47, 56,  ..., 45,  8,  0])

In [8]:
train_split_pct = 0.9
train_split_idx = int(len(encoded_text)*train_split_pct)
train_split_idx

1003854

In [9]:
train_data = encoded_text[:train_split_idx]
valid_data = encoded_text[train_split_idx:]
print('Train data length:',len(train_data),'Valid data length:',len(valid_data),
      'Train percentage:',len(train_data)/len(encoded_text))

Train data length: 1003854 Valid data length: 111540 Train percentage: 0.8999994620734916


In [17]:
context_length = 8

for i in range(context_length):

    x, y = train_data[:i+1], train_data[i+1]

    print(f"idx:{i}, x:{x}, y:{y} and decoded version of x : {decode(x.tolist())}, y : {decode(y[None].tolist())}")

idx:0, x:tensor([18]), y:47 and decoded version of x : F, y : i
idx:1, x:tensor([18, 47]), y:56 and decoded version of x : Fi, y : r
idx:2, x:tensor([18, 47, 56]), y:57 and decoded version of x : Fir, y : s
idx:3, x:tensor([18, 47, 56, 57]), y:58 and decoded version of x : Firs, y : t
idx:4, x:tensor([18, 47, 56, 57, 58]), y:1 and decoded version of x : First, y :  
idx:5, x:tensor([18, 47, 56, 57, 58,  1]), y:15 and decoded version of x : First , y : C
idx:6, x:tensor([18, 47, 56, 57, 58,  1, 15]), y:47 and decoded version of x : First C, y : i
idx:7, x:tensor([18, 47, 56, 57, 58,  1, 15, 47]), y:58 and decoded version of x : First Ci, y : t


In [18]:
TORCH_SEED = 1337 #Setting a manual torch seed for reproducable results
torch.manual_seed(TORCH_SEED) #Used to compare against @karpathy's lecture
context_length = 8 #Maximum number of tokens used in each training sequence
batch_size = 4 #number of batches that will be trained in parallel.

## Data Loader

In [33]:
def get_batch(train_valid):

    data = train_data if train_valid == "train" else valid_data
    data_len = len(data)

    print(f"total data length : {data_len}")

    start_idx = torch.randint(high = data_len-context_length, size = (batch_size, ))

    print(f"start_idx: {start_idx}")

    x = torch.stack([data[i:i+context_length] for i in start_idx])
    y = torch.stack([data[i+1:i+context_length+1] for i in start_idx])

    return x, y

In [36]:
x, y = get_batch("train")

print(f"Shape of x: {x.shape}, y: {y.shape}")

print(f"x values: {x}, \ny values: {y}")


total data length : 1003854
start_idx: tensor([405712, 847796, 542635,  16085])
Shape of x: torch.Size([4, 8]), y: torch.Size([4, 8])
x values: tensor([[20, 53, 61,  1, 44, 53, 52, 42],
        [58, 43, 56, 51, 47, 52, 43, 57],
        [43,  1, 53, 44,  1, 58, 56, 59],
        [43, 52,  1, 44, 56, 53, 51,  1]]), 
y values: tensor([[53, 61,  1, 44, 53, 52, 42, 50],
        [43, 56, 51, 47, 52, 43, 57,  0],
        [ 1, 53, 44,  1, 58, 56, 59, 43],
        [52,  1, 44, 56, 53, 51,  1, 39]])


In [39]:
for batch_id in range(batch_size):
    print(f"Batch Id {batch_id}")
    for sequence_idx in range(context_length):

        context = x[batch_id, :sequence_idx+1]
        target = y[batch_id, sequence_idx]

        print(f"Given context {context} the target is {target}")

Batch Id 0
Given context tensor([20]) the target is 53
Given context tensor([20, 53]) the target is 61
Given context tensor([20, 53, 61]) the target is 1
Given context tensor([20, 53, 61,  1]) the target is 44
Given context tensor([20, 53, 61,  1, 44]) the target is 53
Given context tensor([20, 53, 61,  1, 44, 53]) the target is 52
Given context tensor([20, 53, 61,  1, 44, 53, 52]) the target is 42
Given context tensor([20, 53, 61,  1, 44, 53, 52, 42]) the target is 50
Batch Id 1
Given context tensor([58]) the target is 43
Given context tensor([58, 43]) the target is 56
Given context tensor([58, 43, 56]) the target is 51
Given context tensor([58, 43, 56, 51]) the target is 47
Given context tensor([58, 43, 56, 51, 47]) the target is 52
Given context tensor([58, 43, 56, 51, 47, 52]) the target is 43
Given context tensor([58, 43, 56, 51, 47, 52, 43]) the target is 57
Given context tensor([58, 43, 56, 51, 47, 52, 43, 57]) the target is 0
Batch Id 2
Given context tensor([43]) the target is 

## Bigram Model

In [40]:
import torch.nn as nn
from torch.nn import functional as F

In [41]:
torch.manual_seed(TORCH_SEED)
class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.vocab_size = vocab_size
        #Each token reads off the logits (~probabilities) from the subsequent token from the lookup table
        self.token_embedding_table = nn.Embedding(num_embeddings=self.vocab_size, embedding_dim=self.vocab_size)

    def forward(self, idx, targets):
        #Both idx and targets are (B,T) Batch x Time array of integers
        logits = self.token_embedding_table(idx) #(B,T,C) Batch, Time, Channel
        return logits

In [61]:
bigram_model = BigramLanguageModel(vocab_size)

out = bigram_model(x, y)

print(f"shape => x: {x.shape}, y: {y.shape}, out:{out.shape}")

shape => x: torch.Size([4, 8]), y: torch.Size([4, 8]), out:torch.Size([4, 8, 65])


### Add loss function

In [63]:
class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.vocab_size = vocab_size
        #Each token reads off the logits (~probabilities) from the subsequent token from the lookup table
        self.token_embedding_table = nn.Embedding(num_embeddings=self.vocab_size, embedding_dim=self.vocab_size)

    def forward(self, idx, targets):
        #Both idx and targets are (B,T) Batch x Time array of integers
        logits = self.token_embedding_table(idx) #(B,T,C) Batch, Time, Channel
        B, T, C = logits.shape
        logits_reshaped = logits.reshape(B*T, C)
        targets_reshaped = targets.reshape(B*T)
        loss = F.cross_entropy(input=logits_reshaped, target = targets_reshaped)

        return logits, loss

bigram_model = BigramLanguageModel(vocab_size=vocab_size)
out,loss = bigram_model(x, y)
print('Bigram Model Output Shapes out:',out.shape,'x:',x.shape,'y:',y.shape)
print('The calculated loss is:',loss)


Bigram Model Output Shapes out: torch.Size([4, 8, 65]) x: torch.Size([4, 8]) y: torch.Size([4, 8])
The calculated loss is: tensor(4.7035, grad_fn=<NllLossBackward0>)


### Generate Tokens

In [64]:
class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.vocab_size = vocab_size
        #Each token reads off the logits (~probabilities) from the subsequent token from the lookup table
        self.token_embedding_table = nn.Embedding(num_embeddings=self.vocab_size, embedding_dim=self.vocab_size)

    def forward(self, idx, targets=None):
        #Both idx and targets are (B,T) Batch x Time array of integers
        logits = self.token_embedding_table(idx) #(B,T,C) Batch, Time, Channel
        B, T, C = logits.shape
        if targets:
            logits_reshaped = logits.reshape(B*T, C)
            targets_reshaped = targets.reshape(B*T)
            loss = F.cross_entropy(input=logits_reshaped, target = targets_reshaped)
        else:
            loss = None

        return logits, loss

    def generate_tokens(self, idx, max_new_tokens):

        for _ in range(max_new_tokens):



bigram_model = BigramLanguageModel(vocab_size=vocab_size)
out,loss = bigram_model(x, y)
print('Bigram Model Output Shapes out:',out.shape,'x:',x.shape,'y:',y.shape)
print('The calculated loss is:',loss)

IndentationError: expected an indented block after 'for' statement on line 23 (<ipython-input-64-c1226e813f1f>, line 27)