# CS336 Assignments

| # | Topic                         | Description                                 |
|---|-------------------------------|---------------------------------------------|
| 1 | Basics                        | Train an LLM from scratch                   |
| 2 | Systems                       | Make it run fast!                           |
| 3 | Scaling                       | Make it performant at a FLOP budget         |
| 4 | Data                          | Prepare the right datasets                  |
| 5 | Alignment & Reasoning RL      | Align it to real-world use cases            |

# Assignment #1
- Implement all of the components (tokenizer, model, loss function, optimizer) necessary to train a standard Transformer language model
- Train a minimal language model

In [1]:
import warnings
warnings.filterwarnings("ignore")

from datasets import load_dataset
import tiktoken
import torch
import lovely_tensors as lt
lt.monkey_patch()


In [2]:
tinystories = load_dataset("roneneldan/TinyStories")
tinystories

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 2119719
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 21990
    })
})

In [3]:
tinystories['train'][0:10]

{'text': ['One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.\n\nLily went to her mom and said, "Mom, I found this needle. Can you share it with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."\n\nTogether, they shared the needle and sewed the button on Lily\'s shirt. It was not difficult for them because they were sharing and helping each other. After they finished, Lily thanked her mom for sharing the needle and fixing her shirt. They both felt happy because they had shared and worked together.',
  'Once upon a time, there was a little car named Beep. Beep loved to go fast and play in the sun. Beep was a healthy car because he always had good fuel. Good fuel made Beep happy and strong.\n\nOne day, Beep was driving in the park when he saw a big tree. The tree had many leav

# BPE Tokenizer

In [4]:
sample_text = [
    "The quick brown fox jumps over the lazy dog.",
    "Artificial intelligence is transforming the world.",
    "Python is a popular programming language.",
    "Machine learning enables computers to learn from data.",
    "Natural language processing helps computers understand text.",
    "Deep learning models require large amounts of data.",
    "Neural networks are inspired by the human brain.",
    "Data science combines statistics and computer science.",
    "Transformers have revolutionized language modeling.",
    "Open source software encourages collaboration."
]

In [5]:
import tiktoken

ttok = tiktoken.get_encoding("gpt2")
ttok

<Encoding 'gpt2'>

In [6]:
sentence = "Hey! I am doing well.. I hope you're well too! "
print(f"original  : {sentence}")

sentence_encoded = ttok.encode(sentence)
print(f"encoded   : {sentence_encoded}")

sentence_decoded = ttok.decode(sentence_encoded)
print(f"decoded   : {sentence_decoded}")


original  : Hey! I am doing well.. I hope you're well too! 
encoded   : [10814, 0, 314, 716, 1804, 880, 492, 314, 2911, 345, 821, 880, 1165, 0, 220]
decoded   : Hey! I am doing well.. I hope you're well too! 


To code up a transformer model, let's take it step by step.
1. Code a simplified self-attention model without trainable weights
2. Code a simplified self-attention model with trainable weights
3. Modify #2 with causal attention
4. Extend #3 to multiple heads

## Code a simplified self-attention model without trainable weights

In [7]:
sample_sentence = "Your journey starts with one step"
sample_sentence

sample_sentence_tokens = ttok.encode(sample_sentence)
sample_sentence_tokens

[7120, 7002, 4940, 351, 530, 2239]

For illustration purpose, let's create a small embedding matrix for each of these tokens.)

In [8]:
emb_matrix = torch.tensor(
    [
        [0.43, 0.15, 0.89],
        [0.55, 0.87, 0.66],
        [0.57, 0.85, 0.64],
        [0.22, 0.58, 0.33],
        [0.77, 0.25, 0.10],
        [0.05, 0.80, 0.55]     
     ]
)
emb_matrix.v

tensor[6, 3] n=18 x∈[0.050, 0.890] μ=0.514 σ=0.276
tensor([[0.4300, 0.1500, 0.8900],
        [0.5500, 0.8700, 0.6600],
        [0.5700, 0.8500, 0.6400],
        [0.2200, 0.5800, 0.3300],
        [0.7700, 0.2500, 0.1000],
        [0.0500, 0.8000, 0.5500]])

To calculate self attention, we compute the similarity of each token in a sentence to itself. That means, we take dot products (cosine similarity) of the each embedding vector to all the other embedding vectors in the same sentence.

### Computing context vectors for one input token

In [9]:
# Here, I am computing the attention between the second token (emb_matrix[1]) with all other input tokens
attn_scores = torch.zeros(len(sample_sentence_tokens))
for t, inp in enumerate(emb_matrix):
    attn_scores[t] = torch.dot(emb_matrix[1], inp)

attn_scores

tensor[6] x∈[0.707, 1.495] μ=1.094 σ=0.328 [0.954, 1.495, 1.475, 0.843, 0.707, 1.087]

We then normalize the attention scores so that they sum upto 1 for good mathematical properties.

In [10]:
# Here, I am computing the attention between the second token (emb_matrix[1]) with all other input tokens
attn_scores = torch.zeros(len(sample_sentence_tokens))
for t, inp in enumerate(emb_matrix):
    attn_scores[t] = torch.dot(emb_matrix[1], inp)

attn_scores = attn_scores / attn_scores.sum()
attn_scores

tensor[6] x∈[0.108, 0.228] μ=0.167 σ=0.050 [0.145, 0.228, 0.225, 0.129, 0.108, 0.166]

In practice, the softmax based normalization is preferred for numerical stability.

In [11]:
def softmax_naive(x):
    return torch.exp(x) / torch.exp(x).sum(dim=0)

In [12]:
# Here, I am computing the attention between the second token (emb_matrix[1]) with all other input tokens
attn_scores = torch.zeros(len(sample_sentence_tokens))
for t, inp in enumerate(emb_matrix):
    attn_scores[t] = torch.dot(emb_matrix[1], inp)

attn_weights = softmax_naive(attn_scores)
attn_weights

tensor[6] x∈[0.108, 0.238] μ=0.167 σ=0.056 [0.139, 0.238, 0.233, 0.124, 0.108, 0.158]

Even if we use the `torch.softmax` function, we see that we get the same result.

In [13]:
# Here, I am computing the attention between the second token (emb_matrix[1]) with all other input tokens
attn_scores = torch.zeros(len(sample_sentence_tokens))
for t, inp in enumerate(emb_matrix):
    attn_scores[t] = torch.dot(emb_matrix[1], inp)

attn_weights = torch.softmax(attn_scores, dim=0)
attn_weights

tensor[6] x∈[0.108, 0.238] μ=0.167 σ=0.056 [0.139, 0.238, 0.233, 0.124, 0.108, 0.158]

In [14]:
# Here, I am computing the attention between the second token (emb_matrix[1]) with all other input tokens
attn_scores = torch.zeros(len(sample_sentence_tokens))
for t, inp in enumerate(emb_matrix):
    attn_scores[t] = torch.dot(emb_matrix[2], inp)

attn_weights = softmax_naive(attn_scores)
attn_weights

tensor[6] x∈[0.111, 0.237] μ=0.167 σ=0.055 [0.139, 0.237, 0.233, 0.124, 0.111, 0.156]

We can obtain the context vector by multiplying these attention scores with each embedding.

In [15]:
context_vec = torch.zeros(emb_matrix.shape[1])
for t, inp in enumerate(emb_matrix):
    context_vec += attn_weights[t] * inp

context_vec

tensor[3] x∈[0.443, 0.650] μ=0.553 σ=0.104 [0.443, 0.650, 0.567]

We now found the context vector when the 2nd token in our sentence attended to all the tokens in the sentence. To identify similar context vectors, we will generalize this process to N context vectors for N tokens in a sentence.

### Computing context vectors for all input tokens

In [28]:
%%time

attn_scores = torch.zeros(len(emb_matrix), len(emb_matrix))
for i, m1 in enumerate(emb_matrix):
    for j, m2 in enumerate(emb_matrix):
        attn_scores[i, j] = torch.dot(m1, m2)

attn_scores.v

CPU times: user 1.37 ms, sys: 1.31 ms, total: 2.68 ms
Wall time: 1.55 ms


tensor[6, 6] n=36 x∈[0.294, 1.495] μ=0.806 σ=0.333
tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])

The same operation can be done using matrix math.

In [31]:
%%time

attn_scores = (emb_matrix @ emb_matrix.T)
attn_scores.v

CPU times: user 340 μs, sys: 260 μs, total: 600 μs
Wall time: 371 μs


tensor[6, 6] n=36 x∈[0.294, 1.495] μ=0.806 σ=0.333
tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])

By replacing for loops with matrix math, we got a 5x speed up in this small matrix multiplication :)

In [71]:
def softmax_naive_2d(x):
    return torch.exp(x) / torch.exp(x).sum(dim=-1, keepdim=True)
attn_weights = softmax_naive_2d(attn_scores)
attn_weights.v

tensor[6, 6] n=36 x∈[0.099, 0.238] μ=0.167 σ=0.042
tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452],
        [0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581],
        [0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565],
        [0.1435, 0.2074, 0.2046, 0.1462, 0.1263, 0.1720],
        [0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.1295],
        [0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])

In [72]:
# sanity check
attn_weights[1].sum()

tensor 1.000

In [77]:
context_vec = attn_weights @ emb_matrix
context_vec.v

tensor[6, 3] n=18 x∈[0.418, 0.651] μ=0.542 σ=0.082
tensor([[0.4421, 0.5931, 0.5790],
        [0.4419, 0.6515, 0.5683],
        [0.4431, 0.6496, 0.5671],
        [0.4304, 0.6298, 0.5510],
        [0.4671, 0.5910, 0.5266],
        [0.4177, 0.6503, 0.5645]])

These are the context vectors of all tokens attended to all other tokens, normalized and weighted by their input tokens.

Now that we have seen how to compute a simplified self-attention, let's move to the next section where compute self-attention with trainable weights learned from data.

## Code a simplified self-attention model with trainable weights

This self-attention is also called as scaled dot product attention.

Instead of directly computing `attn_scores` from the embedding matrix, we want to do a weighted multiplication with input vectors. This helps in identifying "good context vectors".

These weights take the form of matrices we learn from the training data.

In [83]:
emb_matrix.v

tensor[6, 3] n=18 x∈[0.050, 0.890] μ=0.514 σ=0.276
tensor([[0.4300, 0.1500, 0.8900],
        [0.5500, 0.8700, 0.6600],
        [0.5700, 0.8500, 0.6400],
        [0.2200, 0.5800, 0.3300],
        [0.7700, 0.2500, 0.1000],
        [0.0500, 0.8000, 0.5500]])

In [85]:
x_2 = emb_matrix[1]
x_2.v

tensor[3] x∈[0.550, 0.870] μ=0.693 σ=0.163 [0.550, 0.870, 0.660]
tensor([0.5500, 0.8700, 0.6600])

In [86]:
d_in = emb_matrix.shape[1]
d_out = 2 # output dimension

In [87]:
torch.manual_seed(42) # for reproducibility
W_query = torch.nn.Parameter(torch.randn(d_in, d_out), requires_grad=False)
W_key   = torch.nn.Parameter(torch.randn(d_in, d_out), requires_grad=False)
W_value = torch.nn.Parameter(torch.randn(d_in, d_out), requires_grad=False)

In [89]:
W_query.v

Parameter[3, 2] n=6 x∈[-1.123, 0.337] μ=-0.063 σ=0.549 [[0.337, 0.129], [0.234, 0.230], [-1.123, -0.186]]
tensor([[ 0.3367,  0.1288],
        [ 0.2345,  0.2303],
        [-1.1229, -0.1863]])

In [93]:
query_2 = x_2 @ W_query
key_2 = x_2 @ W_key
value_2 = x_2 @ W_value

query_2.v

tensor[2] μ=-0.102 σ=0.354 [-0.352, 0.148]
tensor([-0.3519,  0.1483])

In [98]:
query_2.v, key_2.v, value_2.v

(tensor[2] μ=-0.102 σ=0.354 [-0.352, 0.148]
 tensor([-0.3519,  0.1483]),
 tensor[2] μ=1.193 σ=1.098 [1.969, 0.416]
 tensor([1.9692, 0.4159]),
 tensor[2] μ=0.533 σ=0.127 [0.623, 0.443]
 tensor([0.6229, 0.4434]))

In [101]:
queries = emb_matrix @ W_query
keys = emb_matrix @ W_key
values = emb_matrix @ W_value

queries.v

tensor[6, 2] n=12 x∈[-0.819, 0.206] μ=-0.110 σ=0.314
tensor([[-0.8194, -0.0759],
        [-0.3519,  0.1483],
        [-0.3274,  0.1500],
        [-0.1605,  0.1004],
        [ 0.2056,  0.1381],
        [-0.4132,  0.0882]])

In [102]:
attn_scores = queries @ keys.T
attn_scores.v

tensor[6, 6] n=36 x∈[-1.662, 0.463] μ=-0.440 σ=0.534
tensor([[-1.2618, -1.6451, -1.6624, -0.7835, -1.5056, -0.6818],
        [-0.4540, -0.6313, -0.6450, -0.2855, -0.7087, -0.1794],
        [-0.4166, -0.5824, -0.5955, -0.2623, -0.6635, -0.1594],
        [-0.1911, -0.2742, -0.2816, -0.1210, -0.3345, -0.0612],
        [ 0.3745,  0.4623,  0.4625,  0.2301,  0.3368,  0.2457],
        [-0.5747, -0.7769, -0.7900, -0.3594, -0.8026, -0.2644]])

In [120]:
# For attention weights, instead of naive normalization, we normalize by sqrt(keys dimension size)
# We then take a softmax for attention weights
# This scaling is the reason why this op is also called scaled-doc product attention

attn_weights = softmax_naive_2d(attn_scores / (keys.shape[1] ** 0.5))
attn_weights.v

tensor[6, 6] n=36 x∈[0.120, 0.240] μ=0.167 σ=0.027
tensor([[0.1596, 0.1217, 0.1202, 0.2238, 0.1343, 0.2405],
        [0.1686, 0.1487, 0.1473, 0.1899, 0.1408, 0.2047],
        [0.1688, 0.1501, 0.1487, 0.1882, 0.1417, 0.2024],
        [0.1686, 0.1590, 0.1581, 0.1772, 0.1523, 0.1848],
        [0.1690, 0.1798, 0.1798, 0.1526, 0.1645, 0.1543],
        [0.1670, 0.1448, 0.1435, 0.1945, 0.1422, 0.2080]])

In [112]:
context_vec = attn_weights @ values
context_vec.v

tensor[6, 2] n=12 x∈[0.254, 0.618] μ=0.441 σ=0.136
tensor([[0.5141, 0.3639],
        [0.5633, 0.3251],
        [0.5659, 0.3221],
        [0.5839, 0.2941],
        [0.6180, 0.2539],
        [0.5575, 0.3262]])

Turning all of this into one single SelfAttention python class

In [117]:
emb_matrix.v

tensor[6, 3] n=18 x∈[0.050, 0.890] μ=0.514 σ=0.276
tensor([[0.4300, 0.1500, 0.8900],
        [0.5500, 0.8700, 0.6600],
        [0.5700, 0.8500, 0.6400],
        [0.2200, 0.5800, 0.3300],
        [0.7700, 0.2500, 0.1000],
        [0.0500, 0.8000, 0.5500]])

In [130]:
class SelfAttention(torch.nn.Module):
    def __init__(self, d_in, d_out):
        super().__init__()
        self.W_query = torch.nn.Parameter(torch.randn(d_in, d_out), requires_grad=False)
        self.W_key   = torch.nn.Parameter(torch.randn(d_in, d_out), requires_grad=False)
        self.W_value = torch.nn.Parameter(torch.randn(d_in, d_out), requires_grad=False)

    def forward(self, x):
        # x shape: (seq_len, d_in)
        queries = x @ self.W_query # queries shape: (seq_len, d_out)
        keys = x @ self.W_key # keys shape: (seq_len, d_out)
        values = x @ self.W_value # values shape: (seq_len, d_out)

        attn_scores = queries @ keys.T # attn_scores shape: (seq_len, seq_len)
        attn_weights = softmax_naive_2d(attn_scores / keys.shape[1] ** 0.5) # attn_weights shape: (seq_len, seq_len)
        context_vec = attn_weights @ values # context_vec shape: (seq_len, d_out)
        return context_vec

torch.manual_seed(42)
model = SelfAttention(3, 2)
context_vec = model(emb_matrix)
context_vec.v

tensor[6, 2] n=12 x∈[0.254, 0.618] μ=0.441 σ=0.136
tensor([[0.5141, 0.3639],
        [0.5633, 0.3251],
        [0.5659, 0.3221],
        [0.5839, 0.2941],
        [0.6180, 0.2539],
        [0.5575, 0.3262]])

### Pytorchification of SelfAttention class

In [138]:
class SelfAttentionV2(torch.nn.Module):
    def __init__(self, d_in, d_out, qkv_bias=False):
        super().__init__()
        self.W_query = torch.nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = torch.nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = torch.nn.Linear(d_in, d_out, bias=qkv_bias)

    def forward(self, x):
        # x shape: (seq_len, emb_sz)
        queries = self.W_query(x)
        keys = self.W_key(x)
        values = self.W_value(x)

        attn_scores = queries @ keys.T
        attn_weights = softmax_naive_2d(attn_scores / d_out ** 0.5)
        context_vec = attn_weights @ values
        return context_vec

torch.manual_seed(42)
model = SelfAttentionV2(3, 2)
context_vec = model(emb_matrix)
context_vec.v

tensor[6, 2] n=12 x∈[0.275, 0.377] μ=0.328 σ=0.050 grad MmBackward0
tensor([[0.3755, 0.2777],
        [0.3761, 0.2831],
        [0.3761, 0.2833],
        [0.3768, 0.2763],
        [0.3754, 0.2836],
        [0.3772, 0.2746]], grad_fn=<MmBackward0>)