# CS336 Assignments

| # | Topic                         | Description                                 |
|---|-------------------------------|---------------------------------------------|
| 1 | Basics                        | Train an LLM from scratch                   |
| 2 | Systems                       | Make it run fast!                           |
| 3 | Scaling                       | Make it performant at a FLOP budget         |
| 4 | Data                          | Prepare the right datasets                  |
| 5 | Alignment & Reasoning RL      | Align it to real-world use cases            |

### Assignment #1
- Implement all of the components (tokenizer, model, loss function, optimizer) necessary to train a standard Transformer language model
- Train a minimal language model

In [1]:
import warnings
warnings.filterwarnings("ignore")

import torch
import lovely_tensors as lt
lt.monkey_patch()

import tiktoken

from datasets import load_dataset
import joblib

from torch.utils.data import TensorDataset, DataLoader


In [2]:
from string import ascii_letters

{l: ord(l) for l in ascii_letters}

{'a': 97,
 'b': 98,
 'c': 99,
 'd': 100,
 'e': 101,
 'f': 102,
 'g': 103,
 'h': 104,
 'i': 105,
 'j': 106,
 'k': 107,
 'l': 108,
 'm': 109,
 'n': 110,
 'o': 111,
 'p': 112,
 'q': 113,
 'r': 114,
 's': 115,
 't': 116,
 'u': 117,
 'v': 118,
 'w': 119,
 'x': 120,
 'y': 121,
 'z': 122,
 'A': 65,
 'B': 66,
 'C': 67,
 'D': 68,
 'E': 69,
 'F': 70,
 'G': 71,
 'H': 72,
 'I': 73,
 'J': 74,
 'K': 75,
 'L': 76,
 'M': 77,
 'N': 78,
 'O': 79,
 'P': 80,
 'Q': 81,
 'R': 82,
 'S': 83,
 'T': 84,
 'U': 85,
 'V': 86,
 'W': 87,
 'X': 88,
 'Y': 89,
 'Z': 90}

In [3]:
chr(115)

's'

### Exercise 1: Problem (unicode1): Understanding Unicode (1 point)


In [4]:
chr(0)

'\x00'

This represents a null character often used to represent end of a string. It is also called an escape sequence.

In [5]:
repr('\x00')

"'\\x00'"

The string representation of this character is '\x00'. When this string is passed to print function, it's rendered as null as that is the purpose of this character.

In [6]:
chr(0)

'\x00'

In [7]:
print(chr(0))

 


In [8]:
"this is a test" + chr(0) + "string"

'this is a test\x00string'

In [9]:
print("this is a test" + chr(0) + "string")

this is a test string


When we print the character with the print function, the character is executed and hence renders nothing on the stdout.

### Exercise 2: Problem (unicode2): Unicode Encodings (3 points)

In [10]:
test_string = "Hello"
test_string_encoded = test_string.encode("UTF-8")
print(f"test_string_encoded: {test_string_encoded}")
print(f"Byte values: {list(test_string_encoded)}")
print(f"test string decoded: {test_string_encoded.decode("UTF-8")}")

test_string_encoded: b'Hello'
Byte values: [72, 101, 108, 108, 111]
test string decoded: Hello


(a) What are some reasons to prefer training our tokenizer on UTF-8 encoded bytes, rather than
UTF-16 or UTF-32? It may be helpful to compare the output of these encodings for various
input strings

A: Majority of the internet comprises of UTF-8 characters. And, UTF-8 is space efficient as 5 characters in UTF-8 takes 5 bytes whereas UTF-16 and UTF-32 takes 2x and 4x the bytes.

In [11]:
def decode_utf8_bytes_to_str_wrong(bytestring: bytes):
    return "".join([bytes([b]).decode("utf-8") for b in bytestring])

decode_utf8_bytes_to_str_wrong("hello".encode("utf-8"))

'hello'

In [12]:
decode_utf8_bytes_to_str_wrong("café".encode("utf-8"))

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 0: unexpected end of data

The function attempts to convert each character as a standalone single character. 

In [13]:
"café".encode("UTF-8")

b'caf\xc3\xa9'

## Exercise 3: Problem (train_bpe): BPE Tokenizer Training (15 points)

In [14]:
def train_bpe(input_path, vocab_size=1000, special_tokens=[]):
    vocab, merges = None, None

    return vocab, merges



## Exercise 4: Problem (train_bpe_tinystories): BPE Training on TinyStories (2 points)

In [15]:
def train_bpe_tinystories(input_path, vocab_size=10000, special_tokens=["|endoftext|"]):
    vocab, merges = None, None

    return vocab, merges

## Exercise 5: Problem (train_bpe_expts_owt): BPE Training on OpenWebText (2 points)

In [16]:
def train_bpe_expts_owt(input_path, vocab_size=32000, special_tokens=["|endoftext|"]):
    vocab, merges = None, None

    return vocab, merges

## Exercise 6: Problem (tokenizer): Implementing the tokenizer (15 points)

In [18]:
class BPETokkenizer():
    def __init__(self, vocab, merges, special_tokens=None):
        pass

    def encode(self, text:str):
        pass

    def decode(self, ids:list[str]):
        pass

    def from_files():
        pass

## Exercise 7: Problem (tokenizer_experiments): Experiments with tokenizers (4 points)

a. Sample 10 documents from TinyStories and OpenWebText. Using your previously-trained TinyStories and OpenWebText tokenizers (10K and 32K vocabulary size, respectively), encode these sampled documents into integer IDs. What is each tokenizer’s compression ratio (bytes/token)?


b. What happens if you tokenize your OpenWebText sample with the TinyStories tokenizer? Compare the compression ratio and/or qualitatively describe what happens

## Exercise 8: Problem (linear): Implementing the linear module (1 point)

In [19]:
import torch
from math import sqrt
from einops import einsum

class MyLinear(torch.nn.Module):
    def __init__(self, in_features, out_features, device=None, dtype=None):
        super().__init__()
        self.device = device
        self.dtype = dtype
        weight = torch.empty(in_features, out_features)
        sigma = sqrt(2 / (in_features + out_features))
        torch.nn.init.trunc_normal_(weight, mean=0, std=sigma, a=-3 * sigma, b=3 * sigma)

        self.W = torch.nn.Parameter(weight)
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # PyTorch way
        # output = x @ self.W 

        # einsum way
        output = einsum(x, self.W, "... j, j k-> ... k")

        return output
    
torch.manual_seed(42)
model = MyLinear(5, 2)

batch = torch.randn(3, 5)

output = model(batch)
output.v

tensor[3, 2] n=6 x∈[-1.725, 0.063] μ=-0.740 σ=0.702 grad ViewBackward0 [[-1.281, -1.725], [-0.531, -0.048], [0.063, -0.920]]
tensor([[-1.2815, -1.7247],
        [-0.5314, -0.0480],
        [ 0.0631, -0.9196]], grad_fn=<ViewBackward0>)

## Exercise 9: Problem (embedding): Implement the embedding module (1 point)

In [20]:
import torch
from math import sqrt
from einops import einsum

class MyEmbedding(torch.nn.Module):
    def __init__(self, num_embeddings, embedding_dim, device=None, dtype=None):
        super().__init__()
        self.device = device
        self.dtype = dtype

        weight = torch.empty(num_embeddings, embedding_dim)
        torch.nn.init.trunc_normal_(weight, mean=0, std=1, a=-3, b=3)

        self.W = torch.nn.Parameter(weight)
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # PyTorch way
        output = self.W[x]

        return output
    
torch.manual_seed(42)
model = MyEmbedding(10, 5)

batch = torch.randint(0, 10, (4, ))

print(batch.v)
output = model(batch)
output.v

tensor[4] i64 x∈[0, 9] μ=4.500 σ=5.196 [0, 9, 0, 9]
tensor([0, 9, 0, 9])


tensor[4, 5] n=20 x∈[-0.856, 1.729] μ=0.398 σ=0.862 grad IndexBackward0
tensor([[ 1.1812,  1.3651, -0.2971,  1.7287, -0.2774],
        [-0.5769,  0.7990,  0.2255,  0.6847, -0.8557],
        [ 1.1812,  1.3651, -0.2971,  1.7287, -0.2774],
        [-0.5769,  0.7990,  0.2255,  0.6847, -0.8557]],
       grad_fn=<IndexBackward0>)

## Exercise 10: Problem (rmsnorm): Root Mean Square Layer Normalization (1 point)

In [21]:
import torch
from math import sqrt
from einops import einsum

class MyRMSNorm(torch.nn.Module):
    def __init__(self, d_model, eps=1e-5, device=None, dtype=None):
        super().__init__()
        self.eps = eps
        self.dtype = dtype
        self.device = device
        self.dtype = dtype

        weight = torch.ones(d_model)
        self.W = torch.nn.Parameter(weight)
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # PyTorch way
        output = x.to(torch.float32)
        denom = torch.sqrt(output.pow(2).mean(dim=-1, keepdim=True) + self.eps)
        output = output / denom
        output = output * self.W
        output = output.to(self.dtype)
        return output
    
torch.manual_seed(42)
model = MyRMSNorm(8)

batch = torch.randn((4, 8))

output = model(batch)
output.v

tensor[4, 8] n=32 x∈[-1.502, 1.712] μ=0.232 σ=0.988 grad MulBackward0
tensor([[ 1.3741,  1.0606,  0.6423, -1.5015,  0.4838, -0.8804, -0.0307, -1.1443],
        [-0.7808,  1.7116, -0.4074, -1.4571, -0.7556, -0.5807, -0.7981,  0.7915],
        [ 1.6059, -0.1561, -0.4864,  0.4298, -0.7413,  1.0544,  0.7831,  1.6434],
        [ 1.4381,  1.4575,  0.6863,  1.5006, -0.2604,  0.0469, -0.2828,  0.9667]],
       grad_fn=<MulBackward0>)

## Exercise 11: Problem (positionwise_feedforward): Implement the position-wise feed-forward network (2 points)

In [22]:
import torch
from math import sqrt
from einops import einsum

class MySilu(torch.nn.Module):
    def __init__(self, device=None, dtype=None):
        super().__init__()
        self.device = device
        self.dtype = dtype
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # PyTorch way
        sigm = torch.sigmoid(x)
        output = x * sigm
        return output
    
torch.manual_seed(42)
model = MySilu()

batch = torch.randn((4, 8))
output = model(batch)
output.v

tensor[4, 8] n=32 x∈[-0.278, 1.682] μ=0.360 σ=0.627
tensor([[ 1.6820,  1.2131,  0.6405, -0.2286,  0.4501, -0.2783, -0.0211, -0.2685],
        [-0.2410,  1.3828, -0.1582, -0.2769, -0.2370, -0.2035, -0.2435,  0.5199],
        [ 1.3760, -0.0734, -0.1881,  0.2673, -0.2419,  0.8046,  0.5527,  1.4167],
        [ 1.0007,  1.0180,  0.3956,  1.0566, -0.1025,  0.0213, -0.1100,  0.6042]])

## Exercise 12: Problem (rope): Implement RoPE (2 points)

In [23]:
# import torch
# from math import sqrt
# from einops import einsum

# class MyRope(torch.nn.Module):
#     def __init__(self, theta: float, d_k: int, max_seq_len: int, device=None, ):
#         super().__init__()
#         self.device = device
#         self.dtype = dtype
        
#     def forward(self, x: torch.Tensor) -> torch.Tensor:
#         # PyTorch way
#         sigm = torch.sigmoid(x)
#         output = x * sigm
#         return output
    
# torch.manual_seed(42)
# model = MySilu()

# batch = torch.randn((4, 8))
# output = model(batch)
# output.v

## Exercise 13: Problem (softmax): Implement softmax (1 point)


In [24]:
def MySoftmax(x: torch.Tensor, dim: int):
    eps = 1e-8
    
    # numerical stability at high values of logits
    # dim-wise max and not overall max
    max = torch.max(x, dim=dim, keepdim=True)[0]
    x = x - max

    # usual business here
    num = torch.exp(x)
    denom = torch.exp(x).sum(dim=dim, keepdim=True)
    output = num / (denom + eps)
    return output

torch.manual_seed(42)

batch = torch.randn((4, 8))
output = MySoftmax(batch, -1)
output.v

tensor[4, 8] n=32 x∈[0.007, 0.507] μ=0.125 σ=0.119
tensor([[0.3971, 0.2558, 0.1423, 0.0070, 0.1139, 0.0168, 0.0554, 0.0116],
        [0.0460, 0.5071, 0.0659, 0.0240, 0.0471, 0.0557, 0.0452, 0.2090],
        [0.2693, 0.0444, 0.0317, 0.0809, 0.0244, 0.1532, 0.1161, 0.2799],
        [0.2011, 0.2046, 0.1031, 0.2126, 0.0444, 0.0584, 0.0435, 0.1323]])

## Exercise 14: Problem (scaled_dot_product_attention): Implement scaled dot-product attention (5 points)


In [41]:
from math import sqrt

def Myscaled_dot_product_attention(Q, K, V, mask=None):
    #  ## einsum approach
    attn_scores = einsum(Q, K, "... queries d_k, ... keys d_k -> ... queries keys")
    attn_weights = attn_scores / torch.sqrt(torch.tensor(Q.shape[-1]))

    if mask is not None:
        attn_weights = attn_weights.masked_fill(mask == 0, -torch.inf)

    attn_weights = MySoftmax(attn_weights, dim=-1)
    context_vec = einsum(attn_weights, V, "... queries sl, ... sl d_v -> ... queries d_v")
    return context_vec
    
torch.manual_seed(420)

Q = torch.randn((2, 4, 8))
K = torch.randn((2, 4, 8))
V = torch.randn((2, 4, 8))

# creating a mask
mask = torch.randn((4, 4))
mask = torch.triu(mask, diagonal=1).to(bool)

output = Myscaled_dot_product_attention(Q, K, V, mask)
output.v

tensor[2, 4, 8] n=64 x∈[-1.856, 1.337] μ=-0.164 σ=0.848 [31mNaN![0m
tensor([[[-1.1559, -0.6587,  0.9698, -0.5012,  0.0382, -0.6236, -0.4751,
           0.1088],
         [-1.7076, -1.1538,  1.0961,  0.0806,  0.7386, -1.2000, -1.5364,
           1.0948],
         [ 0.5950,  0.1390,  0.9795,  1.2782, -0.6123, -0.2495, -0.5716,
           0.4881],
         [    nan,     nan,     nan,     nan,     nan,     nan,     nan,
              nan]],

        [[-0.2903, -0.9661,  0.3084, -0.0689, -0.3157, -0.3652,  0.5390,
          -0.8634],
         [-0.2998, -0.6605,  0.3965,  0.5087,  1.3371, -0.3157,  0.8563,
          -1.2335],
         [-0.0314, -0.8524, -1.3506, -0.8969,  1.3004, -1.8561,  0.5581,
          -0.4879],
         [    nan,     nan,     nan,     nan,     nan,     nan,     nan,
              nan]]])

## Exercise 15: Problem (multihead_self_attention): Implement causal multi-head self-attention (5 points)


In [None]:
class MySDPA(torch.nn.Module):
    def __init__(self, seq_len, emb_size):
        super().__init__()
        self.W_query = torch.nn.Parameter(torch.randn(seq_len, emb_size))
        self.W_key   = torch.nn.Parameter(torch.randn(seq_len, emb_size))
        self.W_value = torch.nn.Parameter(torch.randn(seq_len, emb_size))

    def forward(self, x):
        return Myscaled_dot_product_attention(self.W_query, self.W_key, self.W_value, None)

class MyCausalMHA(torch.nn.Module):
    def __init__(self, d_model: int, num_heads: int, seq_len: int, emb_size: int):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.seq_len = seq_len
        self.emb_size = emb_size
        self.d_k = self.d_model / self.num_heads

        self.heads = torch.nn.ModuleList([MySDPA(self.seq_len, self.emb_size)
                                         for _ in range(self.num_heads)])

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        context_vecs = torch.cat([head(x) for head in self.heads], dim=-1)
        return context_vecs

torch.manual_seed(42)
model = MyMHA(d_model=8, num_heads=2, seq_len=6, emb_size=4)
batch = torch.randn((2, 6, 4))
context_vec = model(batch)
context_vec.v

tensor[6, 8] n=48 x∈[-0.719, 2.601] μ=0.160 σ=0.580 grad CatBackward0
tensor([[-0.2432,  0.3059,  0.1249, -0.2378,  0.3230,  0.1088, -0.3383,  0.2205],
        [-0.1074,  0.2105,  1.6652,  0.4818,  0.0650,  0.5881,  0.0076, -0.1414],
        [-0.2060,  0.3413,  1.1676,  0.4955, -0.0630, -0.5109, -0.7194, -0.5144],
        [-0.1612,  0.3109,  2.6013,  0.9274,  0.0252,  0.2998, -0.0720,  0.1192],
        [-0.2827,  0.1304, -0.0213, -0.0715, -0.0075,  0.4589, -0.1612, -0.3944],
        [-0.2487,  0.3305,  1.1002,  0.5451, -0.1859, -0.3870, -0.2347,  0.0226]],
       grad_fn=<CatBackward0>)