# Model Overall Structure

1. **Token + Positional Embedding**  
   Each character in the input is embedded into a vector based on:
   - Its identity (**token embedding**)
   - Its position within the block (**positional embedding**)  
   These are summed to form the initial input vectors for each token.

2. **Stack of Transformer Blocks**  
   The model consists of multiple transformer blocks (`n_layer` determines how many). Each block follows this pattern:
   - `LayerNorm → Self-Attention → Residual connection`
   - `LayerNorm → Feed Forward → Residual connection`

3. **Self-Attention Mechanism**  
   - Each token produces a **query**, **key**, and **value** vector.
   - Attention weights are computed as `softmax(QKᵀ / √d)`, masked to prevent attending to future tokens (causal attention).
   - The result is a weighted combination of values, capturing contextual information.
   - Multi-head attention is used to allow the model to attend to different representation subspaces.

4. **Feed Forward Network (FFN)**  
   - A two-layer MLP: `Linear(n_emb → 4×n_emb) → ReLU → Linear(4×n_emb → n_emb)`
   - This introduces non-linearity and enables more complex transformations for each token independently.

5. **Final Projection**  
   - After all transformer blocks, a final `LayerNorm` is applied.
   - The output is passed through a linear layer projecting from `n_emb` to `vocab_size`, yielding **unnormalized logits** for the next token.

6. **Text Generation**  
   - A softmax converts logits into probabilities.
   - The next token is sampled from this distribution and appended to the sequence.
   - This process is repeated for generating sequences of desired length.


## Imports and Hyperparameters

In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F

In [2]:
block_size = 256
batch_size = 64
max_iters = 15000            # ~60 epochs
eval_interval = 750          # Evaluate every ~5 epochs
learning_rate = 3e-4
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 100
n_emb = 384
n_layer = 6
n_head = 6
dropout = 0.2
temperature = 0.8

In [3]:
torch.manual_seed(2408)

<torch._C.Generator at 0x1eed1329730>

## Text Processing
Text is read from a .txt file.
Chars found in the file is mapped to an integer for encoding nad decoding purposes.
90% of the dataset is assigned to the train set and the remaining 10% validation set.
get_batch is defined, which splits the data into tensors

In [4]:
with open('combined.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [5]:
chars = sorted(list(set(text)))
vocab_size = len(chars)

In [6]:
#creating a map for char to int and vv
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] #encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) #decoder: take a list of integers, output a string

In [7]:
# tokenizer = ByteLevelBPETokenizer()
# tokenizer.train(files=["combined.txt"], vocab_size=50257, min_frequency=2, special_tokens=[
#     "<s>", "<pad>", "</s>", "<unk>", "<mask>"
# ])

In [8]:
# tokenizer.save_model(".", "bpe_tokenizer")
# tokenizer = LoadedTokenizer("bpe_tokenizer-vocab.json", "bpe_tokenizer-merges.txt")

In [9]:
# encoded = tokenizer.encode(text)
# data = torch.tensor(encoded.ids, dtype=torch.long)

In [10]:
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

In [11]:
# data = torch.tensor(encoded.ids, dtype=torch.long)
# n = int(0.9*len(data))
# train_data = data[:n]
# val_data = data[n:]

In [12]:
#function to get a batch of data
def get_batch(split):
    #generates a small batch of input and target
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,)) #random offsets
    
    #get a 4x8 tensor
    x = torch.stack([data[i:i+block_size] for i in ix]) 
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)  #move to device
    
    return x, y

## Loss Estimator


In [13]:
#function to estimate loss
@torch.no_grad()
def estimate_loss():
    model.eval()  #set the model to evaluation mode
    out = {}
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            x, y = get_batch(split)
            # x, y = x.to(device), y.to(device)
            logits, loss = model(x, y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()  #set the model back to training mode
    return out

## Self-Attention Heads
Each Head computes the key, query from each input. From this, the attention weight is calculated and the weight is used to compute the value vector.
It also uses a triangular vector to make sure that it will only be affected by past and not by future.

In [14]:
class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_emb, head_size, bias=False)
        self.query = nn.Linear(n_emb, head_size, bias=False)
        self.value = nn.Linear(n_emb, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout) 
        
    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)  # (B, T, head_size)
        q = self.query(x)  # (B, T, head_size)
        # compute attention scores
        wei = q @ k.transpose(-2, -1) * C ** -0.5  # (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))  # apply causal mask
        wei = F.softmax(wei, dim=-1)  # (B, T, T)
        wei = self.dropout(wei)  # apply dropout to attention weights
        v = self.value(x)  # (B, T, C)
        out = wei @ v  # (B, T, C)
        return out

In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, n_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(n_heads)])
        self.proj = nn.Linear(n_emb, n_emb)  # projection to the original embedding size
        self.dropout = nn.Dropout(dropout)  

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)  # concatenate the outputs of all heads   
        out = self.proj(out)  # project back to the original embedding size
        out = self.dropout(out)
        return out  # return the concatenated output

## Feed Forward Layer
Adds non-linearity to each token using ReLU

In [16]:
class FeedForward(nn.Module):
    def __init__(self, n_emb):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_emb, 4 * n_emb),
            nn.GELU(), 
            nn.Linear(4 * n_emb, n_emb),
            nn.Dropout(dropout)  # dropout for regularization    
            )

    def forward(self, x):
        return self.net(x)  # apply the feedforward network

## Block
Normalizes each token before self-attention. Each token is then normalized again before going through feed-forward

In [17]:
class Block(nn.Module):
    def __init__(self, n_emb, n_head):
        super().__init__()
        heead_size = n_emb // n_head
        self.sa = MultiHeadAttention(n_head, heead_size)  # self-attention
        self.ffwd = FeedForward(n_emb)  # feedforward network
        self.ln1 = nn.LayerNorm(n_emb) 
        self.ln2 = nn.LayerNorm(n_emb)  

    def forward(self, x):
        x = x + self.sa(self.ln1(x))  # apply self-attention
        x = x + self.ffwd(self.ln2(x))  # apply feedforward network
        return x

## Model


In [None]:
class BigramLanguageModel(nn.Module):
    def __init__(self):
        super().__init__()
        #each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_emb)
        self.pos_embedding_table = nn.Embedding(block_size, n_emb)  # positional embeddings
        self.blocks = nn.Sequential(*[Block(n_emb, n_head=n_head) for _ in range(n_layer)])  # stack of transformer blocks
        self.ln_f = nn.LayerNorm(n_emb)  
        self.lm_head = nn.Linear(n_emb, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape  # B is batch size, T is block size

        #idx and targets are both (B,T) tensor of integers
        tkn_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.pos_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tkn_emb + pos_emb  
        x = self.blocks(x)  # pass through the transformer blocks
        x = self.ln_f(x)  # apply layer normalization
        
        logits = self.lm_head(x) # (B,T,vc size)
        #logits are the unnormalized log probabilities of the next token

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets) #quality of logits based on targets

        return logits, loss
        
    def generate(self, idx, max_new_tokens):
        #idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            #cut the context to the last block_size tokens
            idx_cond = idx[:, -block_size:]  # (B, T) get the last block_size tokens
            #get the predictions
            logits, loss = self(idx_cond)
            #focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            #apply softmax to get probabilities
            probs = F.softmax(logits / temperature, dim=-1)
            #sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            #append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

## Model Initialization / Loading

In [19]:
#initialize the model    
model = BigramLanguageModel()
m = model.to(device)
optimizer = torch.optim.AdamW(m.parameters(), lr=learning_rate)

In [20]:
# model = BigramLanguageModel()
# model.load_state_dict(torch.load('model.pt'))
# m = model.to(device)
# optimizer = torch.optim.AdamW(m.parameters(), lr=learning_rate)

## Training

In [21]:
best_val_loss = float('inf')  # Track the best validation loss
checkpoint_path = "checkpoint.pt"  # File to save model checkpoint

In [22]:
for iter in range(max_iters):
    
    #evaluate the model every eval_interval iterations
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"Step {iter}: Train loss {losses['train']:.4f}, Val loss {losses['val']:.4f}")
        
        if losses['val'] < best_val_loss:
            best_val_loss = losses['val']
            torch.save(model.state_dict(), checkpoint_path)
            print(f"New best val loss! Model saved at iteration {iter}.")

    #get a batch of data
    xb, yb = get_batch('train')
    xb, yb = xb.to(device), yb.to(device)

    #evaluate the model
    logits, loss = m(xb, yb)

    #backpropagation
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

Step 0: Train loss 4.7751, Val loss 4.7732
New best val loss! Model saved at iteration 0.
Step 750: Train loss 1.4852, Val loss 1.5263
New best val loss! Model saved at iteration 750.
Step 1500: Train loss 1.2464, Val loss 1.3067
New best val loss! Model saved at iteration 1500.
Step 2250: Train loss 1.1527, Val loss 1.2196
New best val loss! Model saved at iteration 2250.
Step 3000: Train loss 1.0899, Val loss 1.1713
New best val loss! Model saved at iteration 3000.
Step 3750: Train loss 1.0483, Val loss 1.1413
New best val loss! Model saved at iteration 3750.
Step 4500: Train loss 1.0138, Val loss 1.1131
New best val loss! Model saved at iteration 4500.
Step 5250: Train loss 0.9872, Val loss 1.1007
New best val loss! Model saved at iteration 5250.
Step 6000: Train loss 0.9644, Val loss 1.0880
New best val loss! Model saved at iteration 6000.
Step 6750: Train loss 0.9433, Val loss 1.0792
New best val loss! Model saved at iteration 6750.
Step 7500: Train loss 0.9310, Val loss 1.0775
Ne

## Saving model weights

In [23]:
torch.save(m.state_dict(), 'model.pt')

## Model Testing

In [26]:
start = "THE ONE WITH THE "
context = torch.tensor([encode(start)], dtype=torch.long, device=device)  # wrap in tensor and send to device
generated = m.generate(context, max_new_tokens=10000)[0].tolist()
print(decode(generated))

THE ONE WITH THE OLD TO BUT O OD MOON!!!

MONICA: Yeah.

RACHEL: Yeah it's okay to so be okay, he's okay, this is not funny... , he's okay, got a call for a sec; it has no final.

MONICA: Yes, because, he's got a big deal, and he's got a loong ring, and I need a way to hell with you.

RACHEL: Why does a little fan of life.

MONICA: Could you look it, you know when we were out to knee your name?

RACHEL: Yeah, actually it was really huge. I was a name.

MONICA: Yeah, we'd just leave them a noted a deal.

JOEY: Ok, man, I went you out with her.

MONICA: With her question and the classics shake.

ROSS: No.

CHANDLER: Wow, you don't make turned out any night.

ROSS: Hey look. Ok.

PHOEBE: So what do you guys want to do that? Why won't the good part?

MONICA: It's not so obvious. Rachel, you're supposed to be a men. They don't like it been a little Turty back of Ben. They look all nice.

ROSS: I'm sorry that you put your head.

RACHEL: How do you do not think that?

ROSS: Because you got me

In [25]:
# start = "THE ONE WITH THE "
# context = torch.tensor([tokenizer.encode(start)], dtype=torch.long, device=device)  # wrap in tensor and send to device
# generated = m.generate(context, max_new_tokens=1000)[0].tolist()
# print(tokenizer.decode(generated))