# Transformers and LLMs

	• Self-Attention Mechanism
	• Positional Encoding, Multi-Head Attention
	• Feedforward + LayerNorm + Residual
	• Python: Mini-Transformer from Scratch (PyTorch)
How DeepSeek Rewrote the Transformer [MLA] https://youtu.be/0VLAoVGf_74?si=AecUx9gtvM-ETDMR 


Deepseek R1

Agents


Here’s your complete module on Transformers and Large Language Models (LLMs), including essential theory and a Mini-Transformer implemented from scratch in PyTorch.

⸻

Transformers and LLMs

⸻

1. Self-Attention Mechanism

Each token attends to others via scaled dot-product attention:

Given query $Q$, key $K$, and value $V$:

$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$

Where $d_k$ is the dimensionality of the key vectors.

⸻

2. Positional Encoding & Multi-Head Attention

Positional Encoding adds order info (since attention is permutation-invariant):

$$
PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right), \quad PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)
$$

Multi-Head Attention:
	•	Projects input into multiple Q/K/V spaces
	•	Runs attention in parallel
	•	Concatenates and reprojects

⸻

3. Transformer Block

Each block includes:
	1.	Multi-Head Attention
	2.	Add + LayerNorm
	3.	Feedforward Network
	4.	Add + LayerNorm

Each with residual connections:

x → Attention → Add & Norm → FFN → Add & Norm



⸻

4. Python: Mini-Transformer from Scratch (PyTorch)

import torch
import torch.nn as nn
import math

class SelfAttention(nn.Module):
    def __init__(self, dim, heads=4):
        super().__init__()
        self.heads = heads
        self.scale = dim ** -0.5
        self.to_qkv = nn.Linear(dim, dim * 3)
        self.to_out = nn.Linear(dim, dim)

    def forward(self, x):
        B, T, C = x.size()
        qkv = self.to_qkv(x).chunk(3, dim=-1)
        q, k, v = map(lambda t: t.reshape(B, T, self.heads, C // self.heads).transpose(1, 2), qkv)
        dots = (q @ k.transpose(-2, -1)) * self.scale
        attn = torch.softmax(dots, dim=-1)
        out = (attn @ v).transpose(1, 2).reshape(B, T, C)
        return self.to_out(out)

class TransformerBlock(nn.Module):
    def __init__(self, dim, heads, ff_hidden):
        super().__init__()
        self.attn = SelfAttention(dim, heads)
        self.norm1 = nn.LayerNorm(dim)
        self.ff = nn.Sequential(
            nn.Linear(dim, ff_hidden),
            nn.ReLU(),
            nn.Linear(ff_hidden, dim)
        )
        self.norm2 = nn.LayerNorm(dim)

    def forward(self, x):
        x = x + self.attn(self.norm1(x))
        x = x + self.ff(self.norm2(x))
        return x

class PositionalEncoding(nn.Module):
    def __init__(self, dim, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, dim)
        position = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, dim, 2) * (-math.log(10000.0) / dim))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe.unsqueeze(0))  # shape (1, max_len, dim)

    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

class MiniTransformer(nn.Module):
    def __init__(self, vocab_size, dim, depth, heads, ff_hidden, max_len):
        super().__init__()
        self.token_embedding = nn.Embedding(vocab_size, dim)
        self.pos_encoding = PositionalEncoding(dim, max_len)
        self.blocks = nn.Sequential(*[
            TransformerBlock(dim, heads, ff_hidden) for _ in range(depth)
        ])
        self.to_logits = nn.Linear(dim, vocab_size)

    def forward(self, x):
        x = self.token_embedding(x)
        x = self.pos_encoding(x)
        x = self.blocks(x)
        return self.to_logits(x)

# Sample usage
model = MiniTransformer(vocab_size=1000, dim=64, depth=2, heads=4, ff_hidden=256, max_len=50)
dummy_input = torch.randint(0, 1000, (8, 50))  # Batch of 8 sequences
out = model(dummy_input)
print(out.shape)  # (8, 50, 1000)



⸻

Would you like to extend this to causal masking for language modeling or plug it into a training loop with tokenized data?