# Custom Transformer Implementation from Scratch

**Project**: MiNiLLM  
**Technologies**: Python, NumPy, PyTorch, NLP  
**Source**: [https://github.com/anarcoiris/MiNiLLM](https://github.com/anarcoiris/MiNiLLM)

---

## Executive Summary

Low-level implementation of the Transformer architecture (Attention Is All You Need) from first principles, demonstrating deep understanding of self-attention and positional encoding.

---


## 1. Transformer Architecture Fundamentals

### Core Components

```
Input Sequence
      ↓
[Token Embedding] + [Positional Encoding]
      ↓
[Multi-Head Self-Attention]
      ↓
[Layer Normalization]
      ↓
[Feed-Forward Network]
      ↓
[Layer Normalization]
      ↓
(Repeat N layers)
      ↓
[Output Projection]
      ↓
Predicted Next Token
```

### Key Parameters
- **Vocabulary Size**: Character-level (~100 chars)
- **Embedding Dimension**: 384
- **Num Heads**: 6
- **Num Layers**: 6
- **Context Window**: 256 characters
- **Total Parameters**: ~10M

In [None]:
# Setup
import sys
from pathlib import Path

# Try to add MiNiLLM to path (repository code available for reference only)
try:
    repo_path = Path('MiNiLLM').resolve()
    if repo_path.exists():
        sys.path.insert(0, str(repo_path))
        print("✓ Repository code loaded")
    else:
        print("ℹ Note: Repository code not found. Using standalone demo implementations.")
except Exception as e:
    print(f"ℹ Note: Repository import skipped - using demo code ({e})")

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("✓ Environment configured for transformer implementation")
print("\n📝 Execution Note:")
print("   This notebook demonstrates transformer architecture from first principles.")
print("   Full production code available at: https://github.com/anarcoiris/MiNiLLM")

## 2. Self-Attention Mechanism

### Scaled Dot-Product Attention

**Formula**:
```
Attention(Q, K, V) = softmax(QK^T / √d_k) V
```

Where:
- **Q** (Query): "What am I looking for?"
- **K** (Key): "What do I contain?"
- **V** (Value): "What do I actually say?"
- **d_k**: Dimension scaling factor

### Multi-Head Attention

- Run 6 attention heads in parallel
- Each head learns different relationships
- Concatenate and project outputs

In [None]:
# Simplified self-attention implementation
def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Implement scaled dot-product attention mechanism.
    
    This is the core operation in transformer models, allowing the model
    to weigh the importance of different positions in the sequence.
    
    Args:
        Q: Queries (batch, seq_len, d_k) - "What am I looking for?"
        K: Keys (batch, seq_len, d_k) - "What do I contain?"
        V: Values (batch, seq_len, d_v) - "What information do I provide?"
        mask: Causal mask for autoregressive modeling (prevents looking ahead)
    
    Returns:
        output: Attention-weighted values
        attention_weights: Attention distribution (useful for visualization)
    """
    d_k = Q.shape[-1]
    
    # Step 1: Compute attention scores (similarity between queries and keys)
    # Shape: (batch, seq_len, seq_len)
    scores = np.matmul(Q, K.transpose(0, 2, 1)) / np.sqrt(d_k)
    # Division by sqrt(d_k) prevents scores from growing too large (gradient stability)
    
    # Step 2: Apply causal mask (prevent looking at future tokens)
    if mask is not None:
        scores = scores + mask  # mask contains -inf for future positions
    
    # Step 3: Softmax to get attention weights (sum to 1 across keys)
    attention_weights = softmax(scores, axis=-1)
    
    # Step 4: Apply attention to values (weighted sum)
    output = np.matmul(attention_weights, V)
    
    return output, attention_weights

def softmax(x, axis=-1):
    """
    Numerically stable softmax implementation.
    
    Subtracts max value before exp() to prevent overflow.
    """
    exp_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)

# Demo: Attention for simple sequence
np.random.seed(42)
batch, seq_len, d_model = 1, 5, 8

Q = np.random.randn(batch, seq_len, d_model)
K = np.random.randn(batch, seq_len, d_model)
V = np.random.randn(batch, seq_len, d_model)

# Create causal mask (upper triangle = -infinity)
# This ensures token i can only attend to tokens 0...i (not future tokens)
mask = np.triu(np.ones((seq_len, seq_len)) * -1e9, k=1)
mask = mask[np.newaxis, :, :]  # Add batch dimension

output, weights = scaled_dot_product_attention(Q, K, V, mask)

print("Self-Attention Demo:")
print(f"  Input shape: {Q.shape}")
print(f"  Output shape: {output.shape}")
print(f"  Attention weights shape: {weights.shape}")
print(f"\nAttention weights (each row sums to 1.0):")
print(weights[0].round(3))

In [None]:
# Visualize attention pattern
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(weights[0], annot=True, fmt='.2f', cmap='YlOrRd', 
            xticklabels=[f'Pos {i}' for i in range(seq_len)],
            yticklabels=[f'Pos {i}' for i in range(seq_len)],
            cbar_kws={'label': 'Attention Weight'})
ax.set_title('Causal Self-Attention Pattern', fontsize=14, fontweight='bold')
ax.set_xlabel('Key Position')
ax.set_ylabel('Query Position')
plt.tight_layout()
plt.show()

print("\nNote: Upper triangle is zero (causal mask prevents looking ahead)")

## 3. Positional Encoding

### Sinusoidal Position Embeddings

**Why Needed**: Transformers have no inherent notion of sequence order

**Formula**:
```
PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
```

**Properties**:
- Unique encoding for each position
- Deterministic (no learned parameters)
- Can extrapolate to longer sequences

In [None]:
# Positional encoding implementation
def get_positional_encoding(seq_len, d_model):
    """
    Generate sinusoidal positional encodings.
    """
    position = np.arange(seq_len)[:, np.newaxis]
    div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
    
    pos_encoding = np.zeros((seq_len, d_model))
    pos_encoding[:, 0::2] = np.sin(position * div_term)
    pos_encoding[:, 1::2] = np.cos(position * div_term)
    
    return pos_encoding

# Generate positional encodings
seq_len, d_model = 100, 64
pos_enc = get_positional_encoding(seq_len, d_model)

print(f"Positional Encoding:")
print(f"  Shape: {pos_enc.shape}")
print(f"  Range: [{pos_enc.min():.3f}, {pos_enc.max():.3f}]")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Heatmap of positional encodings
im = axes[0].imshow(pos_enc.T, cmap='RdBu', aspect='auto', interpolation='nearest')
axes[0].set_title('Positional Encoding Matrix', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Position')
axes[0].set_ylabel('Embedding Dimension')
plt.colorbar(im, ax=axes[0])

# Individual position encodings
for pos in [0, 25, 50, 75]:
    axes[1].plot(pos_enc[pos], label=f'Position {pos}', alpha=0.7)
axes[1].set_title('Encoding Patterns for Different Positions', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Dimension Index')
axes[1].set_ylabel('Encoding Value')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 4. Training Pipeline

### Character-Level Language Modeling

**Dataset Preparation**:
1. Tokenize text into characters
2. Create fixed-length sequences (context window)
3. Predict next character at each position

**Training Objective**:
- Cross-entropy loss on next-token prediction
- Minimize perplexity: exp(loss)

**Optimization**:
- AdamW optimizer
- Learning rate schedule: warmup + cosine decay
- Gradient clipping for stability

In [None]:
# Character-level tokenization
sample_text = """To be, or not to be, that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles"""

# Build vocabulary
chars = sorted(list(set(sample_text)))
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}
vocab_size = len(chars)

# Encode text
encoded = [char_to_idx[ch] for ch in sample_text]

print("Character-Level Tokenization:")
print(f"  Vocabulary size: {vocab_size}")
print(f"  Unique characters: {chars}")
print(f"  Encoded length: {len(encoded)}")
print(f"\nExample encoding:")
print(f"  Text: '{sample_text[:30]}...'")
print(f"  IDs:  {encoded[:30]}")

# Create training sequences
def create_sequences(data, seq_len):
    X, y = [], []
    for i in range(len(data) - seq_len):
        X.append(data[i:i+seq_len])
        y.append(data[i+1:i+seq_len+1])
    return np.array(X), np.array(y)

seq_len = 32
X_train, y_train = create_sequences(encoded, seq_len)

print(f"\nTraining Sequences:")
print(f"  X shape: {X_train.shape} (num_sequences, seq_len)")
print(f"  y shape: {y_train.shape} (targets shifted by 1)")
print(f"  Total training samples: {len(X_train)}")

## 5. Inference & Text Generation

### Sampling Strategies

**1. Greedy Decoding**:
- Always select highest probability token
- Fast but repetitive

**2. Top-K Sampling**:
- Sample from K most likely tokens
- Balances quality and diversity

**3. Nucleus (Top-P) Sampling**:
- Sample from smallest set with cumulative probability > P
- Adaptive vocabulary size

**4. Temperature Scaling**:
- T < 1: More confident (peaked distribution)
- T > 1: More creative (flattened distribution)

In [None]:
# Sampling strategies demo
def sample_token(logits, temperature=1.0, top_k=0, top_p=1.0):
    """
    Sample next token with various strategies.
    """
    # Apply temperature
    logits = logits / temperature
    
    # Convert to probabilities
    probs = softmax(logits, axis=-1)
    
    # Top-K filtering
    if top_k > 0:
        indices_to_remove = probs < np.partition(probs, -top_k)[-top_k]
        probs[indices_to_remove] = 0
        probs = probs / probs.sum()
    
    # Nucleus (Top-P) filtering
    if top_p < 1.0:
        sorted_indices = np.argsort(probs)[::-1]
        sorted_probs = probs[sorted_indices]
        cumulative_probs = np.cumsum(sorted_probs)
        
        # Remove tokens with cumulative probability above threshold
        sorted_indices_to_remove = cumulative_probs > top_p
        sorted_indices_to_remove[1:] = sorted_indices_to_remove[:-1].copy()
        sorted_indices_to_remove[0] = False
        
        indices_to_remove = sorted_indices[sorted_indices_to_remove]
        probs[indices_to_remove] = 0
        probs = probs / probs.sum()
    
    # Sample from distribution
    return np.random.choice(len(probs), p=probs)

# Demo: Sample with different strategies
np.random.seed(42)
fake_logits = np.random.randn(vocab_size)

strategies = [
    ('Greedy', {'temperature': 0.01, 'top_k': 0, 'top_p': 1.0}),
    ('Top-K (k=5)', {'temperature': 1.0, 'top_k': 5, 'top_p': 1.0}),
    ('Nucleus (p=0.9)', {'temperature': 1.0, 'top_k': 0, 'top_p': 0.9}),
    ('High Temp (T=1.5)', {'temperature': 1.5, 'top_k': 0, 'top_p': 1.0}),
]

print("Sampling Strategy Comparison:")
print(f"\n{'Strategy':<20} {'Sampled Tokens (5 trials)'}")
print("-" * 60)

for name, params in strategies:
    samples = [sample_token(fake_logits.copy(), **params) for _ in range(5)]
    sample_chars = [idx_to_char.get(s, '?') for s in samples]
    print(f"{name:<20} {sample_chars}")

## 6. Model Architecture Summary

### MiNiLLM Configuration

```python
config = {
    'vocab_size': 100,        # Character vocabulary
    'd_model': 384,           # Embedding dimension
    'n_heads': 6,             # Attention heads
    'n_layers': 6,            # Transformer blocks
    'context_length': 256,    # Max sequence length
    'dropout': 0.1,           # Regularization
}
```

### Parameter Count

- Token embeddings: 100 × 384 = 38.4K
- Positional embeddings: 256 × 384 = 98.3K
- Attention layers (6): ~6M
- Feed-forward (6): ~3M
- Output projection: 38.4K

**Total**: ~10M parameters

In [None]:
# Model size analysis
config = {
    'vocab_size': 100,
    'd_model': 384,
    'n_heads': 6,
    'n_layers': 6,
    'context_length': 256,
}

# Calculate parameters
token_emb = config['vocab_size'] * config['d_model']
pos_emb = config['context_length'] * config['d_model']

# Per-layer attention: 4 weight matrices (Q, K, V, O)
attn_per_layer = 4 * (config['d_model'] ** 2)

# Per-layer FFN: 2 weight matrices (up + down projection)
ffn_per_layer = 2 * (config['d_model'] * 4 * config['d_model'])  # 4x expansion

total_attn = attn_per_layer * config['n_layers']
total_ffn = ffn_per_layer * config['n_layers']
output_proj = config['vocab_size'] * config['d_model']

total_params = token_emb + pos_emb + total_attn + total_ffn + output_proj

print("Model Parameter Breakdown:")
print(f"  Token embeddings:     {token_emb/1e6:.2f}M")
print(f"  Position embeddings:  {pos_emb/1e6:.2f}M")
print(f"  Attention layers:     {total_attn/1e6:.2f}M")
print(f"  Feed-forward layers:  {total_ffn/1e6:.2f}M")
print(f"  Output projection:    {output_proj/1e6:.2f}M")
print(f"  " + "="*40)
print(f"  TOTAL PARAMETERS:     {total_params/1e6:.2f}M")
print(f"\nModel size (FP32):    {total_params * 4 / 1e6:.1f} MB")
print(f"Model size (FP16):    {total_params * 2 / 1e6:.1f} MB")

---

## Summary & Key Takeaways

### Technical Achievements

✅ **Self-Attention**: Scaled dot-product with causal masking  
✅ **Positional Encoding**: Sinusoidal embeddings for sequence order  
✅ **Transformer Architecture**: Multi-layer decoder with residual connections  
✅ **Character-Level Modeling**: Flexible tokenization for any language  
✅ **Sampling Strategies**: Greedy, top-k, nucleus, temperature scaling  
✅ **Efficient Implementation**: ~10M parameters, optimized for training  

### Skills Demonstrated

**Deep Learning**:
- Transformer architecture from scratch
- Attention mechanisms
- Sequence modeling

**NLP**:
- Tokenization strategies
- Language modeling
- Text generation

**Engineering**:
- Numerical stability (softmax, gradients)
- Memory optimization
- Inference efficiency

---

### Applications

**Text Generation**:
- Code completion
- Creative writing
- Dialogue systems

**Understanding**:
- Foundation for BERT, GPT, T5
- Transferable to vision (ViT)
- Multi-modal models

---

## References

- **Repository**: https://github.com/anarcoiris/MiNiLLM
- **Papers**:
  - "Attention Is All You Need" (Vaswani et al., 2017)
  - "Language Models are Unsupervised Multitask Learners" (GPT-2)
- **Technologies**: Python, NumPy, Character-level NLP

---

*This notebook demonstrates low-level transformer implementation for deep learning expertise.*
