# Transformer Architecture from Scratch using NumPy
## Individual Assignment: GPT-style Decoder-only Implementation

This notebook demonstrates a complete implementation of a Transformer architecture from scratch using only NumPy, without any deep learning frameworks like PyTorch or TensorFlow.

### Assignment Objectives:
- Build a decoder-only Transformer (GPT-style) architecture
- Implement all core components: embedding, attention, feed-forward networks, etc.
- Create modular, testable code with proper mathematical foundations
- Demonstrate forward pass from token input to probability distribution output

### Architecture Overview:
1. **Token Embedding** - Maps token IDs to dense vectors
2. **Positional Encoding** - Adds positional information using sinusoidal functions  
3. **Multi-Head Attention** - Core attention mechanism with causal masking
4. **Feed-Forward Network** - Two-layer MLP with non-linear activation
5. **Layer Normalization** - Stabilizes training with pre-norm architecture
6. **Residual Connections** - Enables deeper networks and better gradient flow
7. **Output Layer** - Projects to vocabulary size with softmax distribution

## 1. Import Required Libraries

We'll only use NumPy for all mathematical operations as per assignment requirements.

In [None]:
import numpy as np
import math
from typing import Tuple, Optional, List

# Set random seed for reproducibility
np.random.seed(42)

print("✅ Libraries imported successfully!")
print(f"NumPy version: {np.__version__}")
print("🚀 Ready to build Transformer from scratch!")

## 2. Token Embedding Implementation

Token embedding converts discrete token IDs into continuous vector representations. This is the first step in processing text input.

**Mathematical Foundation:**
```
embedding(token_id) = W_embedding[token_id]
```
Where W_embedding is a learnable matrix of shape [vocab_size, embedding_dim]

In [None]:
class TokenEmbedding:
    """Token embedding layer that maps token IDs to dense vectors."""
    
    def __init__(self, vocab_size: int, embed_dim: int, seed: int = 42):
        """
        Initialize token embedding layer.
        
        Args:
            vocab_size: Size of vocabulary
            embed_dim: Dimension of embedding vectors  
            seed: Random seed for reproducibility
        """
        np.random.seed(seed)
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim
        
        # Initialize embedding matrix with small random values
        # Using normal distribution scaled by embedding dimension
        self.embedding_matrix = np.random.normal(0, 0.02, (vocab_size, embed_dim))
    
    def forward(self, token_ids: np.ndarray) -> np.ndarray:
        """
        Forward pass of token embedding.
        
        Args:
            token_ids: Token IDs of shape [batch_size, seq_len]
            
        Returns:
            Embedded tokens of shape [batch_size, seq_len, embed_dim]
        """
        return self.embedding_matrix[token_ids]

# Test the token embedding
print("🔧 Testing Token Embedding...")
vocab_size = 1000
embed_dim = 512
batch_size = 2
seq_len = 8

# Initialize embedding layer
token_emb = TokenEmbedding(vocab_size, embed_dim)

# Create sample token IDs
token_ids = np.random.randint(0, vocab_size, (batch_size, seq_len))
print(f"Input token IDs shape: {token_ids.shape}")
print(f"Sample tokens: {token_ids[0]}")

# Get embeddings
embeddings = token_emb.forward(token_ids)
print(f"Output embeddings shape: {embeddings.shape}")
print(f"✅ Token embedding working correctly!")