# Introduction to Neural Networks: Building a Tiny Language Model

In this notebook, we'll build a simple neural network that learns to predict the next character in text. We'll use Shakespeare's writing as our training data!

## What We'll Learn:
1. **One-Hot Encoding** - How to turn letters into numbers
2. **Attention** - How the model "pays attention" to different parts of the input
3. **MLP (Multi-Layer Perceptron)** - A simple neural network layer
4. **Training** - How the model learns from examples

## Step 1: Get the Data

Let's download a small piece of Shakespeare's writing.

In [None]:
import urllib.request

# Download tiny shakespeare dataset
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
urllib.request.urlretrieve(url, "shakespeare.txt")

# Read the text
with open("shakespeare.txt", "r") as f:
    text = f.read()

print(f"Total characters: {len(text)}")
print(f"\nFirst 500 characters:\n{text[:500]}")

## Step 2: Create a Vocabulary

We need to know all the unique characters in our text. This is our "vocabulary".

In [None]:
# Get all unique characters
chars = sorted(list(set(text)))
vocab_size = len(chars)

print(f"Vocabulary size: {vocab_size}")
print(f"Characters: {''.join(chars)}")

# Create mappings: character <-> number
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}

print(f"\nExample: 'a' -> {char_to_idx['a']}")
print(f"Example: {char_to_idx['a']} -> '{idx_to_char[char_to_idx['a']]}'")

## Step 3: One-Hot Encoding

Neural networks work with numbers, not letters. **One-hot encoding** converts each character into a vector of 0s with a single 1.

For example, if we have 3 characters [a, b, c]:
- 'a' becomes [1, 0, 0]
- 'b' becomes [0, 1, 0]
- 'c' becomes [0, 0, 1]

In [None]:
import numpy as np

def one_hot_encode(indices, vocab_size):
    """
    Convert character indices to one-hot vectors.
    
    Args:
        indices: list of character indices, shape (sequence_length,)
        vocab_size: total number of unique characters
    
    Returns:
        one_hot: array of shape (sequence_length, vocab_size)
    """
    seq_len = len(indices)
    one_hot = np.zeros((seq_len, vocab_size))
    
    for i, idx in enumerate(indices):
        one_hot[i, idx] = 1.0
    
    return one_hot

# Example
example_text = "hello"
example_indices = [char_to_idx[c] for c in example_text]
example_one_hot = one_hot_encode(example_indices, vocab_size)

print(f"Text: '{example_text}'")
print(f"Indices: {example_indices}")
print(f"One-hot shape: {example_one_hot.shape}")
print(f"\nOne-hot for 'h' (index {char_to_idx['h']}):")
print(f"Position of the 1: {np.argmax(example_one_hot[0])}")

## Step 4: Prepare Training Data

We'll train the model to predict the next character. Given "hell", predict "o".

In [None]:
# Use a smaller subset for faster training
text = text[:10000]

# Convert entire text to indices
data = [char_to_idx[c] for c in text]

# Context length: how many characters the model sees to predict the next one
context_length = 8

def get_batch(data, batch_size, context_length):
    """
    Get a random batch of training examples.
    
    Returns:
        X: input sequences, shape (batch_size, context_length, vocab_size)
        Y: target characters (what comes next), shape (batch_size,)
    """
    # Pick random starting positions
    starts = np.random.randint(0, len(data) - context_length - 1, batch_size)
    
    X = []
    Y = []
    
    for start in starts:
        # Input: context_length characters
        input_indices = data[start:start + context_length]
        # Target: the next character
        target = data[start + context_length]
        
        X.append(one_hot_encode(input_indices, vocab_size))
        Y.append(target)
    
    return np.array(X), np.array(Y)

# Test it
X_test, Y_test = get_batch(data, batch_size=2, context_length=context_length)
print(f"Input shape: {X_test.shape}")  # (batch, context_length, vocab_size)
print(f"Target shape: {Y_test.shape}")  # (batch,)

## Step 5: The Attention Mechanism (Simplified)

**Attention** lets the model decide which parts of the input are most important for making a prediction.

The idea:
1. Each position asks: "How relevant is every other position to me?"
2. We compute **attention scores** between all positions
3. We use these scores to create a weighted combination of the input

Think of it like reading a sentence and highlighting the most important words!

In [None]:
def softmax(x):
    """Convert scores to probabilities (0 to 1, sum to 1)."""
    # Subtract max for numerical stability
    exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)


class SimpleAttention:
    """
    A barebones attention layer.
    
    For each position, it computes how much to "attend" to every other position,
    then creates a weighted sum of all positions.
    """
    
    def __init__(self, input_dim, hidden_dim):
        # Initialize weights randomly (small values)
        scale = 0.1
        
        # Query: "What am I looking for?"
        self.W_query = np.random.randn(input_dim, hidden_dim) * scale
        
        # Key: "What do I contain?"
        self.W_key = np.random.randn(input_dim, hidden_dim) * scale
        
        # Value: "What information do I provide?"
        self.W_value = np.random.randn(input_dim, hidden_dim) * scale
    
    def forward(self, x):
        """
        Args:
            x: input of shape (batch_size, seq_len, input_dim)
        
        Returns:
            output: shape (batch_size, seq_len, hidden_dim)
        """
        # Store input for backward pass
        self.x = x
        batch_size, seq_len, _ = x.shape
        
        # Compute Query, Key, Value
        # Each position gets its own Q, K, V vector
        self.Q = x @ self.W_query  # (batch, seq_len, hidden_dim)
        self.K = x @ self.W_key    # (batch, seq_len, hidden_dim)
        self.V = x @ self.W_value  # (batch, seq_len, hidden_dim)
        
        # Compute attention scores: how much should position i attend to position j?
        # Score = Q @ K^T
        # Shape: (batch, seq_len, seq_len)
        scores = self.Q @ self.K.transpose(0, 2, 1)
        
        # Scale scores (helps with training stability)
        hidden_dim = self.W_query.shape[1]
        scores = scores / np.sqrt(hidden_dim)
        
        # Apply causal mask: can only attend to previous positions!
        # This is crucial for language modeling - we can't peek at the future
        mask = np.triu(np.ones((seq_len, seq_len)), k=1) * -1e9
        scores = scores + mask
        
        # Convert to probabilities
        self.attention_weights = softmax(scores)  # (batch, seq_len, seq_len)
        
        # Weighted sum of values
        output = self.attention_weights @ self.V  # (batch, seq_len, hidden_dim)
        
        return output
    
    def backward(self, grad_output, learning_rate):
        """
        Compute gradients and update weights.
        This is a simplified backward pass.
        """
        batch_size = self.x.shape[0]
        
        # Gradient for V: attention_weights^T @ grad_output
        grad_V = self.attention_weights.transpose(0, 2, 1) @ grad_output
        
        # Gradient for W_value
        grad_W_value = np.zeros_like(self.W_value)
        for b in range(batch_size):
            grad_W_value += self.x[b].T @ grad_V[b]
        grad_W_value /= batch_size
        
        # Simplified gradients for Q and K (approximate)
        grad_W_query = np.zeros_like(self.W_query)
        grad_W_key = np.zeros_like(self.W_key)
        
        for b in range(batch_size):
            # This is a simplified approximation
            grad_attn = grad_output[b] @ self.V[b].T
            grad_Q = grad_attn @ self.K[b]
            grad_K = grad_attn.T @ self.Q[b]
            grad_W_query += self.x[b].T @ grad_Q
            grad_W_key += self.x[b].T @ grad_K
        
        grad_W_query /= batch_size
        grad_W_key /= batch_size
        
        # Update weights
        self.W_query -= learning_rate * grad_W_query
        self.W_key -= learning_rate * grad_W_key
        self.W_value -= learning_rate * grad_W_value
        
        # Return gradient for previous layer
        grad_x = grad_V @ self.W_value.T
        return grad_x


# Test the attention layer
attn = SimpleAttention(input_dim=vocab_size, hidden_dim=32)
test_output = attn.forward(X_test)
print(f"Attention input shape: {X_test.shape}")
print(f"Attention output shape: {test_output.shape}")
print(f"Attention weights shape: {attn.attention_weights.shape}")

## Step 6: The MLP Layer

An **MLP (Multi-Layer Perceptron)** is the simplest type of neural network layer:
1. Multiply input by weights
2. Add a bias
3. Apply an activation function (we'll use ReLU: max(0, x))

In [None]:
def relu(x):
    """ReLU activation: keep positive values, set negative to 0."""
    return np.maximum(0, x)

def relu_backward(x):
    """Gradient of ReLU: 1 if x > 0, else 0."""
    return (x > 0).astype(float)


class MLP:
    """
    A simple two-layer MLP.
    
    input -> hidden layer -> ReLU -> output layer
    """
    
    def __init__(self, input_dim, hidden_dim, output_dim):
        scale = 0.1
        
        # First layer: input -> hidden
        self.W1 = np.random.randn(input_dim, hidden_dim) * scale
        self.b1 = np.zeros(hidden_dim)
        
        # Second layer: hidden -> output
        self.W2 = np.random.randn(hidden_dim, output_dim) * scale
        self.b2 = np.zeros(output_dim)
    
    def forward(self, x):
        """
        Args:
            x: input of shape (batch_size, input_dim)
        
        Returns:
            output: shape (batch_size, output_dim)
        """
        self.x = x
        
        # First layer
        self.z1 = x @ self.W1 + self.b1
        self.a1 = relu(self.z1)
        
        # Second layer (no activation - we'll apply softmax later)
        self.z2 = self.a1 @ self.W2 + self.b2
        
        return self.z2
    
    def backward(self, grad_output, learning_rate):
        """
        Backpropagation: compute gradients and update weights.
        """
        batch_size = self.x.shape[0]
        
        # Gradient for second layer
        grad_W2 = self.a1.T @ grad_output / batch_size
        grad_b2 = np.mean(grad_output, axis=0)
        
        # Gradient through second layer
        grad_a1 = grad_output @ self.W2.T
        
        # Gradient through ReLU
        grad_z1 = grad_a1 * relu_backward(self.z1)
        
        # Gradient for first layer
        grad_W1 = self.x.T @ grad_z1 / batch_size
        grad_b1 = np.mean(grad_z1, axis=0)
        
        # Update weights
        self.W2 -= learning_rate * grad_W2
        self.b2 -= learning_rate * grad_b2
        self.W1 -= learning_rate * grad_W1
        self.b1 -= learning_rate * grad_b1
        
        # Return gradient for previous layer
        return grad_z1 @ self.W1.T


# Test the MLP
mlp = MLP(input_dim=32, hidden_dim=64, output_dim=vocab_size)
# We need to flatten the attention output: take the last position
mlp_input = test_output[:, -1, :]  # (batch, hidden_dim)
mlp_output = mlp.forward(mlp_input)
print(f"MLP input shape: {mlp_input.shape}")
print(f"MLP output shape: {mlp_output.shape}")

## Step 7: Loss Function

We need a way to measure how wrong our predictions are. We use **cross-entropy loss**:
- The model outputs probabilities for each character
- We want high probability for the correct character
- Loss = -log(probability of correct character)

In [None]:
def cross_entropy_loss(logits, targets):
    """
    Compute cross-entropy loss.
    
    Args:
        logits: model outputs, shape (batch_size, vocab_size)
        targets: correct character indices, shape (batch_size,)
    
    Returns:
        loss: scalar
        probs: probabilities, shape (batch_size, vocab_size)
    """
    # Convert logits to probabilities
    probs = softmax(logits)
    
    # Get probability of correct character for each example
    batch_size = logits.shape[0]
    correct_probs = probs[np.arange(batch_size), targets]
    
    # Loss = -log(correct probability)
    # Add small epsilon to avoid log(0)
    loss = -np.mean(np.log(correct_probs + 1e-9))
    
    return loss, probs


def cross_entropy_backward(probs, targets):
    """
    Gradient of cross-entropy loss.
    
    The gradient is simply: probs - one_hot(targets)
    """
    batch_size = probs.shape[0]
    grad = probs.copy()
    grad[np.arange(batch_size), targets] -= 1
    return grad

## Step 8: Put It All Together - The Model

Our complete model:
1. Input: one-hot encoded characters
2. Attention layer: learns which characters are important
3. MLP: makes the final prediction
4. Output: probability for each possible next character

In [None]:
class SimpleLanguageModel:
    """
    A tiny language model: Attention + MLP
    """
    
    def __init__(self, vocab_size, hidden_dim=32, mlp_hidden=64):
        self.attention = SimpleAttention(vocab_size, hidden_dim)
        self.mlp = MLP(hidden_dim, mlp_hidden, vocab_size)
        self.vocab_size = vocab_size
    
    def forward(self, x):
        """
        Args:
            x: one-hot input, shape (batch_size, seq_len, vocab_size)
        
        Returns:
            logits: shape (batch_size, vocab_size)
        """
        # Attention layer
        attn_out = self.attention.forward(x)  # (batch, seq_len, hidden_dim)
        
        # Take the last position (we're predicting what comes next)
        last_hidden = attn_out[:, -1, :]  # (batch, hidden_dim)
        
        # MLP to get predictions
        logits = self.mlp.forward(last_hidden)  # (batch, vocab_size)
        
        return logits
    
    def backward(self, grad_output, learning_rate):
        """
        Backpropagate through the model.
        """
        # Backward through MLP
        grad_last_hidden = self.mlp.backward(grad_output, learning_rate)
        
        # Expand gradient to full sequence (only last position has gradient)
        batch_size = grad_last_hidden.shape[0]
        seq_len = self.attention.x.shape[1]
        hidden_dim = grad_last_hidden.shape[1]
        
        grad_attn_out = np.zeros((batch_size, seq_len, hidden_dim))
        grad_attn_out[:, -1, :] = grad_last_hidden
        
        # Backward through attention
        self.attention.backward(grad_attn_out, learning_rate)
    
    def generate(self, start_text, length, char_to_idx, idx_to_char, context_length):
        """
        Generate new text starting from start_text.
        """
        # Convert start text to indices
        current = [char_to_idx[c] for c in start_text]
        
        generated = start_text
        
        for _ in range(length):
            # Get the last context_length characters
            context = current[-context_length:]
            
            # Pad if needed
            while len(context) < context_length:
                context = [0] + context
            
            # One-hot encode
            x = one_hot_encode(context, self.vocab_size)
            x = x[np.newaxis, :, :]  # Add batch dimension
            
            # Get prediction
            logits = self.forward(x)
            probs = softmax(logits[0])
            
            # Sample from the distribution
            next_idx = np.random.choice(len(probs), p=probs)
            
            # Add to sequence
            current.append(next_idx)
            generated += idx_to_char[next_idx]
        
        return generated


# Create the model
model = SimpleLanguageModel(vocab_size, hidden_dim=32, mlp_hidden=64)
print("Model created!")
print(f"Vocabulary size: {vocab_size}")

## Step 9: Training!

Now we train the model:
1. Get a batch of examples
2. Forward pass: compute predictions
3. Compute loss
4. Backward pass: compute gradients
5. Update weights
6. Repeat!

In [None]:
# Training hyperparameters
learning_rate = 0.1
batch_size = 32
num_steps = 1000

# Track losses
losses = []

print("Starting training...\n")

for step in range(num_steps):
    # Get a batch
    X_batch, Y_batch = get_batch(data, batch_size, context_length)
    
    # Forward pass
    logits = model.forward(X_batch)
    
    # Compute loss
    loss, probs = cross_entropy_loss(logits, Y_batch)
    losses.append(loss)
    
    # Backward pass
    grad = cross_entropy_backward(probs, Y_batch)
    model.backward(grad, learning_rate)
    
    # Print progress
    if step % 100 == 0:
        print(f"Step {step:4d} | Loss: {loss:.4f}")

print(f"\nFinal loss: {losses[-1]:.4f}")

## Step 10: Visualize Training

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 4))
plt.plot(losses)
plt.xlabel('Training Step')
plt.ylabel('Loss')
plt.title('Training Loss Over Time')
plt.grid(True, alpha=0.3)
plt.show()

print(f"Starting loss: {losses[0]:.4f}")
print(f"Final loss: {losses[-1]:.4f}")
print(f"Improvement: {((losses[0] - losses[-1]) / losses[0] * 100):.1f}%")

## Step 11: Generate Text!

Now let's see what our model learned! We'll give it a starting prompt and let it generate text.

In [None]:
# Generate some text
print("Generated text samples:\n")

prompts = ["The ", "KING", "What "]

for prompt in prompts:
    generated = model.generate(
        start_text=prompt,
        length=100,
        char_to_idx=char_to_idx,
        idx_to_char=idx_to_char,
        context_length=context_length
    )
    print(f"Prompt: '{prompt}'")
    print(f"Generated: {generated}")
    print("-" * 50)

## Step 12: Visualize Attention

Let's see what the model is "paying attention" to!

In [None]:
# Visualize attention for a specific input
test_text = "to be or"
test_indices = [char_to_idx[c] for c in test_text]
test_one_hot = one_hot_encode(test_indices, vocab_size)
test_input = test_one_hot[np.newaxis, :, :]  # Add batch dimension

# Forward pass
_ = model.forward(test_input)

# Get attention weights
attn_weights = model.attention.attention_weights[0]  # (seq_len, seq_len)

# Plot
fig, ax = plt.subplots(figsize=(8, 6))
im = ax.imshow(attn_weights, cmap='Blues')

# Labels
chars_list = list(test_text)
ax.set_xticks(range(len(chars_list)))
ax.set_yticks(range(len(chars_list)))
ax.set_xticklabels(chars_list)
ax.set_yticklabels(chars_list)

ax.set_xlabel('Key (attending TO)')
ax.set_ylabel('Query (attending FROM)')
ax.set_title(f'Attention Pattern for "{test_text}"')

plt.colorbar(im)
plt.tight_layout()
plt.show()

print("\nNote: Each row shows how much that position attends to previous positions.")
print("The diagonal pattern shows the causal mask - can only attend to past!")

## Summary

We built a tiny language model from scratch! Here's what we learned:

1. **One-Hot Encoding**: Convert characters to vectors that neural networks can understand

2. **Attention**: A mechanism that lets the model decide which parts of the input are most relevant. Uses Query (what am I looking for?), Key (what do I contain?), and Value (what information do I provide?)

3. **MLP**: A simple neural network layer that transforms data through weights, biases, and activation functions

4. **Training**: Use loss functions and backpropagation to teach the model by showing it examples

### What Makes Real Models (like GPT) Different?
- Much larger (billions of parameters vs our ~10,000)
- Multiple attention layers stacked
- Better embeddings (learned, not one-hot)
- More training data and compute
- Advanced techniques (layer norm, dropout, etc.)

But the core ideas are the same!

In [None]:
# Count parameters
def count_params(model):
    total = 0
    # Attention
    total += model.attention.W_query.size
    total += model.attention.W_key.size
    total += model.attention.W_value.size
    # MLP
    total += model.mlp.W1.size + model.mlp.b1.size
    total += model.mlp.W2.size + model.mlp.b2.size
    return total

print(f"Our model has {count_params(model):,} parameters")
print(f"GPT-3 has 175,000,000,000 parameters")
print(f"That's {175_000_000_000 / count_params(model):,.0f}x more!")