# Micrograd: A Complete Summary

This notebook is a clean, linear summary of **micrograd** - a minimal automatic differentiation engine that can train neural networks.

**What we'll cover:**
1. **Derivatives** - The foundation of learning
2. **Value Class** - Smart numbers that track gradients
3. **Backpropagation** - How gradients flow backward
4. **Neural Network Components** - Neuron, Layer, MLP
5. **Training** - Putting it all together

---

In [1]:
import math
import random
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

---
## Part 1: Understanding Derivatives

A **derivative** tells us how much a function's output changes when we slightly change its input.

**Mathematical Definition:**
```
f'(x) = lim(h→0) [f(x + h) - f(x)] / h
```

In simple terms: *"If I nudge the input a tiny bit, how much does the output change?"*

In [2]:
# Example: f(x) = 3x² - 4x + 5
def f(x):
    return 3*x**2 - 4*x + 5

# Compute derivative numerically
h = 0.0001  # tiny nudge
x = 3.0

slope = (f(x + h) - f(x)) / h
print(f"At x = {x}:")
print(f"  f(x) = {f(x)}")
print(f"  slope (derivative) ≈ {slope:.4f}")
print(f"  Analytical derivative: 6x - 4 = {6*x - 4}")

At x = 3.0:
  f(x) = 20.0
  slope (derivative) ≈ 14.0003
  Analytical derivative: 6x - 4 = 14.0


### Why Do We Care About Derivatives?

In machine learning, we want to **minimize a loss function**. The derivative tells us which direction to move!

```
If derivative > 0: decrease the input to reduce output
If derivative < 0: increase the input to reduce output
```

This is the core idea behind **gradient descent**.

---
## Part 2: The Value Class

The `Value` class is the heart of micrograd. It wraps a number and:

1. **Tracks the computation graph** - remembers what operations created it
2. **Stores the gradient** - derivative of loss with respect to this value
3. **Knows how to backpropagate** - compute gradients automatically

Think of it as a **"smart number"** that remembers its history.

In [3]:
class Value:
    """
    A wrapper around a number that supports automatic differentiation.
    
    Attributes:
        data: The actual numerical value
        grad: The gradient (derivative of loss w.r.t. this value)
        _backward: Function to propagate gradients to children
        _prev: Set of Value objects that created this one
        _op: String describing the operation
    """
    
    def __init__(self, data, _children=(), _op='', label=''):
        self.data = data
        self.grad = 0.0                    # gradient starts at 0
        self._backward = lambda: None      # no-op by default
        self._prev = set(_children)        # parent nodes in graph
        self._op = _op                     # operation that created this
        self.label = label                 # optional name for debugging
    
    def __repr__(self):
        return f"Value(data={self.data:.4f}, grad={self.grad:.4f})"
    
    # =====================================================================
    # ARITHMETIC OPERATIONS
    # =====================================================================
    # Each operation:
    # 1. Computes the forward pass (the actual math)
    # 2. Defines _backward() to compute gradients during backprop
    # =====================================================================
    
    def __add__(self, other):
        """Addition: self + other"""
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), '+')
        
        def _backward():
            # d(a+b)/da = 1, d(a+b)/db = 1
            # Gradient flows equally to both inputs
            self.grad += 1.0 * out.grad
            other.grad += 1.0 * out.grad
        out._backward = _backward
        
        return out
    
    def __mul__(self, other):
        """Multiplication: self * other"""
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), '*')
        
        def _backward():
            # d(a*b)/da = b, d(a*b)/db = a
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = _backward
        
        return out
    
    def __pow__(self, other):
        """Power: self ** other (only supports int/float exponents)"""
        assert isinstance(other, (int, float)), "only int/float powers supported"
        out = Value(self.data ** other, (self,), f'**{other}')
        
        def _backward():
            # d(x^n)/dx = n * x^(n-1)
            self.grad += other * (self.data ** (other - 1)) * out.grad
        out._backward = _backward
        
        return out
    
    def __neg__(self):        # -self
        return self * -1
    
    def __sub__(self, other): # self - other
        return self + (-other)
    
    def __truediv__(self, other): # self / other
        return self * (other ** -1)
    
    # Reverse operations (when Value is on the right side)
    def __radd__(self, other):  # other + self
        return self + other
    
    def __rsub__(self, other):  # other - self
        return other + (-self)
    
    def __rmul__(self, other):  # other * self
        return self * other
    
    # =====================================================================
    # ACTIVATION FUNCTION
    # =====================================================================
    
    def tanh(self):
        """
        Hyperbolic tangent activation.
        
        tanh(x) = (e^(2x) - 1) / (e^(2x) + 1)
        
        - Output range: (-1, 1)
        - Derivative: 1 - tanh(x)²
        """
        x = self.data
        t = (math.exp(2*x) - 1) / (math.exp(2*x) + 1)
        out = Value(t, (self,), 'tanh')
        
        def _backward():
            self.grad += (1 - t ** 2) * out.grad
        out._backward = _backward
        
        return out
    
    def exp(self):
        """Exponential: e^self"""
        out = Value(math.exp(self.data), (self,), 'exp')
        
        def _backward():
            # d(e^x)/dx = e^x
            self.grad += out.data * out.grad
        out._backward = _backward
        
        return out
    
    # =====================================================================
    # BACKPROPAGATION
    # =====================================================================
    
    def backward(self):
        """
        Compute gradients for all nodes in the computation graph.
        
        Algorithm:
        1. Build topological order (children before parents)
        2. Set output gradient to 1.0
        3. Process nodes in reverse order, calling _backward() on each
        """
        # Step 1: Topological sort using DFS
        topo = []
        visited = set()
        
        def build_topo(node):
            if node not in visited:
                visited.add(node)
                for child in node._prev:
                    build_topo(child)
                topo.append(node)
        
        build_topo(self)
        
        # Step 2: Set output gradient to 1
        self.grad = 1.0
        
        # Step 3: Propagate gradients backward
        for node in reversed(topo):
            node._backward()

---
## Part 3: Understanding Backpropagation

**Backpropagation** is how we compute gradients automatically using the **chain rule**.

### The Chain Rule

If `y = f(g(x))`, then:
```
dy/dx = dy/dg · dg/dx
```

In other words: **multiply the gradients along the path**.

### Visual Example

Let's trace through: `L = (a * b + c) * f`

In [4]:
# Forward pass
a = Value(2.0, label='a')
b = Value(-3.0, label='b')
c = Value(10.0, label='c')
f = Value(-2.0, label='f')

e = a * b       # e = 2 * -3 = -6
d = e + c       # d = -6 + 10 = 4
L = d * f       # L = 4 * -2 = -8

print("Forward Pass:")
print(f"  a={a.data}, b={b.data}, c={c.data}, f={f.data}")
print(f"  e = a * b = {e.data}")
print(f"  d = e + c = {d.data}")
print(f"  L = d * f = {L.data}")

Forward Pass:
  a=2.0, b=-3.0, c=10.0, f=-2.0
  e = a * b = -6.0
  d = e + c = 4.0
  L = d * f = -8.0


In [5]:
# Backward pass - compute all gradients automatically
L.backward()

print("\nBackward Pass (Gradients):")
print(f"  dL/dL = {L.grad} (always 1.0)")
print(f"  dL/df = d = {f.grad}")
print(f"  dL/dd = f = {d.grad}")
print(f"  dL/dc = dL/dd × dd/dc = {d.grad} × 1 = {c.grad}")
print(f"  dL/de = dL/dd × dd/de = {d.grad} × 1 = {e.grad}")
print(f"  dL/da = dL/de × de/da = {e.grad} × b = {a.grad}")
print(f"  dL/db = dL/de × de/db = {e.grad} × a = {b.grad}")

print("\nInterpretation:")
print(f"  If we increase 'a' by 0.01, L changes by ≈ {a.grad * 0.01:.4f}")


Backward Pass (Gradients):
  dL/dL = 1.0 (always 1.0)
  dL/df = d = 4.0
  dL/dd = f = -2.0
  dL/dc = dL/dd × dd/dc = -2.0 × 1 = -2.0
  dL/de = dL/dd × dd/de = -2.0 × 1 = -2.0
  dL/da = dL/de × de/da = -2.0 × b = 6.0
  dL/db = dL/de × de/db = -2.0 × a = -4.0

Interpretation:
  If we increase 'a' by 0.01, L changes by ≈ 0.0600


### The Computation Graph

```
    a ──┐
        × ── e ──┐
    b ──┘        │
                 + ── d ──┐
    c ───────────┘        │
                          × ── L
    f ────────────────────┘
```

**Forward pass**: Follow arrows left → right to compute values  
**Backward pass**: Follow arrows right → left to compute gradients

---
## Part 4: Neural Network Components

Neural networks are built from simple, composable parts:

| Component | What it does | Input → Output |
|-----------|--------------|----------------|
| **Neuron** | Weighted sum + activation | n values → 1 value |
| **Layer** | Multiple neurons in parallel | n values → m values |
| **MLP** | Multiple layers in sequence | n values → k values |

In [6]:
class Neuron:
    """
    A single neuron: computes weighted sum of inputs + bias, then applies tanh.
    
    Formula: output = tanh(w₁×x₁ + w₂×x₂ + ... + wₙ×xₙ + b)
    
    Visual:
        x₁ ──w₁──┐
        x₂ ──w₂──┼── Σ + b ── tanh ── output
        x₃ ──w₃──┘
    """
    
    def __init__(self, num_inputs):
        # Initialize weights and bias randomly in [-1, 1]
        self.w = [Value(random.uniform(-1, 1)) for _ in range(num_inputs)]
        self.b = Value(random.uniform(-1, 1))
    
    def __call__(self, x):
        # Weighted sum: w₁×x₁ + w₂×x₂ + ... + b
        activation = sum((wi * xi for wi, xi in zip(self.w, x)), self.b)
        # Apply activation function
        return activation.tanh()
    
    def parameters(self):
        """Return all trainable parameters."""
        return self.w + [self.b]

In [7]:
class Layer:
    """
    A layer of neurons processing the same input in parallel.
    
    Each neuron sees the SAME input but has DIFFERENT weights,
    so they learn to detect different patterns.
    
    Visual (Layer with 3 neurons, 2 inputs):
        
               ┌── Neuron₁ ── out₁
        x₁ ────┼── Neuron₂ ── out₂
        x₂ ────┼── Neuron₃ ── out₃
               └──────────────────
    """
    
    def __init__(self, num_inputs, num_outputs):
        self.neurons = [Neuron(num_inputs) for _ in range(num_outputs)]
    
    def __call__(self, x):
        outputs = [neuron(x) for neuron in self.neurons]
        # Return single value if only one neuron
        return outputs[0] if len(outputs) == 1 else outputs
    
    def parameters(self):
        return [p for neuron in self.neurons for p in neuron.parameters()]

In [8]:
class MLP:
    """
    Multi-Layer Perceptron: layers stacked sequentially.
    
    Example: MLP(3, [4, 4, 1])
    
        Input (3)     Hidden 1 (4)    Hidden 2 (4)    Output (1)
        
          x₁ ────────── ○ ────────────── ○ ──────────┐
          x₂ ────────── ○ ────────────── ○ ──────────┼── output
          x₃ ────────── ○ ────────────── ○ ──────────┘
                        ○                ○
    """
    
    def __init__(self, num_inputs, layer_sizes):
        sizes = [num_inputs] + layer_sizes
        self.layers = [Layer(sizes[i], sizes[i+1]) for i in range(len(layer_sizes))]
    
    def __call__(self, x):
        # Pass through each layer sequentially
        for layer in self.layers:
            x = layer(x)
        return x
    
    def parameters(self):
        return [p for layer in self.layers for p in layer.parameters()]

In [9]:
# Test the components
random.seed(42)

x = [2.0, 3.0]  # input with 2 features

print("Testing Neural Network Components:")
print(f"Input: {x}")
print()

neuron = Neuron(2)
print(f"Neuron (2 inputs → 1 output): {neuron(x)}")

layer = Layer(2, 3)
print(f"Layer (2 inputs → 3 outputs): {[v.data for v in layer(x)]}")

mlp = MLP(2, [4, 4, 1])
print(f"MLP (2 → 4 → 4 → 1): {mlp(x)}")
print(f"\nTotal parameters in MLP: {len(mlp.parameters())}")

Testing Neural Network Components:
Input: [2.0, 3.0]

Neuron (2 inputs → 1 output): Value(data=-0.9917, grad=0.0000)
Layer (2 inputs → 3 outputs): [0.5817270619168307, -0.7878755440086811, -0.9983781851034335]
MLP (2 → 4 → 4 → 1): Value(data=-0.5063, grad=0.0000)

Total parameters in MLP: 37


---
## Part 5: Training a Neural Network

Training adjusts weights to minimize the **loss** (error).

### The Training Loop

```
┌─────────────────────────────────────────────────────────────┐
│  1. FORWARD PASS    : Input → Network → Prediction         │
│  2. COMPUTE LOSS    : How wrong is the prediction?         │
│  3. ZERO GRADIENTS  : Reset all gradients to 0             │
│  4. BACKWARD PASS   : Compute gradients (∂Loss/∂weight)    │
│  5. UPDATE WEIGHTS  : weight -= learning_rate × gradient   │
│                                                             │
│  REPEAT until loss is small enough                         │
└─────────────────────────────────────────────────────────────┘
```

### Loss Function: Mean Squared Error

```
Loss = Σ (prediction - target)²
```

- If prediction = target → loss = 0 (perfect!)
- Bigger difference → bigger loss

In [10]:
# ============================================
# TRAINING DATA
# ============================================

# Input samples: 4 examples with 3 features each
inputs = [
    [2.0, 3.0, -1.0],   # sample 1
    [3.0, -1.0, 0.5],   # sample 2
    [0.5, 1.0, 1.0],    # sample 3
    [1.0, 1.0, -1.0]    # sample 4
]

# Target outputs: what we want the network to predict
targets = [1.0, -1.0, -1.0, 1.0]

# Create network: 3 inputs → 4 hidden → 4 hidden → 1 output
random.seed(42)
mlp = MLP(3, [4, 4, 1])

print(f"Dataset: {len(inputs)} samples")
print(f"Network architecture: 3 → 4 → 4 → 1")
print(f"Total parameters: {len(mlp.parameters())}")

Dataset: 4 samples
Network architecture: 3 → 4 → 4 → 1
Total parameters: 41


In [11]:
# ============================================
# TRAINING LOOP
# ============================================

learning_rate = 0.1   # step size for weight updates
num_iterations = 100  # number of training iterations

print(f"Training for {num_iterations} iterations...")
print(f"Learning rate: {learning_rate}")
print("-" * 50)

for iteration in range(num_iterations):
    
    # ===== STEP 1: FORWARD PASS =====
    # Compute predictions for all inputs
    predictions = [mlp(x) for x in inputs]
    
    # ===== STEP 2: COMPUTE LOSS =====
    # Mean Squared Error
    loss = sum((pred - target)**2 for target, pred in zip(targets, predictions))
    
    # ===== STEP 3: ZERO GRADIENTS =====
    # Gradients accumulate, so we must reset them
    for param in mlp.parameters():
        param.grad = 0.0
    
    # ===== STEP 4: BACKWARD PASS =====
    # Compute all gradients
    loss.backward()
    
    # ===== STEP 5: UPDATE WEIGHTS =====
    # Gradient descent: move opposite to gradient
    for param in mlp.parameters():
        param.data -= learning_rate * param.grad
    
    # Print progress
    if iteration % 20 == 0:
        print(f"Iteration {iteration:3d} | Loss: {loss.data:.6f}")

print("-" * 50)
print(f"Final Loss: {loss.data:.6f}")

Training for 100 iterations...
Learning rate: 0.1
--------------------------------------------------
Iteration   0 | Loss: 5.230518
Iteration  20 | Loss: 0.034212
Iteration  40 | Loss: 0.014531
Iteration  60 | Loss: 0.008926
Iteration  80 | Loss: 0.006351
--------------------------------------------------
Final Loss: 0.004950


In [12]:
# ============================================
# FINAL RESULTS
# ============================================

print("\nPredictions vs Targets:")
print("=" * 50)

for i, (pred, target) in enumerate(zip(predictions, targets)):
    match = "✓" if (pred.data > 0) == (target > 0) else "✗"
    print(f"Sample {i+1}: predicted {pred.data:+.4f}, target {target:+.1f}  {match}")

print("\nThe network learned to classify inputs correctly!")


Predictions vs Targets:
Sample 1: predicted +0.9663, target +1.0  ✓
Sample 2: predicted -0.9816, target -1.0  ✓
Sample 3: predicted -0.9516, target -1.0  ✓
Sample 4: predicted +0.9663, target +1.0  ✓

The network learned to classify inputs correctly!


---
## Summary: What We Built

### 1. Value Class (Automatic Differentiation)
- Wraps numbers and tracks computation history
- Each operation knows its gradient rule
- `backward()` applies chain rule automatically

### 2. Neural Network Components
```
Neuron: weighted sum + activation → 1 output
Layer:  multiple neurons in parallel → n outputs
MLP:    multiple layers in sequence → final output
```

### 3. Training Loop
```
1. Forward:  Input → Prediction
2. Loss:     Measure error
3. Backward: Compute gradients
4. Update:   Adjust weights
5. Repeat!
```

### Key Insight

**This is the foundation of PyTorch and TensorFlow!**

The only differences in production frameworks:
- Optimized for GPUs
- More operations and activations
- Better optimizers than basic gradient descent
- Automatic batching

But the core idea is exactly the same!