# Part 3.1: Perceptrons & Basic Networks â€” The Formula 1 Edition

Welcome to the beginning of neural networks! In this notebook, we'll build up from a single artificial neuron (the perceptron) to a complete multi-layer network that can solve non-linear problems. By the end, you'll implement a neural network from scratch using only NumPy.

**Why this matters:** Every deep learning model - from GPT to image classifiers - is built from these fundamental building blocks. Understanding perceptrons, activation functions, and forward propagation is essential before tackling backpropagation and modern architectures.

**F1 analogy:** Think of a perceptron as a single sensor threshold decision on an F1 car. The tire temperature sensor reads data, applies a threshold, and outputs a binary decision: is the tire too hot? Yes or no. A full neural network is the chain from dozens of sensors through feature extraction layers to the pit wall's strategy decision. In this notebook, we build that chain from a single sensor up.

---

## Learning Objectives

By the end of this notebook, you should be able to:

- [ ] Explain how a perceptron computes a weighted sum and applies an activation
- [ ] Visualize decision boundaries for single neurons
- [ ] Compare activation functions (sigmoid, tanh, ReLU, Leaky ReLU, GELU) and know when to use each
- [ ] Explain the vanishing gradient problem and why ReLU helps
- [ ] Trace data through forward propagation in a multi-layer network
- [ ] Choose appropriate loss functions for regression vs classification
- [ ] Build a 2-layer neural network from scratch in NumPy
- [ ] Train a network to solve the XOR problem

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)

---

## 1. The Biological Inspiration

### From Biology to Math

Neural networks are loosely inspired by how biological neurons work in the brain:

| Biological Neuron | Artificial Neuron | F1 Analogy |
|-------------------|-------------------|------------|
| Dendrites receive signals | Input features $x_1, x_2, ..., x_n$ | Sensor readings: speed, tire temp, throttle position |
| Synapses have varying strengths | Weights $w_1, w_2, ..., w_n$ | How much each sensor matters for the decision |
| Cell body sums inputs | Weighted sum $z = \sum w_i x_i + b$ | Combining all telemetry into a single risk score |
| Axon fires if threshold exceeded | Activation function $a = f(z)$ | Engineer decides: pit now or stay out |

**Important caveat:** While inspired by biology, artificial neural networks are primarily a mathematical framework. We won't dwell on the biology - let's focus on the math that actually makes them work!

**F1 analogy:** A single neuron is like a single sensor threshold decision. The tire temperature sensor reads 105C (input), multiplies by a calibration weight, adds a bias offset, and if the result exceeds a threshold, it fires a warning: "tire degradation critical." That is exactly what a perceptron does.

In [None]:
# Visualization: Biological vs Artificial Neuron
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Biological neuron (simplified diagram)
ax = axes[0]
ax.set_xlim(-2, 6)
ax.set_ylim(-2, 4)

# Dendrites (inputs)
for i, y in enumerate([3, 2, 1, 0]):
    ax.annotate('', xy=(1, 1.5), xytext=(-1, y),
                arrowprops=dict(arrowstyle='->', color='blue', lw=2))
    ax.text(-1.5, y, f'Input {i+1}', fontsize=10, va='center')

# Cell body
circle = plt.Circle((2, 1.5), 0.8, color='green', alpha=0.5)
ax.add_patch(circle)
ax.text(2, 1.5, 'Sum', ha='center', va='center', fontsize=12, fontweight='bold')

# Axon (output)
ax.annotate('', xy=(5, 1.5), xytext=(2.8, 1.5),
            arrowprops=dict(arrowstyle='->', color='red', lw=3))
ax.text(5.2, 1.5, 'Output', fontsize=10, va='center')

ax.set_title('Biological Neuron (Simplified)', fontsize=14, fontweight='bold')
ax.axis('off')

# Right: Artificial neuron (mathematical)
ax = axes[1]
ax.set_xlim(-2, 8)
ax.set_ylim(-1, 5)

# Inputs with weights
inputs = ['$x_1$', '$x_2$', '$x_3$', '$x_n$']
weights = ['$w_1$', '$w_2$', '$w_3$', '$w_n$']
y_positions = [4, 3, 2, 0.5]

for i, (inp, w, y) in enumerate(zip(inputs, weights, y_positions)):
    ax.annotate('', xy=(2.5, 2.25), xytext=(0, y),
                arrowprops=dict(arrowstyle='->', color='blue', lw=2))
    ax.text(-0.5, y, inp, fontsize=12, va='center')
    ax.text(1.2, (y + 2.25)/2 + 0.2, w, fontsize=10, color='purple')

# Dots for "..."
ax.text(-0.3, 1.2, '...', fontsize=14, va='center')

# Summation node
circle = plt.Circle((3, 2.25), 0.5, color='green', alpha=0.5)
ax.add_patch(circle)
ax.text(3, 2.25, '$\Sigma$', ha='center', va='center', fontsize=16)

# Bias
ax.annotate('', xy=(3, 1.75), xytext=(3, 0),
            arrowprops=dict(arrowstyle='->', color='orange', lw=2))
ax.text(3, -0.3, '$b$ (bias)', ha='center', fontsize=10, color='orange')

# Linear output
ax.annotate('', xy=(5, 2.25), xytext=(3.5, 2.25),
            arrowprops=dict(arrowstyle='->', color='gray', lw=2))
ax.text(4.3, 2.6, '$z$', fontsize=12)

# Activation function
rect = plt.Rectangle((5, 1.75), 1, 1, color='red', alpha=0.3)
ax.add_patch(rect)
ax.text(5.5, 2.25, '$f$', ha='center', va='center', fontsize=14)

# Output
ax.annotate('', xy=(7.5, 2.25), xytext=(6, 2.25),
            arrowprops=dict(arrowstyle='->', color='red', lw=3))
ax.text(7.7, 2.25, '$a = f(z)$', fontsize=12, va='center')

ax.set_title('Artificial Neuron (Perceptron)', fontsize=14, fontweight='bold')
ax.axis('off')

plt.tight_layout()
plt.show()

print("Key equation: z = w1*x1 + w2*x2 + ... + wn*xn + b = w . x + b")
print("Output:       a = f(z)  where f is the activation function")

---

## 2. Single Neuron (Perceptron)

### The Fundamental Computation

A single neuron performs two operations:

1. **Linear transformation:** Compute a weighted sum of inputs plus a bias
2. **Non-linear activation:** Apply a function to produce the output

### The Weighted Sum

$$z = w_1 x_1 + w_2 x_2 + ... + w_n x_n + b = \mathbf{w} \cdot \mathbf{x} + b$$

#### Breaking down the formula:

| Component | Meaning | Role | F1 Analogy |
|-----------|---------|------|------------|
| $\mathbf{x}$ | Input features | The data we're processing | Sensor readings: `[tire_temp, brake_temp]` |
| $\mathbf{w}$ | Weights | How much each input matters (learned) | Calibration factors for each sensor |
| $b$ | Bias | Shifts the decision boundary (learned) | Baseline threshold offset |
| $z$ | Pre-activation | The raw linear combination | Combined risk score before the go/no-go decision |

**What this means:** The weighted sum tells you "how strongly does this input match the pattern I'm looking for?" Large positive $z$ means strong match, large negative means opposite of pattern, near zero means uncertain.

**F1 analogy:** Imagine computing a "pit stop urgency score." You take tire temperature (weighted heavily), brake wear (weighted moderately), and fuel load (weighted lightly), combine them into a single number. A high score means "pit now," a low score means "stay out." The weights encode which factors matter most for this decision.

In [None]:
def weighted_sum(x, w, b):
    """
    Compute the weighted sum (pre-activation).
    
    Args:
        x: Input vector (n,)
        w: Weight vector (n,)
        b: Bias scalar
    
    Returns:
        z: Pre-activation value
    """
    return np.dot(w, x) + b

# Example: A simple 2-input neuron
x = np.array([2, 3])       # Input features
w = np.array([0.5, -0.3])  # Weights
b = 0.1                     # Bias

z = weighted_sum(x, w, b)

print("Input x:", x)
print("Weights w:", w)
print("Bias b:", b)
print(f"\nWeighted sum z = w . x + b")
print(f"z = ({w[0]} * {x[0]}) + ({w[1]} * {x[1]}) + {b}")
print(f"z = {w[0]*x[0]} + {w[1]*x[1]} + {b}")
print(f"z = {z}")

### Visualizing the Decision Boundary

A single neuron with 2 inputs creates a **linear decision boundary** - a line that separates two regions in the input space.

The boundary occurs where $z = 0$, which means:
$$w_1 x_1 + w_2 x_2 + b = 0$$

Solving for $x_2$:
$$x_2 = -\frac{w_1}{w_2} x_1 - \frac{b}{w_2}$$

This is just a line with slope $-w_1/w_2$ and intercept $-b/w_2$.

In [None]:
# Visualization: Decision boundary of a single neuron
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Different weight/bias combinations
configs = [
    {'w': np.array([1, 1]), 'b': 0, 'title': 'w=[1,1], b=0'},
    {'w': np.array([1, 1]), 'b': -1, 'title': 'w=[1,1], b=-1 (shifted)'},
    {'w': np.array([2, 1]), 'b': 0, 'title': 'w=[2,1], b=0 (rotated)'}
]

x_range = np.linspace(-3, 3, 100)

for ax, config in zip(axes, configs):
    w, b = config['w'], config['b']
    
    # Create a grid of points
    xx, yy = np.meshgrid(x_range, x_range)
    Z = w[0] * xx + w[1] * yy + b
    
    # Plot the regions
    ax.contourf(xx, yy, Z, levels=[-100, 0, 100], colors=['lightblue', 'lightsalmon'], alpha=0.5)
    
    # Plot the decision boundary (z = 0)
    ax.contour(xx, yy, Z, levels=[0], colors='black', linewidths=2)
    
    # Labels
    ax.text(2, 2, 'z > 0', fontsize=12, color='red', fontweight='bold')
    ax.text(-2, -2, 'z < 0', fontsize=12, color='blue', fontweight='bold')
    
    ax.set_xlabel('$x_1$', fontsize=12)
    ax.set_ylabel('$x_2$', fontsize=12)
    ax.set_title(config['title'], fontsize=12, fontweight='bold')
    ax.set_xlim(-3, 3)
    ax.set_ylim(-3, 3)
    ax.set_aspect('equal')
    ax.axhline(y=0, color='k', linewidth=0.5)
    ax.axvline(x=0, color='k', linewidth=0.5)
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Key insight: Weights control the orientation, bias shifts the boundary")

### Why We Need Activation Functions

Without activation functions, stacking multiple layers would be pointless:

$$\text{Layer 1: } z_1 = W_1 x + b_1$$
$$\text{Layer 2: } z_2 = W_2 z_1 + b_2 = W_2(W_1 x + b_1) + b_2 = (W_2 W_1)x + (W_2 b_1 + b_2)$$

This is just another linear transformation! We could replace it with a single layer.

**What this means:** Without non-linearity, no matter how many layers we stack, we can only learn linear relationships. Activation functions break this limitation.

**F1 analogy:** Tire grip versus temperature is not linear. At low temperatures, grip increases as the rubber warms up. At optimal temperature, grip is at its peak. Beyond that, grip drops off sharply as the tire overheats. This nonlinear S-curve relationship is exactly the kind of pattern that activation functions let neural networks capture. A purely linear model would draw a straight line through this curve and miss the critical "cliff edge" where grip falls off.

| Problem Type | Linear Model Can Solve? | Needs Non-linearity? |
|--------------|------------------------|---------------------|
| AND gate | Yes | No |
| OR gate | Yes | No |
| XOR gate | **No** | **Yes** |
| Image classification | No | Yes |
| Language understanding | No | Yes |

In [None]:
# Visualization: Why XOR needs non-linearity
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# AND gate (linearly separable)
ax = axes[0]
and_data = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
and_labels = np.array([0, 0, 0, 1])  # Only (1,1) is True

ax.scatter(and_data[and_labels==0, 0], and_data[and_labels==0, 1], 
           c='blue', s=200, marker='o', label='0 (False)', edgecolors='black')
ax.scatter(and_data[and_labels==1, 0], and_data[and_labels==1, 1], 
           c='red', s=200, marker='s', label='1 (True)', edgecolors='black')

# Decision boundary
x_line = np.linspace(-0.5, 1.5, 100)
ax.plot(x_line, 1.5 - x_line, 'g--', lw=2, label='Decision boundary')
ax.fill_between(x_line, 1.5 - x_line, 2, alpha=0.2, color='red')

ax.set_xlim(-0.5, 1.5)
ax.set_ylim(-0.5, 1.5)
ax.set_xlabel('$x_1$', fontsize=12)
ax.set_ylabel('$x_2$', fontsize=12)
ax.set_title('AND Gate (Linearly Separable)', fontsize=12, fontweight='bold')
ax.legend()
ax.set_aspect('equal')
ax.grid(True, alpha=0.3)

# OR gate (linearly separable)
ax = axes[1]
or_labels = np.array([0, 1, 1, 1])  # Only (0,0) is False

ax.scatter(and_data[or_labels==0, 0], and_data[or_labels==0, 1], 
           c='blue', s=200, marker='o', label='0 (False)', edgecolors='black')
ax.scatter(and_data[or_labels==1, 0], and_data[or_labels==1, 1], 
           c='red', s=200, marker='s', label='1 (True)', edgecolors='black')

ax.plot(x_line, 0.5 - x_line, 'g--', lw=2, label='Decision boundary')
ax.fill_between(x_line, 0.5 - x_line, 2, alpha=0.2, color='red')

ax.set_xlim(-0.5, 1.5)
ax.set_ylim(-0.5, 1.5)
ax.set_xlabel('$x_1$', fontsize=12)
ax.set_ylabel('$x_2$', fontsize=12)
ax.set_title('OR Gate (Linearly Separable)', fontsize=12, fontweight='bold')
ax.legend()
ax.set_aspect('equal')
ax.grid(True, alpha=0.3)

# XOR gate (NOT linearly separable)
ax = axes[2]
xor_labels = np.array([0, 1, 1, 0])  # (0,1) and (1,0) are True

ax.scatter(and_data[xor_labels==0, 0], and_data[xor_labels==0, 1], 
           c='blue', s=200, marker='o', label='0 (False)', edgecolors='black')
ax.scatter(and_data[xor_labels==1, 0], and_data[xor_labels==1, 1], 
           c='red', s=200, marker='s', label='1 (True)', edgecolors='black')

# Show that no single line works
ax.plot(x_line, 0.5 - x_line, 'gray', lw=1, alpha=0.5, linestyle='--')
ax.plot(x_line, 1.5 - x_line, 'gray', lw=1, alpha=0.5, linestyle='--')
ax.text(0.5, 1.3, 'No single line\ncan separate!', ha='center', fontsize=10, color='darkred')

ax.set_xlim(-0.5, 1.5)
ax.set_ylim(-0.5, 1.5)
ax.set_xlabel('$x_1$', fontsize=12)
ax.set_ylabel('$x_2$', fontsize=12)
ax.set_title('XOR Gate (NOT Linearly Separable)', fontsize=12, fontweight='bold')
ax.legend()
ax.set_aspect('equal')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("XOR is the classic example showing why neural networks need non-linear activations.")
print("We'll solve XOR at the end of this notebook!")

---

## 3. Activation Functions

### Intuition

Activation functions introduce non-linearity, allowing neural networks to learn complex patterns. Think of them as "decision makers" that determine how strongly a neuron should "fire" based on its input.

Different activation functions have different properties that make them suitable for different situations.

**F1 analogy:** Each activation function is like a different type of nonlinear response in a race car. Sigmoid is like tire grip as a function of temperature -- it saturates at both extremes (cold = no grip, overheated = no grip, with a smooth transition between). ReLU is like the throttle response: below a threshold nothing happens, above it the response is proportional. Leaky ReLU is like engine braking -- even when you lift off the throttle, there is a small negative force (the "leak").

In [None]:
# Define all major activation functions

def sigmoid(z):
    """Sigmoid activation: squashes to (0, 1)"""
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

def sigmoid_derivative(z):
    """Derivative of sigmoid"""
    s = sigmoid(z)
    return s * (1 - s)

def tanh_activation(z):
    """Tanh activation: squashes to (-1, 1)"""
    return np.tanh(z)

def tanh_derivative(z):
    """Derivative of tanh"""
    return 1 - np.tanh(z)**2

def relu(z):
    """ReLU activation: max(0, z)"""
    return np.maximum(0, z)

def relu_derivative(z):
    """Derivative of ReLU"""
    return (z > 0).astype(float)

def leaky_relu(z, alpha=0.01):
    """Leaky ReLU: allows small negative values"""
    return np.where(z > 0, z, alpha * z)

def leaky_relu_derivative(z, alpha=0.01):
    """Derivative of Leaky ReLU"""
    return np.where(z > 0, 1, alpha)

def gelu(z):
    """GELU activation: smooth approximation used in transformers"""
    return 0.5 * z * (1 + np.tanh(np.sqrt(2/np.pi) * (z + 0.044715 * z**3)))

def gelu_derivative(z):
    """Approximate derivative of GELU"""
    # Numerical derivative for simplicity
    eps = 1e-7
    return (gelu(z + eps) - gelu(z - eps)) / (2 * eps)

print("Activation functions defined!")

In [None]:
# Visualization: All activation functions
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
z = np.linspace(-5, 5, 200)

activations = [
    ('Sigmoid', sigmoid, sigmoid_derivative, 'Squashes to (0,1)\nUsed for binary output'),
    ('Tanh', tanh_activation, tanh_derivative, 'Squashes to (-1,1)\nZero-centered'),
    ('ReLU', relu, relu_derivative, 'max(0, z)\nMost popular hidden layer'),
    ('Leaky ReLU', leaky_relu, leaky_relu_derivative, 'Allows small negatives\nFixes "dying ReLU"'),
    ('GELU', gelu, gelu_derivative, 'Smooth ReLU variant\nUsed in transformers')
]

for ax, (name, func, deriv, desc) in zip(axes.flat[:5], activations):
    # Plot activation
    ax.plot(z, func(z), 'b-', lw=2.5, label='Activation')
    # Plot derivative
    ax.plot(z, deriv(z), 'r--', lw=2, label='Derivative')
    
    ax.axhline(y=0, color='k', linewidth=0.5)
    ax.axvline(x=0, color='k', linewidth=0.5)
    ax.set_xlabel('z (input)', fontsize=11)
    ax.set_ylabel('f(z) / f\'(z)', fontsize=11)
    ax.set_title(f'{name}', fontsize=13, fontweight='bold')
    ax.legend(loc='best')
    ax.set_xlim(-5, 5)
    ax.set_ylim(-1.5, 2)
    ax.grid(True, alpha=0.3)
    
    # Add description
    ax.text(0.02, 0.02, desc, transform=ax.transAxes, fontsize=9,
            verticalalignment='bottom', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

# Remove the 6th subplot
axes[1, 2].axis('off')

plt.tight_layout()
plt.show()

### Activation Function Comparison

| Activation | Formula | Range | Derivative Range | Pros | Cons | When to Use |
|------------|---------|-------|------------------|------|------|-------------|
| **Sigmoid** | $\frac{1}{1+e^{-z}}$ | (0, 1) | (0, 0.25] | Output as probability | Vanishing gradient | Output layer (binary) |
| **Tanh** | $\frac{e^z - e^{-z}}{e^z + e^{-z}}$ | (-1, 1) | (0, 1] | Zero-centered | Vanishing gradient | RNNs, hidden layers |
| **ReLU** | $\max(0, z)$ | [0, inf) | {0, 1} | Fast, no vanishing grad | Dead neurons | Default for hidden |
| **Leaky ReLU** | $\max(\alpha z, z)$ | (-inf, inf) | {alpha, 1} | No dead neurons | Extra hyperparameter | When ReLU fails |
| **GELU** | $z \cdot \Phi(z)$ | approx (-0.17, inf) | smooth | Smooth, probabilistic | Slower to compute | Transformers |

**F1 analogy:** Choosing an activation function is like choosing a tire compound. Soft compounds (sigmoid/tanh) give nuanced response but degrade quickly at extremes (vanishing gradients). Hard compounds (ReLU) are durable and fast but binary -- they either grip or they do not. Medium compounds (GELU, Leaky ReLU) try to find the best compromise for the conditions.

### Deep Dive: The Vanishing Gradient Problem

When training deep networks with sigmoid or tanh activations, gradients can become extremely small as they propagate backward through layers.

#### Why it happens:

1. **Sigmoid derivative is small:** Maximum value is 0.25 (at z=0)
2. **Gradients multiply:** In backpropagation, gradients from each layer multiply together
3. **Exponential decay:** With n layers: gradient ~ $(0.25)^n$ in worst case

| Layers | Max Gradient (sigmoid) | Practical Impact |
|--------|------------------------|------------------|
| 2 | 0.0625 | Manageable |
| 5 | ~0.001 | Learning slows |
| 10 | ~0.0000001 | Almost no learning |

**F1 analogy:** Imagine tracing a lap time problem back through the car's systems. The pit wall sees a 0.3s time loss. They attribute it to lower corner exit speed, which came from reduced rear grip, which came from tire overheating, which came from an aggressive differential setting. At each step in this chain, the "blame signal" gets weaker. By the time you trace it back to the differential, the signal is so diluted you cannot tell if it mattered. That is the vanishing gradient problem -- the feedback signal dies before it reaches the parameters that need adjusting.

#### Key Insight

ReLU solves this because its derivative is exactly 1 for positive inputs - gradients flow unchanged! This is why ReLU became the default activation for deep networks.

#### Common Misconceptions

| Misconception | Reality |
|---------------|--------|
| "Sigmoid is bad" | Sigmoid is fine for output layers in binary classification |
| "ReLU has no problems" | ReLU can cause "dead neurons" that never activate |
| "Deeper is always better" | Without proper activations, deeper can be worse |

In [None]:
# Visualization: Vanishing gradient demonstration
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Show how sigmoid derivative is small
ax = axes[0]
z = np.linspace(-6, 6, 200)

ax.plot(z, sigmoid(z), 'b-', lw=2, label='Sigmoid')
ax.plot(z, sigmoid_derivative(z), 'r-', lw=2, label='Sigmoid derivative')
ax.axhline(y=0.25, color='r', linestyle='--', alpha=0.5)
ax.text(3, 0.27, 'Max derivative = 0.25', color='r', fontsize=10)

# Shade the "saturation regions"
ax.fill_between(z[z < -3], 0, sigmoid_derivative(z[z < -3]), alpha=0.3, color='gray')
ax.fill_between(z[z > 3], 0, sigmoid_derivative(z[z > 3]), alpha=0.3, color='gray')
ax.text(-4.5, 0.1, 'Saturated\n(gradient~0)', ha='center', fontsize=9, color='gray')
ax.text(4.5, 0.1, 'Saturated\n(gradient~0)', ha='center', fontsize=9, color='gray')

ax.set_xlabel('z', fontsize=12)
ax.set_ylabel('Value', fontsize=12)
ax.set_title('Sigmoid Saturation Problem', fontsize=13, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

# Right: Show gradient decay through layers
ax = axes[1]
layers = np.arange(1, 11)

# Sigmoid gradient decay (assuming derivative ~ 0.25 at each layer)
sigmoid_grad = 0.25 ** layers
# ReLU gradient (stays at 1 for positive activations)
relu_grad = 1.0 ** layers

ax.semilogy(layers, sigmoid_grad, 'b-o', lw=2, markersize=8, label='Sigmoid (worst case)')
ax.semilogy(layers, relu_grad, 'r-s', lw=2, markersize=8, label='ReLU (positive path)')

ax.axhline(y=1e-6, color='gray', linestyle='--', alpha=0.5)
ax.text(8, 2e-6, 'Effectively zero', fontsize=9, color='gray')

ax.set_xlabel('Number of Layers', fontsize=12)
ax.set_ylabel('Gradient Magnitude (log scale)', fontsize=12)
ax.set_title('Gradient Flow Through Layers', fontsize=13, fontweight='bold')
ax.legend()
ax.set_xlim(1, 10)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("This is why deep networks struggled until ReLU became popular around 2012!")

In [None]:
# Interactive: Compare activations on the same input
z_values = np.array([-3, -1, 0, 1, 3])

print("Comparing activation outputs for different inputs:\n")
print(f"{'z':>6} | {'Sigmoid':>8} | {'Tanh':>8} | {'ReLU':>8} | {'LeakyReLU':>10} | {'GELU':>8}")
print("-" * 65)

for z_val in z_values:
    print(f"{z_val:>6} | {sigmoid(z_val):>8.4f} | {tanh_activation(z_val):>8.4f} | {relu(z_val):>8.4f} | {leaky_relu(z_val):>10.4f} | {gelu(z_val):>8.4f}")

### Why This Matters in Machine Learning

| Application | Activation Choice | Reason | F1 Parallel |
|-------------|-------------------|--------|-------------|
| Binary classification output | Sigmoid | Outputs probability in (0,1) | Pit/no-pit probability |
| Multi-class output | Softmax | Outputs probability distribution | Tire compound selection probabilities |
| Hidden layers (CNNs) | ReLU | Fast, no vanishing gradient | Quick sensor threshold checks |
| Hidden layers (RNNs) | Tanh | Zero-centered, bounded | Smoothly varying telemetry signals |
| Transformers (BERT, GPT) | GELU | Smooth, probabilistic | Nuanced strategy weighting |
| GANs | Leaky ReLU | Prevents dead neurons | Never fully ignoring a sensor reading |

---

## 4. Forward Propagation

### Intuition

Forward propagation is simply passing data through the network layer by layer:

**Input** --> **Layer 1** --> **Layer 2** --> ... --> **Output**

Each layer performs: **Linear transformation** --> **Activation function**

**F1 analogy:** Forward propagation is data flowing from car sensors through the telemetry pipeline to the pit wall. Raw sensor data (speed, tire temps, fuel load) enters the system. Layer 1 extracts features: "is the car understeering?", "is the tire degrading faster than expected?" Layer 2 combines those features into higher-level assessments: "should we change strategy?" The final output is the decision the pit wall communicates to the driver. Each layer adds understanding, just like each stage in the telemetry pipeline.

### Single Layer Forward Pass

For one layer with input $\mathbf{x}$, weights $W$, and bias $\mathbf{b}$:

1. **Linear step:** $\mathbf{z} = W\mathbf{x} + \mathbf{b}$
2. **Activation step:** $\mathbf{a} = f(\mathbf{z})$

where $f$ is the activation function.

In [None]:
def forward_layer(x, W, b, activation_fn):
    """
    Forward pass through a single layer.
    
    Args:
        x: Input vector/matrix (features, ) or (features, samples)
        W: Weight matrix (output_size, input_size)
        b: Bias vector (output_size, )
        activation_fn: Activation function to apply
    
    Returns:
        a: Activated output
        z: Pre-activation (for backprop, stored in cache)
    """
    z = np.dot(W, x) + b.reshape(-1, 1) if x.ndim > 1 else np.dot(W, x) + b
    a = activation_fn(z)
    return a, z

# Example: Single layer with 3 inputs, 2 outputs
x = np.array([1.0, 2.0, 3.0])  # 3 input features
W = np.array([[0.1, 0.2, 0.3],   # 2 output neurons
              [0.4, 0.5, 0.6]])
b = np.array([0.1, 0.2])

a, z = forward_layer(x, W, b, relu)

print("Single Layer Forward Pass:")
print(f"Input x: {x}")
print(f"\nWeights W:\n{W}")
print(f"\nBias b: {b}")
print(f"\nPre-activation z = Wx + b: {z}")
print(f"Activated output a = ReLU(z): {a}")

In [None]:
# Visualization: Data flowing through a network
fig, ax = plt.subplots(figsize=(14, 8))

# Network architecture: 3 -> 4 -> 2
layer_sizes = [3, 4, 2]
layer_names = ['Input\n(3 features)', 'Hidden\n(4 neurons)', 'Output\n(2 neurons)']
x_positions = [0, 2.5, 5]

# Draw neurons
neuron_positions = []
for layer_idx, (n_neurons, x_pos) in enumerate(zip(layer_sizes, x_positions)):
    y_positions = np.linspace(0, 6, n_neurons + 2)[1:-1]  # Evenly space neurons
    positions = [(x_pos, y) for y in y_positions]
    neuron_positions.append(positions)
    
    for (x, y) in positions:
        color = ['lightblue', 'lightgreen', 'lightsalmon'][layer_idx]
        circle = plt.Circle((x, y), 0.3, color=color, ec='black', linewidth=2)
        ax.add_patch(circle)

# Draw connections
for layer_idx in range(len(layer_sizes) - 1):
    for (x1, y1) in neuron_positions[layer_idx]:
        for (x2, y2) in neuron_positions[layer_idx + 1]:
            ax.annotate('', xy=(x2 - 0.3, y2), xytext=(x1 + 0.3, y1),
                       arrowprops=dict(arrowstyle='->', color='gray', lw=0.5, alpha=0.5))

# Labels
for layer_idx, (x_pos, name) in enumerate(zip(x_positions, layer_names)):
    ax.text(x_pos, -0.8, name, ha='center', fontsize=11, fontweight='bold')

# Show the math
ax.text(1.25, 7, r'$z^{[1]} = W^{[1]}x + b^{[1]}$', fontsize=12, ha='center')
ax.text(1.25, 6.5, r'$a^{[1]} = \text{ReLU}(z^{[1]})$', fontsize=12, ha='center')

ax.text(3.75, 7, r'$z^{[2]} = W^{[2]}a^{[1]} + b^{[2]}$', fontsize=12, ha='center')
ax.text(3.75, 6.5, r'$a^{[2]} = \sigma(z^{[2]})$', fontsize=12, ha='center')

# Arrows for flow
ax.annotate('', xy=(1.5, 3), xytext=(0.5, 3),
           arrowprops=dict(arrowstyle='->', color='blue', lw=2))
ax.annotate('', xy=(4.0, 3), xytext=(3.0, 3),
           arrowprops=dict(arrowstyle='->', color='blue', lw=2))

ax.set_xlim(-1, 6)
ax.set_ylim(-1.5, 8)
ax.set_aspect('equal')
ax.axis('off')
ax.set_title('Forward Propagation: Data Flows Left to Right', fontsize=14, fontweight='bold', y=1.02)

plt.tight_layout()
plt.show()

In [None]:
# Visualization: Show actual values flowing through
np.random.seed(42)

# Initialize a small network: 2 inputs -> 3 hidden -> 1 output
W1 = np.array([[0.5, -0.5],
               [0.3, 0.3],
               [-0.2, 0.6]])
b1 = np.array([0.1, 0.0, -0.1])

W2 = np.array([[0.4, -0.3, 0.5]])
b2 = np.array([0.1])

# Input
x = np.array([1.0, 0.5])

# Forward pass - Layer 1
z1 = np.dot(W1, x) + b1
a1 = relu(z1)

# Forward pass - Layer 2
z2 = np.dot(W2, a1) + b2
a2 = sigmoid(z2)  # Sigmoid for final output

print("=" * 60)
print("FORWARD PROPAGATION - Tracing Values Through the Network")
print("=" * 60)

print(f"\nINPUT LAYER:")
print(f"  x = {x}")

print(f"\nHIDDEN LAYER (ReLU activation):")
print(f"  z1 = W1 @ x + b1")
print(f"  z1 = {z1}")
print(f"  a1 = ReLU(z1) = {a1}")

print(f"\nOUTPUT LAYER (Sigmoid activation):")
print(f"  z2 = W2 @ a1 + b2")
print(f"  z2 = {z2}")
print(f"  a2 = sigmoid(z2) = {a2}")

print(f"\nFINAL OUTPUT: {a2[0]:.4f}")
print("=" * 60)

### Deep Dive: Notation Conventions

In neural network literature, you'll see consistent notation:

| Symbol | Meaning |
|--------|--------|
| $L$ | Number of layers (not counting input) |
| $n^{[l]}$ | Number of neurons in layer $l$ |
| $W^{[l]}$ | Weight matrix for layer $l$, shape $(n^{[l]}, n^{[l-1]})$ |
| $b^{[l]}$ | Bias vector for layer $l$, shape $(n^{[l]},)$ |
| $z^{[l]}$ | Pre-activation at layer $l$ |
| $a^{[l]}$ | Activation (output) at layer $l$ |
| $a^{[0]}$ | Input $x$ (by convention) |

**Key insight:** The superscript $[l]$ denotes the layer number. This is different from superscript $(i)$ which denotes the training example number.

---

## 5. Loss Functions

### Intuition

A loss function measures "how wrong" the network's predictions are. During training, we minimize this loss to improve predictions.

**Key principle:** Different problems need different loss functions.

**F1 analogy:** The loss function is the lap time delta from the theoretical best. If your car crosses the line 0.3s behind the ideal lap, your "loss" is 0.3. The entire engineering effort -- setup changes, strategy adjustments, driver coaching -- is aimed at minimizing this delta. Lower is always better. For regression problems (predicting lap times), we use MSE. For classification problems (will it rain? yes/no), we use cross-entropy.

| Problem Type | Example | Loss Function | F1 Parallel |
|--------------|---------|---------------|-------------|
| Regression | Predict house price | Mean Squared Error (MSE) | Predicting lap time delta |
| Binary Classification | Cat vs Dog | Binary Cross-Entropy | Rain or dry? Pit or stay? |
| Multi-class Classification | Digit recognition | Categorical Cross-Entropy | Which tire compound to choose? |

### Mean Squared Error (MSE) for Regression

**Intuition:** Penalize predictions that are far from the true value. Squaring makes large errors hurt more.

$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

#### Breaking down the formula:

| Component | Meaning | F1 Analogy |
|-----------|--------|------------|
| $y_i$ | True value for example $i$ | Actual lap time |
| $\hat{y}_i$ | Predicted value for example $i$ | Model's predicted lap time |
| $(y_i - \hat{y}_i)^2$ | Squared error (always positive) | Squared time delta |
| $\frac{1}{n}$ | Average over all examples | Average over all laps |

**What this means:** MSE = 0 means perfect predictions. Larger MSE means worse predictions.

**F1 analogy:** If your model predicts lap times, MSE penalizes big misses much more than small ones. Predicting a 1:32.0 when the actual time is 1:32.1 costs you $0.1^2 = 0.01$. But predicting 1:34.0 costs $2.0^2 = 4.0$. That large error is 400 times more costly, which forces the model to eliminate the big prediction mistakes first.

In [None]:
def mse_loss(y_true, y_pred):
    """
    Mean Squared Error loss.
    
    Args:
        y_true: True values
        y_pred: Predicted values
    
    Returns:
        MSE loss value
    """
    return np.mean((y_true - y_pred) ** 2)

def mse_derivative(y_true, y_pred):
    """Derivative of MSE with respect to predictions."""
    n = len(y_true)
    return 2 * (y_pred - y_true) / n

# Example
y_true = np.array([3.0, 5.0, 2.0, 7.0])
y_pred = np.array([2.5, 5.5, 2.1, 6.0])

print("MSE Loss Example:")
print(f"True values:      {y_true}")
print(f"Predicted values: {y_pred}")
print(f"Errors:           {y_true - y_pred}")
print(f"Squared errors:   {(y_true - y_pred)**2}")
print(f"MSE Loss:         {mse_loss(y_true, y_pred):.4f}")

In [None]:
# Visualization: MSE penalizes large errors more
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Show quadratic penalty
ax = axes[0]
errors = np.linspace(-3, 3, 100)
squared_errors = errors ** 2

ax.plot(errors, squared_errors, 'b-', lw=2.5)
ax.scatter([0.5, 1, 2], [0.25, 1, 4], c='red', s=100, zorder=5)
ax.annotate('Small error: 0.5 -> 0.25', xy=(0.5, 0.25), xytext=(1.5, 1.5),
            arrowprops=dict(arrowstyle='->', color='red'), fontsize=10)
ax.annotate('Large error: 2 -> 4', xy=(2, 4), xytext=(0, 5),
            arrowprops=dict(arrowstyle='->', color='red'), fontsize=10)

ax.set_xlabel('Error (y - y_pred)', fontsize=12)
ax.set_ylabel('Squared Error', fontsize=12)
ax.set_title('MSE Penalty: Large Errors Hurt More', fontsize=13, fontweight='bold')
ax.grid(True, alpha=0.3)
ax.axhline(y=0, color='k', linewidth=0.5)
ax.axvline(x=0, color='k', linewidth=0.5)

# Right: Show predictions vs true values
ax = axes[1]
x_plot = np.arange(len(y_true))
width = 0.35

ax.bar(x_plot - width/2, y_true, width, label='True', color='blue', alpha=0.7)
ax.bar(x_plot + width/2, y_pred, width, label='Predicted', color='red', alpha=0.7)

# Draw error lines
for i in range(len(y_true)):
    ax.plot([i, i], [y_true[i], y_pred[i]], 'k--', lw=1)
    error = abs(y_true[i] - y_pred[i])
    ax.text(i + 0.15, (y_true[i] + y_pred[i])/2, f'{error:.1f}', fontsize=9)

ax.set_xlabel('Sample', fontsize=12)
ax.set_ylabel('Value', fontsize=12)
ax.set_title(f'Predictions vs True Values (MSE = {mse_loss(y_true, y_pred):.4f})', 
             fontsize=13, fontweight='bold')
ax.legend()
ax.set_xticks(x_plot)
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

### Binary Cross-Entropy for Classification

**Intuition:** Measure how "surprised" we are by the prediction. If we predict 0.99 for a true positive, low surprise (low loss). If we predict 0.01 for a true positive, high surprise (high loss).

$$\text{BCE} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]$$

#### Breaking down the formula:

| Component | When Active | Meaning |
|-----------|-------------|--------|
| $y_i \log(\hat{y}_i)$ | When $y_i = 1$ | Penalize low probability for true class |
| $(1-y_i) \log(1-\hat{y}_i)$ | When $y_i = 0$ | Penalize high probability for wrong class |
| $-$ sign | Always | Make loss positive (log of probability is negative) |

**What this means:** Cross-entropy heavily penalizes confident wrong predictions. Predicting 0.01 when truth is 1 gives $-\log(0.01) = 4.6$. Predicting 0.5 gives only $-\log(0.5) = 0.69$.

In [None]:
def binary_cross_entropy(y_true, y_pred, epsilon=1e-15):
    """
    Binary Cross-Entropy loss.
    
    Args:
        y_true: True binary labels (0 or 1)
        y_pred: Predicted probabilities (0 to 1)
        epsilon: Small value to avoid log(0)
    
    Returns:
        BCE loss value
    """
    # Clip predictions to avoid log(0)
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

def bce_derivative(y_true, y_pred, epsilon=1e-15):
    """Derivative of BCE with respect to predictions."""
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return (y_pred - y_true) / (y_pred * (1 - y_pred)) / len(y_true)

# Example
y_true_class = np.array([1, 0, 1, 1])
y_pred_prob = np.array([0.9, 0.1, 0.8, 0.3])  # Note: 0.3 is a bad prediction for y=1

print("Binary Cross-Entropy Loss Example:")
print(f"True labels:       {y_true_class}")
print(f"Predicted probs:   {y_pred_prob}")
print(f"\nPer-sample loss breakdown:")
for i in range(len(y_true_class)):
    if y_true_class[i] == 1:
        loss_i = -np.log(y_pred_prob[i])
        print(f"  Sample {i}: y=1, pred={y_pred_prob[i]:.1f}, loss = -log({y_pred_prob[i]:.1f}) = {loss_i:.4f}")
    else:
        loss_i = -np.log(1 - y_pred_prob[i])
        print(f"  Sample {i}: y=0, pred={y_pred_prob[i]:.1f}, loss = -log(1-{y_pred_prob[i]:.1f}) = {loss_i:.4f}")

print(f"\nBCE Loss: {binary_cross_entropy(y_true_class, y_pred_prob):.4f}")

In [None]:
# Visualization: Cross-entropy loss behavior
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Loss vs predicted probability for y=1
ax = axes[0]
probs = np.linspace(0.01, 0.99, 100)

# When true label is 1
loss_y1 = -np.log(probs)
ax.plot(probs, loss_y1, 'b-', lw=2.5, label='True label = 1')

# When true label is 0
loss_y0 = -np.log(1 - probs)
ax.plot(probs, loss_y0, 'r-', lw=2.5, label='True label = 0')

# Mark some key points
ax.scatter([0.9, 0.1], [-np.log(0.9), -np.log(1-0.1)], c='green', s=100, zorder=5)
ax.annotate('Good prediction\n(low loss)', xy=(0.9, -np.log(0.9)), 
            xytext=(0.6, 1.5), arrowprops=dict(arrowstyle='->', color='green'))

ax.scatter([0.1, 0.9], [-np.log(0.1), -np.log(1-0.9)], c='red', s=100, zorder=5)
ax.annotate('Bad prediction\n(high loss)', xy=(0.1, -np.log(0.1)), 
            xytext=(0.3, 3), arrowprops=dict(arrowstyle='->', color='red'))

ax.set_xlabel('Predicted Probability', fontsize=12)
ax.set_ylabel('Cross-Entropy Loss', fontsize=12)
ax.set_title('Binary Cross-Entropy: Confident Mistakes Hurt', fontsize=13, fontweight='bold')
ax.legend()
ax.set_xlim(0, 1)
ax.set_ylim(0, 5)
ax.grid(True, alpha=0.3)

# Right: Compare MSE vs BCE for classification
ax = axes[1]

# For y_true = 1
mse_loss_y1 = (1 - probs) ** 2
bce_loss_y1 = -np.log(probs)

ax.plot(probs, mse_loss_y1, 'b--', lw=2, label='MSE (for y=1)')
ax.plot(probs, bce_loss_y1 / 5, 'b-', lw=2, label='BCE / 5 (for y=1)')  # Scaled for comparison

ax.axvline(x=0.5, color='gray', linestyle=':', alpha=0.5)
ax.text(0.52, 0.8, 'Decision\nboundary', fontsize=9, color='gray')

ax.set_xlabel('Predicted Probability', fontsize=12)
ax.set_ylabel('Loss', fontsize=12)
ax.set_title('MSE vs BCE for Classification', fontsize=13, fontweight='bold')
ax.legend()
ax.set_xlim(0, 1)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Key insight: BCE has steeper gradients near 0 and 1, leading to faster learning")

### Loss Function Comparison Table

| Loss Function | Formula | Problem Type | Output Activation | Gradient Behavior |
|---------------|---------|--------------|-------------------|-------------------|
| **MSE** | $\frac{1}{n}\sum(y-\hat{y})^2$ | Regression | Linear (none) | Proportional to error |
| **MAE** | $\frac{1}{n}\sum|y-\hat{y}|$ | Regression (robust) | Linear | Constant magnitude |
| **Binary CE** | $-[y\log\hat{y} + (1-y)\log(1-\hat{y})]$ | Binary classification | Sigmoid | Strong at confident errors |
| **Categorical CE** | $-\sum y_c \log \hat{y}_c$ | Multi-class | Softmax | Strong at confident errors |

### Why This Matters in Machine Learning

| Scenario | Recommended Loss | Reason | F1 Parallel |
|----------|------------------|--------|-------------|
| Predicting continuous values | MSE | Penalizes large errors | Predicting lap time, tire wear rate |
| Binary yes/no decisions | Binary Cross-Entropy | Probabilistic interpretation | Will it rain? Should we pit? |
| Classifying into categories | Categorical Cross-Entropy | Works with softmax | Which tire compound is optimal? |
| When outliers exist | MAE (or Huber) | Less sensitive to outliers | Safety car laps distorting time data |

---

## 6. Putting It Together: Building a Neural Network from Scratch

Now let's build a complete 2-layer neural network and train it to solve the XOR problem!

**F1 analogy:** Building this network from scratch is like building a telemetry system from individual components. We wire sensors (inputs) through processing stages (hidden layers) to a strategy output. The XOR problem below is a simplified version of a real F1 challenge: sometimes conditions that are individually fine (warm tires AND low fuel) combine in unexpected ways. A single-layer system cannot capture these interactions, just as a single threshold cannot capture nonlinear sensor interactions.

### The XOR Problem

XOR (exclusive or) is the classic non-linear problem:

| $x_1$ | $x_2$ | XOR |
|-------|-------|-----|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |

A single perceptron cannot solve this, but a 2-layer network can!

In [None]:
# XOR dataset
X_xor = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]).T  # Shape: (2, 4) - 2 features, 4 samples
y_xor = np.array([[0, 1, 1, 0]])  # Shape: (1, 4)

print("XOR Dataset:")
print(f"X (inputs):\n{X_xor}")
print(f"\ny (outputs): {y_xor}")

In [None]:
class NeuralNetwork:
    """
    A simple 2-layer neural network built from scratch.
    
    Architecture: Input -> Hidden (with ReLU) -> Output (with Sigmoid)
    """
    
    def __init__(self, input_size, hidden_size, output_size):
        """
        Initialize the network with random weights.
        
        Args:
            input_size: Number of input features
            hidden_size: Number of neurons in hidden layer
            output_size: Number of output neurons
        """
        # Initialize weights with small random values
        # Using Xavier/Glorot initialization for better convergence
        self.W1 = np.random.randn(hidden_size, input_size) * np.sqrt(2.0 / input_size)
        self.b1 = np.zeros((hidden_size, 1))
        
        self.W2 = np.random.randn(output_size, hidden_size) * np.sqrt(2.0 / hidden_size)
        self.b2 = np.zeros((output_size, 1))
        
        # Store architecture info
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
    
    def forward(self, X):
        """
        Forward propagation through the network.
        
        Args:
            X: Input data, shape (input_size, n_samples)
        
        Returns:
            A2: Output predictions, shape (output_size, n_samples)
        """
        # Layer 1: Linear + ReLU
        self.Z1 = np.dot(self.W1, X) + self.b1
        self.A1 = np.maximum(0, self.Z1)  # ReLU
        
        # Layer 2: Linear + Sigmoid
        self.Z2 = np.dot(self.W2, self.A1) + self.b2
        self.A2 = 1 / (1 + np.exp(-self.Z2))  # Sigmoid
        
        # Store input for backprop
        self.X = X
        
        return self.A2
    
    def compute_loss(self, y):
        """
        Compute binary cross-entropy loss.
        
        Args:
            y: True labels, shape (output_size, n_samples)
        
        Returns:
            loss: Scalar loss value
        """
        m = y.shape[1]  # Number of samples
        epsilon = 1e-15
        A2_clipped = np.clip(self.A2, epsilon, 1 - epsilon)
        loss = -np.mean(y * np.log(A2_clipped) + (1 - y) * np.log(1 - A2_clipped))
        return loss
    
    def backward(self, y):
        """
        Backward propagation to compute gradients.
        
        Args:
            y: True labels, shape (output_size, n_samples)
        """
        m = y.shape[1]  # Number of samples
        
        # Output layer gradient
        # dL/dZ2 = A2 - y (for BCE + sigmoid)
        dZ2 = self.A2 - y
        self.dW2 = (1/m) * np.dot(dZ2, self.A1.T)
        self.db2 = (1/m) * np.sum(dZ2, axis=1, keepdims=True)
        
        # Hidden layer gradient
        # dL/dA1 = W2.T @ dZ2
        dA1 = np.dot(self.W2.T, dZ2)
        # dL/dZ1 = dL/dA1 * ReLU'(Z1)
        dZ1 = dA1 * (self.Z1 > 0)  # ReLU derivative
        self.dW1 = (1/m) * np.dot(dZ1, self.X.T)
        self.db1 = (1/m) * np.sum(dZ1, axis=1, keepdims=True)
    
    def update_weights(self, learning_rate):
        """
        Update weights using gradient descent.
        
        Args:
            learning_rate: Step size for gradient descent
        """
        self.W1 -= learning_rate * self.dW1
        self.b1 -= learning_rate * self.db1
        self.W2 -= learning_rate * self.dW2
        self.b2 -= learning_rate * self.db2
    
    def train(self, X, y, learning_rate=1.0, epochs=10000, print_every=1000):
        """
        Train the network using gradient descent.
        
        Args:
            X: Training inputs
            y: Training labels
            learning_rate: Step size
            epochs: Number of training iterations
            print_every: Print loss every N epochs
        
        Returns:
            losses: List of loss values during training
        """
        losses = []
        
        for epoch in range(epochs):
            # Forward pass
            self.forward(X)
            
            # Compute loss
            loss = self.compute_loss(y)
            losses.append(loss)
            
            # Backward pass
            self.backward(y)
            
            # Update weights
            self.update_weights(learning_rate)
            
            # Print progress
            if epoch % print_every == 0:
                print(f"Epoch {epoch:5d}: Loss = {loss:.6f}")
        
        return losses
    
    def predict(self, X, threshold=0.5):
        """
        Make binary predictions.
        
        Args:
            X: Input data
            threshold: Classification threshold
        
        Returns:
            Binary predictions (0 or 1)
        """
        probs = self.forward(X)
        return (probs > threshold).astype(int)

print("NeuralNetwork class defined!")

In [None]:
# Create and train the network on XOR
np.random.seed(42)

# Network architecture: 2 inputs -> 4 hidden -> 1 output
nn = NeuralNetwork(input_size=2, hidden_size=4, output_size=1)

print("Training Neural Network on XOR Problem")
print("=" * 50)
print(f"Architecture: {nn.input_size} -> {nn.hidden_size} -> {nn.output_size}")
print("=" * 50)

# Train
losses = nn.train(X_xor, y_xor, learning_rate=1.0, epochs=10001, print_every=2000)

In [None]:
# Test the trained network
print("\n" + "=" * 50)
print("TESTING TRAINED NETWORK")
print("=" * 50)

predictions = nn.forward(X_xor)

print(f"\n{'Input':^15} | {'True':^8} | {'Predicted':^10} | {'Rounded':^8}")
print("-" * 50)
for i in range(X_xor.shape[1]):
    x_str = f"({X_xor[0,i]}, {X_xor[1,i]})"
    print(f"{x_str:^15} | {y_xor[0,i]:^8} | {predictions[0,i]:^10.4f} | {int(predictions[0,i] > 0.5):^8}")

accuracy = np.mean((predictions > 0.5) == y_xor) * 100
print(f"\nAccuracy: {accuracy:.1f}%")

In [None]:
# Visualization: Training progress and decision boundary
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Left: Loss curve
ax = axes[0]
ax.plot(losses, 'b-', lw=1)
ax.set_xlabel('Epoch', fontsize=12)
ax.set_ylabel('Loss (BCE)', fontsize=12)
ax.set_title('Training Loss Over Time', fontsize=13, fontweight='bold')
ax.set_yscale('log')
ax.grid(True, alpha=0.3)

# Middle: Decision boundary
ax = axes[1]

# Create a grid of points
xx, yy = np.meshgrid(np.linspace(-0.5, 1.5, 100), np.linspace(-0.5, 1.5, 100))
grid_points = np.c_[xx.ravel(), yy.ravel()].T  # Shape: (2, 10000)

# Get predictions for all grid points
Z = nn.forward(grid_points)
Z = Z.reshape(xx.shape)

# Plot decision boundary
ax.contourf(xx, yy, Z, levels=np.linspace(0, 1, 11), cmap='RdBu_r', alpha=0.7)
ax.contour(xx, yy, Z, levels=[0.5], colors='black', linewidths=2)

# Plot data points
for i in range(4):
    color = 'red' if y_xor[0, i] == 1 else 'blue'
    marker = 's' if y_xor[0, i] == 1 else 'o'
    ax.scatter(X_xor[0, i], X_xor[1, i], c=color, s=200, marker=marker, 
               edgecolors='black', linewidths=2, zorder=5)

ax.set_xlabel('$x_1$', fontsize=12)
ax.set_ylabel('$x_2$', fontsize=12)
ax.set_title('Learned Decision Boundary for XOR', fontsize=13, fontweight='bold')
ax.set_xlim(-0.5, 1.5)
ax.set_ylim(-0.5, 1.5)

# Add colorbar
cbar = plt.colorbar(ax.contourf(xx, yy, Z, levels=np.linspace(0, 1, 11), cmap='RdBu_r', alpha=0.7), ax=ax)
cbar.set_label('P(y=1)', fontsize=10)

# Right: Network architecture visualization
ax = axes[2]
ax.set_xlim(-1, 5)
ax.set_ylim(-0.5, 4.5)

# Draw neurons
layer_x = [0, 2, 4]
layer_neurons = [[1, 3], [0.5, 1.5, 2.5, 3.5], [2]]  # y positions
layer_colors = ['lightblue', 'lightgreen', 'lightsalmon']
layer_labels = ['Input', 'Hidden (ReLU)', 'Output (Sigmoid)']

for layer_idx, (x, neurons, color) in enumerate(zip(layer_x, layer_neurons, layer_colors)):
    for y in neurons:
        circle = plt.Circle((x, y), 0.25, color=color, ec='black', linewidth=2)
        ax.add_patch(circle)
    ax.text(x, -0.3, layer_labels[layer_idx], ha='center', fontsize=9)

# Draw connections (simplified - just show some)
for y1 in layer_neurons[0]:
    for y2 in layer_neurons[1]:
        ax.plot([0.25, 1.75], [y1, y2], 'gray', alpha=0.3, lw=0.5)

for y1 in layer_neurons[1]:
    for y2 in layer_neurons[2]:
        ax.plot([2.25, 3.75], [y1, y2], 'gray', alpha=0.3, lw=0.5)

ax.set_title('Network Architecture', fontsize=13, fontweight='bold')
ax.axis('off')
ax.set_aspect('equal')

plt.tight_layout()
plt.show()

print("\nThe network learned a NON-LINEAR decision boundary to solve XOR!")

### Deep Dive: How the Network Solves XOR

The hidden layer transforms the input space so that XOR becomes linearly separable!

#### Key Insight

The 4 hidden neurons learn to create a new representation where:
- Points (0,0) and (1,1) map to one region
- Points (0,1) and (1,0) map to another region

The output layer then just needs to draw a line in this new space.

**F1 analogy:** This is exactly what a multi-layer telemetry system does. Raw sensor data (tire temps, wind speed) are hard to interpret directly. But the hidden layer transforms them into meaningful features like "effective downforce" and "tire degradation rate." In this new feature space, the strategy decision becomes simple. The hidden layer is the feature engineering that the network learns automatically.

In [None]:
# Visualize the hidden layer representations
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Original space
ax = axes[0]
for i in range(4):
    color = 'red' if y_xor[0, i] == 1 else 'blue'
    marker = 's' if y_xor[0, i] == 1 else 'o'
    ax.scatter(X_xor[0, i], X_xor[1, i], c=color, s=300, marker=marker, 
               edgecolors='black', linewidths=2)
    ax.annotate(f'({int(X_xor[0,i])},{int(X_xor[1,i])})', 
                xy=(X_xor[0, i], X_xor[1, i]),
                xytext=(X_xor[0, i] + 0.15, X_xor[1, i] + 0.15), fontsize=11)

ax.set_xlabel('$x_1$', fontsize=12)
ax.set_ylabel('$x_2$', fontsize=12)
ax.set_title('Original Input Space (Not Separable)', fontsize=13, fontweight='bold')
ax.set_xlim(-0.3, 1.5)
ax.set_ylim(-0.3, 1.5)
ax.grid(True, alpha=0.3)
ax.set_aspect('equal')

# Right: Hidden layer representation (first 2 dimensions)
ax = axes[1]
nn.forward(X_xor)  # Make sure A1 is computed
A1 = nn.A1  # Hidden layer activations

# Use first 2 hidden neurons for visualization
for i in range(4):
    color = 'red' if y_xor[0, i] == 1 else 'blue'
    marker = 's' if y_xor[0, i] == 1 else 'o'
    ax.scatter(A1[0, i], A1[1, i], c=color, s=300, marker=marker, 
               edgecolors='black', linewidths=2)
    ax.annotate(f'({int(X_xor[0,i])},{int(X_xor[1,i])})', 
                xy=(A1[0, i], A1[1, i]),
                xytext=(A1[0, i] + 0.05, A1[1, i] + 0.05), fontsize=11)

ax.set_xlabel('Hidden Neuron 1 Activation', fontsize=12)
ax.set_ylabel('Hidden Neuron 2 Activation', fontsize=12)
ax.set_title('Hidden Layer Space (Separable!)', fontsize=13, fontweight='bold')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("The hidden layer 'untangles' the XOR problem!")
print("Red squares (XOR=1) and blue circles (XOR=0) are now separable.")

---

## Exercises

### Exercise 1: Implement a Perceptron

Implement a single perceptron that can learn the AND gate.

In [None]:
# EXERCISE 1: Implement a simple perceptron for AND gate
def perceptron(x, w, b):
    """
    Single perceptron with step activation.
    
    Args:
        x: Input features (n,)
        w: Weights (n,)
        b: Bias
    
    Returns:
        Output (0 or 1)
    """
    # TODO: Implement this!
    # 1. Compute weighted sum z = w . x + b
    # 2. Apply step function: return 1 if z >= 0, else 0
    
    pass  # Replace with your implementation

# Test: AND gate with known weights
# For AND: both inputs must be 1 for output to be 1
# Try w = [1, 1] and b = -1.5 (why does this work?)

w_and = np.array([1.0, 1.0])
b_and = -1.5

print("Testing AND gate perceptron:")
test_inputs = [[0, 0], [0, 1], [1, 0], [1, 1]]
expected = [0, 0, 0, 1]

for inp, exp in zip(test_inputs, expected):
    result = perceptron(np.array(inp), w_and, b_and)
    status = "Correct" if result == exp else "WRONG"
    print(f"AND({inp[0]}, {inp[1]}) = {result}, expected = {exp}, {status}")

### Exercise 2: Compare Activation Functions

Create a function that applies different activations to the same input and visualizes the results.

In [None]:
# EXERCISE 2: Create a comparison visualization
def compare_activations(z_values):
    """
    Apply all activation functions to z_values and print comparison.
    
    Args:
        z_values: Array of input values
    
    Returns:
        Dictionary with activation name -> output array
    """
    # TODO: Apply sigmoid, tanh, relu, and leaky_relu to z_values
    # Return a dictionary like {'sigmoid': [...], 'tanh': [...], ...}
    
    pass  # Replace with your implementation

# Test
z_test = np.array([-2, -1, 0, 1, 2])
results = compare_activations(z_test)

if results:
    print(f"Input z:       {z_test}")
    for name, values in results.items():
        print(f"{name:12s}: {np.round(values, 3)}")

### Exercise 3: Extend the Network

Modify the neural network to solve a slightly harder problem: the circle classification problem.

In [None]:
# EXERCISE 3: Train on circle dataset
np.random.seed(42)

# Generate circle data: points inside circle (r < 1.0) are class 1
n_samples = 200
X_circle = np.random.randn(2, n_samples)
radii = np.sqrt(X_circle[0]**2 + X_circle[1]**2)
y_circle = (radii < 1.0).astype(int).reshape(1, -1)

# Visualize the data
plt.figure(figsize=(8, 8))
plt.scatter(X_circle[0, y_circle[0]==0], X_circle[1, y_circle[0]==0], 
            c='blue', label='Class 0 (outside)', alpha=0.6)
plt.scatter(X_circle[0, y_circle[0]==1], X_circle[1, y_circle[0]==1], 
            c='red', label='Class 1 (inside)', alpha=0.6)
circle = plt.Circle((0, 0), 1.0, fill=False, color='black', linestyle='--', linewidth=2)
plt.gca().add_patch(circle)
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
plt.title('Circle Classification Problem')
plt.legend()
plt.axis('equal')
plt.grid(True, alpha=0.3)
plt.show()

# TODO: Create a neural network and train it on this data
# Hint: You may need more hidden neurons (try 8 or 16)
# nn_circle = NeuralNetwork(input_size=2, hidden_size=?, output_size=1)
# losses = nn_circle.train(X_circle, y_circle, learning_rate=?, epochs=?)

---

## Summary

### Key Concepts

| Concept | Definition | F1 Parallel |
|---------|-----------|-------------|
| **Perceptron** | A single neuron: weighted sum + activation | A single sensor threshold decision (is the tire too hot? yes/no) |
| **Weighted sum** | $z = \mathbf{w} \cdot \mathbf{x} + b$ | Combining telemetry readings into a single urgency score |
| **Activation functions** | Non-linear functions enabling complex patterns | Nonlinear sensor responses (tire grip vs temp is not linear) |
| **Forward propagation** | Data flows through layers: linear + activation | Sensor data flowing through the telemetry pipeline to the pit wall |
| **Loss functions** | Measure prediction error (MSE, cross-entropy) | Lap time delta from the theoretical best (lower is better) |
| **Hidden layers** | Transform inputs into separable representations | Feature extraction: raw data to "understeer detected" |

- **Activation function specifics:**
  - Sigmoid: $(0, 1)$ range, good for binary output
  - Tanh: $(-1, 1)$ range, zero-centered
  - ReLU: Default for hidden layers, avoids vanishing gradients
  - Leaky ReLU: Fixes "dying ReLU" problem
  - GELU: Smooth version used in transformers

### Connection to Deep Learning

| Concept | Where You'll See It | F1 Parallel |
|---------|--------------------|-------------|
| Weighted sums | Every layer in every network | Every telemetry aggregation step |
| ReLU activation | CNNs, MLPs, most architectures | Binary threshold decisions on sensor data |
| GELU activation | Transformers (BERT, GPT) | Smooth strategy weighting in modern F1 analytics |
| Sigmoid output | Binary classification tasks | Pit stop probability: go or stay |
| Cross-entropy loss | Classification training | Penalizing confident wrong strategy calls |
| Forward propagation | Inference in all networks | The full telemetry pipeline from sensor to pit wall |
| Hidden representations | Feature learning, embeddings | Intermediate features like "effective downforce" |

### Checklist

Before moving on, make sure you can:

- [ ] Compute a weighted sum by hand for 2-3 inputs
- [ ] Sketch the shape of sigmoid, tanh, and ReLU
- [ ] Explain why ReLU helps with vanishing gradients
- [ ] Trace values through a 2-layer network
- [ ] Explain why XOR requires a hidden layer
- [ ] Choose the right loss function for a given problem

---

## Next Steps

Now that you understand forward propagation and how networks compute predictions, you're ready to learn **how they learn**!

**Next: Part 3.2 - Backpropagation & Training**

In the next notebook, we'll cover:
- The chain rule of calculus (essential for understanding backprop)
- Backpropagation: computing gradients efficiently
- Gradient descent: updating weights to minimize loss
- Training loops and batch processing

**Recommended preparation:**
- Review the chain rule from calculus: $(f \circ g)'(x) = f'(g(x)) \cdot g'(x)$
- Re-run this notebook's XOR example and trace how the loss decreases
- Try Exercise 3 to get practice with training networks