# Neural Networks Fundamentals - Exercise Solutions

This notebook provides complete solutions to all exercises in `exercises.ipynb`. Each solution includes:
- Working code with detailed comments
- Explanation of the approach
- Common mistakes to avoid
- Extensions and variations to explore

**Note:** Try to solve the exercises yourself before looking at these solutions. Learning happens through struggle!

---

In [None]:
# Import required libraries
import numpy as np
import matplotlib.pyplot as plt
from utils import *
from viz_utils import *

# Set random seed for reproducibility
np.random.seed(42)

print("✓ Libraries imported successfully!")

---

## Solution 1: Single Neuron Variations

### Solution 1.1: Linear Neuron

**Approach:** Compute weighted sum of inputs plus bias.

In [None]:
def linear_neuron(inputs, weights, bias):
    """
    Implement a linear neuron.
    
    This is the simplest type of neuron - just a weighted sum.
    No activation function is applied.
    """
    # Compute dot product of inputs and weights
    weighted_sum = np.dot(inputs, weights)
    
    # Add bias term
    output = weighted_sum + bias
    
    return output

# Test implementation
test_inputs = np.array([1.0, 2.0, 3.0])
test_weights = np.array([0.5, -0.3, 0.8])
test_bias = 0.1

result = linear_neuron(test_inputs, test_weights, test_bias)
expected = np.dot(test_inputs, test_weights) + test_bias

print(f"Linear neuron output: {result}")
print(f"Expected output: {expected}")
print(f"Match: {np.isclose(result, expected)}")

# Detailed calculation:
print("\n=== Detailed Calculation ===")
print(f"Weighted sum: (1.0 × 0.5) + (2.0 × -0.3) + (3.0 × 0.8) = {0.5 - 0.6 + 2.4}")
print(f"Plus bias: {0.5 - 0.6 + 2.4} + 0.1 = {result}")

**Key Points:**
- Linear neurons can only learn linear relationships
- They're the building block for more complex neurons
- The bias allows shifting the output up or down

**Common Mistakes:**
- Forgetting to add the bias term
- Using wrong array dimensions for dot product

**Extensions:**
- Try vectorizing to handle multiple samples at once

### Solution 1.2: Sigmoid Neuron

In [None]:
def sigmoid(x):
    """Sigmoid activation function."""
    return 1 / (1 + np.exp(-x))

def sigmoid_neuron(inputs, weights, bias):
    """
    Implement a sigmoid-activated neuron.
    
    Steps:
    1. Compute linear combination (same as linear neuron)
    2. Apply sigmoid activation to squash output to [0, 1]
    """
    # Step 1: Linear transformation
    z = np.dot(inputs, weights) + bias
    
    # Step 2: Apply sigmoid activation
    output = sigmoid(z)
    
    return output

# Test with 5 inputs
test_inputs = np.random.randn(5)
test_weights = np.random.randn(5) * 0.1
test_bias = 0.0

output = sigmoid_neuron(test_inputs, test_weights, test_bias)

print(f"Input values: {test_inputs}")
print(f"Weights: {test_weights}")
print(f"Bias: {test_bias}")
print(f"\nSigmoid neuron output: {output:.4f}")
print(f"Output in [0, 1]: {0 <= output <= 1}")

# Demonstrate sigmoid properties
print("\n=== Sigmoid Properties ===")
print(f"Sigmoid(0) = {sigmoid(0):.4f} (always 0.5)")
print(f"Sigmoid(large positive) = {sigmoid(10):.6f} (approaches 1)")
print(f"Sigmoid(large negative) = {sigmoid(-10):.6f} (approaches 0)")

**Why Sigmoid?**
- Squashes output to [0, 1] range (good for probabilities)
- Smooth, differentiable everywhere
- Historically popular, though less common now due to vanishing gradients

**Common Mistakes:**
- Numerical overflow with `exp()` for very negative values (use `np.clip()` if needed)
- Confusing sigmoid with softmax (softmax normalizes across multiple outputs)

---

## Solution 2: ELU Activation Function

### Solution 2.1: Implement ELU and Derivative

In [None]:
def elu(x, alpha=1.0):
    """
    ELU activation function.
    
    ELU smoothly handles negative values unlike ReLU which zeros them out.
    This helps with the "dying ReLU" problem.
    """
    # Use np.where for element-wise conditional
    # If x > 0: return x
    # If x <= 0: return alpha * (exp(x) - 1)
    return np.where(x > 0, x, alpha * (np.exp(x) - 1))

def elu_derivative(x, alpha=1.0):
    """
    Derivative of ELU.
    
    For x > 0: derivative is 1
    For x <= 0: derivative is alpha * exp(x) = ELU(x) + alpha
    """
    return np.where(x > 0, 1, alpha * np.exp(x))

# Test and visualize
x = np.linspace(-3, 3, 1000)
y = elu(x)
dy = elu_derivative(x)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Plot ELU
ax1.plot(x, y, 'b-', linewidth=2.5, label='ELU', zorder=3)
ax1.plot(x, np.maximum(0, x), 'r--', linewidth=2, label='ReLU (comparison)', alpha=0.7)
ax1.axhline(y=0, color='k', linestyle='--', alpha=0.3)
ax1.axvline(x=0, color='k', linestyle='--', alpha=0.3)
ax1.axhline(y=-1, color='g', linestyle=':', alpha=0.5, label='Asymptote at -α')
ax1.set_title('ELU Activation Function', fontsize=14, fontweight='bold')
ax1.set_xlabel('Input', fontsize=12)
ax1.set_ylabel('Output', fontsize=12)
ax1.grid(True, alpha=0.3)
ax1.legend(fontsize=10)
ax1.set_ylim([-2, 3])

# Plot derivative
ax2.plot(x, dy, 'r-', linewidth=2.5, label='ELU Derivative')
ax2.axhline(y=0, color='k', linestyle='--', alpha=0.3)
ax2.axvline(x=0, color='k', linestyle='--', alpha=0.3)
ax2.axhline(y=1, color='b', linestyle=':', alpha=0.5, label='Gradient = 1 for x > 0')
ax2.set_title('ELU Gradient', fontsize=14, fontweight='bold')
ax2.set_xlabel('Input', fontsize=12)
ax2.set_ylabel('Gradient', fontsize=12)
ax2.grid(True, alpha=0.3)
ax2.legend(fontsize=10)
ax2.set_ylim([-0.1, 1.5])

plt.tight_layout()
plt.show()

print("=== ELU Properties ===")
print(f"ELU(0) = {elu(0):.4f} (smooth at origin)")
print(f"ELU(-1) = {elu(-1):.4f} (negative values allowed)")
print(f"ELU(-10) ≈ {elu(-10):.4f} (approaches -α for large negatives)")

### Solution 2.2: ELU vs ReLU Discussion

**Advantages of ELU over ReLU:**

1. **No "Dying ReLU" Problem:**
   - ReLU neurons can become permanently inactive (output always 0) if they enter a state where all inputs produce negative pre-activations
   - ELU allows small negative values, keeping gradients flowing

2. **Smooth at Zero:**
   - ELU is differentiable everywhere, including at x=0
   - ReLU has a sharp corner at x=0 (though this rarely causes issues in practice)

3. **Self-Normalizing:**
   - ELU's negative saturation helps push mean activations closer to zero
   - This can improve training stability

4. **Better Gradient Flow:**
   - For negative inputs, ELU still has non-zero gradients
   - ReLU has zero gradient for all negative inputs

**When to Use ELU:**
- Deeper networks where dying ReLU is a concern
- When you want slightly better performance (at cost of computation)
- When training stability is important

**When to Use ReLU:**
- Default choice for most applications (simpler, faster)
- When computational efficiency is critical
- When dying ReLU isn't a problem (can be mitigated with proper initialization)

---

## Solution 3: Dense Layer and 3-Layer Network

### Solution 3.1: Dense Layer Class

In [None]:
class DenseLayer:
    """
    A fully connected (dense) neural network layer.
    
    This implementation shows how layers work internally before using
    frameworks like PyTorch or TensorFlow.
    """
    
    def __init__(self, n_inputs, n_neurons, activation='relu'):
        """
        Initialize layer with random weights and zero biases.
        
        Weight initialization is crucial:
        - Too large: exploding gradients
        - Too small: vanishing gradients
        - We use small random values (0.01 scale)
        """
        # Initialize weights from normal distribution, scaled down
        # Shape: (n_inputs, n_neurons) for matrix multiplication
        self.weights = np.random.randn(n_inputs, n_neurons) * 0.01
        
        # Initialize biases to zero
        # Shape: (1, n_neurons) for broadcasting
        self.biases = np.zeros((1, n_neurons))
        
        self.activation = activation
    
    def forward(self, inputs):
        """
        Forward pass: linear transformation + activation.
        
        Matrix shapes:
        - inputs: (batch_size, n_inputs)
        - weights: (n_inputs, n_neurons)
        - output: (batch_size, n_neurons)
        """
        # Linear transformation: Z = XW + b
        # @ operator is matrix multiplication (same as np.dot)
        z = inputs @ self.weights + self.biases
        
        # Apply activation function
        if self.activation == 'relu':
            output = np.maximum(0, z)
        elif self.activation == 'sigmoid':
            output = 1 / (1 + np.exp(-z))
        elif self.activation == 'linear':
            output = z  # No activation
        else:
            raise ValueError(f"Unknown activation: {self.activation}")
        
        return output

# Comprehensive testing
print("=== Testing Dense Layer ===")

# Test 1: Basic functionality
layer = DenseLayer(n_inputs=10, n_neurons=5, activation='relu')
test_input = np.random.randn(3, 10)  # 3 samples, 10 features
output = layer.forward(test_input)

print(f"\nTest 1: Basic functionality")
print(f"  Input shape: {test_input.shape}")
print(f"  Weight shape: {layer.weights.shape}")
print(f"  Bias shape: {layer.biases.shape}")
print(f"  Output shape: {output.shape}")
print(f"  Expected: (3, 5) ✓" if output.shape == (3, 5) else "  ✗ FAILED")

# Test 2: ReLU activation (no negative values)
print(f"\nTest 2: ReLU activation")
print(f"  All outputs >= 0: {np.all(output >= 0)}")
print(f"  Some outputs > 0: {np.any(output > 0)}")

# Test 3: Different activation functions
layer_sigmoid = DenseLayer(10, 5, activation='sigmoid')
output_sigmoid = layer_sigmoid.forward(test_input)
print(f"\nTest 3: Sigmoid activation")
print(f"  All outputs in [0,1]: {np.all((output_sigmoid >= 0) & (output_sigmoid <= 1))}")

# Test 4: Batch processing
large_batch = np.random.randn(100, 10)
large_output = layer.forward(large_batch)
print(f"\nTest 4: Batch processing")
print(f"  Processed 100 samples: {large_output.shape == (100, 5)}")

**Key Concepts:**

1. **Weight Initialization:** Small random values prevent gradient problems
2. **Matrix Multiplication:** Shape compatibility is crucial
3. **Broadcasting:** Biases automatically expand to match batch size
4. **Activation Functions:** Applied element-wise to introduce non-linearity

**Common Mistakes:**
- Wrong weight matrix shape (should be `(n_inputs, n_neurons)`)
- Forgetting to initialize biases
- Using `*` instead of `@` for matrix multiplication
- Not handling batch dimension properly

### Solution 3.2: Three-Layer Network

In [None]:
class ThreeLayerNetwork:
    """
    A complete 3-layer neural network for MNIST classification.
    
    Architecture:
    Input (784) -> Hidden1 (128, ReLU) -> Hidden2 (64, ReLU) -> Output (10, Linear)
    
    Note: We don't apply softmax in the network itself. It's typically
    combined with the loss function for numerical stability.
    """
    
    def __init__(self):
        """Initialize all three layers."""
        # Layer 1: 784 -> 128 with ReLU
        self.layer1 = DenseLayer(n_inputs=784, n_neurons=128, activation='relu')
        
        # Layer 2: 128 -> 64 with ReLU
        self.layer2 = DenseLayer(n_inputs=128, n_neurons=64, activation='relu')
        
        # Layer 3: 64 -> 10 with linear activation
        # (softmax will be applied during loss computation)
        self.layer3 = DenseLayer(n_inputs=64, n_neurons=10, activation='linear')
    
    def forward(self, X):
        """
        Forward pass through all layers.
        
        The data flows:
        X -> layer1 -> h1 -> layer2 -> h2 -> layer3 -> output
        """
        # Pass through first hidden layer
        h1 = self.layer1.forward(X)
        
        # Pass through second hidden layer
        h2 = self.layer2.forward(h1)
        
        # Pass through output layer
        output = self.layer3.forward(h2)
        
        return output
    
    def predict(self, X):
        """
        Make predictions (with softmax for probabilities).
        """
        # Get raw scores (logits)
        logits = self.forward(X)
        
        # Apply softmax for probabilities
        exp_logits = np.exp(logits - np.max(logits, axis=1, keepdims=True))
        probabilities = exp_logits / np.sum(exp_logits, axis=1, keepdims=True)
        
        return probabilities

# Test the network
print("=== Testing Three-Layer Network ===")

network = ThreeLayerNetwork()

# Test with random MNIST-like data
test_batch = np.random.randn(5, 784)
raw_output = network.forward(test_batch)
predictions = network.predict(test_batch)

print(f"\nInput shape: {test_batch.shape}")
print(f"Raw output shape: {raw_output.shape}")
print(f"Predictions shape: {predictions.shape}")

print(f"\nFirst sample raw scores (logits):")
print(raw_output[0])

print(f"\nFirst sample probabilities (after softmax):")
print(predictions[0])
print(f"Sum of probabilities: {predictions[0].sum():.6f} (should be 1.0)")

print(f"\nPredicted class: {np.argmax(predictions[0])}")

# Visualize network architecture
print("\n=== Network Architecture ===")
print("Layer 1: 784 inputs  -> 128 neurons (ReLU)")
print("Layer 2: 128 inputs  -> 64 neurons  (ReLU)")
print("Layer 3: 64 inputs   -> 10 neurons  (Linear)")
print(f"\nTotal parameters: {784*128 + 128 + 128*64 + 64 + 64*10 + 10:,}")

# Visualize with plotting function
fig = plot_network_architecture([784, 128, 64, 10])
plt.show()

**Design Decisions:**

1. **Layer Sizes:** Gradually decreasing (128 -> 64 -> 10) creates a funnel
2. **Activations:** ReLU for hidden layers (fast, effective), linear for output
3. **Softmax Placement:** Applied during prediction, not in forward pass

**Alternative Approaches:**
- Could use dropout for regularization
- Could add batch normalization between layers
- Could use different activation functions (ELU, Leaky ReLU)

---

## Solution 4: Manual Forward Propagation

### Solution 4.1: Hand Calculation

**Manual Calculation:**

Given:
- Input: x = [2.0, -1.0]
- Weights: W = [[0.5, -0.3], [0.2, 0.8]]
- Biases: b = [0.1, -0.1]
- Activation: ReLU

**Step-by-Step:**

```
Neuron 1 (first column):
  z₁ = (0.5)(2.0) + (-0.3)(-1.0) + 0.1
     = 1.0 + 0.3 + 0.1
     = 1.4
  a₁ = ReLU(1.4) = max(0, 1.4) = 1.4

Neuron 2 (second column):
  z₂ = (0.2)(2.0) + (0.8)(-1.0) + (-0.1)
     = 0.4 + (-0.8) + (-0.1)
     = -0.5
  a₂ = ReLU(-0.5) = max(0, -0.5) = 0.0

Output: [1.4, 0.0]
```

In [None]:
# Verify the calculation
print("=== Verifying Manual Calculation ===")

x = np.array([2.0, -1.0])
W = np.array([[0.5, -0.3],   # First row: weights for neuron 1 from both inputs
              [0.2, 0.8]])    # Second row: weights for neuron 2 from both inputs
b = np.array([0.1, -0.1])

# Method 1: Using @ operator (matrix multiplication)
z = x @ W + b
output = np.maximum(0, z)  # ReLU

print(f"\nMethod 1 (Matrix multiplication):")
print(f"Pre-activation (z): {z}")
print(f"After ReLU: {output}")

# Method 2: Manual calculation (showing each neuron)
print(f"\nMethod 2 (Step-by-step):")
z1 = x[0] * W[0, 0] + x[1] * W[1, 0] + b[0]
z2 = x[0] * W[0, 1] + x[1] * W[1, 1] + b[1]
print(f"Neuron 1: z₁ = {z1:.1f}, a₁ = {max(0, z1):.1f}")
print(f"Neuron 2: z₂ = {z2:.1f}, a₂ = {max(0, z2):.1f}")

# Detailed breakdown
print(f"\n=== Detailed Breakdown ===")
print(f"Neuron 1 calculation:")
print(f"  (2.0 × 0.5) = {2.0 * 0.5}")
print(f"  (-1.0 × -0.3) = {-1.0 * -0.3}")
print(f"  bias = {b[0]}")
print(f"  Sum = {z1}")
print(f"  ReLU({z1}) = {max(0, z1)}")

print(f"\nNeuron 2 calculation:")
print(f"  (2.0 × 0.2) = {2.0 * 0.2}")
print(f"  (-1.0 × 0.8) = {-1.0 * 0.8}")
print(f"  bias = {b[1]}")
print(f"  Sum = {z2}")
print(f"  ReLU({z2}) = {max(0, z2)} (negative value zeroed out!)")

print(f"\n✓ Manual calculation verified!")

**Key Insights:**

1. **Matrix Multiplication:** Each output neuron computes a weighted sum from ALL inputs
2. **ReLU Effect:** Neuron 2's negative pre-activation becomes 0
3. **Bias Impact:** Shifts the activation threshold

**Common Mistakes:**
- Confusing row/column indexing in weight matrix
- Forgetting to apply activation function
- Sign errors in arithmetic

---

## Solution 5: Mean Absolute Error Loss

### Solution 5.1: Implement MAE

In [None]:
def mae_loss(y_true, y_pred):
    """
    Mean Absolute Error (L1 Loss).
    
    More robust to outliers than MSE because it doesn't square errors.
    """
    # Compute absolute differences
    abs_errors = np.abs(y_true - y_pred)
    
    # Return mean
    return np.mean(abs_errors)

def mae_derivative(y_true, y_pred):
    """
    Gradient of MAE with respect to predictions.
    
    The derivative is:
    - +1 if y_pred > y_true (prediction too high)
    - -1 if y_pred < y_true (prediction too low)
    -  0 if y_pred == y_true (perfect prediction)
    
    Note: The gradient is constant (doesn't depend on magnitude of error)
    This is different from MSE where larger errors have larger gradients.
    """
    # Use np.sign to get -1, 0, or +1
    return np.sign(y_pred - y_true) / len(y_true)

# Test implementation
print("=== Testing MAE Implementation ===")

y_true = np.array([1.0, 2.0, 3.0, 4.0])
y_pred = np.array([1.2, 1.8, 3.1, 4.5])

mae = mae_loss(y_true, y_pred)
expected = np.mean(np.abs(y_true - y_pred))

print(f"\nTrue values: {y_true}")
print(f"Predictions: {y_pred}")
print(f"Absolute errors: {np.abs(y_true - y_pred)}")
print(f"\nMAE Loss: {mae:.4f}")
print(f"Expected: {expected:.4f}")
print(f"Match: {np.isclose(mae, expected)}")

# Test gradient
grad = mae_derivative(y_true, y_pred)
print(f"\nGradient: {grad}")
print(f"Interpretation:")
for i, (yt, yp, g) in enumerate(zip(y_true, y_pred, grad)):
    direction = "decrease" if g < 0 else "increase" if g > 0 else "perfect"
    print(f"  Sample {i}: true={yt}, pred={yp:.1f}, gradient={g:.3f} -> {direction} prediction")

### Solution 5.2: Compare MAE and MSE

In [None]:
def mse_loss(y_true, y_pred):
    """Mean Squared Error for comparison."""
    return np.mean((y_true - y_pred) ** 2)

# Compare with outlier
print("=== Comparing MAE vs MSE ===")

# Case 1: Normal errors
y_true_normal = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
y_pred_normal = np.array([1.1, 2.1, 2.9, 4.1, 4.9])

# Case 2: One large outlier
y_pred_outlier = np.array([1.1, 2.1, 2.9, 4.1, 15.0])  # Last prediction is way off!

print("\nCase 1: Small, consistent errors")
print(f"MAE: {mae_loss(y_true_normal, y_pred_normal):.4f}")
print(f"MSE: {mse_loss(y_true_normal, y_pred_normal):.4f}")

print("\nCase 2: One large outlier error")
print(f"MAE: {mae_loss(y_true_normal, y_pred_outlier):.4f}")
print(f"MSE: {mse_loss(y_true_normal, y_pred_outlier):.4f}")
print("\nNotice: MSE increases dramatically due to outlier (10.0)² = 100")
print("        MAE increases linearly with outlier magnitude")

# Visualize
errors = np.linspace(-5, 5, 100)
mae_curve = np.abs(errors)
mse_curve = errors ** 2

plt.figure(figsize=(10, 6))
plt.plot(errors, mae_curve, 'b-', linewidth=2, label='MAE (L1)')
plt.plot(errors, mse_curve, 'r-', linewidth=2, label='MSE (L2)')
plt.axvline(x=0, color='k', linestyle='--', alpha=0.3)
plt.axhline(y=0, color='k', linestyle='--', alpha=0.3)
plt.xlabel('Error (y_true - y_pred)', fontsize=12)
plt.ylabel('Loss Contribution', fontsize=12)
plt.title('MAE vs MSE: Response to Errors', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.ylim([0, 10])
plt.show()

print("\nKey Observation: MSE penalizes large errors much more heavily (quadratically)")

**When to Use MAE:**

1. **Outliers Present:** MAE is less sensitive to outliers
2. **Uniform Error Penalty:** All errors treated equally regardless of magnitude
3. **Interpretability:** MAE is in same units as target variable

**When to Use MSE:**

1. **Penalize Large Errors:** Want to heavily discourage large mistakes
2. **Smooth Gradients:** MSE has smoother gradients near optimum
3. **Standard Practice:** MSE is more common in many domains

**Gradient Behavior:**
- MAE: Constant gradient (doesn't depend on error size)
- MSE: Gradient proportional to error (larger errors → larger gradients)

---

## Solution 6: Manual Gradient Computation

### Solution 6.1: Backpropagation by Hand

**Manual Calculation:**

Given:
- x = 3.0
- w = 0.5
- b = 0.2
- y = 2.0

**Forward Pass:**
```
ŷ = wx + b = 0.5(3.0) + 0.2 = 1.5 + 0.2 = 1.7
L = (y - ŷ)² = (2.0 - 1.7)² = (0.3)² = 0.09
```

**Backward Pass (Chain Rule):**

1. **∂L/∂ŷ:** Derivative of loss with respect to prediction
   ```
   L = (y - ŷ)²
   ∂L/∂ŷ = 2(ŷ - y) = 2(1.7 - 2.0) = 2(-0.3) = -0.6
   ```

2. **∂L/∂w:** How does loss change with weight?
   ```
   Using chain rule: ∂L/∂w = ∂L/∂ŷ × ∂ŷ/∂w
   
   ∂ŷ/∂w = ∂(wx + b)/∂w = x = 3.0
   
   ∂L/∂w = (-0.6)(3.0) = -1.8
   ```

3. **∂L/∂b:** How does loss change with bias?
   ```
   ∂L/∂b = ∂L/∂ŷ × ∂ŷ/∂b
   
   ∂ŷ/∂b = ∂(wx + b)/∂b = 1.0
   
   ∂L/∂b = (-0.6)(1.0) = -0.6
   ```

**Results:**
- Loss: 0.09
- ∂L/∂w = -1.8 (increasing w will decrease loss)
- ∂L/∂b = -0.6 (increasing b will decrease loss)

In [None]:
# Verify with code
print("=== Verifying Manual Gradient Computation ===")

x = 3.0
w = 0.5
b = 0.2
y_true = 2.0

# Forward pass
y_pred = w * x + b
loss = (y_true - y_pred) ** 2

print(f"\n=== Forward Pass ===")
print(f"Prediction: ŷ = {w} × {x} + {b} = {y_pred}")
print(f"Loss: L = (y - ŷ)² = ({y_true} - {y_pred})² = {loss}")

# Backward pass
dL_dyhat = 2 * (y_pred - y_true)
dL_dw = dL_dyhat * x  # Chain rule: ∂L/∂w = ∂L/∂ŷ × ∂ŷ/∂w
dL_db = dL_dyhat * 1.0  # Chain rule: ∂L/∂b = ∂L/∂ŷ × ∂ŷ/∂b

print(f"\n=== Backward Pass ===")
print(f"∂L/∂ŷ = 2(ŷ - y) = 2({y_pred} - {y_true}) = {dL_dyhat}")
print(f"∂L/∂w = ∂L/∂ŷ × ∂ŷ/∂w = {dL_dyhat} × {x} = {dL_dw}")
print(f"∂L/∂b = ∂L/∂ŷ × ∂ŷ/∂b = {dL_dyhat} × 1.0 = {dL_db}")

# Verify using numerical gradients (finite differences)
print(f"\n=== Numerical Verification (Finite Differences) ===")
epsilon = 1e-5

# Numerical gradient for w
loss_plus_w = (y_true - ((w + epsilon) * x + b)) ** 2
loss_minus_w = (y_true - ((w - epsilon) * x + b)) ** 2
numerical_dL_dw = (loss_plus_w - loss_minus_w) / (2 * epsilon)

# Numerical gradient for b
loss_plus_b = (y_true - (w * x + (b + epsilon))) ** 2
loss_minus_b = (y_true - (w * x + (b - epsilon))) ** 2
numerical_dL_db = (loss_plus_b - loss_minus_b) / (2 * epsilon)

print(f"Analytical ∂L/∂w: {dL_dw:.6f}")
print(f"Numerical ∂L/∂w:  {numerical_dL_dw:.6f}")
print(f"Match: {np.isclose(dL_dw, numerical_dL_dw)}")

print(f"\nAnalytical ∂L/∂b: {dL_db:.6f}")
print(f"Numerical ∂L/∂b:  {numerical_dL_db:.6f}")
print(f"Match: {np.isclose(dL_db, numerical_dL_db)}")

# Show gradient descent update
print(f"\n=== Gradient Descent Update ===")
learning_rate = 0.1
w_new = w - learning_rate * dL_dw
b_new = b - learning_rate * dL_db

print(f"Old parameters: w={w}, b={b}")
print(f"Gradients: ∂L/∂w={dL_dw}, ∂L/∂b={dL_db}")
print(f"New parameters: w={w_new}, b={b_new}")

# Verify new loss is lower
y_pred_new = w_new * x + b_new
loss_new = (y_true - y_pred_new) ** 2
print(f"\nOld loss: {loss:.6f}")
print(f"New loss: {loss_new:.6f}")
print(f"Loss decreased: {loss_new < loss} ✓")

**Understanding the Chain Rule:**

The chain rule is fundamental to backpropagation:

```
If y = f(g(x)), then dy/dx = (df/dg) × (dg/dx)
```

In our case:
- Loss depends on prediction: L(ŷ)
- Prediction depends on weight: ŷ(w)
- So: ∂L/∂w = (∂L/∂ŷ) × (∂ŷ/∂w)

**Gradient Sign Interpretation:**
- Negative gradient: Increasing parameter decreases loss
- Positive gradient: Decreasing parameter decreases loss
- Zero gradient: At local optimum (or saddle point)

---

## Solution 7: Debug Broken Training Loop

### Solution 7.1: Fixed Training Loop

In [None]:
def fixed_training_loop(X, y, epochs=10, learning_rate=0.01):
    """
    Corrected training loop with all bugs fixed.
    """
    n_samples, n_features = X.shape
    n_classes = y.shape[1]
    
    # Initialize weights and biases
    W = np.random.randn(n_features, n_classes) * 0.01
    b = np.zeros((1, n_classes))
    
    losses = []
    
    for epoch in range(epochs):
        # Forward pass
        # BUG #1 FIX: Should be PLUS bias, not minus
        z = X @ W + b  # ✓ FIXED
        
        # Softmax activation
        exp_z = np.exp(z - np.max(z, axis=1, keepdims=True))
        predictions = exp_z / np.sum(exp_z, axis=1, keepdims=True)
        
        # Cross-entropy loss
        loss = -np.mean(np.sum(y * np.log(predictions + 1e-8), axis=1))
        losses.append(loss)
        
        # Backward pass
        dz = predictions - y
        
        # BUG #2 FIX: Divide by batch size to get average gradient
        dW = (X.T @ dz) / n_samples  # ✓ FIXED
        db = np.sum(dz, axis=0, keepdims=True) / n_samples
        
        # Update parameters
        # BUG #3 FIX: Gradient descent moves AGAINST gradient (minus)
        W = W - learning_rate * dW  # ✓ Already correct
        b = b - learning_rate * db  # ✓ FIXED (was plus)
        
        if (epoch + 1) % 2 == 0:
            print(f"Epoch {epoch+1}/{epochs}, Loss: {loss:.4f}")
    
    return W, b, losses

print("=== Testing Fixed Training Loop ===")
print("\n📋 Bug Fixes Applied:")
print("  Bug #1: Changed 'X @ W - b' to 'X @ W + b'")
print("  Bug #2: Added '/ n_samples' to dW calculation")
print("  Bug #3: Changed 'b + learning_rate * db' to 'b - learning_rate * db'")

# Test with small dataset
np.random.seed(42)
X_small = np.random.randn(100, 10)
y_small = np.eye(3)[np.random.randint(0, 3, 100)]  # 3 classes

print("\n" + "="*50)
W, b, losses = fixed_training_loop(X_small, y_small, epochs=20, learning_rate=0.1)
print("="*50)

# Plot losses - should decrease!
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Loss curve
ax1.plot(losses, 'b-o', linewidth=2, markersize=6)
ax1.set_xlabel('Epoch', fontsize=12)
ax1.set_ylabel('Loss', fontsize=12)
ax1.set_title('Training Loss (Should Decrease!) ✓', fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3)

# Add annotations
ax1.annotate(f'Start: {losses[0]:.3f}', 
            xy=(0, losses[0]), xytext=(5, losses[0] + 0.1),
            arrowprops=dict(arrowstyle='->', color='red'),
            fontsize=10, color='red')
ax1.annotate(f'End: {losses[-1]:.3f}', 
            xy=(len(losses)-1, losses[-1]), xytext=(len(losses)-5, losses[-1] - 0.1),
            arrowprops=dict(arrowstyle='->', color='green'),
            fontsize=10, color='green')

# Loss improvement
improvement = ((losses[0] - losses[-1]) / losses[0]) * 100
ax2.bar(['Initial Loss', 'Final Loss'], [losses[0], losses[-1]], 
       color=['red', 'green'], alpha=0.7)
ax2.set_ylabel('Loss', fontsize=12)
ax2.set_title(f'Loss Improvement: {improvement:.1f}%', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print(f"\n✓ Training successful! Loss decreased from {losses[0]:.4f} to {losses[-1]:.4f}")

**Detailed Bug Explanations:**

**Bug #1: Bias Sign**
- **Wrong:** `z = X @ W - b`
- **Correct:** `z = X @ W + b`
- **Why:** Bias should shift activations up, not down
- **Effect:** Model couldn't learn proper decision boundaries

**Bug #2: Missing Gradient Averaging**
- **Wrong:** `dW = X.T @ dz`
- **Correct:** `dW = (X.T @ dz) / n_samples`
- **Why:** Gradients should be averaged over batch
- **Effect:** Learning rate was effectively `n_samples` times too large

**Bug #3: Wrong Gradient Direction**
- **Wrong:** `b = b + learning_rate * db`
- **Correct:** `b = b - learning_rate * db`
- **Why:** Gradient descent moves OPPOSITE to gradient direction
- **Effect:** Parameters moved in wrong direction (increasing loss!)

**Debugging Tips:**
1. Plot loss curves - they should decrease
2. Check gradient signs - ensure moving toward lower loss
3. Verify matrix dimensions match expectations
4. Test on simple toy problems first

---

## Congratulations!

You've completed all the core solutions. The remaining exercises (8-10) are more open-ended and designed for experimentation. Here are some starting points:

## Solution 8: Hyperparameter Tuning (Starter Code)

This is an open-ended exercise. Here's a framework to get started:

In [None]:
# Example hyperparameter tuning workflow
print("=== Hyperparameter Tuning Framework ===")
print("\n💡 Tips for Tuning:")
print("1. Start with learning rate (most impactful)")
print("2. Then tune architecture (layer sizes)")
print("3. Finally tune batch size and epochs")
print("4. Use validation set to prevent overfitting")
print("5. Track all experiments in a table/dict")

print("\n📊 Typical Learning Rate Ranges:")
print("  - Too small: 0.0001 (very slow learning)")
print("  - Good range: 0.001 - 0.1")
print("  - Too large: 1.0+ (unstable training)")

print("\n🏗️ Architecture Guidelines:")
print("  - Start simple: [784, 128, 10]")
print("  - Add depth: [784, 256, 128, 10]")
print("  - Wider layers: [784, 512, 10]")print("  - Balance: More parameters ≠ always better")# Your experimentation code here...print("\n🔬 Ready to experiment!")

## Solution 9: PyTorch Implementation (Starter)

Here's a complete PyTorch implementation to compare with your NumPy version:

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

class MNISTNet(nn.Module):
    """PyTorch implementation of MNIST classifier."""
    
    def __init__(self):
        super(MNISTNet, self).__init__()
        # Define layers
        self.fc1 = nn.Linear(784, 128)
        self.fc2 = nn.Linear(128, 10)
    
    def forward(self, x):
        # Forward pass with ReLU activation
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)  # No activation (logits)
        return x

print("✓ PyTorch model defined!")
print("\nKey Differences from NumPy:")
print("1. Automatic differentiation (no manual backprop)")
print("2. GPU acceleration available")
print("3. Built-in optimizers")
print("4. Less code, more abstraction")

---

## Final Thoughts

### What You've Learned

Through these exercises, you've:

1. ✅ **Understood Neurons:** Built single neurons from scratch
2. ✅ **Activation Functions:** Implemented and compared different activations
3. ✅ **Network Architecture:** Composed layers into networks
4. ✅ **Forward Propagation:** Traced information flow
5. ✅ **Loss Functions:** Implemented MAE and compared with MSE
6. ✅ **Backpropagation:** Computed gradients manually
7. ✅ **Debugging:** Fixed broken training code
8. ✅ **Experimentation:** Tuned hyperparameters (open-ended)
9. ✅ **Framework Translation:** Connected NumPy to PyTorch

### Key Takeaways

**Conceptual Understanding:**
- Neural networks are just compositions of simple functions
- Training is optimization: find parameters that minimize loss
- Backpropagation is just the chain rule applied systematically

**Practical Skills:**
- Matrix operations are the core of neural networks
- Activation functions introduce non-linearity (crucial!)
- Hyperparameters significantly impact performance
- Debugging requires understanding fundamentals

**Why Build from Scratch?**
- Frameworks like PyTorch are "magic" until you know what's inside
- Debugging is easier when you understand the internals
- Helps you make better architectural choices
- Enables you to implement custom components

### Next Steps

**Continue Learning:**
1. 📚 Study advanced architectures (CNNs, RNNs, Transformers)
2. 🔬 Read research papers and implement them
3. 🏗️ Build projects that solve real problems
4. 🤝 Contribute to open-source ML projects

**Practice Suggestions:**
- Implement more activation functions (Swish, GELU)
- Add regularization (L2, dropout)
- Implement advanced optimizers (Adam, RMSprop)
- Try different datasets (Fashion-MNIST, CIFAR-10)

**Resources:**
- Deep Learning Book (Goodfellow et al.)
- Fast.ai courses
- PyTorch tutorials
- Papers with Code

### Remember

> "The best way to learn is by doing. The second best way is by teaching."

Share what you've learned with others. Explaining concepts solidifies your understanding.

**Keep experimenting, stay curious, and happy learning! 🚀**

---

*If you found these exercises helpful, consider creating your own variations or contributing improvements to help other learners!*