# Neural Network from Scratch

**Author:** Anik Tahabilder  
**Project:** 10 of 22 - Kaggle ML Portfolio  
**Dataset:** MNIST Handwritten Digits (Kaggle: digit-recognizer)  
**Difficulty:** 8/10 | **Learning Value:** 10/10

---

## What Will You Learn?

This notebook builds a **Neural Network completely from scratch** using only NumPy. No TensorFlow, no PyTorch - just pure math and Python!

| Concept | What You'll Understand |
|---------|------------------------|
| **Neurons & Perceptrons** | The basic building block of neural networks |
| **Activation Functions** | Sigmoid, ReLU, Tanh, Softmax - why we need them |
| **Forward Propagation** | How data flows through the network |
| **Loss Functions** | Measuring how wrong our predictions are |
| **Backpropagation** | The magic algorithm that makes learning possible |
| **Gradient Descent** | How weights are updated to minimize error |
| **Full Implementation** | A working neural network class from scratch |

### Why Build from Scratch?

```
Using TensorFlow without understanding backprop is like
driving a car without knowing how engines work.
You can do it, but you'll never truly master it.
```

### Dataset: MNIST Handwritten Digits

| Info | Value |
|------|-------|
| **Kaggle** | digit-recognizer competition |
| **Task** | Classify digits 0-9 |
| **Samples** | 42,000 training images |
| **Features** | 784 pixels (28x28) |
| **Classes** | 10 |

---

## Table of Contents

1. [Part 1: The Neuron - Building Block of Neural Networks](#part1)
2. [Part 2: Activation Functions](#part2)
3. [Part 3: Neural Network Architecture](#part3)
4. [Part 4: Forward Propagation](#part4)
5. [Part 5: Loss Functions](#part5)
6. [Part 6: Backpropagation - The Math](#part6)
7. [Part 7: Gradient Descent Optimization](#part7)
8. [Part 8: Building the Neural Network Class](#part8)
9. [Part 9: Training on MNIST](#part9)
10. [Part 10: CNN Concepts Overview](#part10)
11. [Part 11: Summary and Key Takeaways](#part11)

---

<a id='part1'></a>
# Part 1: The Neuron - Building Block of Neural Networks

---

## 1.1 What is a Neuron?

A **neuron** (or perceptron) is inspired by biological neurons in the brain.

### Biological vs Artificial Neuron:

| Biological Neuron | Artificial Neuron |
|-------------------|-------------------|
| Dendrites receive signals | Inputs (x1, x2, ..., xn) |
| Cell body processes | Weighted sum + bias |
| Axon transmits output | Activation function |
| Synapses connect neurons | Weights (w1, w2, ..., wn) |

### Mathematical Representation:

```
        x1 ---w1-->\
        x2 ---w2--->\ 
        x3 ---w3---->[ Σ + b ]---> f(z) ---> output
        ...        /
        xn ---wn-->/

Where:
  z = w1*x1 + w2*x2 + ... + wn*xn + b  (weighted sum + bias)
  output = f(z)  (activation function)
```

### In Matrix Form:

$$z = \mathbf{w}^T \mathbf{x} + b = \sum_{i=1}^{n} w_i x_i + b$$

$$\text{output} = f(z)$$

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_openml, make_moons, make_circles
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import warnings
warnings.filterwarnings('ignore')

# Display settings
plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)

print("="*70)
print("NEURAL NETWORK FROM SCRATCH")
print("="*70)
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")
print("\nNo TensorFlow, No PyTorch - Just Pure NumPy!")

In [None]:
# ============================================================
# SINGLE NEURON (PERCEPTRON) IMPLEMENTATION
# ============================================================
print("="*70)
print("SINGLE NEURON IMPLEMENTATION")
print("="*70)

class Neuron:
    """
    A single neuron (perceptron).
    
    Components:
    - weights: Connection strengths for each input
    - bias: Threshold for activation
    - activation: Function to introduce non-linearity
    """
    
    def __init__(self, n_inputs, activation='sigmoid'):
        # Initialize weights randomly (small values)
        self.weights = np.random.randn(n_inputs) * 0.01
        self.bias = 0.0
        self.activation = activation
        
    def _activate(self, z):
        """Apply activation function"""
        if self.activation == 'sigmoid':
            return 1 / (1 + np.exp(-z))
        elif self.activation == 'relu':
            return np.maximum(0, z)
        elif self.activation == 'tanh':
            return np.tanh(z)
        else:
            return z  # Linear
    
    def forward(self, inputs):
        """Compute output: f(w*x + b)"""
        # Step 1: Weighted sum
        z = np.dot(self.weights, inputs) + self.bias
        # Step 2: Activation
        output = self._activate(z)
        return output

# Example: Single neuron with 3 inputs
print("\nExample: Single Neuron with 3 inputs")
print("-" * 40)

neuron = Neuron(n_inputs=3, activation='sigmoid')
print(f"Weights: {neuron.weights}")
print(f"Bias: {neuron.bias}")

# Sample input
x = np.array([1.0, 2.0, 3.0])
output = neuron.forward(x)

print(f"\nInput: {x}")
print(f"Weighted Sum (z): {np.dot(neuron.weights, x) + neuron.bias:.4f}")
print(f"Output (after sigmoid): {output:.4f}")

print("\n" + "="*70)
print("The neuron takes inputs, multiplies by weights, adds bias,")
print("and applies activation function to produce output!")
print("="*70)

## 1.2 Why Do We Need Activation Functions?

Without activation functions, a neural network is just **linear regression**!

| Without Activation | With Activation |
|-------------------|------------------|
| Only linear transformations | Can learn non-linear patterns |
| Multiple layers = 1 layer | Each layer adds complexity |
| Can't solve XOR problem | Can solve any function (Universal Approximation) |

### Proof: Why Multiple Linear Layers = 1 Layer

If we have two layers without activation:
- Layer 1: $z_1 = W_1 x + b_1$
- Layer 2: $z_2 = W_2 z_1 + b_2 = W_2 (W_1 x + b_1) + b_2 = (W_2 W_1) x + (W_2 b_1 + b_2)$

This is just another linear transformation! **Activation functions break this linearity.**

---

<a id='part2'></a>
# Part 2: Activation Functions

---

## 2.1 Common Activation Functions

| Function | Formula | Range | Use Case |
|----------|---------|-------|----------|
| **Sigmoid** | $\sigma(z) = \frac{1}{1+e^{-z}}$ | (0, 1) | Binary classification output |
| **Tanh** | $\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$ | (-1, 1) | Hidden layers (centered) |
| **ReLU** | $\max(0, z)$ | [0, ∞) | Hidden layers (most popular) |
| **Leaky ReLU** | $\max(0.01z, z)$ | (-∞, ∞) | Prevents dying ReLU |
| **Softmax** | $\frac{e^{z_i}}{\sum_j e^{z_j}}$ | (0, 1), sum=1 | Multi-class output |

In [None]:
# ============================================================
# ACTIVATION FUNCTIONS IMPLEMENTATION
# ============================================================
print("="*70)
print("ACTIVATION FUNCTIONS")
print("="*70)

class ActivationFunctions:
    """
    Collection of activation functions and their derivatives.
    Derivatives are needed for backpropagation!
    """
    
    # ========== SIGMOID ==========
    @staticmethod
    def sigmoid(z):
        """Sigmoid: squashes values to (0, 1)"""
        # Clip to prevent overflow
        z = np.clip(z, -500, 500)
        return 1 / (1 + np.exp(-z))
    
    @staticmethod
    def sigmoid_derivative(z):
        """Derivative of sigmoid: σ(z) * (1 - σ(z))"""
        s = ActivationFunctions.sigmoid(z)
        return s * (1 - s)
    
    # ========== TANH ==========
    @staticmethod
    def tanh(z):
        """Tanh: squashes values to (-1, 1)"""
        return np.tanh(z)
    
    @staticmethod
    def tanh_derivative(z):
        """Derivative of tanh: 1 - tanh²(z)"""
        return 1 - np.tanh(z) ** 2
    
    # ========== ReLU ==========
    @staticmethod
    def relu(z):
        """ReLU: max(0, z)"""
        return np.maximum(0, z)
    
    @staticmethod
    def relu_derivative(z):
        """Derivative of ReLU: 1 if z > 0, else 0"""
        return (z > 0).astype(float)
    
    # ========== LEAKY ReLU ==========
    @staticmethod
    def leaky_relu(z, alpha=0.01):
        """Leaky ReLU: max(αz, z)"""
        return np.where(z > 0, z, alpha * z)
    
    @staticmethod
    def leaky_relu_derivative(z, alpha=0.01):
        """Derivative of Leaky ReLU"""
        return np.where(z > 0, 1, alpha)
    
    # ========== SOFTMAX ==========
    @staticmethod
    def softmax(z):
        """Softmax: converts to probability distribution"""
        # Subtract max for numerical stability
        exp_z = np.exp(z - np.max(z, axis=-1, keepdims=True))
        return exp_z / np.sum(exp_z, axis=-1, keepdims=True)

# Create shorthand
AF = ActivationFunctions

print("\nActivation functions implemented:")
print("  - Sigmoid (and derivative)")
print("  - Tanh (and derivative)")
print("  - ReLU (and derivative)")
print("  - Leaky ReLU (and derivative)")
print("  - Softmax")

In [None]:
# Visualize activation functions
print("="*70)
print("VISUALIZING ACTIVATION FUNCTIONS")
print("="*70)

z = np.linspace(-5, 5, 200)

fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# 1. Sigmoid
ax = axes[0, 0]
ax.plot(z, AF.sigmoid(z), 'b-', linewidth=2, label='Sigmoid')
ax.plot(z, AF.sigmoid_derivative(z), 'r--', linewidth=2, label='Derivative')
ax.axhline(y=0, color='k', linewidth=0.5)
ax.axhline(y=0.5, color='gray', linewidth=0.5, linestyle=':')
ax.axvline(x=0, color='k', linewidth=0.5)
ax.set_title('Sigmoid: σ(z) = 1/(1+e^{-z})', fontweight='bold')
ax.set_xlabel('z')
ax.set_ylabel('Output')
ax.legend()
ax.set_ylim(-0.5, 1.5)

# 2. Tanh
ax = axes[0, 1]
ax.plot(z, AF.tanh(z), 'b-', linewidth=2, label='Tanh')
ax.plot(z, AF.tanh_derivative(z), 'r--', linewidth=2, label='Derivative')
ax.axhline(y=0, color='k', linewidth=0.5)
ax.axvline(x=0, color='k', linewidth=0.5)
ax.set_title('Tanh: (e^z - e^{-z})/(e^z + e^{-z})', fontweight='bold')
ax.set_xlabel('z')
ax.legend()
ax.set_ylim(-1.5, 1.5)

# 3. ReLU
ax = axes[0, 2]
ax.plot(z, AF.relu(z), 'b-', linewidth=2, label='ReLU')
ax.plot(z, AF.relu_derivative(z), 'r--', linewidth=2, label='Derivative')
ax.axhline(y=0, color='k', linewidth=0.5)
ax.axvline(x=0, color='k', linewidth=0.5)
ax.set_title('ReLU: max(0, z)', fontweight='bold')
ax.set_xlabel('z')
ax.legend()
ax.set_ylim(-1, 5)

# 4. Leaky ReLU
ax = axes[1, 0]
ax.plot(z, AF.leaky_relu(z), 'b-', linewidth=2, label='Leaky ReLU')
ax.plot(z, AF.leaky_relu_derivative(z), 'r--', linewidth=2, label='Derivative')
ax.axhline(y=0, color='k', linewidth=0.5)
ax.axvline(x=0, color='k', linewidth=0.5)
ax.set_title('Leaky ReLU: max(0.01z, z)', fontweight='bold')
ax.set_xlabel('z')
ax.legend()
ax.set_ylim(-1, 5)

# 5. Comparison
ax = axes[1, 1]
ax.plot(z, AF.sigmoid(z), linewidth=2, label='Sigmoid')
ax.plot(z, AF.tanh(z), linewidth=2, label='Tanh')
ax.plot(z, AF.relu(z), linewidth=2, label='ReLU')
ax.axhline(y=0, color='k', linewidth=0.5)
ax.axvline(x=0, color='k', linewidth=0.5)
ax.set_title('Comparison of Activation Functions', fontweight='bold')
ax.set_xlabel('z')
ax.legend()
ax.set_ylim(-1.5, 5)

# 6. Softmax example
ax = axes[1, 2]
z_example = np.array([2.0, 1.0, 0.1])
softmax_out = AF.softmax(z_example)
bars = ax.bar(['Class 0', 'Class 1', 'Class 2'], softmax_out, 
              color=['steelblue', 'lightblue', 'lightgray'], edgecolor='black')
ax.set_title(f'Softmax: z={list(z_example)}', fontweight='bold')
ax.set_ylabel('Probability')
for bar, val in zip(bars, softmax_out):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02, 
            f'{val:.3f}', ha='center', fontweight='bold')
ax.set_ylim(0, 1)
ax.axhline(y=1/3, color='red', linestyle='--', alpha=0.5, label='Equal prob')

plt.tight_layout()
plt.show()

print("\nKey Observations:")
print("  - Sigmoid: Output (0,1), derivative max at z=0")
print("  - Tanh: Output (-1,1), zero-centered, stronger gradients")
print("  - ReLU: Fast computation, can 'die' (output 0 forever)")
print("  - Softmax: Outputs sum to 1 (probability distribution)")

## 2.2 When to Use Which Activation?

| Layer Type | Recommended Activation | Why |
|------------|----------------------|-----|
| **Hidden Layers** | ReLU | Fast, avoids vanishing gradient |
| **Hidden (if ReLU dies)** | Leaky ReLU, ELU | Prevents dead neurons |
| **Binary Output** | Sigmoid | Output is probability (0-1) |
| **Multi-class Output** | Softmax | Outputs are probabilities summing to 1 |
| **Regression Output** | Linear (none) | No squashing needed |

### The Vanishing Gradient Problem

| Activation | Max Derivative | Problem |
|------------|---------------|----------|
| Sigmoid | 0.25 (at z=0) | Gradients shrink exponentially in deep networks |
| Tanh | 1.0 (at z=0) | Better but still shrinks |
| ReLU | 1.0 (for z>0) | No shrinking! But can "die" |

**ReLU solved the vanishing gradient problem**, enabling deep networks!

---

<a id='part3'></a>
# Part 3: Neural Network Architecture

---

## 3.1 Layers of a Neural Network

```
    INPUT LAYER          HIDDEN LAYERS           OUTPUT LAYER
    (features)           (learned repr.)         (predictions)
    
        x1  ----\                                    /---- y1
                 \       [h1] --- [h4]              /
        x2  ------\     /    \   /    \            /------ y2
                   \   /      \ /      \          /
        x3  --------[h1]------[h3]------[h5]-----/-------- y3
                   /   \      / \      /          \
        x4  ------/     \    /   \    /            \------ y4
                 /       [h2] --- [h4]              \
        xn  ----/                                    \---- yk
    
    Layer 0             Layer 1    Layer 2          Layer 3
    (n neurons)         (4 neurons)(3 neurons)      (k neurons)
```

### Layer Definitions:

| Layer | Purpose | Activation |
|-------|---------|------------|
| **Input** | Receives raw features | None |
| **Hidden** | Learns representations | ReLU, Tanh |
| **Output** | Makes predictions | Sigmoid/Softmax |

### Notation:

| Symbol | Meaning |
|--------|---------|  
| $L$ | Number of layers (not counting input) |
| $n^{[l]}$ | Number of neurons in layer $l$ |
| $W^{[l]}$ | Weight matrix for layer $l$, shape $(n^{[l]}, n^{[l-1]})$ |
| $b^{[l]}$ | Bias vector for layer $l$, shape $(n^{[l]}, 1)$ |
| $a^{[l]}$ | Activations of layer $l$ |
| $z^{[l]}$ | Pre-activation (before applying activation function) |

In [None]:
# ============================================================
# VISUALIZE NEURAL NETWORK ARCHITECTURE
# ============================================================
print("="*70)
print("NEURAL NETWORK ARCHITECTURE VISUALIZATION")
print("="*70)

def draw_neural_network(layer_sizes, ax, title="Neural Network"):
    """
    Draw a neural network diagram.
    layer_sizes: list of neurons per layer [input, hidden1, ..., output]
    """
    n_layers = len(layer_sizes)
    v_spacing = 1.0
    h_spacing = 2.0
    
    # Calculate positions
    layer_positions = []
    for i, size in enumerate(layer_sizes):
        x = i * h_spacing
        y_start = (max(layer_sizes) - size) * v_spacing / 2
        positions = [(x, y_start + j * v_spacing) for j in range(size)]
        layer_positions.append(positions)
    
    # Draw connections
    for i in range(n_layers - 1):
        for pos1 in layer_positions[i]:
            for pos2 in layer_positions[i + 1]:
                ax.plot([pos1[0], pos2[0]], [pos1[1], pos2[1]], 
                       'b-', alpha=0.3, linewidth=0.5)
    
    # Draw neurons
    colors = ['#FFD700', '#87CEEB', '#87CEEB', '#FFB6C1']  # Gold, Sky Blue, Pink
    labels = ['Input', 'Hidden', 'Hidden', 'Output']
    
    for i, positions in enumerate(layer_positions):
        color = colors[min(i, len(colors)-1)] if i < n_layers - 1 else colors[-1]
        for pos in positions:
            circle = plt.Circle(pos, 0.3, color=color, ec='black', linewidth=2, zorder=10)
            ax.add_patch(circle)
        
        # Label
        label = 'Input' if i == 0 else ('Output' if i == n_layers - 1 else f'Hidden {i}')
        ax.text(i * h_spacing, -1.5, f'{label}\n({layer_sizes[i]} neurons)', 
               ha='center', fontsize=10, fontweight='bold')
    
    ax.set_xlim(-1, (n_layers - 1) * h_spacing + 1)
    ax.set_ylim(-2.5, max(layer_sizes) * v_spacing)
    ax.set_aspect('equal')
    ax.axis('off')
    ax.set_title(title, fontsize=14, fontweight='bold')

# Example architectures
fig, axes = plt.subplots(1, 3, figsize=(16, 6))

# Simple network
draw_neural_network([3, 4, 2], axes[0], "Simple: [3, 4, 2]")

# Medium network
draw_neural_network([4, 5, 5, 3], axes[1], "Medium: [4, 5, 5, 3]")

# Deep network
draw_neural_network([5, 4, 4, 4, 2], axes[2], "Deep: [5, 4, 4, 4, 2]")

plt.tight_layout()
plt.show()

print("\nArchitecture Notation:")
print("  [3, 4, 2] means:")
print("    - 3 input features")
print("    - 1 hidden layer with 4 neurons")
print("    - 2 output neurons (e.g., binary classification)")

## 3.2 Weight Matrices and Dimensions

For a network with architecture `[3, 4, 2]`:

| Layer | Weight Shape | Bias Shape | Total Parameters |
|-------|--------------|------------|------------------|
| 1 (Hidden) | (4, 3) | (4, 1) | 4×3 + 4 = 16 |
| 2 (Output) | (2, 4) | (2, 1) | 2×4 + 2 = 10 |
| **Total** | | | **26 parameters** |

### Why These Shapes?

- $W^{[l]}$ has shape $(n^{[l]}, n^{[l-1]})$
- This allows: $z^{[l]} = W^{[l]} \cdot a^{[l-1]} + b^{[l]}$
- Where $a^{[l-1]}$ has shape $(n^{[l-1]}, m)$ for $m$ samples

In [None]:
# ============================================================
# WEIGHT INITIALIZATION
# ============================================================
print("="*70)
print("WEIGHT INITIALIZATION")
print("="*70)

def initialize_parameters(layer_sizes, init_method='he'):
    """
    Initialize weights and biases for all layers.
    
    Parameters:
    - layer_sizes: list like [input_size, hidden1_size, ..., output_size]
    - init_method: 'random', 'xavier', 'he'
    
    Returns:
    - parameters: dict with W1, b1, W2, b2, ...
    """
    parameters = {}
    L = len(layer_sizes) - 1  # Number of layers (not counting input)
    
    for l in range(1, L + 1):
        n_current = layer_sizes[l]
        n_prev = layer_sizes[l - 1]
        
        if init_method == 'random':
            # Small random values
            W = np.random.randn(n_current, n_prev) * 0.01
        elif init_method == 'xavier':
            # Xavier initialization (good for tanh)
            W = np.random.randn(n_current, n_prev) * np.sqrt(1 / n_prev)
        elif init_method == 'he':
            # He initialization (good for ReLU)
            W = np.random.randn(n_current, n_prev) * np.sqrt(2 / n_prev)
        
        b = np.zeros((n_current, 1))
        
        parameters[f'W{l}'] = W
        parameters[f'b{l}'] = b
    
    return parameters

# Example
layer_sizes = [784, 128, 64, 10]  # MNIST example
params = initialize_parameters(layer_sizes, init_method='he')

print(f"\nNetwork Architecture: {layer_sizes}")
print(f"\nInitialized Parameters:")
print("-" * 50)

total_params = 0
for l in range(1, len(layer_sizes)):
    W_shape = params[f'W{l}'].shape
    b_shape = params[f'b{l}'].shape
    layer_params = W_shape[0] * W_shape[1] + b_shape[0]
    total_params += layer_params
    print(f"Layer {l}: W{l} shape = {W_shape}, b{l} shape = {b_shape}, params = {layer_params:,}")

print("-" * 50)
print(f"Total Parameters: {total_params:,}")

print("\n" + "="*70)
print("INITIALIZATION METHODS:")
print("="*70)
print("""
| Method | Formula | Best For |
|--------|---------|----------|
| Random | W * 0.01 | Simple cases |
| Xavier | W * sqrt(1/n_prev) | Sigmoid, Tanh |
| He     | W * sqrt(2/n_prev) | ReLU |

Why initialization matters:
- Too large: Exploding gradients, activations saturate
- Too small: Vanishing gradients, slow learning
- Just right: Stable training!
""")

---

<a id='part4'></a>
# Part 4: Forward Propagation

---

## 4.1 What is Forward Propagation?

Forward propagation is the process of **passing input through the network** to get output.

### Algorithm:

```
For each layer l = 1, 2, ..., L:
    1. Compute linear combination:  z[l] = W[l] · a[l-1] + b[l]
    2. Apply activation:            a[l] = g[l](z[l])
    
Where:
    - a[0] = X (input)
    - a[L] = ŷ (output/prediction)
    - g[l] is the activation function for layer l
```

### Visual Representation:

```
Input (X)                                          Output (ŷ)
    ↓                                                  ↑
 a[0] = X                                          a[L] = ŷ
    ↓                                                  ↑
 z[1] = W[1]·a[0] + b[1]  →  a[1] = g(z[1])  →  ... →  a[L]
```

In [None]:
# ============================================================
# FORWARD PROPAGATION IMPLEMENTATION
# ============================================================
print("="*70)
print("FORWARD PROPAGATION")
print("="*70)

def forward_propagation(X, parameters, activations):
    """
    Perform forward propagation through the network.
    
    Parameters:
    - X: Input data, shape (n_features, n_samples)
    - parameters: dict with W1, b1, W2, b2, ...
    - activations: list of activation functions per layer
    
    Returns:
    - A_final: Output predictions
    - cache: dict storing all intermediate values (needed for backprop)
    """
    cache = {'A0': X}  # Store input
    A = X
    L = len(activations)  # Number of layers
    
    for l in range(1, L + 1):
        A_prev = A
        W = parameters[f'W{l}']
        b = parameters[f'b{l}']
        
        # Step 1: Linear transformation
        Z = np.dot(W, A_prev) + b
        
        # Step 2: Activation
        activation = activations[l - 1]
        if activation == 'sigmoid':
            A = AF.sigmoid(Z)
        elif activation == 'relu':
            A = AF.relu(Z)
        elif activation == 'tanh':
            A = AF.tanh(Z)
        elif activation == 'softmax':
            A = AF.softmax(Z.T).T  # Softmax along correct axis
        else:
            A = Z  # Linear
        
        # Store in cache for backpropagation
        cache[f'Z{l}'] = Z
        cache[f'A{l}'] = A
    
    return A, cache

# Example: Forward pass
print("\nExample Forward Propagation:")
print("-" * 50)

# Simple network: 3 inputs -> 4 hidden -> 2 outputs
layer_sizes = [3, 4, 2]
params = initialize_parameters(layer_sizes, 'he')
activations = ['relu', 'sigmoid']  # ReLU for hidden, Sigmoid for output

# Sample input (3 features, 5 samples)
X = np.random.randn(3, 5)

print(f"Input shape: {X.shape}")
print(f"Network: {layer_sizes}")
print(f"Activations: {activations}")

# Forward pass
output, cache = forward_propagation(X, params, activations)

print(f"\nLayer-by-layer shapes:")
print(f"  Input (A0): {cache['A0'].shape}")
print(f"  Z1: {cache['Z1'].shape} -> A1: {cache['A1'].shape}")
print(f"  Z2: {cache['Z2'].shape} -> A2: {cache['A2'].shape}")

print(f"\nOutput (predictions):")
print(output.round(3))

---

<a id='part5'></a>
# Part 5: Loss Functions

---

## 5.1 What is a Loss Function?

A **loss function** (or cost function) measures **how wrong** our predictions are.

| Task | Loss Function | Formula |
|------|--------------|--------|
| **Binary Classification** | Binary Cross-Entropy | $-\frac{1}{m}\sum[y\log(\hat{y}) + (1-y)\log(1-\hat{y})]$ |
| **Multi-class Classification** | Categorical Cross-Entropy | $-\frac{1}{m}\sum\sum y_k\log(\hat{y}_k)$ |
| **Regression** | Mean Squared Error | $\frac{1}{m}\sum(y - \hat{y})^2$ |

### Goal of Training:

**Minimize the loss function** by adjusting weights and biases!

In [None]:
# ============================================================
# LOSS FUNCTIONS IMPLEMENTATION
# ============================================================
print("="*70)
print("LOSS FUNCTIONS")
print("="*70)

class LossFunctions:
    """
    Collection of loss functions and their derivatives.
    """
    
    @staticmethod
    def binary_crossentropy(y_true, y_pred, epsilon=1e-15):
        """
        Binary Cross-Entropy Loss.
        Used for binary classification with sigmoid output.
        
        Formula: -1/m * sum(y*log(ŷ) + (1-y)*log(1-ŷ))
        """
        # Clip predictions to prevent log(0)
        y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
        m = y_true.shape[1]  # Number of samples
        
        loss = -np.sum(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)) / m
        return loss
    
    @staticmethod
    def categorical_crossentropy(y_true, y_pred, epsilon=1e-15):
        """
        Categorical Cross-Entropy Loss.
        Used for multi-class classification with softmax output.
        
        Formula: -1/m * sum(sum(y_k * log(ŷ_k)))
        """
        y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
        m = y_true.shape[1]
        
        loss = -np.sum(y_true * np.log(y_pred)) / m
        return loss
    
    @staticmethod
    def mse(y_true, y_pred):
        """
        Mean Squared Error.
        Used for regression.
        
        Formula: 1/m * sum((y - ŷ)²)
        """
        m = y_true.shape[1]
        loss = np.sum((y_true - y_pred) ** 2) / m
        return loss

LF = LossFunctions

# Example: Binary Cross-Entropy
print("\nExample: Binary Cross-Entropy")
print("-" * 40)

y_true = np.array([[1, 0, 1, 0, 1]])  # True labels
y_pred_good = np.array([[0.9, 0.1, 0.8, 0.2, 0.9]])  # Good predictions
y_pred_bad = np.array([[0.5, 0.5, 0.5, 0.5, 0.5]])   # Bad predictions

loss_good = LF.binary_crossentropy(y_true, y_pred_good)
loss_bad = LF.binary_crossentropy(y_true, y_pred_bad)

print(f"True labels:      {y_true[0]}")
print(f"Good predictions: {y_pred_good[0]} -> Loss: {loss_good:.4f}")
print(f"Bad predictions:  {y_pred_bad[0]} -> Loss: {loss_bad:.4f}")
print(f"\nLower loss = better predictions!")

In [None]:
# Visualize loss landscape
print("="*70)
print("VISUALIZING LOSS FUNCTIONS")
print("="*70)

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# 1. Binary Cross-Entropy for y=1
y_pred_range = np.linspace(0.001, 0.999, 100)
loss_y1 = -np.log(y_pred_range)  # When y=1
loss_y0 = -np.log(1 - y_pred_range)  # When y=0

axes[0].plot(y_pred_range, loss_y1, 'b-', linewidth=2, label='y=1: -log(ŷ)')
axes[0].plot(y_pred_range, loss_y0, 'r-', linewidth=2, label='y=0: -log(1-ŷ)')
axes[0].set_xlabel('Predicted Probability (ŷ)')
axes[0].set_ylabel('Loss')
axes[0].set_title('Binary Cross-Entropy Loss', fontweight='bold')
axes[0].legend()
axes[0].set_ylim(0, 5)
axes[0].axvline(x=0.5, color='gray', linestyle='--', alpha=0.5)

# 2. MSE Loss
y_true_val = 1.0
y_pred_range = np.linspace(-1, 3, 100)
mse_loss = (y_true_val - y_pred_range) ** 2

axes[1].plot(y_pred_range, mse_loss, 'g-', linewidth=2)
axes[1].scatter([y_true_val], [0], color='red', s=100, zorder=5, label='True value')
axes[1].set_xlabel('Predicted Value (ŷ)')
axes[1].set_ylabel('Loss')
axes[1].set_title('Mean Squared Error (y=1)', fontweight='bold')
axes[1].legend()

# 3. Loss surface (2D)
w1_range = np.linspace(-3, 3, 100)
w2_range = np.linspace(-3, 3, 100)
W1, W2 = np.meshgrid(w1_range, w2_range)

# Simple quadratic loss surface
Loss = (W1 - 1)**2 + (W2 + 0.5)**2

contour = axes[2].contour(W1, W2, Loss, levels=20, cmap='viridis')
axes[2].scatter([1], [-0.5], color='red', s=100, zorder=5, marker='*', label='Minimum')
axes[2].set_xlabel('Weight 1')
axes[2].set_ylabel('Weight 2')
axes[2].set_title('Loss Surface (Gradient Descent Target)', fontweight='bold')
axes[2].legend()
plt.colorbar(contour, ax=axes[2], label='Loss')

plt.tight_layout()
plt.show()

print("\nKey Insights:")
print("  - BCE heavily penalizes confident wrong predictions")
print("  - MSE has a smooth parabolic shape (easy to optimize)")
print("  - Gradient descent finds the minimum of the loss surface")

---

<a id='part6'></a>
# Part 6: Backpropagation - The Math

---

## 6.1 What is Backpropagation?

**Backpropagation** is the algorithm that computes gradients (derivatives) of the loss with respect to each weight.

### The Big Picture:

```
Forward Pass:  X → Layer 1 → Layer 2 → ... → ŷ → Loss
Backward Pass: X ← Layer 1 ← Layer 2 ← ... ← ŷ ← Loss
                   (compute gradients going backward)
```

### Chain Rule - The Foundation:

If $L = f(g(h(x)))$, then:

$$\frac{dL}{dx} = \frac{dL}{df} \cdot \frac{df}{dg} \cdot \frac{dg}{dh} \cdot \frac{dh}{dx}$$

This is how we propagate gradients backward through layers!

## 6.2 Backpropagation Equations

For layer $l$:

| What We Compute | Formula | Purpose |
|-----------------|---------|--------|
| $dZ^{[l]}$ | $dA^{[l]} \times g'^{[l]}(Z^{[l]})$ | Gradient of pre-activation |
| $dW^{[l]}$ | $\frac{1}{m} dZ^{[l]} \cdot A^{[l-1]T}$ | Gradient for weights |
| $db^{[l]}$ | $\frac{1}{m} \sum dZ^{[l]}$ | Gradient for biases |
| $dA^{[l-1]}$ | $W^{[l]T} \cdot dZ^{[l]}$ | Pass gradient to previous layer |

Where $g'$ is the derivative of the activation function.

In [None]:
# ============================================================
# BACKPROPAGATION IMPLEMENTATION
# ============================================================
print("="*70)
print("BACKPROPAGATION - THE LEARNING ALGORITHM")
print("="*70)

def backward_propagation(Y, cache, parameters, activations):
    """
    Perform backward propagation to compute gradients.
    
    Parameters:
    - Y: True labels, shape (n_classes, n_samples)
    - cache: dict with Z1, A1, Z2, A2, ... from forward pass
    - parameters: dict with W1, b1, W2, b2, ...
    - activations: list of activation functions per layer
    
    Returns:
    - gradients: dict with dW1, db1, dW2, db2, ...
    """
    gradients = {}
    L = len(activations)  # Number of layers
    m = Y.shape[1]  # Number of samples
    
    # Get final layer output
    AL = cache[f'A{L}']
    
    # ========== OUTPUT LAYER GRADIENT ==========
    # For cross-entropy loss with sigmoid/softmax: dZ = A - Y
    # This is a simplified form that combines loss derivative and activation derivative
    dZ = AL - Y
    
    # ========== BACKWARD THROUGH EACH LAYER ==========
    for l in reversed(range(1, L + 1)):
        A_prev = cache[f'A{l-1}']
        W = parameters[f'W{l}']
        
        # Compute gradients
        dW = (1/m) * np.dot(dZ, A_prev.T)
        db = (1/m) * np.sum(dZ, axis=1, keepdims=True)
        
        # Store gradients
        gradients[f'dW{l}'] = dW
        gradients[f'db{l}'] = db
        
        # Compute dA for previous layer (if not input layer)
        if l > 1:
            dA_prev = np.dot(W.T, dZ)
            
            # Apply activation derivative
            Z_prev = cache[f'Z{l-1}']
            activation = activations[l - 2]
            
            if activation == 'sigmoid':
                dZ = dA_prev * AF.sigmoid_derivative(Z_prev)
            elif activation == 'relu':
                dZ = dA_prev * AF.relu_derivative(Z_prev)
            elif activation == 'tanh':
                dZ = dA_prev * AF.tanh_derivative(Z_prev)
            else:
                dZ = dA_prev  # Linear
    
    return gradients

# Example: Backpropagation
print("\nExample: Computing Gradients")
print("-" * 50)

# Simple network
layer_sizes = [3, 4, 2]
params = initialize_parameters(layer_sizes, 'he')
activations = ['relu', 'sigmoid']

# Sample data
X = np.random.randn(3, 5)  # 3 features, 5 samples
Y = np.random.randint(0, 2, (2, 5))  # Binary labels (2 classes, 5 samples)

# Forward pass
output, cache = forward_propagation(X, params, activations)

# Backward pass
grads = backward_propagation(Y, cache, params, activations)

print("Computed gradients:")
for key, value in grads.items():
    print(f"  {key}: shape = {value.shape}, mean = {value.mean():.6f}")

print("\n" + "="*70)
print("These gradients tell us HOW to adjust weights to reduce loss!")
print("="*70)

## 6.3 Visualizing Backpropagation

### The Backward Flow:

```
FORWARD:   X ──→ Z1 ──→ A1 ──→ Z2 ──→ A2 ──→ Loss
                 ↑            ↑            ↑
              W1, b1       W2, b2         (Y)

BACKWARD:  dX ←── dZ1 ←── dA1 ←── dZ2 ←── dA2 ←── dLoss
                 ↓            ↓
             dW1, db1     dW2, db2
```

### Key Intuition:

1. **Output layer**: Error = (prediction - true label)
2. **Propagate backward**: Each layer receives error from layer above
3. **Compute gradients**: Use chain rule at each layer
4. **Update weights**: Move in opposite direction of gradient

---

<a id='part7'></a>
# Part 7: Gradient Descent Optimization

---

## 7.1 The Gradient Descent Algorithm

**Gradient Descent** updates weights to minimize the loss function.

### Update Rule:

$$W_{new} = W_{old} - \alpha \cdot \frac{\partial J}{\partial W}$$

$$b_{new} = b_{old} - \alpha \cdot \frac{\partial J}{\partial b}$$

Where:
- $\alpha$ = **learning rate** (step size)
- $\frac{\partial J}{\partial W}$ = gradient of loss with respect to weights

### Analogy: Hiking Down a Mountain

| Mountain Hiking | Gradient Descent |
|-----------------|------------------|
| You're at some position | Current weights |
| Look for steepest downhill | Compute gradient |
| Take a step downhill | Update weights |
| Repeat until valley | Repeat until minimum loss |

In [None]:
# ============================================================
# GRADIENT DESCENT IMPLEMENTATION
# ============================================================
print("="*70)
print("GRADIENT DESCENT OPTIMIZATION")
print("="*70)

def update_parameters(parameters, gradients, learning_rate):
    """
    Update parameters using gradient descent.
    
    W_new = W_old - learning_rate * dW
    b_new = b_old - learning_rate * db
    """
    L = len(parameters) // 2  # Number of layers
    
    for l in range(1, L + 1):
        parameters[f'W{l}'] -= learning_rate * gradients[f'dW{l}']
        parameters[f'b{l}'] -= learning_rate * gradients[f'db{l}']
    
    return parameters

# Visualize gradient descent
print("\nVisualizing Gradient Descent on 2D Loss Surface:")

def loss_function(w1, w2):
    """Simple quadratic loss for visualization"""
    return (w1 - 1)**2 + 2*(w2 + 0.5)**2

def gradient(w1, w2):
    """Gradient of the loss"""
    dw1 = 2 * (w1 - 1)
    dw2 = 4 * (w2 + 0.5)
    return dw1, dw2

# Gradient descent simulation
w1, w2 = -2.0, 2.0  # Starting point
learning_rate = 0.1
path = [(w1, w2)]

for _ in range(50):
    dw1, dw2 = gradient(w1, w2)
    w1 -= learning_rate * dw1
    w2 -= learning_rate * dw2
    path.append((w1, w2))

# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Loss surface
w1_range = np.linspace(-3, 3, 100)
w2_range = np.linspace(-3, 3, 100)
W1, W2 = np.meshgrid(w1_range, w2_range)
Loss = loss_function(W1, W2)

contour = axes[0].contour(W1, W2, Loss, levels=30, cmap='viridis')
path = np.array(path)
axes[0].plot(path[:, 0], path[:, 1], 'r.-', linewidth=2, markersize=8, label='GD Path')
axes[0].scatter([path[0, 0]], [path[0, 1]], color='red', s=100, marker='o', zorder=5, label='Start')
axes[0].scatter([1], [-0.5], color='green', s=200, marker='*', zorder=5, label='Minimum')
axes[0].set_xlabel('Weight 1')
axes[0].set_ylabel('Weight 2')
axes[0].set_title('Gradient Descent Path', fontweight='bold', fontsize=14)
axes[0].legend()
plt.colorbar(contour, ax=axes[0], label='Loss')

# Loss over iterations
losses = [loss_function(p[0], p[1]) for p in path]
axes[1].plot(losses, 'b-', linewidth=2)
axes[1].set_xlabel('Iteration')
axes[1].set_ylabel('Loss')
axes[1].set_title('Loss Decrease Over Iterations', fontweight='bold', fontsize=14)
axes[1].axhline(y=0, color='green', linestyle='--', alpha=0.5, label='Minimum')
axes[1].legend()

plt.tight_layout()
plt.show()

print(f"\nStarting point: w1={path[0, 0]:.2f}, w2={path[0, 1]:.2f}")
print(f"Final point:    w1={path[-1, 0]:.4f}, w2={path[-1, 1]:.4f}")
print(f"True minimum:   w1=1.0, w2=-0.5")
print(f"Initial loss:   {losses[0]:.4f}")
print(f"Final loss:     {losses[-1]:.6f}")

## 7.2 Learning Rate - The Most Important Hyperparameter

| Learning Rate | Effect |
|---------------|--------|
| **Too Large** | Overshoots minimum, may diverge |
| **Too Small** | Very slow convergence |
| **Just Right** | Smooth convergence to minimum |

In [None]:
# Effect of learning rate
print("="*70)
print("EFFECT OF LEARNING RATE")
print("="*70)

learning_rates = [0.01, 0.1, 0.5, 1.5]

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

for idx, lr in enumerate(learning_rates):
    ax = axes[idx // 2, idx % 2]
    
    # Gradient descent
    w1, w2 = -2.0, 2.0
    path = [(w1, w2)]
    
    for _ in range(30):
        dw1, dw2 = gradient(w1, w2)
        w1 -= lr * dw1
        w2 -= lr * dw2
        # Clip to prevent divergence
        w1, w2 = np.clip(w1, -5, 5), np.clip(w2, -5, 5)
        path.append((w1, w2))
    
    path = np.array(path)
    
    # Plot
    contour = ax.contour(W1, W2, Loss, levels=30, cmap='viridis', alpha=0.7)
    ax.plot(path[:, 0], path[:, 1], 'r.-', linewidth=2, markersize=8)
    ax.scatter([path[0, 0]], [path[0, 1]], color='red', s=100, marker='o', zorder=5)
    ax.scatter([1], [-0.5], color='green', s=200, marker='*', zorder=5)
    ax.set_xlabel('Weight 1')
    ax.set_ylabel('Weight 2')
    
    status = "Too slow" if lr < 0.05 else ("Good" if lr < 1.0 else "Diverging!")
    ax.set_title(f'Learning Rate = {lr} ({status})', fontweight='bold')
    ax.set_xlim(-3, 3)
    ax.set_ylim(-3, 3)

plt.tight_layout()
plt.show()

print("\nLearning Rate Guidelines:")
print("  - Start with 0.001 or 0.01")
print("  - If loss decreases smoothly, learning rate is good")
print("  - If loss oscillates wildly, reduce learning rate")
print("  - If loss decreases too slowly, increase learning rate")

---

<a id='part8'></a>
# Part 8: Building the Complete Neural Network Class

---

Now we combine everything into a single, reusable Neural Network class!

In [None]:
# ============================================================
# COMPLETE NEURAL NETWORK CLASS FROM SCRATCH
# ============================================================
print("="*70)
print("BUILDING COMPLETE NEURAL NETWORK FROM SCRATCH")
print("="*70)

class NeuralNetwork:
    """
    A fully-connected neural network built from scratch.
    
    Supports:
    - Multiple hidden layers
    - Different activation functions (relu, sigmoid, tanh, softmax)
    - Binary and multi-class classification
    - He/Xavier weight initialization
    """
    
    def __init__(self, layer_sizes, activations, learning_rate=0.01, init_method='he'):
        """
        Initialize the neural network.
        
        Parameters:
        - layer_sizes: list like [input_size, hidden1, hidden2, ..., output_size]
        - activations: list of activations for each layer (excluding input)
        - learning_rate: step size for gradient descent
        - init_method: 'random', 'xavier', or 'he'
        """
        self.layer_sizes = layer_sizes
        self.activations = activations
        self.learning_rate = learning_rate
        self.parameters = self._initialize_parameters(init_method)
        self.history = {'loss': [], 'accuracy': []}
        
    def _initialize_parameters(self, init_method):
        """Initialize weights and biases."""
        parameters = {}
        L = len(self.layer_sizes) - 1
        
        for l in range(1, L + 1):
            n_current = self.layer_sizes[l]
            n_prev = self.layer_sizes[l - 1]
            
            if init_method == 'he':
                W = np.random.randn(n_current, n_prev) * np.sqrt(2 / n_prev)
            elif init_method == 'xavier':
                W = np.random.randn(n_current, n_prev) * np.sqrt(1 / n_prev)
            else:
                W = np.random.randn(n_current, n_prev) * 0.01
            
            b = np.zeros((n_current, 1))
            parameters[f'W{l}'] = W
            parameters[f'b{l}'] = b
        
        return parameters
    
    def _activate(self, Z, activation):
        """Apply activation function."""
        if activation == 'sigmoid':
            return 1 / (1 + np.exp(-np.clip(Z, -500, 500)))
        elif activation == 'relu':
            return np.maximum(0, Z)
        elif activation == 'tanh':
            return np.tanh(Z)
        elif activation == 'softmax':
            exp_Z = np.exp(Z - np.max(Z, axis=0, keepdims=True))
            return exp_Z / np.sum(exp_Z, axis=0, keepdims=True)
        return Z
    
    def _activate_derivative(self, Z, activation):
        """Compute derivative of activation function."""
        if activation == 'sigmoid':
            s = self._activate(Z, 'sigmoid')
            return s * (1 - s)
        elif activation == 'relu':
            return (Z > 0).astype(float)
        elif activation == 'tanh':
            return 1 - np.tanh(Z) ** 2
        return np.ones_like(Z)
    
    def forward(self, X):
        """Forward propagation."""
        self.cache = {'A0': X}
        A = X
        L = len(self.activations)
        
        for l in range(1, L + 1):
            W = self.parameters[f'W{l}']
            b = self.parameters[f'b{l}']
            
            Z = np.dot(W, A) + b
            A = self._activate(Z, self.activations[l - 1])
            
            self.cache[f'Z{l}'] = Z
            self.cache[f'A{l}'] = A
        
        return A
    
    def backward(self, Y):
        """Backward propagation."""
        gradients = {}
        L = len(self.activations)
        m = Y.shape[1]
        
        AL = self.cache[f'A{L}']
        
        # Output layer gradient (cross-entropy with sigmoid/softmax)
        dZ = AL - Y
        
        for l in reversed(range(1, L + 1)):
            A_prev = self.cache[f'A{l-1}']
            W = self.parameters[f'W{l}']
            
            gradients[f'dW{l}'] = (1/m) * np.dot(dZ, A_prev.T)
            gradients[f'db{l}'] = (1/m) * np.sum(dZ, axis=1, keepdims=True)
            
            if l > 1:
                dA_prev = np.dot(W.T, dZ)
                Z_prev = self.cache[f'Z{l-1}']
                dZ = dA_prev * self._activate_derivative(Z_prev, self.activations[l-2])
        
        return gradients
    
    def update_parameters(self, gradients):
        """Update parameters using gradient descent."""
        L = len(self.activations)
        
        for l in range(1, L + 1):
            self.parameters[f'W{l}'] -= self.learning_rate * gradients[f'dW{l}']
            self.parameters[f'b{l}'] -= self.learning_rate * gradients[f'db{l}']
    
    def compute_loss(self, Y, Y_pred):
        """Compute cross-entropy loss."""
        m = Y.shape[1]
        epsilon = 1e-15
        Y_pred = np.clip(Y_pred, epsilon, 1 - epsilon)
        loss = -np.sum(Y * np.log(Y_pred)) / m
        return loss
    
    def fit(self, X, Y, epochs=1000, verbose=True, print_every=100):
        """
        Train the neural network.
        
        Parameters:
        - X: Training data, shape (n_features, n_samples)
        - Y: Labels, shape (n_classes, n_samples)
        - epochs: Number of training iterations
        - verbose: Print progress
        - print_every: Print frequency
        """
        for epoch in range(epochs):
            # Forward propagation
            Y_pred = self.forward(X)
            
            # Compute loss
            loss = self.compute_loss(Y, Y_pred)
            
            # Backward propagation
            gradients = self.backward(Y)
            
            # Update parameters
            self.update_parameters(gradients)
            
            # Compute accuracy
            predictions = np.argmax(Y_pred, axis=0)
            labels = np.argmax(Y, axis=0)
            accuracy = np.mean(predictions == labels)
            
            # Store history
            self.history['loss'].append(loss)
            self.history['accuracy'].append(accuracy)
            
            # Print progress
            if verbose and (epoch % print_every == 0 or epoch == epochs - 1):
                print(f"Epoch {epoch:5d}/{epochs} - Loss: {loss:.4f} - Accuracy: {accuracy:.4f}")
    
    def predict(self, X):
        """Make predictions."""
        Y_pred = self.forward(X)
        return np.argmax(Y_pred, axis=0)
    
    def predict_proba(self, X):
        """Get probability predictions."""
        return self.forward(X)

print("\nNeuralNetwork class created!")
print("\nFeatures:")
print("  - Custom architecture (any number of layers)")
print("  - Multiple activation functions (relu, sigmoid, tanh, softmax)")
print("  - He/Xavier weight initialization")
print("  - Gradient descent optimization")
print("  - Training history tracking")

---

<a id='part9'></a>
# Part 9: Training on MNIST

---

Let's test our neural network on the famous **MNIST dataset** (handwritten digits)!

In [None]:
# ============================================================
# LOAD AND PREPARE MNIST DATA (KAGGLE)
# ============================================================
print("="*70)
print("LOADING MNIST DATASET FROM KAGGLE")
print("="*70)

# ============================================================
# DATA SOURCE OPTIONS
# ============================================================
# Option 1: Kaggle Dataset (when running on Kaggle)
# Option 2: sklearn fetch_openml (when running locally)
# ============================================================

USE_KAGGLE = True  # Set to False if running locally

if USE_KAGGLE:
    # ============================================================
    # KAGGLE: Load from CSV files
    # ============================================================
    # Dataset: https://www.kaggle.com/competitions/digit-recognizer
    
    KAGGLE_PATH = '/kaggle/input/digit-recognizer'
    
    print("\nLoading from Kaggle dataset...")
    print(f"Path: {KAGGLE_PATH}")
    
    import pandas as pd
    
    try:
        # Load training data
        train_df = pd.read_csv(f'{KAGGLE_PATH}/train.csv')
        
        # Separate features and labels
        y = train_df['label'].values
        X = train_df.drop('label', axis=1).values
        
        print(f"\nDataset loaded from Kaggle!")
        print(f"  Samples: {X.shape[0]}")
        print(f"  Features: {X.shape[1]} (28x28 pixels)")
        print(f"  Classes: {len(np.unique(y))} (digits 0-9)")
        
    except FileNotFoundError:
        print("ERROR: Kaggle dataset not found!")
        print("Make sure you've added the 'digit-recognizer' dataset to your notebook.")
        print("Go to: Add Data -> Competition Data -> digit-recognizer")
        USE_KAGGLE = False

if not USE_KAGGLE:
    # ============================================================
    # LOCAL: Load using sklearn
    # ============================================================
    print("\nLoading from sklearn (local)...")
    print("Downloading MNIST (this may take a moment)...")
    
    mnist = fetch_openml('mnist_784', version=1, as_frame=False)
    X, y = mnist.data, mnist.target.astype(int)
    
    print(f"\nDataset loaded!")
    print(f"  Samples: {X.shape[0]}")
    print(f"  Features: {X.shape[1]} (28x28 pixels)")
    print(f"  Classes: {len(np.unique(y))} (digits 0-9)")

# ============================================================
# PREPARE DATA
# ============================================================
print("\n" + "="*70)
print("PREPARING DATA")
print("="*70)

# Use subset for faster training (optional)
n_samples = min(20000, len(X))  # Use up to 20,000 samples
X = X[:n_samples]
y = y[:n_samples]

print(f"\nUsing {n_samples} samples for training")

# Normalize to [0, 1]
X = X / 255.0

# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Transpose for our neural network (features x samples)
X_train = X_train.T
X_test = X_test.T

# One-hot encode labels
def one_hot_encode(y, num_classes=10):
    """Convert labels to one-hot encoding."""
    m = y.shape[0]
    Y = np.zeros((num_classes, m))
    Y[y, np.arange(m)] = 1
    return Y

Y_train = one_hot_encode(y_train)
Y_test = one_hot_encode(y_test)

print(f"\nData shapes:")
print(f"  X_train: {X_train.shape} (features x samples)")
print(f"  Y_train: {Y_train.shape} (classes x samples)")
print(f"  X_test:  {X_test.shape}")
print(f"  Y_test:  {Y_test.shape}")

print(f"\nTrain samples: {X_train.shape[1]}")
print(f"Test samples:  {X_test.shape[1]}")

In [None]:
# Visualize some samples
print("="*70)
print("SAMPLE IMAGES")
print("="*70)

fig, axes = plt.subplots(2, 5, figsize=(12, 5))

for i, ax in enumerate(axes.flat):
    img = X_train[:, i].reshape(28, 28)
    label = y_train[i]
    ax.imshow(img, cmap='gray')
    ax.set_title(f'Label: {label}', fontweight='bold')
    ax.axis('off')

plt.suptitle('Sample MNIST Images', fontweight='bold', fontsize=14)
plt.tight_layout()
plt.show()

In [None]:
# ============================================================
# TRAIN NEURAL NETWORK ON MNIST
# ============================================================
print("="*70)
print("TRAINING NEURAL NETWORK ON MNIST")
print("="*70)

# Define network architecture
# 784 inputs -> 128 hidden -> 64 hidden -> 10 outputs
layer_sizes = [784, 128, 64, 10]
activations = ['relu', 'relu', 'softmax']

print(f"\nArchitecture: {layer_sizes}")
print(f"Activations: {activations}")

# Create and train network
nn = NeuralNetwork(
    layer_sizes=layer_sizes,
    activations=activations,
    learning_rate=0.1,
    init_method='he'
)

print(f"\nTotal parameters: {sum(nn.parameters[f'W{l}'].size + nn.parameters[f'b{l}'].size for l in range(1, len(layer_sizes)))}")
print(f"Learning rate: {nn.learning_rate}")
print("\nTraining...\n")

# Train
nn.fit(X_train, Y_train, epochs=500, verbose=True, print_every=50)

In [None]:
# Visualize training progress
print("="*70)
print("TRAINING PROGRESS")
print("="*70)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss
axes[0].plot(nn.history['loss'], 'b-', linewidth=2)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Training Loss', fontweight='bold', fontsize=14)
axes[0].grid(True, alpha=0.3)

# Accuracy
axes[1].plot(nn.history['accuracy'], 'g-', linewidth=2)
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Training Accuracy', fontweight='bold', fontsize=14)
axes[1].set_ylim(0, 1)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nFinal Training Loss: {nn.history['loss'][-1]:.4f}")
print(f"Final Training Accuracy: {nn.history['accuracy'][-1]*100:.2f}%")

In [None]:
# Evaluate on test set
print("="*70)
print("EVALUATION ON TEST SET")
print("="*70)

# Make predictions
y_pred_test = nn.predict(X_test)

# Calculate accuracy
test_accuracy = np.mean(y_pred_test == y_test)
print(f"\nTest Accuracy: {test_accuracy*100:.2f}%")

# Confusion matrix
cm = confusion_matrix(y_test, y_pred_test)

fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax,
            xticklabels=range(10), yticklabels=range(10))
ax.set_xlabel('Predicted', fontsize=12)
ax.set_ylabel('Actual', fontsize=12)
ax.set_title('Confusion Matrix', fontweight='bold', fontsize=14)
plt.tight_layout()
plt.show()

# Classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred_test, digits=3))

In [None]:
# Visualize predictions
print("="*70)
print("SAMPLE PREDICTIONS")
print("="*70)

# Random samples
n_show = 15
indices = np.random.choice(X_test.shape[1], n_show, replace=False)

fig, axes = plt.subplots(3, 5, figsize=(12, 8))

for i, ax in enumerate(axes.flat):
    idx = indices[i]
    img = X_test[:, idx].reshape(28, 28)
    true_label = y_test[idx]
    pred_label = y_pred_test[idx]
    
    ax.imshow(img, cmap='gray')
    
    color = 'green' if true_label == pred_label else 'red'
    ax.set_title(f'True: {true_label}, Pred: {pred_label}', 
                 fontweight='bold', color=color)
    ax.axis('off')

plt.suptitle('Sample Predictions (Green=Correct, Red=Wrong)', 
             fontweight='bold', fontsize=14)
plt.tight_layout()
plt.show()

---

<a id='part10'></a>
# Part 10: CNN Concepts Overview

---

## 10.1 Why CNNs for Images?

Our fully-connected network treats each pixel independently. But images have **spatial structure**!

| Fully-Connected | CNN |
|-----------------|-----|
| All pixels connected to all neurons | Local connections (kernels) |
| No spatial awareness | Exploits spatial patterns |
| Many parameters | Fewer parameters (weight sharing) |
| For tabular data | For images, video, spatial data |

## 10.2 Key CNN Components

| Component | Purpose | How It Works |
|-----------|---------|-------------|
| **Convolution** | Extract features | Slide kernel over image, compute dot product |
| **Pooling** | Reduce size | Take max/average of local region |
| **ReLU** | Non-linearity | Same as before: max(0, x) |
| **Fully Connected** | Classification | Same as our neural network! |

## 10.3 Convolution Operation

A **kernel** (filter) slides over the image, performing element-wise multiplication and sum:

```
Input Image (5x5)           Kernel (3x3)            Output (3x3)
[1 2 3 0 1]                 [1 0 1]                 
[0 1 2 3 0]     *           [0 1 0]      =         [? ? ?]
[1 0 1 2 1]                 [1 0 1]                 [? ? ?]
[2 1 0 1 0]                                         [? ? ?]
[0 1 2 0 1]
```

### Common Kernels:

| Kernel Type | Effect |
|-------------|--------|
| Edge Detection | Highlights edges |
| Blur | Smooths image |
| Sharpen | Enhances details |

In [None]:
# ============================================================
# CNN CONCEPTS: CONVOLUTION DEMONSTRATION
# ============================================================
print("="*70)
print("CONVOLUTION DEMONSTRATION")
print("="*70)

def convolve2d(image, kernel):
    """Simple 2D convolution (no padding)."""
    h, w = image.shape
    kh, kw = kernel.shape
    output = np.zeros((h - kh + 1, w - kw + 1))
    
    for i in range(output.shape[0]):
        for j in range(output.shape[1]):
            region = image[i:i+kh, j:j+kw]
            output[i, j] = np.sum(region * kernel)
    
    return output

# Sample image (MNIST digit)
sample_img = X_train[:, 0].reshape(28, 28)

# Define kernels
kernels = {
    'Edge Detection': np.array([[-1, -1, -1],
                                 [-1,  8, -1],
                                 [-1, -1, -1]]),
    
    'Box Blur': np.array([[1, 1, 1],
                          [1, 1, 1],
                          [1, 1, 1]]) / 9,
    
    'Sharpen': np.array([[ 0, -1,  0],
                         [-1,  5, -1],
                         [ 0, -1,  0]]),
    
    'Horizontal Edge': np.array([[-1, -1, -1],
                                  [ 0,  0,  0],
                                  [ 1,  1,  1]]),
    
    'Vertical Edge': np.array([[-1, 0, 1],
                                [-1, 0, 1],
                                [-1, 0, 1]])
}

# Apply kernels
fig, axes = plt.subplots(2, 3, figsize=(14, 10))

# Original
axes[0, 0].imshow(sample_img, cmap='gray')
axes[0, 0].set_title('Original Image', fontweight='bold')
axes[0, 0].axis('off')

# Apply each kernel
for idx, (name, kernel) in enumerate(kernels.items()):
    row, col = (idx + 1) // 3, (idx + 1) % 3
    
    convolved = convolve2d(sample_img, kernel)
    
    axes[row, col].imshow(convolved, cmap='gray')
    axes[row, col].set_title(f'{name}', fontweight='bold')
    axes[row, col].axis('off')

plt.suptitle('Convolution with Different Kernels', fontweight='bold', fontsize=14)
plt.tight_layout()
plt.show()

print("\nCNN learns these kernels automatically during training!")
print("Early layers learn edges, later layers learn complex patterns.")

In [None]:
# Max pooling demonstration
print("="*70)
print("MAX POOLING DEMONSTRATION")
print("="*70)

def max_pool2d(image, pool_size=2):
    """Max pooling operation."""
    h, w = image.shape
    new_h, new_w = h // pool_size, w // pool_size
    output = np.zeros((new_h, new_w))
    
    for i in range(new_h):
        for j in range(new_w):
            region = image[i*pool_size:(i+1)*pool_size, 
                          j*pool_size:(j+1)*pool_size]
            output[i, j] = np.max(region)
    
    return output

# Apply max pooling
pooled = max_pool2d(sample_img, pool_size=2)

fig, axes = plt.subplots(1, 2, figsize=(10, 5))

axes[0].imshow(sample_img, cmap='gray')
axes[0].set_title(f'Original: {sample_img.shape}', fontweight='bold')
axes[0].axis('off')

axes[1].imshow(pooled, cmap='gray')
axes[1].set_title(f'After Max Pooling: {pooled.shape}', fontweight='bold')
axes[1].axis('off')

plt.suptitle('Max Pooling (2x2)', fontweight='bold', fontsize=14)
plt.tight_layout()
plt.show()

print("\nMax Pooling:")
print(f"  - Reduces spatial dimensions by {pooled.shape[0]/sample_img.shape[0]:.0%}")
print("  - Provides translation invariance")
print("  - Reduces computation and overfitting")

## 10.4 CNN Architecture Summary

```
Input Image (28x28x1)
       ↓
Conv Layer (32 filters, 3x3) → ReLU → (26x26x32)
       ↓
Max Pool (2x2) → (13x13x32)
       ↓
Conv Layer (64 filters, 3x3) → ReLU → (11x11x64)
       ↓
Max Pool (2x2) → (5x5x64)
       ↓
Flatten → (1600)
       ↓
Fully Connected (128) → ReLU
       ↓
Fully Connected (10) → Softmax
       ↓
Output (10 classes)
```

**For CNNs, use TensorFlow/PyTorch!** They have optimized implementations.

---

<a id='part11'></a>
# Part 11: Summary and Key Takeaways

---

In [None]:
# Final Summary
print("="*70)
print("NEURAL NETWORK FROM SCRATCH - SUMMARY")
print("="*70)

print("""
WHAT WE BUILT:
==============
A complete neural network from scratch using only NumPy!

KEY COMPONENTS:
===============
1. NEURONS: Basic unit that computes weighted sum + activation
   z = Wx + b, a = g(z)

2. ACTIVATION FUNCTIONS:
   - Sigmoid: 1/(1+e^-z) → Output layer (binary)
   - ReLU: max(0,z) → Hidden layers (most popular)
   - Softmax: e^zi/Σe^zj → Output layer (multi-class)

3. FORWARD PROPAGATION:
   X → Z1 → A1 → Z2 → A2 → ... → ŷ

4. LOSS FUNCTIONS:
   - Cross-Entropy: -Σ(y·log(ŷ)) → Classification
   - MSE: Σ(y-ŷ)² → Regression

5. BACKPROPAGATION:
   Compute gradients using chain rule, backward through network
   dW = (1/m) dZ · A_prev^T
   db = (1/m) Σ dZ

6. GRADIENT DESCENT:
   W_new = W_old - α · dW
   b_new = b_old - α · db

RESULTS ON MNIST:
=================
""")

print(f"Architecture: {layer_sizes}")
print(f"Training Accuracy: {nn.history['accuracy'][-1]*100:.2f}%")
print(f"Test Accuracy: {test_accuracy*100:.2f}%")

print("""
CNN CONCEPTS:
=============
- Convolution: Extract spatial features with kernels
- Pooling: Reduce dimensions, add translation invariance
- Architecture: Conv → ReLU → Pool → ... → FC → Softmax

WHAT'S NEXT:
============
- Use TensorFlow/PyTorch for production
- Add regularization (dropout, L2)
- Try advanced optimizers (Adam, RMSprop)
- Build CNNs for image classification
- Explore RNNs for sequences
""")

## Key Formulas Cheat Sheet

### Forward Propagation
| Step | Formula |
|------|---------|
| Linear | $Z^{[l]} = W^{[l]} \cdot A^{[l-1]} + b^{[l]}$ |
| Activation | $A^{[l]} = g(Z^{[l]})$ |

### Backpropagation
| Step | Formula |
|------|---------|
| Output gradient | $dZ^{[L]} = A^{[L]} - Y$ |
| Weight gradient | $dW^{[l]} = \frac{1}{m} dZ^{[l]} \cdot A^{[l-1]T}$ |
| Bias gradient | $db^{[l]} = \frac{1}{m} \sum dZ^{[l]}$ |
| Previous layer | $dA^{[l-1]} = W^{[l]T} \cdot dZ^{[l]}$ |
| Activation deriv | $dZ^{[l-1]} = dA^{[l-1]} * g'(Z^{[l-1]})$ |

### Gradient Descent
| Update | Formula |
|--------|---------|
| Weights | $W := W - \alpha \cdot dW$ |
| Biases | $b := b - \alpha \cdot db$ |

---

## Checklist

- [x] Understood neuron structure (weights, bias, activation)
- [x] Implemented activation functions (Sigmoid, ReLU, Tanh, Softmax)
- [x] Learned activation function derivatives
- [x] Built neural network architecture
- [x] Implemented forward propagation
- [x] Understood loss functions (Cross-Entropy, MSE)
- [x] Implemented backpropagation with chain rule
- [x] Implemented gradient descent
- [x] Built complete NeuralNetwork class from scratch
- [x] Trained on MNIST dataset
- [x] Understood CNN concepts (convolution, pooling)

---

**Congratulations! You now understand how neural networks work at a fundamental level!**