# 🔄 The Training Loop: Practice Makes Perfect

**Putting Everything Together: Building a Complete Training System**

Welcome to the final fundamental notebook! You've learned all the pieces - now it's time to put them together into a complete training system.

---

## 📖 What We'll Learn

1. **Training Concepts**: Epochs, batches, iterations
2. **Complete Training Loop**: From raw data to trained model
3. **Weight Initialization**: Starting off right
4. **Hyperparameters**: Learning rate, batch size, epochs
5. **Monitoring Progress**: Loss curves, accuracy
6. **Train/Validation Split**: Detecting overfitting
7. **Best Practices**: Common mistakes and how to avoid them

---

## 🎵 The Musical Analogy: Learning an Instrument

Training a neural network is like learning to play the piano:

### One Practice Session ≠ Mastery

- **Single weight update** (from Notebook 7) = Playing a song once
- **Training loop** = Practicing the song many times over weeks

### The Practice Routine

1. **Practice session** (Epoch):
   - Go through your entire songbook once
   - Each time you play all songs = 1 epoch

2. **Breaking it down** (Batches):
   - Instead of playing 100 songs at once (overwhelming!)
   - Practice 10 songs at a time (manageable batches)
   - Take a break, adjust technique, repeat

3. **Repeated practice** (Iterations):
   - Each small practice session = 1 iteration
   - More iterations = more improvement opportunities

4. **Progress tracking** (Validation):
   - Perform for friends (validation set)
   - See if you're really getting better
   - Not just memorizing, but actually learning!

### 💡 Key Insight

**Learning requires repetition!** One update isn't enough. We need to:
- See the data multiple times
- Gradually adjust weights
- Track progress over time
- Know when to stop

---

In [None]:
# Import necessary libraries
import numpy as np  # For numerical operations and array handling
import matplotlib.pyplot as plt  # For creating beautiful visualizations
from matplotlib.animation import FuncAnimation  # For animated training visualizations
from IPython.display import HTML  # For displaying animations in notebook
import time  # For tracking training time
from typing import Tuple, List, Dict  # For type hints (better code documentation)

# Set random seed for reproducibility (same results every time)
np.random.seed(42)

# Configure matplotlib for better-looking plots
plt.style.use('seaborn-v0_8-darkgrid')  # Professional plot style
plt.rcParams['figure.figsize'] = (12, 6)  # Default figure size
plt.rcParams['font.size'] = 11  # Readable font size

---

## 📚 Part 1: Understanding Training Terminology

Let's clarify the key terms you'll hear everywhere in deep learning:

### 🔄 Epoch
**One complete pass through the entire training dataset**

- If you have 1000 training examples, seeing all 1000 once = 1 epoch
- Typical training: 10-100+ epochs (or more for complex problems)
- More epochs = more learning opportunities (but risk overfitting)

### 📦 Batch
**A subset of training data processed together**

- Instead of updating weights after EACH example (slow!)
- Or after ALL examples (uses too much memory!)
- We update after a "batch" of examples (just right! 🐻)

**Batch sizes:**
- **Batch size = 1**: Stochastic Gradient Descent (SGD) - noisy but fast updates
- **Batch size = all data**: Batch Gradient Descent - smooth but slow
- **Batch size = 32, 64, 128, etc.**: Mini-batch Gradient Descent - best of both! ⭐

### 🔢 Iteration
**One weight update (one forward + backward pass on a batch)**

- Number of iterations per epoch = Total examples / Batch size
- Example: 1000 examples, batch size 100 → 10 iterations per epoch

### 📊 Simple Example

```
Dataset: 1000 examples
Batch size: 100
Epochs: 10

Iterations per epoch: 1000 / 100 = 10
Total iterations: 10 epochs × 10 iterations = 100 iterations
```

In [None]:
# Let's visualize these concepts

def visualize_training_concepts():
    """Create a visual explanation of epochs, batches, and iterations"""
    
    fig, axes = plt.subplots(3, 1, figsize=(14, 10))
    
    # Configuration
    total_examples = 100  # Total training examples
    batch_size = 20  # Process 20 at a time
    num_epochs = 3  # Train for 3 epochs
    
    iterations_per_epoch = total_examples // batch_size  # 100 / 20 = 5
    total_iterations = iterations_per_epoch * num_epochs  # 5 * 3 = 15
    
    # Plot 1: Showing one epoch (all data once)
    ax1 = axes[0]
    data_indices = np.arange(total_examples)
    colors = plt.cm.viridis(np.linspace(0, 1, total_examples))
    ax1.bar(data_indices, np.ones(total_examples), color=colors, width=1.0, edgecolor='none')
    ax1.set_xlim(-1, total_examples)
    ax1.set_ylim(0, 1.2)
    ax1.set_ylabel('Data Point', fontsize=12)
    ax1.set_title(f'1 EPOCH = Seeing all {total_examples} examples once', 
                  fontsize=14, fontweight='bold')
    ax1.set_xticks([0, 25, 50, 75, 99])
    ax1.set_yticks([])
    
    # Plot 2: Showing batches within an epoch
    ax2 = axes[1]
    for batch_idx in range(iterations_per_epoch):
        start = batch_idx * batch_size
        end = start + batch_size
        batch_indices = np.arange(start, end)
        
        # Different color for each batch
        batch_color = plt.cm.Set3(batch_idx)
        ax2.bar(batch_indices, np.ones(batch_size), color=batch_color, 
               width=1.0, edgecolor='black', linewidth=2)
        
        # Label each batch
        mid_point = (start + end) / 2
        ax2.text(mid_point, 0.5, f'Batch {batch_idx+1}', 
                ha='center', va='center', fontsize=10, fontweight='bold')
    
    ax2.set_xlim(-1, total_examples)
    ax2.set_ylim(0, 1.2)
    ax2.set_ylabel('Batch', fontsize=12)
    ax2.set_title(f'1 EPOCH = {iterations_per_epoch} BATCHES (each batch = {batch_size} examples)',
                  fontsize=14, fontweight='bold')
    ax2.set_xticks([0, 25, 50, 75, 99])
    ax2.set_yticks([])
    
    # Plot 3: Showing multiple epochs (iterations over time)
    ax3 = axes[2]
    iteration_numbers = np.arange(1, total_iterations + 1)
    
    # Color by epoch
    colors = []
    for epoch in range(num_epochs):
        epoch_color = plt.cm.Set1(epoch)
        colors.extend([epoch_color] * iterations_per_epoch)
    
    ax3.bar(iteration_numbers, np.ones(total_iterations), color=colors, 
           width=0.8, edgecolor='black', linewidth=1)
    
    # Mark epoch boundaries
    for epoch in range(1, num_epochs):
        ax3.axvline(x=epoch * iterations_per_epoch + 0.5, 
                   color='red', linestyle='--', linewidth=3, alpha=0.7)
    
    # Label epochs
    for epoch in range(num_epochs):
        mid_iter = epoch * iterations_per_epoch + iterations_per_epoch / 2 + 0.5
        ax3.text(mid_iter, 0.5, f'Epoch {epoch+1}',
                ha='center', va='center', fontsize=11, fontweight='bold')
    
    ax3.set_xlim(0, total_iterations + 1)
    ax3.set_ylim(0, 1.2)
    ax3.set_xlabel('Iteration Number (Weight Updates)', fontsize=12)
    ax3.set_ylabel('Epoch', fontsize=12)
    ax3.set_title(f'{num_epochs} EPOCHS = {total_iterations} ITERATIONS (weight updates)',
                  fontsize=14, fontweight='bold')
    ax3.set_yticks([])
    
    plt.tight_layout()
    plt.show()
    
    # Print summary
    print("\n" + "="*60)
    print("📊 TRAINING CONFIGURATION SUMMARY")
    print("="*60)
    print(f"Total training examples: {total_examples}")
    print(f"Batch size: {batch_size}")
    print(f"Number of epochs: {num_epochs}")
    print()
    print("COMPUTED VALUES:")
    print(f"Iterations per epoch: {iterations_per_epoch}")
    print(f"Total iterations: {total_iterations}")
    print()
    print("WHAT THIS MEANS:")
    print(f"• Each epoch, we see all {total_examples} examples")
    print(f"• We process {batch_size} examples at a time (mini-batches)")
    print(f"• We update weights {iterations_per_epoch} times per epoch")
    print(f"• Total: {total_iterations} weight updates across all epochs")
    print("="*60)

visualize_training_concepts()

---

## 🎲 Part 2: Weight Initialization - Starting Off Right

Before training, we need to initialize our weights. **This matters more than you might think!**

### ❌ Bad Initialization: All Zeros

```python
# DON'T DO THIS!
weights = np.zeros((n_in, n_out))
```

**Problem**: All neurons learn the same thing (symmetry problem)
- All neurons get same gradient
- All neurons update identically
- Network doesn't learn diverse features

### ⚠️ Bad Initialization: Too Large

```python
# Also bad!
weights = np.random.randn(n_in, n_out) * 10  # Too large
```

**Problem**: Exploding gradients, unstable training

### ✅ Good Initialization: Xavier/He

**Xavier (Glorot) Initialization** - for sigmoid/tanh:
```python
std = np.sqrt(2.0 / (n_in + n_out))
weights = np.random.randn(n_in, n_out) * std
```

**He Initialization** - for ReLU:
```python
std = np.sqrt(2.0 / n_in)
weights = np.random.randn(n_in, n_out) * std
```

These keep activations and gradients in a good range!

In [None]:
# Let's see the effect of different initializations

def compare_initializations():
    """Compare different weight initialization strategies"""
    
    n_in = 100  # Input size
    n_out = 100  # Output size
    n_samples = 10000  # How many times to initialize
    
    fig, axes = plt.subplots(2, 3, figsize=(16, 10))
    
    # Different initialization methods
    methods = [
        ('All Zeros', lambda: np.zeros((n_in, n_out))),
        ('Too Large', lambda: np.random.randn(n_in, n_out) * 5),
        ('Too Small', lambda: np.random.randn(n_in, n_out) * 0.01),
        ('Standard Normal', lambda: np.random.randn(n_in, n_out)),
        ('Xavier', lambda: np.random.randn(n_in, n_out) * np.sqrt(2.0 / (n_in + n_out))),
        ('He', lambda: np.random.randn(n_in, n_out) * np.sqrt(2.0 / n_in))
    ]
    
    for idx, (name, init_func) in enumerate(methods):
        ax = axes[idx // 3, idx % 3]
        
        # Initialize weights
        weights = init_func()
        
        # Plot histogram
        ax.hist(weights.flatten(), bins=50, edgecolor='black', alpha=0.7)
        ax.set_title(name, fontsize=13, fontweight='bold')
        ax.set_xlabel('Weight Value', fontsize=11)
        ax.set_ylabel('Frequency', fontsize=11)
        
        # Add statistics
        mean = np.mean(weights)
        std = np.std(weights)
        ax.text(0.95, 0.95, f'Mean: {mean:.4f}\nStd: {std:.4f}',
               transform=ax.transAxes, fontsize=10,
               verticalalignment='top', horizontalalignment='right',
               bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
        
        # Mark if this is good or bad
        if name in ['Xavier', 'He']:
            ax.set_facecolor('#e8f5e9')  # Light green background
            ax.text(0.5, 1.05, '✅ GOOD', transform=ax.transAxes,
                   ha='center', fontsize=12, fontweight='bold', color='green')
        else:
            ax.set_facecolor('#ffebee')  # Light red background
            ax.text(0.5, 1.05, '❌ AVOID', transform=ax.transAxes,
                   ha='center', fontsize=12, fontweight='bold', color='red')
    
    plt.tight_layout()
    plt.show()
    
    print("\n💡 Key Takeaways:")
    print("• All zeros: No learning (symmetry problem)")
    print("• Too large/small: Unstable gradients")
    print("• Xavier: Good for sigmoid/tanh activations")
    print("• He: Good for ReLU activations")
    print("\n🎯 Use Xavier or He initialization for best results!")

compare_initializations()

---

## 🏗️ Part 3: Building a Complete Neural Network with Training Loop

Now let's build a complete, production-quality neural network class with:
- Proper initialization
- Mini-batch training
- Progress tracking
- Validation support
- Early stopping
- Extensive logging

In [None]:
# Activation functions (from previous notebooks)

def sigmoid(x):
    """Sigmoid activation: outputs between 0 and 1"""
    return 1 / (1 + np.exp(-np.clip(x, -500, 500)))  # Clip to prevent overflow

def sigmoid_derivative(x):
    """Derivative of sigmoid"""
    s = sigmoid(x)
    return s * (1 - s)

def relu(x):
    """ReLU activation: max(0, x)"""
    return np.maximum(0, x)

def relu_derivative(x):
    """Derivative of ReLU: 1 if x > 0, else 0"""
    return (x > 0).astype(float)

print("✅ Activation functions defined")

In [None]:
class NeuralNetwork:
    """A complete 2-layer neural network with full training capabilities
    
    Features:
    - Configurable architecture
    - Multiple activation functions
    - Mini-batch training
    - Progress tracking
    - Validation support
    - Early stopping
    """
    
    def __init__(self, input_size, hidden_size, output_size, 
                 learning_rate=0.01, activation='relu'):
        """Initialize the neural network
        
        Args:
            input_size: Number of input features
            hidden_size: Number of neurons in hidden layer
            output_size: Number of output neurons
            learning_rate: Step size for gradient descent (default: 0.01)
            activation: 'relu' or 'sigmoid' for hidden layer (default: 'relu')
        """
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.learning_rate = learning_rate
        
        # Set activation function
        if activation == 'relu':
            self.activation = relu
            self.activation_derivative = relu_derivative
            # Use He initialization for ReLU
            init_std_w1 = np.sqrt(2.0 / input_size)
            init_std_w2 = np.sqrt(2.0 / hidden_size)
        elif activation == 'sigmoid':
            self.activation = sigmoid
            self.activation_derivative = sigmoid_derivative
            # Use Xavier initialization for sigmoid
            init_std_w1 = np.sqrt(2.0 / (input_size + hidden_size))
            init_std_w2 = np.sqrt(2.0 / (hidden_size + output_size))
        else:
            raise ValueError(f"Unknown activation: {activation}")
        
        # Initialize weights with appropriate initialization
        self.W1 = np.random.randn(input_size, hidden_size) * init_std_w1
        self.b1 = np.zeros(hidden_size)  # Biases start at zero
        
        self.W2 = np.random.randn(hidden_size, output_size) * init_std_w2
        self.b2 = np.zeros(output_size)
        
        # Storage for intermediate values (needed for backprop)
        self.cache = {}
        
        # Training history
        self.history = {
            'train_loss': [],  # Loss on training data
            'val_loss': [],    # Loss on validation data
            'train_acc': [],   # Accuracy on training data
            'val_acc': []      # Accuracy on validation data
        }
    
    def forward(self, X):
        """Forward propagation: compute predictions
        
        Args:
            X: Input data (batch_size, input_size)
        
        Returns:
            y_pred: Predictions (batch_size, output_size)
        """
        # Hidden layer: z1 = X·W1 + b1, then activation
        z1 = np.dot(X, self.W1) + self.b1  # Weighted sum
        h = self.activation(z1)  # Apply activation function
        
        # Output layer: z2 = h·W2 + b2, then sigmoid (for binary classification)
        z2 = np.dot(h, self.W2) + self.b2  # Weighted sum
        y_pred = sigmoid(z2)  # Sigmoid for final output (probability)
        
        # Cache intermediate values for backpropagation
        self.cache = {
            'X': X,    # Input
            'z1': z1,  # Hidden layer weighted sum
            'h': h,    # Hidden layer activation
            'z2': z2,  # Output layer weighted sum
            'y_pred': y_pred  # Final prediction
        }
        
        return y_pred
    
    def compute_loss(self, y_pred, y_true):
        """Compute Mean Squared Error loss
        
        Args:
            y_pred: Predicted values
            y_true: True values
        
        Returns:
            loss: Average loss across all samples
        """
        # MSE = mean of (prediction - true)²
        return np.mean((y_pred - y_true) ** 2)
    
    def compute_accuracy(self, y_pred, y_true):
        """Compute classification accuracy
        
        Args:
            y_pred: Predicted probabilities
            y_true: True labels (0 or 1)
        
        Returns:
            accuracy: Percentage of correct predictions
        """
        # Convert probabilities to binary predictions (threshold at 0.5)
        predictions = (y_pred > 0.5).astype(int)
        # Calculate percentage of correct predictions
        return np.mean(predictions == y_true) * 100
    
    def backward(self, y_true):
        """Backward propagation: compute gradients
        
        Args:
            y_true: True labels
        
        Returns:
            gradients: Dictionary of gradients for all parameters
        """
        # Retrieve cached values from forward pass
        X = self.cache['X']
        z1 = self.cache['z1']
        h = self.cache['h']
        z2 = self.cache['z2']
        y_pred = self.cache['y_pred']
        
        batch_size = X.shape[0]  # Number of examples in batch
        
        # --- Backward pass through output layer ---
        
        # Gradient of loss w.r.t. predictions: ∂L/∂y
        dL_dy = 2 * (y_pred - y_true) / batch_size
        
        # Gradient of loss w.r.t. z2 (before sigmoid): ∂L/∂z2
        # Chain rule: ∂L/∂z2 = ∂L/∂y × ∂y/∂z2
        dy_dz2 = sigmoid_derivative(z2)
        dL_dz2 = dL_dy * dy_dz2
        
        # Gradients for W2 and b2
        dL_dW2 = np.dot(h.T, dL_dz2)  # ∂L/∂W2 = hᵀ · ∂L/∂z2
        dL_db2 = np.sum(dL_dz2, axis=0)  # Sum over batch
        
        # --- Backward pass through hidden layer ---
        
        # Gradient of loss w.r.t. hidden activations: ∂L/∂h
        dL_dh = np.dot(dL_dz2, self.W2.T)  # Chain through W2
        
        # Gradient of loss w.r.t. z1 (before activation): ∂L/∂z1
        # Chain rule: ∂L/∂z1 = ∂L/∂h × ∂h/∂z1
        dh_dz1 = self.activation_derivative(z1)
        dL_dz1 = dL_dh * dh_dz1
        
        # Gradients for W1 and b1
        dL_dW1 = np.dot(X.T, dL_dz1)  # ∂L/∂W1 = Xᵀ · ∂L/∂z1
        dL_db1 = np.sum(dL_dz1, axis=0)  # Sum over batch
        
        # Return all gradients
        return {
            'dW1': dL_dW1,
            'db1': dL_db1,
            'dW2': dL_dW2,
            'db2': dL_db2
        }
    
    def update_weights(self, gradients):
        """Update weights using gradient descent
        
        Args:
            gradients: Dictionary of gradients from backward pass
        """
        # Update rule: weight = weight - learning_rate × gradient
        self.W1 -= self.learning_rate * gradients['dW1']
        self.b1 -= self.learning_rate * gradients['db1']
        self.W2 -= self.learning_rate * gradients['dW2']
        self.b2 -= self.learning_rate * gradients['db2']
    
    def train_step(self, X, y):
        """Perform one training step (forward, backward, update)
        
        Args:
            X: Input batch
            y: True labels for batch
        
        Returns:
            loss: Loss value for this batch
        """
        # 1. Forward pass: compute predictions
        y_pred = self.forward(X)
        
        # 2. Compute loss
        loss = self.compute_loss(y_pred, y)
        
        # 3. Backward pass: compute gradients
        gradients = self.backward(y)
        
        # 4. Update weights
        self.update_weights(gradients)
        
        return loss
    
    def create_batches(self, X, y, batch_size):
        """Split data into mini-batches
        
        Args:
            X: All input data
            y: All labels
            batch_size: Size of each batch
        
        Yields:
            (X_batch, y_batch): One batch at a time
        """
        n_samples = X.shape[0]  # Total number of samples
        
        # Shuffle data at the start of each epoch
        indices = np.random.permutation(n_samples)
        X_shuffled = X[indices]
        y_shuffled = y[indices]
        
        # Split into batches
        for i in range(0, n_samples, batch_size):
            # Get batch (handle last batch which might be smaller)
            end_idx = min(i + batch_size, n_samples)
            X_batch = X_shuffled[i:end_idx]
            y_batch = y_shuffled[i:end_idx]
            
            yield X_batch, y_batch
    
    def train(self, X_train, y_train, X_val=None, y_val=None, 
              epochs=100, batch_size=32, verbose=True, 
              early_stopping_patience=None):
        """Train the neural network
        
        Args:
            X_train: Training data (n_samples, n_features)
            y_train: Training labels (n_samples, n_outputs)
            X_val: Validation data (optional)
            y_val: Validation labels (optional)
            epochs: Number of times to iterate through training data
            batch_size: Number of samples per batch
            verbose: If True, print progress
            early_stopping_patience: Stop if val loss doesn't improve for this many epochs
        
        Returns:
            history: Dictionary with training metrics
        """
        n_samples = X_train.shape[0]  # Total training samples
        iterations_per_epoch = int(np.ceil(n_samples / batch_size))  # Batches per epoch
        
        # Early stopping tracking
        best_val_loss = float('inf')  # Best validation loss seen
        patience_counter = 0  # How many epochs without improvement
        
        # Training start time
        start_time = time.time()
        
        if verbose:
            print("="*70)
            print("🚀 STARTING TRAINING")
            print("="*70)
            print(f"Training samples: {n_samples}")
            print(f"Batch size: {batch_size}")
            print(f"Iterations per epoch: {iterations_per_epoch}")
            print(f"Total epochs: {epochs}")
            print(f"Total iterations: {epochs * iterations_per_epoch}")
            if early_stopping_patience:
                print(f"Early stopping patience: {early_stopping_patience} epochs")
            print("="*70)
            print()
        
        # Training loop - iterate through epochs
        for epoch in range(epochs):
            epoch_start_time = time.time()
            epoch_losses = []  # Track losses for this epoch
            
            # Iterate through mini-batches
            for X_batch, y_batch in self.create_batches(X_train, y_train, batch_size):
                # Perform one training step
                batch_loss = self.train_step(X_batch, y_batch)
                epoch_losses.append(batch_loss)
            
            # Calculate epoch metrics
            avg_train_loss = np.mean(epoch_losses)
            
            # Compute training accuracy
            train_pred = self.forward(X_train)
            train_acc = self.compute_accuracy(train_pred, y_train)
            
            # Store training metrics
            self.history['train_loss'].append(avg_train_loss)
            self.history['train_acc'].append(train_acc)
            
            # Validation metrics (if validation data provided)
            if X_val is not None and y_val is not None:
                val_pred = self.forward(X_val)
                val_loss = self.compute_loss(val_pred, y_val)
                val_acc = self.compute_accuracy(val_pred, y_val)
                
                self.history['val_loss'].append(val_loss)
                self.history['val_acc'].append(val_acc)
                
                # Early stopping check
                if early_stopping_patience is not None:
                    if val_loss < best_val_loss:
                        best_val_loss = val_loss
                        patience_counter = 0  # Reset counter
                    else:
                        patience_counter += 1  # Increment counter
                        
                        # Stop if patience exceeded
                        if patience_counter >= early_stopping_patience:
                            if verbose:
                                print(f"\n⏹️  Early stopping at epoch {epoch+1}")
                                print(f"Validation loss hasn't improved for {early_stopping_patience} epochs")
                            break
            
            # Print progress
            if verbose and (epoch + 1) % max(1, epochs // 10) == 0:
                epoch_time = time.time() - epoch_start_time
                print(f"Epoch {epoch+1:3d}/{epochs} | "
                      f"Train Loss: {avg_train_loss:.4f} | "
                      f"Train Acc: {train_acc:5.2f}%", end="")
                
                if X_val is not None:
                    print(f" | Val Loss: {val_loss:.4f} | "
                          f"Val Acc: {val_acc:5.2f}%", end="")
                
                print(f" | Time: {epoch_time:.2f}s")
        
        # Training complete
        total_time = time.time() - start_time
        
        if verbose:
            print()
            print("="*70)
            print("✅ TRAINING COMPLETE")
            print("="*70)
            print(f"Total time: {total_time:.2f}s")
            print(f"Final training loss: {self.history['train_loss'][-1]:.4f}")
            print(f"Final training accuracy: {self.history['train_acc'][-1]:.2f}%")
            if X_val is not None:
                print(f"Final validation loss: {self.history['val_loss'][-1]:.4f}")
                print(f"Final validation accuracy: {self.history['val_acc'][-1]:.2f}%")
            print("="*70)
        
        return self.history
    
    def predict(self, X):
        """Make predictions on new data
        
        Args:
            X: Input data
        
        Returns:
            predictions: Binary predictions (0 or 1)
        """
        probabilities = self.forward(X)
        return (probabilities > 0.5).astype(int)

print("✅ Complete Neural Network class implemented!")
print("\nThis class includes:")
print("  ✓ Proper weight initialization (Xavier/He)")
print("  ✓ Mini-batch training")
print("  ✓ Training/validation split")
print("  ✓ Progress tracking")
print("  ✓ Early stopping")
print("  ✓ Comprehensive logging")

---

## 🎯 Part 4: Training on a Real Problem - XOR

Let's use our complete training system to solve the XOR problem!

In [None]:
# Create XOR dataset
X_xor = np.array([
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1]
])

y_xor = np.array([
    [0],
    [1],
    [1],
    [0]
])

print("XOR Problem:")
print("="*40)
print("Input 1 | Input 2 | Output")
print("-"*40)
for i in range(len(X_xor)):
    print(f"   {X_xor[i,0]}    |    {X_xor[i,1]}    |   {y_xor[i,0]}")
print("="*40)
print()

# Create network
np.random.seed(42)  # For reproducibility
network = NeuralNetwork(
    input_size=2,
    hidden_size=8,  # 8 hidden neurons
    output_size=1,
    learning_rate=0.1,
    activation='relu'  # ReLU activation
)

# Train the network
history = network.train(
    X_train=X_xor,
    y_train=y_xor,
    epochs=1000,
    batch_size=4,  # Use all 4 examples per batch
    verbose=True
)

# Test the network
print("\n" + "="*70)
print("🧪 TESTING TRAINED NETWORK")
print("="*70)
predictions = network.forward(X_xor)

print("Input 1 | Input 2 | Target | Prediction | Rounded | Correct?")
print("-"*70)
for i in range(len(X_xor)):
    x1, x2 = X_xor[i]
    target = y_xor[i, 0]
    pred = predictions[i, 0]
    rounded = round(pred)
    correct = "✓" if rounded == target else "✗"
    print(f"   {x1}    |    {x2}    |   {target}    |   {pred:.4f}   |    {rounded}    |   {correct}")

accuracy = network.compute_accuracy(predictions, y_xor)
print("="*70)
print(f"Final Accuracy: {accuracy:.1f}%")
print("="*70)

In [None]:
# Visualize training progress

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Loss over epochs
epochs = range(1, len(history['train_loss']) + 1)
ax1.plot(epochs, history['train_loss'], 'b-', linewidth=2, label='Training Loss')
ax1.set_xlabel('Epoch', fontsize=12)
ax1.set_ylabel('Loss (MSE)', fontsize=12)
ax1.set_title('Loss Decreases Over Time', fontsize=14, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_yscale('log')  # Log scale to see details

# Plot 2: Decision boundary
# Create a grid of points
xx, yy = np.meshgrid(np.linspace(-0.5, 1.5, 200),
                     np.linspace(-0.5, 1.5, 200))
grid_points = np.c_[xx.ravel(), yy.ravel()]

# Predict for all grid points
Z = network.forward(grid_points)
Z = Z.reshape(xx.shape)

# Plot decision boundary
contour = ax2.contourf(xx, yy, Z, levels=20, cmap='RdYlBu_r', alpha=0.8)
ax2.contour(xx, yy, Z, levels=[0.5], colors='black', linewidths=3)

# Plot the XOR points
scatter = ax2.scatter(X_xor[:, 0], X_xor[:, 1], c=y_xor.flatten(),
                     cmap='RdYlBu_r', s=300, edgecolors='black', linewidths=3)

# Add labels
for i, (x, y, label) in enumerate(zip(X_xor[:, 0], X_xor[:, 1], y_xor.flatten())):
    ax2.text(x, y-0.15, f'({int(x)},{int(y)})→{int(label)}',
             fontsize=11, ha='center', fontweight='bold')

ax2.set_xlabel('Input 1', fontsize=12)
ax2.set_ylabel('Input 2', fontsize=12)
ax2.set_title('Decision Boundary: Network Learned XOR!', fontsize=14, fontweight='bold')
ax2.set_xlim(-0.5, 1.5)
ax2.set_ylim(-0.5, 1.5)
plt.colorbar(contour, ax=ax2, label='Prediction')

plt.tight_layout()
plt.show()

print("\n📊 Understanding the plots:")
print("Left: Loss curve - steady decrease means learning is working")
print("Right: Decision boundary - the network separates the classes correctly!")
print("  • Blue regions = network predicts 0")
print("  • Red regions = network predicts 1")
print("  • Black line = decision boundary (50% confidence)")

---

## 🔬 Part 5: A More Complex Dataset - Circles

Let's create a more challenging problem: classifying points inside vs outside a circle

In [None]:
def generate_circle_data(n_samples=200, noise=0.1):
    """Generate a dataset with two concentric circles
    
    Args:
        n_samples: Number of samples to generate
        noise: Amount of noise to add
    
    Returns:
        X: Features (n_samples, 2)
        y: Labels (n_samples, 1) - 0 for inner circle, 1 for outer
    """
    np.random.seed(42)
    
    # Generate inner circle (class 0)
    n_inner = n_samples // 2
    theta_inner = np.random.uniform(0, 2*np.pi, n_inner)
    r_inner = np.random.uniform(0, 0.5, n_inner)
    X_inner = np.column_stack([
        r_inner * np.cos(theta_inner) + np.random.normal(0, noise, n_inner),
        r_inner * np.sin(theta_inner) + np.random.normal(0, noise, n_inner)
    ])
    y_inner = np.zeros((n_inner, 1))
    
    # Generate outer circle (class 1)
    n_outer = n_samples - n_inner
    theta_outer = np.random.uniform(0, 2*np.pi, n_outer)
    r_outer = np.random.uniform(0.7, 1.0, n_outer)
    X_outer = np.column_stack([
        r_outer * np.cos(theta_outer) + np.random.normal(0, noise, n_outer),
        r_outer * np.sin(theta_outer) + np.random.normal(0, noise, n_outer)
    ])
    y_outer = np.ones((n_outer, 1))
    
    # Combine and shuffle
    X = np.vstack([X_inner, X_outer])
    y = np.vstack([y_inner, y_outer])
    
    shuffle_idx = np.random.permutation(n_samples)
    X = X[shuffle_idx]
    y = y[shuffle_idx]
    
    return X, y

# Generate dataset
X_circles, y_circles = generate_circle_data(n_samples=400, noise=0.08)

# Split into train and validation sets (80/20 split)
split_idx = int(0.8 * len(X_circles))
X_train = X_circles[:split_idx]
y_train = y_circles[:split_idx]
X_val = X_circles[split_idx:]
y_val = y_circles[split_idx:]

print(f"Training set: {len(X_train)} samples")
print(f"Validation set: {len(X_val)} samples")
print()

# Visualize the dataset
plt.figure(figsize=(8, 8))
plt.scatter(X_train[y_train.flatten() == 0, 0], 
           X_train[y_train.flatten() == 0, 1],
           c='blue', label='Class 0 (inner)', s=50, alpha=0.6, edgecolors='black')
plt.scatter(X_train[y_train.flatten() == 1, 0], 
           X_train[y_train.flatten() == 1, 1],
           c='red', label='Class 1 (outer)', s=50, alpha=0.6, edgecolors='black')
plt.xlabel('Feature 1', fontsize=12)
plt.ylabel('Feature 2', fontsize=12)
plt.title('Circles Dataset (Training Set)', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.axis('equal')
plt.show()

print("💡 This is a non-linear problem!")
print("A single neuron can't solve this - we need a multi-layer network.")

In [None]:
# Train network on circles dataset

print("Training network on circles dataset...\n")

# Create network with more hidden neurons for complex pattern
np.random.seed(42)
network_circles = NeuralNetwork(
    input_size=2,
    hidden_size=16,  # More neurons for complex pattern
    output_size=1,
    learning_rate=0.1,
    activation='relu'
)

# Train with validation set and early stopping
history_circles = network_circles.train(
    X_train=X_train,
    y_train=y_train,
    X_val=X_val,
    y_val=y_val,
    epochs=500,
    batch_size=32,
    verbose=True,
    early_stopping_patience=50  # Stop if no improvement for 50 epochs
)

In [None]:
# Visualize training progress and results

fig, axes = plt.subplots(2, 2, figsize=(14, 12))

# Plot 1: Training and validation loss
ax1 = axes[0, 0]
epochs = range(1, len(history_circles['train_loss']) + 1)
ax1.plot(epochs, history_circles['train_loss'], 'b-', linewidth=2, label='Training Loss')
ax1.plot(epochs, history_circles['val_loss'], 'r-', linewidth=2, label='Validation Loss')
ax1.set_xlabel('Epoch', fontsize=12)
ax1.set_ylabel('Loss (MSE)', fontsize=12)
ax1.set_title('Loss Over Time', fontsize=14, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Plot 2: Training and validation accuracy
ax2 = axes[0, 1]
ax2.plot(epochs, history_circles['train_acc'], 'b-', linewidth=2, label='Training Accuracy')
ax2.plot(epochs, history_circles['val_acc'], 'r-', linewidth=2, label='Validation Accuracy')
ax2.set_xlabel('Epoch', fontsize=12)
ax2.set_ylabel('Accuracy (%)', fontsize=12)
ax2.set_title('Accuracy Over Time', fontsize=14, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

# Plot 3: Decision boundary on training set
ax3 = axes[1, 0]
xx, yy = np.meshgrid(np.linspace(-1.2, 1.2, 200),
                     np.linspace(-1.2, 1.2, 200))
grid_points = np.c_[xx.ravel(), yy.ravel()]
Z = network_circles.forward(grid_points)
Z = Z.reshape(xx.shape)

contour = ax3.contourf(xx, yy, Z, levels=20, cmap='RdYlBu_r', alpha=0.8)
ax3.contour(xx, yy, Z, levels=[0.5], colors='black', linewidths=2)
ax3.scatter(X_train[y_train.flatten() == 0, 0], X_train[y_train.flatten() == 0, 1],
           c='blue', s=30, alpha=0.6, edgecolors='black')
ax3.scatter(X_train[y_train.flatten() == 1, 0], X_train[y_train.flatten() == 1, 1],
           c='red', s=30, alpha=0.6, edgecolors='black')
ax3.set_xlabel('Feature 1', fontsize=12)
ax3.set_ylabel('Feature 2', fontsize=12)
ax3.set_title('Training Set Decision Boundary', fontsize=14, fontweight='bold')
ax3.axis('equal')
plt.colorbar(contour, ax=ax3, label='Prediction')

# Plot 4: Decision boundary on validation set
ax4 = axes[1, 1]
contour = ax4.contourf(xx, yy, Z, levels=20, cmap='RdYlBu_r', alpha=0.8)
ax4.contour(xx, yy, Z, levels=[0.5], colors='black', linewidths=2)
ax4.scatter(X_val[y_val.flatten() == 0, 0], X_val[y_val.flatten() == 0, 1],
           c='blue', s=30, alpha=0.6, edgecolors='black', label='Class 0')
ax4.scatter(X_val[y_val.flatten() == 1, 0], X_val[y_val.flatten() == 1, 1],
           c='red', s=30, alpha=0.6, edgecolors='black', label='Class 1')
ax4.set_xlabel('Feature 1', fontsize=12)
ax4.set_ylabel('Feature 2', fontsize=12)
ax4.set_title('Validation Set Decision Boundary', fontsize=14, fontweight='bold')
ax4.legend()
ax4.axis('equal')
plt.colorbar(contour, ax=ax4, label='Prediction')

plt.tight_layout()
plt.show()

print("\n📊 Key Observations:")
print("• Top left: Both training and validation loss decrease - good sign!")
print("• Top right: Accuracy improves for both sets")
print("• Bottom: The network learned the circular pattern")
print("• If validation curves match training: model generalizes well! ✓")

---

## 🎛️ Part 6: Hyperparameter Tuning Experiments

Let's explore how different hyperparameters affect training!

### Experiment 1: Learning Rate Comparison

In [None]:
# Compare different learning rates

learning_rates = [0.001, 0.01, 0.1, 0.5]
results = {}

print("Testing different learning rates...\n")

for lr in learning_rates:
    print(f"Training with learning rate = {lr}...")
    
    # Create and train network
    np.random.seed(42)  # Same initialization for fair comparison
    net = NeuralNetwork(
        input_size=2,
        hidden_size=8,
        output_size=1,
        learning_rate=lr,
        activation='relu'
    )
    
    history = net.train(
        X_train=X_train,
        y_train=y_train,
        X_val=X_val,
        y_val=y_val,
        epochs=200,
        batch_size=32,
        verbose=False  # Suppress output
    )
    
    results[lr] = history
    print(f"  Final val accuracy: {history['val_acc'][-1]:.2f}%")

print("\nDone!")

# Plot comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

for lr, history in results.items():
    epochs = range(1, len(history['val_loss']) + 1)
    ax1.plot(epochs, history['val_loss'], linewidth=2, label=f'LR = {lr}')
    ax2.plot(epochs, history['val_acc'], linewidth=2, label=f'LR = {lr}')

ax1.set_xlabel('Epoch', fontsize=12)
ax1.set_ylabel('Validation Loss', fontsize=12)
ax1.set_title('Effect of Learning Rate on Loss', fontsize=14, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_yscale('log')

ax2.set_xlabel('Epoch', fontsize=12)
ax2.set_ylabel('Validation Accuracy (%)', fontsize=12)
ax2.set_title('Effect of Learning Rate on Accuracy', fontsize=14, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n🎯 What we learned:")
print("• Too small (0.001): Slow learning, might not converge")
print("• Just right (0.01-0.1): Fast, stable convergence")
print("• Too large (0.5): Might be unstable or overshoot")

### Experiment 2: Batch Size Comparison

In [None]:
# Compare different batch sizes

batch_sizes = [8, 16, 32, 64]
batch_results = {}

print("Testing different batch sizes...\n")

for bs in batch_sizes:
    print(f"Training with batch size = {bs}...")
    
    # Create and train network
    np.random.seed(42)
    net = NeuralNetwork(
        input_size=2,
        hidden_size=8,
        output_size=1,
        learning_rate=0.1,
        activation='relu'
    )
    
    start_time = time.time()
    history = net.train(
        X_train=X_train,
        y_train=y_train,
        X_val=X_val,
        y_val=y_val,
        epochs=200,
        batch_size=bs,
        verbose=False
    )
    train_time = time.time() - start_time
    
    batch_results[bs] = {'history': history, 'time': train_time}
    print(f"  Time: {train_time:.2f}s | Final val acc: {history['val_acc'][-1]:.2f}%")

print("\nDone!")

# Plot comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

for bs, data in batch_results.items():
    history = data['history']
    epochs = range(1, len(history['val_loss']) + 1)
    ax1.plot(epochs, history['val_loss'], linewidth=2, label=f'Batch = {bs}')
    ax2.plot(epochs, history['val_acc'], linewidth=2, label=f'Batch = {bs}')

ax1.set_xlabel('Epoch', fontsize=12)
ax1.set_ylabel('Validation Loss', fontsize=12)
ax1.set_title('Effect of Batch Size on Loss', fontsize=14, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

ax2.set_xlabel('Epoch', fontsize=12)
ax2.set_ylabel('Validation Accuracy (%)', fontsize=12)
ax2.set_title('Effect of Batch Size on Accuracy', fontsize=14, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n🎯 What we learned:")
print("• Smaller batches (8-16): More noisy but can escape local minima")
print("• Medium batches (32): Good balance of speed and stability")
print("• Larger batches (64+): Smoother updates but might miss fine details")

---

## ⚠️ Part 7: Common Mistakes and How to Avoid Them

### Mistake 1: Not Shuffling Data

In [None]:
# Demonstrate importance of shuffling

# Create sorted XOR data (all class 0, then all class 1)
X_sorted = np.array([[0, 0], [1, 1], [0, 1], [1, 0]])  # Sorted by class
y_sorted = np.array([[0], [0], [1], [1]])  # All 0s, then all 1s

print("⚠️ WARNING: Training on sorted (non-shuffled) data\n")

# Train without shuffling (modify create_batches to not shuffle)
np.random.seed(42)
net_no_shuffle = NeuralNetwork(2, 8, 1, learning_rate=0.1, activation='relu')

# Manually train without shuffling
losses_no_shuffle = []
for epoch in range(100):
    # Process in order (no shuffling!)
    for i in range(0, len(X_sorted), 2):
        X_batch = X_sorted[i:i+2]
        y_batch = y_sorted[i:i+2]
        loss = net_no_shuffle.train_step(X_batch, y_batch)
    losses_no_shuffle.append(loss)

print("✅ Now training with shuffling (correct way)\n")

# Train with shuffling (normal way)
np.random.seed(42)
net_shuffle = NeuralNetwork(2, 8, 1, learning_rate=0.1, activation='relu')
history_shuffle = net_shuffle.train(
    X_sorted, y_sorted, epochs=100, batch_size=2, verbose=False
)

# Compare
plt.figure(figsize=(10, 5))
plt.plot(losses_no_shuffle, 'r-', linewidth=2, label='No Shuffling ❌')
plt.plot(history_shuffle['train_loss'], 'g-', linewidth=2, label='With Shuffling ✓')
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Loss', fontsize=12)
plt.title('Shuffling Makes Training More Stable', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.yscale('log')
plt.show()

print("\n💡 Key Takeaway: ALWAYS shuffle your data!")
print("Without shuffling, the network sees patterns in the order,")
print("not the actual data patterns we want it to learn.")

### Mistake 2: Not Normalizing Inputs

In [None]:
# Create data with very different scales
np.random.seed(42)
X_unnormalized = np.random.randn(100, 2)
X_unnormalized[:, 0] *= 1000  # First feature: large scale
X_unnormalized[:, 1] *= 0.01  # Second feature: tiny scale
y_unnormalized = ((X_unnormalized[:, 0] > 0) & (X_unnormalized[:, 1] > 0)).astype(int).reshape(-1, 1)

# Normalize data
X_normalized = (X_unnormalized - X_unnormalized.mean(axis=0)) / X_unnormalized.std(axis=0)

print("Training on UNNORMALIZED data...")
np.random.seed(42)
net_unnorm = NeuralNetwork(2, 8, 1, learning_rate=0.01, activation='relu')
history_unnorm = net_unnorm.train(
    X_unnormalized, y_unnormalized, epochs=100, batch_size=10, verbose=False
)

print("Training on NORMALIZED data...")
np.random.seed(42)
net_norm = NeuralNetwork(2, 8, 1, learning_rate=0.01, activation='relu')
history_norm = net_norm.train(
    X_normalized, y_unnormalized, epochs=100, batch_size=10, verbose=False
)

# Compare
plt.figure(figsize=(10, 5))
plt.plot(history_unnorm['train_loss'], 'r-', linewidth=2, label='Unnormalized ❌')
plt.plot(history_norm['train_loss'], 'g-', linewidth=2, label='Normalized ✓')
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Loss', fontsize=12)
plt.title('Normalization Speeds Up Training', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.yscale('log')
plt.show()

print("\n💡 Key Takeaway: Normalize your inputs!")
print("Formula: X_norm = (X - mean) / std")
print("This gives features similar scales → more stable training")

### Mistake 3: Training Too Long (Overfitting)

In [None]:
# Demonstrate overfitting with a very small training set
X_tiny = X_train[:30]  # Only 30 training samples
y_tiny = y_train[:30]

print("Training with small dataset to demonstrate overfitting...\n")

# Train for many epochs
np.random.seed(42)
net_overfit = NeuralNetwork(2, 32, 1, learning_rate=0.1, activation='relu')  # Large network
history_overfit = net_overfit.train(
    X_train=X_tiny,
    y_train=y_tiny,
    X_val=X_val,
    y_val=y_val,
    epochs=500,
    batch_size=10,
    verbose=False
)

# Plot training vs validation
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

epochs = range(1, len(history_overfit['train_loss']) + 1)

ax1.plot(epochs, history_overfit['train_loss'], 'b-', linewidth=2, label='Training')
ax1.plot(epochs, history_overfit['val_loss'], 'r-', linewidth=2, label='Validation')
ax1.axvline(x=50, color='green', linestyle='--', linewidth=2, label='Should stop here')
ax1.set_xlabel('Epoch', fontsize=12)
ax1.set_ylabel('Loss', fontsize=12)
ax1.set_title('Overfitting: Val Loss Increases', fontsize=14, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

ax2.plot(epochs, history_overfit['train_acc'], 'b-', linewidth=2, label='Training')
ax2.plot(epochs, history_overfit['val_acc'], 'r-', linewidth=2, label='Validation')
ax2.axvline(x=50, color='green', linestyle='--', linewidth=2, label='Should stop here')
ax2.set_xlabel('Epoch', fontsize=12)
ax2.set_ylabel('Accuracy (%)', fontsize=12)
ax2.set_title('Overfitting: Gap Between Train and Val', fontsize=14, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n🚨 Signs of Overfitting:")
print("• Training loss keeps decreasing")
print("• Validation loss starts increasing (the gap widens)")
print("• Training accuracy high, validation accuracy lower")
print("\n💡 Solution: Use early stopping or regularization!")

---

## 🎯 Final Summary

Congratulations! You've learned everything needed to train neural networks from scratch!

### 🎓 What You've Mastered

#### Core Concepts
✅ **Epochs**: Complete passes through the data

✅ **Batches**: Mini-batch gradient descent for efficiency

✅ **Iterations**: Individual weight updates

✅ **Training Loop**: The full pipeline from data to trained model

#### Best Practices
✅ **Proper Initialization**: Xavier/He for stable gradients

✅ **Data Shuffling**: Prevents order-based biases

✅ **Normalization**: Speeds up training

✅ **Train/Val Split**: Detect overfitting

✅ **Early Stopping**: Prevent overfitting automatically

✅ **Hyperparameter Tuning**: Learning rate, batch size, etc.

#### Common Mistakes to Avoid
❌ All-zero initialization

❌ Not shuffling data

❌ Not normalizing inputs

❌ Training too long (overfitting)

❌ Learning rate too high/low

### 🎬 The Complete Picture

You can now:
1. **Design** a network architecture
2. **Initialize** weights properly
3. **Prepare** data (shuffle, normalize, split)
4. **Train** using mini-batch gradient descent
5. **Monitor** progress (loss curves, accuracy)
6. **Validate** to prevent overfitting
7. **Tune** hyperparameters for best results

### 🚀 What's Next?

You've built neural networks **from scratch**! But in practice, we use frameworks like:
- **PyTorch**: Most popular for research
- **TensorFlow/Keras**: Great for production

These frameworks:
- Handle backpropagation automatically
- Optimize using GPUs
- Provide pre-built layers and models
- Include advanced optimizers (Adam, RMSprop)

**But understanding the fundamentals you've learned makes you a MUCH better deep learning practitioner!** 🌟

---

## 💪 Final Challenge

Can you improve the circles classifier?

Try:
1. Different network sizes
2. Different learning rates
3. Different batch sizes
4. Different numbers of epochs
5. Adding a third layer?

Experiment and see what works best!

In [None]:
# Your experimentation space!
# Try different configurations and see what happens

# Example:
# np.random.seed(42)
# my_network = NeuralNetwork(
#     input_size=2,
#     hidden_size=???  # Try different sizes!
#     output_size=1,
#     learning_rate=???  # Try different rates!
#     activation='relu'
# )
#
# history = my_network.train(
#     X_train=X_train,
#     y_train=y_train,
#     X_val=X_val,
#     y_val=y_val,
#     epochs=???,  # Try different numbers!
#     batch_size=???,  # Try different sizes!
#     verbose=True
# )

print("Happy experimenting! 🧪")

---

## 🎉 Congratulations!

You've completed the **Neural Networks Fundamentals** series!

From knowing nothing to building and training neural networks from scratch - that's an incredible journey! 🚀

### Your Journey:
1. ✅ **Notebook 1-3**: Understanding neurons, activations, and basic concepts
2. ✅ **Notebook 4-6**: Building layers, forward propagation, and loss functions
3. ✅ **Notebook 7**: Backpropagation - the learning algorithm
4. ✅ **Notebook 8**: Training loop - putting it all together

### You now understand:
- How neural networks work (not magic!)
- How they learn (backpropagation + gradient descent)
- How to train them (the complete pipeline)
- What can go wrong (and how to fix it)

### Keep Learning! 📚

Next topics to explore:
- **Convolutional Neural Networks** (CNNs) for images
- **Recurrent Neural Networks** (RNNs) for sequences
- **Transformers** for modern AI
- **PyTorch/TensorFlow** for practical applications

You have a **solid foundation**. Everything else builds on what you've learned here!

**You did it!** 🌟🎊🎉