# Tutorial 2: From Perceptrons to Deep Neural Networks

## Welcome! üéì

In this tutorial, you'll learn the fundamental challenges and solutions for training deep neural networks.

**What you'll master:**
- ‚ö° The vanishing gradient problem
- üéØ Activation functions (sigmoid, tanh, ReLU)
- üìä Detecting overfitting
- üõ°Ô∏è Regularization techniques (L1, L2, Dropout)

**Time commitment:** ~2 hours

**Prerequisites:**
- Basic Python and PyTorch
- Understanding of neural networks
- Completed Tutorial 1 (recommended)

---

## üì¶ Setup and Imports

First, let's import all the necessary modules from our tutorial package.

In [None]:
# Core libraries
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import numpy as np

# Tutorial modules
from perceptron_to_DNN_tutorial.MultiLayerPerceptron import MultiLayerPerceptron
from perceptron_to_DNN_tutorial.train import (
    train_model_with_gradient_tracking,
    train_model_with_validation_tracking
)
from perceptron_to_DNN_tutorial.utils import (
    create_toy_dataset,
    FeatureNormalizer
)
from perceptron_to_DNN_tutorial.plotting import (
    plot_gradient_flow,
    plot_layer_gradient_norms,
    plot_regularization_comparison,
    plot_results
)
from perceptron_to_DNN_tutorial.logger import get_logger

# Initialize logger
logger = get_logger(__name__)

print("‚úÖ All imports successful!")
print(f"PyTorch version: {torch.__version__}")
print(f"Device: {torch.device('cuda' if torch.cuda.is_available() else 'cpu')}")

## ‚öôÔ∏è Configuration

Let's set up our experimental parameters. These control:
- Network architecture (depth and width)
- Data properties (polynomial complexity, noise)
- Training hyperparameters

**Note:** Feel free to modify these later to see how results change!

In [None]:
# ========================================
# CONFIGURATION
# ========================================

# Network Architecture
# This is a DEEP network: 10 layers total (1 input ‚Üí 9 hidden ‚Üí 1 output)
architecture_deep = [1, 128, 128, 128, 128, 128, 128, 128, 128, 128, 1]

# Data Generation Parameters
data_poly_order = 9      # High-order polynomial (complex function)
n_train_samples = 200    # Training set size
n_valid_samples = 200    # Validation set size  
n_test_samples = 200     # Test set size
noise_std = 2.5          # Noise level in data
x_range = [0, 10]        # Input range

# True polynomial coefficients (ground truth function)
coeffs_true = [10.0, 0.5, -0.04, 0.015, -0.001, -0.0003, 0.000055, -0.000005, -1e-7, 2e-8]

# Training Hyperparameters
num_epochs = 10000       # Training iterations (for vanishing gradient demo)
num_epochs_reg = 1000    # Fewer epochs for regularization experiments
learning_rate = 0.005    # Step size for gradient descent

# Activation functions to test
activations_to_test = ['sigmoid', 'tanh', 'relu']

# Regularization parameters
lambda_l2 = 0.01         # L2 regularization strength
dropout_rate = 0.3       # Dropout probability

print("\n" + "="*70)
print(" CONFIGURATION")
print("="*70)
print(f"Polynomial order: {data_poly_order}")
print(f"Training samples: {n_train_samples}")
print(f"Validation samples: {n_valid_samples}")
print(f"Test samples: {n_test_samples}")
print(f"Network architecture: {' ‚Üí '.join(map(str, architecture_deep))}")
print(f"Activations to test: {activations_to_test}")
print(f"Training epochs (gradient demo): {num_epochs}")
print(f"Training epochs (regularization): {num_epochs_reg}")
print("="*70)

---

# Part 1: The Vanishing Gradient Problem ‚ö°

## What is it?

In deep networks, gradients can shrink exponentially as they backpropagate through layers:

$$\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial h_9} \cdot \frac{\partial h_9}{\partial h_8} \cdot \ldots \cdot \frac{\partial h_2}{\partial h_1} \cdot \frac{\partial h_1}{\partial W_1}$$

With sigmoid/tanh, each derivative can be < 1, so the product **vanishes** (approaches 0).

## Why does it matter?

- Early layers get **tiny gradients** ‚Üí learn very slowly
- Network becomes effectively **shallow**
- Can't leverage the power of depth

## What's the solution?

**ReLU activation!** ReLU'(x) = 1 for x > 0, preventing gradient decay.

Let's see this in action! üöÄ

## Step 1.1: Generate Datasets üìä

We'll create three separate datasets:
- **Training**: Used to update weights
- **Validation**: Used to monitor overfitting
- **Test**: Used for final evaluation

**Key principle:** Use different random seeds so datasets are truly independent!

In [None]:
print("\n" + "="*70)
print(" PART 1: VANISHING GRADIENT PROBLEM")
print("="*70)

print("\n[STEP 1.1] Generating train/validation/test datasets...")

# Generate training data
x_train, y_train = create_toy_dataset(
    coefficients=coeffs_true,
    n_samples=n_train_samples,
    x_range=x_range,
    noise_std=noise_std,
    random_seed=100  # Fixed seed for reproducibility
)

# Generate validation data (different seed!)
x_valid, y_valid = create_toy_dataset(
    coefficients=coeffs_true,
    n_samples=n_valid_samples,
    x_range=x_range,
    noise_std=noise_std,
    random_seed=200
)

# Generate test data (different seed!)
x_test, y_test = create_toy_dataset(
    coefficients=coeffs_true,
    n_samples=n_test_samples,
    x_range=x_range,
    noise_std=noise_std,
    random_seed=300
)

print(f"‚úì Training: {len(x_train)} samples, y range [{y_train.min():.2f}, {y_train.max():.2f}]")
print(f"‚úì Validation: {len(x_valid)} samples, y range [{y_valid.min():.2f}, {y_valid.max():.2f}]")
print(f"‚úì Test: {len(x_test)} samples, y range [{y_test.min():.2f}, {y_test.max():.2f}]")

## Step 1.2: Normalize Features üîÑ

Neural networks work best with normalized inputs. We'll normalize to the range [-1, 1].

**Critical:** Use training statistics to normalize ALL datasets (train, val, test)!

In [None]:
print("\n[STEP 1.2] Normalizing features...")

# Fit normalizer on training data only
normalizer = FeatureNormalizer(method='symmetric')
normalizer.fit(x_train)

# Apply same normalization to all datasets
x_t = normalizer.transform(x_train)
y_t = y_train

x_valid_norm = normalizer.transform(x_valid)
x_test_norm = normalizer.transform(x_test)

print(f"‚úì Features normalized to [{x_t.min():.2f}, {x_t.max():.2f}] using training statistics")
print(f"  (Validation and test use same normalization)")

## Step 1.3: Train with Different Activations üèãÔ∏è

Now we'll train **deep networks** (9 hidden layers) with three activation functions:

1. **Sigmoid**: sigmoid(x) = 1/(1+e^(-x)), derivative ‚àà (0, 0.25]
2. **Tanh**: tanh(x), derivative ‚àà (0, 1]
3. **ReLU**: max(0, x), derivative = 1 for x > 0

We'll track gradients at each layer to see the vanishing gradient problem in action!

**This will take a few minutes** (training 30,000 epochs total). ‚òï

In [None]:
print("\n[STEP 1.3] Training deep networks with different activations...")
print("This demonstrates the vanishing gradient problem!\n")

# Store training histories
gradient_histories = {}

for activation in activations_to_test:
    print(f"\n--- Training with {activation.upper()} activation ---")
    
    # Create model
    model = MultiLayerPerceptron(
        layer_sizes=architecture_deep,
        activation=activation,
        dropout_rate=0.0  # No dropout for gradient analysis
    )
    
    # Train with gradient tracking and per-sample distributions
    trained_model, history = train_model_with_gradient_tracking(
        model=model,
        x_train=x_t,
        y_train=y_t,
        x_valid=x_valid_norm,
        y_valid=y_valid,
        num_epochs=num_epochs,
        learning_rate=learning_rate,
        reg_type='none',  # No regularization
        print_every=500,
        verbose=True,
        track_gradients=True,
        track_per_sample_gradients=True  # For distribution plots
    )
    
    # Store full history (includes per-sample gradients)
    gradient_histories[activation] = history
    
    # Analyze final gradients
    final_grads = history['gradient_norms'][-1]
    layer_names = [name for name in final_grads.keys() if 'weight' in name]
    
    print(f"\nFinal gradient magnitudes ({activation}):")
    for layer_name in layer_names:
        print(f"  {layer_name}: {final_grads[layer_name]:.6e}")
    
    # Check for vanishing
    first_layer_grad = final_grads[layer_names[0]]
    last_layer_grad = final_grads[layer_names[-1]]
    if first_layer_grad > 0:
        ratio = last_layer_grad / first_layer_grad
        print(f"\nGradient ratio (last/first layer): {ratio:.6e}")
        if ratio < 0.01:
            print("‚ö†Ô∏è  WARNING: VANISHING GRADIENT DETECTED!")
        else:
            print("‚úì Gradients flowing reasonably through all layers")

print("\n" + "="*70)
print("Training complete! Now let's visualize the results...")
print("="*70)

## Step 1.4: Visualize Gradient Flow üìà

This plot shows how gradient magnitudes change over training for each layer.

**What to look for:**
- **Sigmoid**: Gradients in early layers ‚Üí 0 (exponential decay)
- **Tanh**: Moderate gradient decay
- **ReLU**: Stable gradients across all layers!

In [None]:
print("\n[STEP 1.4] Visualizing gradient flow...")

example_model = MultiLayerPerceptron(architecture_deep, activation='relu')
plot_gradient_flow(
    gradient_histories, 
    example_model, 
    activations_to_test,
    save_name='vanishing_gradient_demo.png'
)

print("\nüí° Key Observation:")
print("   - Sigmoid: Gradients vanish in early layers (red warning box)")
print("   - Tanh: Moderate gradient flow")
print("   - ReLU: Healthy gradient flow (green box)")

## Step 1.5: Visualize Gradient Distributions üìä

These plots show the **distribution** of signed gradient values across all training samples.

We'll create **3 separate plots** for first, middle, and last epochs.

**What to look for:**
- **Healthy**: Wide distribution symmetric around zero
- **Vanishing**: Distribution collapsed near zero

In [None]:
print("\n[STEP 1.5] Visualizing gradient norm distributions across epochs...")
print("Creating 3 separate plots: first epoch, middle epoch, last epoch\n")

plot_layer_gradient_norms(
    gradient_histories,
    example_model,
    activations_to_test,
    save_name='gradient_distributions'
)

print("\nüí° Key Observations:")
print("   First epoch: All activations show reasonable gradients")
print("   Last epoch: Sigmoid collapsed to ~0, ReLU still healthy!")

## Step 1.6: Visualize Best Model (ReLU) üèÜ

Let's see how well the ReLU model fits the data.

In [None]:
print("\n[STEP 1.6] Visualizing ReLU model fit...")

# Train a fresh ReLU model for visualization
relu_model = MultiLayerPerceptron(architecture_deep, activation='relu', dropout_rate=0.0)
relu_model, relu_history = train_model_with_gradient_tracking(
    model=relu_model,
    x_train=x_t,
    y_train=y_t,
    x_valid=x_valid_norm,
    y_valid=y_valid,
    num_epochs=num_epochs,
    learning_rate=learning_rate,
    reg_type='none',
    print_every=num_epochs + 1,  # Silent
    track_gradients=False  # Faster
)

# Visualize results
plot_results(
    x_t, y_t,
    relu_model,
    relu_history,
    coeffs_true,
    data_poly_order,
    model_name="Deep MLP with ReLU",
    normalizer=normalizer,
    show_validation=True
)

## üéØ Part 1 Summary

**What we learned:**

1. ‚ö†Ô∏è **Sigmoid and Tanh suffer from vanishing gradients** in deep networks
   - Gradients shrink exponentially through layers
   - Early layers barely learn

2. ‚úÖ **ReLU solves the vanishing gradient problem**
   - ReLU'(x) = 1 for x > 0 (no gradient decay)
   - Gradients flow stably through all layers

3. üìä **Gradient distributions reveal the problem**
   - Sigmoid: Distribution collapses near zero
   - ReLU: Distribution stays healthy

**Key takeaway:** For deep networks, **always use ReLU** (or variants like Leaky ReLU, ELU)!

---

# Part 2: Detecting Overfitting üîç

## What is overfitting?

Overfitting occurs when a model **memorizes** training data instead of learning the underlying pattern.

**Symptoms:**
- Training loss ‚Üì‚Üì‚Üì (keeps decreasing)
- Validation loss ‚Üë (starts increasing!)
- Large gap between train and validation loss

## How do we detect it?

By tracking **both** training and validation loss:
- **Good**: Both decrease together ‚Üí generalization
- **Bad**: Train ‚Üì, Valid ‚Üë ‚Üí overfitting!

Let's see this in action!

## Already Done! ‚úÖ

We actually already demonstrated overfitting detection in Part 1:
- Generated separate train/validation/test sets
- Tracked both training and validation loss
- Plots show the gap between train and validation curves

**Look at the previous plots:**
- The "Train vs. Validation Loss" panel shows both curves
- The gap between them indicates overfitting level
- Final gap is reported (e.g., "Gap: +0.7 ‚ö†Ô∏è Overfitting")

With 200 samples and a 9th-order polynomial, we have enough data, so the gap should be small.

Now let's see how **regularization** can reduce overfitting even further!

---

# Part 3: Regularization Techniques üõ°Ô∏è

## What is regularization?

Regularization prevents overfitting by:
- Constraining model complexity
- Keeping weights small
- Adding controlled randomness

## Three techniques:

### 1. L1 Regularization (Lasso)
Adds penalty: $\lambda_1 \sum_i |w_i|$
- Promotes **sparse** weights (many ‚Üí 0)
- Feature selection

### 2. L2 Regularization (Ridge/Weight Decay)
Adds penalty: $\lambda_2 \sum_i w_i^2$
- Keeps weights **small**
- Smooth solutions

### 3. Dropout
Randomly drops neurons during training:
- Prevents co-adaptation
- Ensemble-like behavior

Let's compare them!

## Step 3.1: Train WITHOUT Regularization (Baseline) üìä

In [None]:
print("\n" + "="*70)
print(" PART 3: REGULARIZATION TECHNIQUES")
print("="*70)

print("\n[STEP 3.1] Training WITHOUT regularization...")
print("This will show overfitting: train loss decreases but validation loss increases!\n")

# Model without regularization
model_no_reg = MultiLayerPerceptron(
    layer_sizes=architecture_deep,
    activation='relu',
    dropout_rate=0.0
)

# Train
_, history_no_reg = train_model_with_validation_tracking(
    model=model_no_reg,
    x_train=x_t,
    y_train=y_t,
    x_valid=x_valid_norm,
    y_valid=y_valid,
    num_epochs=num_epochs_reg,
    learning_rate=learning_rate,
    reg_type='none',
    print_every=500,
    track_gradients=False  # Speed up training
)

print("\n‚úì Baseline model trained!")

## Step 3.2: Train WITH L2 Regularization üéØ

In [None]:
print("\n[STEP 3.2] Training WITH L2 regularization...")
print(f"L2 penalty: {lambda_l2}\n")

# Model with L2
model_l2 = MultiLayerPerceptron(
    layer_sizes=architecture_deep,
    activation='relu',
    dropout_rate=0.0
)

# Train with L2
_, history_l2 = train_model_with_validation_tracking(
    model=model_l2,
    x_train=x_t,
    y_train=y_t,
    x_valid=x_valid_norm,
    y_valid=y_valid,
    num_epochs=num_epochs_reg,
    learning_rate=learning_rate,
    reg_type='l2',
    lambda_l2=lambda_l2,
    print_every=500,
    track_gradients=False
)

print("\n‚úì L2 regularized model trained!")

## Step 3.3: Train WITH Dropout üé≤

In [None]:
print("\n[STEP 3.3] Training WITH dropout...")
print(f"Dropout rate: {dropout_rate}\n")

# Model with dropout
model_dropout = MultiLayerPerceptron(
    layer_sizes=architecture_deep,
    activation='relu',
    dropout_rate=dropout_rate
)

# Train
_, history_dropout = train_model_with_validation_tracking(
    model=model_dropout,
    x_train=x_t,
    y_train=y_t,
    x_valid=x_valid_norm,
    y_valid=y_valid,
    num_epochs=num_epochs_reg,
    learning_rate=learning_rate,
    reg_type='none',
    print_every=500,
    track_gradients=False
)

print("\n‚úì Dropout model trained!")

## Step 3.4: Compare Regularization Methods üìä

Now let's visualize all three approaches side-by-side!

In [None]:
print("\n[STEP 3.4] Comparing regularization methods...\n")

# Compare all methods
plot_regularization_comparison(
    histories=[history_no_reg, history_l2, history_dropout],
    model_names=['No Regularization', 'L2 Regularization', 'Dropout (p=0.3)'],
    save_name='regularization_comparison.png'
)

print("\nüí° Key Observations:")
print("   - No regularization: Larger train-validation gap")
print("   - L2: Smaller gap, smoother training")
print("   - Dropout: Smaller gap, more stable")
print("\n   Regularization keeps validation loss close to training loss!")

## Step 3.5: Visualize Dropout Model üé≤

In [None]:
print("\n[STEP 3.5] Visualizing dropout model fit...")

plot_results(
    x_t, y_t,
    model_dropout,
    history_dropout,
    coeffs_true,
    data_poly_order,
    model_name=f"Deep MLP with Dropout (p={dropout_rate})",
    normalizer=normalizer,
    show_validation=True
)

## üéØ Part 3 Summary

**What we learned:**

1. üìä **Without regularization**: Models can overfit
   - Training loss keeps decreasing
   - Validation loss plateaus or increases
   - Large train-validation gap

2. üéØ **L2 regularization**: Keeps weights small
   - Adds penalty: $\lambda_2 \sum w_i^2$
   - Reduces overfitting
   - Smoother training curves

3. üé≤ **Dropout**: Random neuron deactivation
   - Prevents co-adaptation
   - Acts like ensemble learning
   - Very effective regularization

**Key takeaway:** Always use regularization for better generalization!

**Recommended approach:**
- Start with L2 regularization (Œª ‚âà 0.01)
- Add dropout (p ‚âà 0.3) for extra robustness
- Monitor train-validation gap to tune hyperparameters

---

# üéì Tutorial Complete!

## Congratulations! üéâ

You've mastered the fundamentals of deep neural networks:

### ‚úÖ What you learned:

1. **Vanishing Gradients**
   - Why sigmoid/tanh fail in deep networks
   - How ReLU solves the problem
   - Gradient flow visualization

2. **Overfitting Detection**
   - Train/validation/test splits
   - Monitoring train-validation gap
   - Identifying memorization vs learning

3. **Regularization**
   - L2 regularization (weight decay)
   - Dropout (random deactivation)
   - Preventing overfitting

### üî¨ Experiments to Try:

1. **Change network depth**: Try 5 layers vs 15 layers
2. **Modify data complexity**: Change `data_poly_order` to 5 or 15
3. **Tune regularization**: Try different Œª values (0.001, 0.1, 1.0)
4. **Compare learning rates**: Test [0.001, 0.01, 0.1]
5. **Add batch normalization**: Implement between layers

### üìö Next Steps:

- **Tutorial 3**: [From DNNs to Transformers](../docs/tutorials/tutorial-3.md)
- Experiment with real datasets (MNIST, CIFAR-10)
- Study advanced architectures (ResNets, Transformers)
- Read the deep learning textbook

### üí° Key Principles to Remember:

1. **Always use ReLU** (or variants) for deep networks
2. **Always split data** into train/val/test
3. **Always monitor validation loss** to detect overfitting
4. **Always use regularization** (L2 + dropout is a good default)
5. **Always normalize inputs** before training

---

## üôã Questions?

- Check the [FAQ](../docs/faq.md)
- See [Troubleshooting](../docs/troubleshooting.md)
- Open an issue on GitHub

Happy learning! üöÄ