# 🔥 PyTorch Equivalent: 500 Lines → 50 Lines

## Welcome to the Power of Deep Learning Frameworks!

**In the last notebook**, you built a complete neural network from scratch:
- ✅ ~500 lines of code
- ✅ Manual forward propagation
- ✅ Manual backpropagation (chain rule, gradients, etc.)
- ✅ Manual weight updates
- ✅ Manual batch processing

**In this notebook**, we'll rebuild the SAME network in PyTorch:
- 🚀 ~50 lines of code
- 🚀 Automatic differentiation (no manual gradients!)
- 🚀 Built-in optimizers
- 🚀 GPU acceleration
- 🚀 Production-ready tools

## Why Learn Fundamentals First?

**You now have a HUGE advantage:**
- You understand what PyTorch does behind the scenes
- You can debug models effectively
- You can implement custom architectures
- You can optimize performance

**Frameworks make you productive, fundamentals make you effective!**

## What is PyTorch?

PyTorch is a deep learning framework that provides:
1. **Tensors**: NumPy-like arrays that run on GPUs
2. **Autograd**: Automatic differentiation (computes gradients for you!)
3. **nn.Module**: Building blocks for neural networks
4. **Optimizers**: SGD, Adam, RMSprop, and more
5. **Utils**: Data loading, training loops, saving/loading models

Let's see it in action!

## Installation

If you haven't installed PyTorch yet:

```bash
# CPU version (lighter, good for learning)
pip install torch torchvision

# GPU version (faster training, requires CUDA)
# Visit pytorch.org for your specific GPU setup
```

In [None]:
# Import libraries
import torch  # Core PyTorch library
import torch.nn as nn  # Neural network modules
import torch.optim as optim  # Optimizers (SGD, Adam, etc.)
from torch.utils.data import DataLoader, TensorDataset  # Data utilities
import numpy as np  # For data preprocessing
import matplotlib.pyplot as plt  # For visualizations
from sklearn.datasets import fetch_openml  # To download MNIST
from sklearn.model_selection import train_test_split  # For data splitting
from sklearn.metrics import confusion_matrix  # For evaluation
import seaborn as sns  # For beautiful plots
from tqdm import tqdm  # For progress bars
import time  # To measure training time

# Set random seeds for reproducibility
torch.manual_seed(42)  # PyTorch random seed
np.random.seed(42)  # NumPy random seed

# Check if GPU is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"📦 PyTorch version: {torch.__version__}")
print(f"🖥️  Using device: {device}")
if device.type == 'cuda':
    print(f"   GPU: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

## Step 1: Load and Preprocess Data

Same MNIST dataset, but we'll convert to PyTorch tensors.

In [None]:
# Load MNIST dataset
print("📥 Loading MNIST dataset...")
mnist = fetch_openml('mnist_784', version=1, parser='auto')

# Extract and normalize data
X = mnist.data.to_numpy() / 255.0  # Normalize to [0, 1]
y = mnist.target.to_numpy().astype(int)  # Convert labels to integers

print(f"✅ Dataset loaded: {X.shape[0]:,} images")

# Split into train/val/test (same as before)
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=10000, random_state=42, stratify=y
)

X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=10000, random_state=42, stratify=y_temp
)

print(f"\n✅ Data split:")
print(f"   Training:   {X_train.shape[0]:,}")
print(f"   Validation: {X_val.shape[0]:,}")
print(f"   Test:       {X_test.shape[0]:,}")

In [None]:
# Convert NumPy arrays to PyTorch tensors
# PyTorch uses tensors instead of NumPy arrays
X_train_tensor = torch.FloatTensor(X_train)  # Convert to PyTorch float tensor
y_train_tensor = torch.LongTensor(y_train)   # Convert to PyTorch long tensor (for labels)

X_val_tensor = torch.FloatTensor(X_val)
y_val_tensor = torch.LongTensor(y_val)

X_test_tensor = torch.FloatTensor(X_test)
y_test_tensor = torch.LongTensor(y_test)

print("✅ Converted to PyTorch tensors")
print(f"   X_train shape: {X_train_tensor.shape}")
print(f"   y_train shape: {y_train_tensor.shape}")
print(f"   Data type: {X_train_tensor.dtype} (inputs), {y_train_tensor.dtype} (labels)")

In [None]:
# Create DataLoaders for efficient batch processing
# DataLoader automatically handles batching, shuffling, and parallel loading
batch_size = 32  # Same as before

# Create datasets (combines inputs and labels)
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
val_dataset = TensorDataset(X_val_tensor, y_val_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

# Create data loaders (handles batching and shuffling)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)  # Shuffle training data
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)     # Don't shuffle validation
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)   # Don't shuffle test

print(f"\n✅ DataLoaders created")
print(f"   Batch size: {batch_size}")
print(f"   Training batches: {len(train_loader)}")
print(f"   Validation batches: {len(val_loader)}")
print(f"   Test batches: {len(test_loader)}")

## Step 2: Define Neural Network in PyTorch

### Compare: NumPy vs PyTorch

**NumPy version (from notebook 9):**
```python
# Initialize weights manually
self.W1 = np.random.randn(784, 128) * np.sqrt(2.0/784)
self.b1 = np.zeros((1, 128))
# ... repeat for each layer

# Forward pass (manual)
Z1 = np.dot(X, W1) + b1
A1 = relu(Z1)
# ... repeat for each layer

# Backward pass (manual chain rule)
dZ3 = A3 - y_true
dW3 = np.dot(A2.T, dZ3) / m
# ... lots more gradient calculations
```

**PyTorch version:**
```python
# That's it! PyTorch handles initialization, forward, and backward!
model = nn.Sequential(
    nn.Linear(784, 128),
    nn.ReLU(),
    nn.Linear(128, 64),
    nn.ReLU(),
    nn.Linear(64, 10)
)
```

Let's build it:

In [None]:
# Define the neural network using nn.Module
class MNISTNet(nn.Module):
    """
    Neural network for MNIST classification.
    
    Same architecture as our NumPy version:
    784 → 128 → 64 → 10
    """
    def __init__(self):
        super(MNISTNet, self).__init__()  # Initialize parent class
        
        # Define layers
        # nn.Linear automatically initializes weights and handles matrix multiplication!
        self.fc1 = nn.Linear(784, 128)  # First fully connected layer (784 → 128)
        self.relu1 = nn.ReLU()           # ReLU activation
        
        self.fc2 = nn.Linear(128, 64)   # Second fully connected layer (128 → 64)
        self.relu2 = nn.ReLU()           # ReLU activation
        
        self.fc3 = nn.Linear(64, 10)    # Output layer (64 → 10)
        # Note: No softmax here! CrossEntropyLoss includes it
    
    def forward(self, x):
        """
        Forward pass through the network.
        
        PyTorch automatically tracks operations for backprop!
        """
        x = self.fc1(x)      # Linear transformation
        x = self.relu1(x)    # ReLU activation
        
        x = self.fc2(x)      # Linear transformation
        x = self.relu2(x)    # ReLU activation
        
        x = self.fc3(x)      # Output layer
        return x             # Return logits (no softmax needed)

# Create model and move to device (CPU or GPU)
model = MNISTNet().to(device)  # .to(device) moves model to GPU if available

print("🧠 Neural Network created!")
print(f"   Architecture: 784 → 128 → 64 → 10")
print(f"   Device: {device}")
print(f"\nModel structure:")
print(model)

In [None]:
# Count parameters
total_params = sum(p.numel() for p in model.parameters())  # numel() = number of elements
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"\n📊 Model Statistics:")
print(f"   Total parameters: {total_params:,}")
print(f"   Trainable parameters: {trainable_params:,}")
print(f"\nParameter breakdown:")
for name, param in model.named_parameters():
    print(f"   {name:20s}: {param.shape}  ({param.numel():,} params)")

## Step 3: Define Loss Function and Optimizer

### More PyTorch Magic!

**NumPy version:**
```python
# Manual cross-entropy loss
loss = -np.sum(y_true * np.log(y_pred + 1e-8)) / m

# Manual gradient descent
W1 -= learning_rate * dW1
b1 -= learning_rate * db1
# ... repeat for all parameters
```

**PyTorch version:**
```python
criterion = nn.CrossEntropyLoss()  # Loss function
optimizer = optim.Adam(model.parameters())  # Optimizer (better than plain SGD!)
```

In [None]:
# Define loss function
# CrossEntropyLoss combines softmax + negative log likelihood
# This is more numerically stable than doing them separately!
criterion = nn.CrossEntropyLoss()

# Define optimizer
# Adam is an advanced optimizer (better than SGD in most cases)
# It adapts learning rate for each parameter!
learning_rate = 0.001  # Adam typically uses smaller learning rate than SGD
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

print("✅ Loss function and optimizer ready")
print(f"   Loss: CrossEntropyLoss (includes softmax)")
print(f"   Optimizer: Adam")
print(f"   Learning rate: {learning_rate}")
print(f"\n💡 Adam vs SGD:")
print("   - Adapts learning rate per parameter")
print("   - Uses momentum (remembers past gradients)")
print("   - Usually converges faster than plain SGD")

## Step 4: Training Loop

### The PyTorch Training Pattern

All PyTorch training loops follow this pattern:

```python
for epoch in range(num_epochs):
    for batch in data_loader:
        # 1. Forward pass
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        
        # 2. Backward pass
        optimizer.zero_grad()  # Clear old gradients
        loss.backward()        # Compute gradients (autograd!)
        optimizer.step()       # Update weights
```

That's it! PyTorch handles all the calculus!

In [None]:
# Training configuration
num_epochs = 20  # Same as NumPy version

# Lists to track metrics
train_losses = []      # Training loss per epoch
val_losses = []        # Validation loss per epoch
train_accuracies = []  # Training accuracy per epoch
val_accuracies = []    # Validation accuracy per epoch

print("🎯 Starting training...")
print(f"   Epochs: {num_epochs}")
print(f"   Batch size: {batch_size}")
print(f"   Optimizer: Adam (lr={learning_rate})")
print("\n" + "="*70)

# Record start time
start_time = time.time()

# Training loop
for epoch in range(num_epochs):
    # Training phase
    model.train()  # Set model to training mode (enables dropout, batch norm, etc.)
    train_loss = 0.0
    train_correct = 0
    train_total = 0
    
    # Iterate over batches
    for inputs, labels in tqdm(train_loader, desc=f"Epoch {epoch+1}/{num_epochs}", leave=False):
        # Move data to device (GPU if available)
        inputs, labels = inputs.to(device), labels.to(device)
        
        # Forward pass
        outputs = model(inputs)  # Get predictions
        loss = criterion(outputs, labels)  # Compute loss
        
        # Backward pass and optimization
        optimizer.zero_grad()  # Clear previous gradients
        loss.backward()        # Compute gradients (autograd magic!)
        optimizer.step()       # Update weights
        
        # Track metrics
        train_loss += loss.item() * inputs.size(0)  # Accumulate loss
        _, predicted = torch.max(outputs.data, 1)   # Get predicted class
        train_total += labels.size(0)               # Count samples
        train_correct += (predicted == labels).sum().item()  # Count correct predictions
    
    # Calculate average training metrics
    train_loss = train_loss / train_total
    train_acc = 100.0 * train_correct / train_total
    
    # Validation phase
    model.eval()  # Set model to evaluation mode (disables dropout, batch norm, etc.)
    val_loss = 0.0
    val_correct = 0
    val_total = 0
    
    # No gradient computation during validation (saves memory and computation)
    with torch.no_grad():
        for inputs, labels in val_loader:
            # Move data to device
            inputs, labels = inputs.to(device), labels.to(device)
            
            # Forward pass only
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            
            # Track metrics
            val_loss += loss.item() * inputs.size(0)
            _, predicted = torch.max(outputs.data, 1)
            val_total += labels.size(0)
            val_correct += (predicted == labels).sum().item()
    
    # Calculate average validation metrics
    val_loss = val_loss / val_total
    val_acc = 100.0 * val_correct / val_total
    
    # Store metrics
    train_losses.append(train_loss)
    val_losses.append(val_loss)
    train_accuracies.append(train_acc)
    val_accuracies.append(val_acc)
    
    # Print progress
    print(f"Epoch {epoch+1}/{num_epochs}:")
    print(f"  Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.2f}%")
    print(f"  Val Loss:   {val_loss:.4f} | Val Acc:   {val_acc:.2f}%")
    print("-" * 70)

# Record end time
training_time = time.time() - start_time

print("\n" + "="*70)
print(f"✅ Training complete in {training_time:.2f} seconds!")
print(f"   Final training accuracy: {train_accuracies[-1]:.2f}%")
print(f"   Final validation accuracy: {val_accuracies[-1]:.2f}%")

## Step 5: Visualize Training Progress

Same visualization as before - let's compare with NumPy version!

In [None]:
# Plot training curves
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Loss curves
ax1.plot(train_losses, label='Training Loss', linewidth=2, marker='o')
ax1.plot(val_losses, label='Validation Loss', linewidth=2, marker='s')
ax1.set_xlabel('Epoch', fontsize=12)
ax1.set_ylabel('Loss', fontsize=12)
ax1.set_title('Loss Over Time (PyTorch)', fontsize=14, fontweight='bold')
ax1.legend(fontsize=11)
ax1.grid(True, alpha=0.3)

# Accuracy curves
ax2.plot(train_accuracies, label='Training Accuracy', linewidth=2, marker='o')
ax2.plot(val_accuracies, label='Validation Accuracy', linewidth=2, marker='s')
ax2.set_xlabel('Epoch', fontsize=12)
ax2.set_ylabel('Accuracy (%)', fontsize=12)
ax2.set_title('Accuracy Over Time (PyTorch)', fontsize=14, fontweight='bold')
ax2.legend(fontsize=11)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n📈 PyTorch training curves")
print("   Notice: Potentially faster convergence due to Adam optimizer!")

## Step 6: Evaluate on Test Set

In [None]:
# Evaluate on test set
model.eval()  # Set to evaluation mode
test_correct = 0
test_total = 0
all_predictions = []  # Store all predictions
all_labels = []       # Store all true labels

with torch.no_grad():  # No gradient computation
    for inputs, labels in test_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        
        # Forward pass
        outputs = model(inputs)
        _, predicted = torch.max(outputs.data, 1)
        
        # Track metrics
        test_total += labels.size(0)
        test_correct += (predicted == labels).sum().item()
        
        # Store for confusion matrix
        all_predictions.extend(predicted.cpu().numpy())  # Move to CPU and convert to numpy
        all_labels.extend(labels.cpu().numpy())

# Calculate accuracy
test_accuracy = 100.0 * test_correct / test_total

print("="*70)
print("🎯 TEST SET PERFORMANCE (PyTorch)")
print("="*70)
print(f"\n   Accuracy: {test_accuracy:.2f}%")
print(f"   Correct predictions: {test_correct:,} / {test_total:,}")
print(f"   Wrong predictions: {test_total - test_correct:,}")

if test_accuracy > 95:
    print("\n   🎉 EXCELLENT! Over 95% accuracy!")
elif test_accuracy > 90:
    print("\n   👍 GOOD! Over 90% accuracy!")
else:
    print("\n   📚 Room for improvement. Try training longer!")

## Step 7: Confusion Matrix

In [None]:
# Compute confusion matrix
cm = confusion_matrix(all_labels, all_predictions)

# Plot confusion matrix
plt.figure(figsize=(12, 10))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=list(range(10)), yticklabels=list(range(10)),
            cbar_kws={'label': 'Number of predictions'})
plt.xlabel('Predicted Label', fontsize=13)
plt.ylabel('True Label', fontsize=13)
plt.title('Confusion Matrix - MNIST Test Set (PyTorch)', fontsize=15, fontweight='bold')
plt.tight_layout()
plt.show()

print("\n📊 Compare this with the NumPy version!")
print("   Similar patterns? That's because it's the same network architecture!")

## Step 8: Side-by-Side Comparison

### NumPy vs PyTorch: What Changed?

| Aspect | NumPy Version | PyTorch Version |
|--------|---------------|------------------|
| **Lines of code** | ~500 lines | ~50 lines |
| **Weight initialization** | Manual (He init) | Automatic |
| **Forward pass** | Manual matrix ops | `model(inputs)` |
| **Loss computation** | Manual cross-entropy | `criterion(outputs, labels)` |
| **Backward pass** | Manual gradients (chain rule) | `loss.backward()` (autograd!) |
| **Weight updates** | Manual gradient descent | `optimizer.step()` |
| **Batch processing** | Manual batch creation | `DataLoader` |
| **GPU support** | ❌ Not available | ✅ `.to(device)` |
| **Advanced optimizers** | ❌ Only SGD | ✅ Adam, RMSprop, etc. |
| **Code clarity** | Lots of implementation details | Focus on architecture |

### What PyTorch Automated:

1. **Automatic Differentiation (Autograd)**
   - No more manual chain rule!
   - No more gradient calculations!
   - Just call `loss.backward()`

2. **Built-in Layers**
   - `nn.Linear` handles weights + biases + matrix multiplication
   - `nn.ReLU`, `nn.Sigmoid`, etc. for activations
   - Automatic weight initialization

3. **Optimizers**
   - Adam, SGD, RMSprop, AdaGrad, etc.
   - Built-in momentum, learning rate schedules
   - Just call `optimizer.step()`

4. **Data Loading**
   - `DataLoader` for efficient batching
   - Automatic shuffling
   - Parallel data loading

5. **GPU Acceleration**
   - Move model and data to GPU with `.to(device)`
   - 10-100x faster training!
   - Same code works on CPU or GPU

## Step 9: Advanced PyTorch Features

Let's explore some powerful PyTorch capabilities!

### 9.1 Different Optimizers

PyTorch provides many optimization algorithms:

In [None]:
# Different optimizer options
print("🔧 Available Optimizers in PyTorch:\n")

print("1. SGD (Stochastic Gradient Descent)")
print("   optimizer = optim.SGD(model.parameters(), lr=0.01)")
print("   - Basic gradient descent")
print("   - Can add momentum: momentum=0.9")
print("")

print("2. Adam (Adaptive Moment Estimation)")
print("   optimizer = optim.Adam(model.parameters(), lr=0.001)")
print("   - Adapts learning rate per parameter")
print("   - Usually converges faster")
print("   - Good default choice!")
print("")

print("3. RMSprop (Root Mean Square Propagation)")
print("   optimizer = optim.RMSprop(model.parameters(), lr=0.001)")
print("   - Good for recurrent neural networks")
print("   - Adapts learning rate")
print("")

print("4. AdaGrad (Adaptive Gradient)")
print("   optimizer = optim.Adagrad(model.parameters(), lr=0.01)")
print("   - Good for sparse data")
print("   - Adapts learning rate per feature")
print("")

print("💡 Tip: Adam is usually a good starting point!")

### 9.2 Learning Rate Schedulers

Automatically adjust learning rate during training:

In [None]:
# Example: Learning rate scheduler
print("📉 Learning Rate Schedulers:\n")

print("1. StepLR - Reduce LR every N epochs")
print("   scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)")
print("   - Reduces LR by 10x every 10 epochs")
print("")

print("2. ReduceLROnPlateau - Reduce when loss plateaus")
print("   scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min')")
print("   - Automatically reduces LR when validation loss stops improving")
print("")

print("3. CosineAnnealingLR - Cosine annealing")
print("   scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=10)")
print("   - Gradually reduces LR following a cosine curve")
print("")

print("Usage in training loop:")
print("   for epoch in range(num_epochs):")
print("       train(...)")
print("       validate(...)")
print("       scheduler.step()  # Update learning rate")

### 9.3 Regularization Techniques

In [None]:
# Example network with regularization
class RegularizedNet(nn.Module):
    """Network with dropout for regularization"""
    def __init__(self):
        super(RegularizedNet, self).__init__()
        self.fc1 = nn.Linear(784, 128)
        self.relu1 = nn.ReLU()
        self.dropout1 = nn.Dropout(0.2)  # Drop 20% of neurons during training
        
        self.fc2 = nn.Linear(128, 64)
        self.relu2 = nn.ReLU()
        self.dropout2 = nn.Dropout(0.2)
        
        self.fc3 = nn.Linear(64, 10)
    
    def forward(self, x):
        x = self.fc1(x)
        x = self.relu1(x)
        x = self.dropout1(x)  # Apply dropout
        
        x = self.fc2(x)
        x = self.relu2(x)
        x = self.dropout2(x)  # Apply dropout
        
        x = self.fc3(x)
        return x

print("✅ Regularization Techniques:\n")
print("1. Dropout")
print("   - Randomly drops neurons during training")
print("   - Prevents overfitting")
print("   - nn.Dropout(p=0.2) drops 20% of neurons")
print("")

print("2. L2 Regularization (Weight Decay)")
print("   - Add to optimizer: optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)")
print("   - Penalizes large weights")
print("")

print("3. Batch Normalization")
print("   - nn.BatchNorm1d(128) normalizes activations")
print("   - Stabilizes training")
print("   - Can speed up convergence")

### 9.4 Saving and Loading Models

In [None]:
# Save model
print("💾 Saving and Loading Models:\n")

# Method 1: Save entire model
torch.save(model, 'mnist_model.pth')
print("✅ Saved entire model to mnist_model.pth")

# Method 2: Save only state dict (recommended)
torch.save(model.state_dict(), 'mnist_model_state.pth')
print("✅ Saved model state dict to mnist_model_state.pth")

print("\nLoading models:")
print("")
print("# Method 1: Load entire model")
print("model = torch.load('mnist_model.pth')")
print("")
print("# Method 2: Load state dict (recommended)")
print("model = MNISTNet()")
print("model.load_state_dict(torch.load('mnist_model_state.pth'))")
print("model.eval()  # Set to evaluation mode")

print("\n💡 Best practice: Save state_dict (more flexible)")

## Step 10: Performance Comparison

Let's compare our NumPy and PyTorch implementations:

In [None]:
# Create comparison summary
print("="*70)
print("📊 NUMPY vs PYTORCH COMPARISON")
print("="*70)

print("\n1. IMPLEMENTATION COMPLEXITY")
print("   NumPy:   ~500 lines of code")
print("   PyTorch: ~50 lines of core code (10x less!)")
print("   Winner: 🏆 PyTorch")

print("\n2. DEVELOPMENT TIME")
print("   NumPy:   Hours to implement and debug")
print("   PyTorch: Minutes to implement")
print("   Winner: 🏆 PyTorch")

print("\n3. ACCURACY")
print("   NumPy:   ~95-97% (depends on hyperparameters)")
print(f"   PyTorch: {test_accuracy:.2f}%")
print("   Winner: 🤝 TIE (same architecture)")

print("\n4. TRAINING SPEED")
if device.type == 'cuda':
    print("   NumPy:   CPU only")
    print("   PyTorch: GPU accelerated (10-100x faster!)")
    print("   Winner: 🏆 PyTorch")
else:
    print("   NumPy:   CPU optimized")
    print("   PyTorch: CPU (similar speed, but can use GPU!)")
    print("   Winner: 🏆 PyTorch (GPU potential)")

print("\n5. FLEXIBILITY")
print("   NumPy:   Full control, but tedious")
print("   PyTorch: Easy to modify, experiment, and scale")
print("   Winner: 🏆 PyTorch")

print("\n6. LEARNING VALUE")
print("   NumPy:   Deep understanding of fundamentals")
print("   PyTorch: Industry-standard tools and practices")
print("   Winner: 🏆 BOTH! (NumPy first, then PyTorch)")

print("\n" + "="*70)
print("🎯 OVERALL: PyTorch wins for productivity,")
print("           NumPy wins for understanding!")
print("="*70)

## Step 11: Quick Experiment - Different Architectures

With PyTorch, it's easy to experiment! Let's try a deeper network:

In [None]:
# Define a deeper network
class DeeperNet(nn.Module):
    """Deeper network: 784 → 256 → 128 → 64 → 10"""
    def __init__(self):
        super(DeeperNet, self).__init__()
        self.fc1 = nn.Linear(784, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 64)
        self.fc4 = nn.Linear(64, 10)
        self.relu = nn.ReLU()
    
    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.relu(self.fc3(x))
        x = self.fc4(x)
        return x

print("🧠 Deeper Network Created!")
print("   Architecture: 784 → 256 → 128 → 64 → 10")
print("\n💡 With PyTorch, experimenting with architectures is EASY!")
print("   Try different:")
print("   - Layer sizes")
print("   - Number of layers")
print("   - Activation functions")
print("   - Regularization techniques")

## 🎓 Best Practices for PyTorch

### 1. Code Organization

```python
# Good structure:
class MyModel(nn.Module):
    def __init__(self):
        super().__init__()
        # Define layers here
    
    def forward(self, x):
        # Define forward pass
        return x

def train_epoch(model, loader, criterion, optimizer):
    # Training logic
    pass

def validate(model, loader, criterion):
    # Validation logic
    pass
```

### 2. Reproducibility

```python
# Set random seeds
torch.manual_seed(42)
np.random.seed(42)

# For GPU reproducibility
if torch.cuda.is_available():
    torch.cuda.manual_seed(42)
    torch.backends.cudnn.deterministic = True
```

### 3. Memory Management

```python
# Clear gradients
optimizer.zero_grad()

# Use torch.no_grad() for inference
with torch.no_grad():
    outputs = model(inputs)

# Move tensors to CPU when done
predictions = outputs.cpu().numpy()
```

### 4. Debugging

```python
# Check shapes
print(f"Input shape: {x.shape}")

# Check for NaN
assert not torch.isnan(loss), "Loss is NaN!"

# Visualize gradients
for name, param in model.named_parameters():
    if param.grad is not None:
        print(f"{name}: {param.grad.norm()}")
```

## 🚀 What's Next?

You've completed the Neural Networks Fundamentals series! Here's where to go next:

### 1. Convolutional Neural Networks (CNNs)
📂 Path: `00-neural-networks/cnn/`

**What you'll learn:**
- Convolution layers for image processing
- Pooling layers for dimensionality reduction
- CNN architectures (LeNet, AlexNet, VGG, ResNet)
- Transfer learning with pre-trained models

**Why it matters:**
- State-of-the-art for computer vision
- Object detection, image classification, segmentation
- Used in self-driving cars, medical imaging, etc.

### 2. Recurrent Neural Networks (RNNs)
📂 Path: `00-neural-networks/rnn/`

**What you'll learn:**
- RNNs for sequential data
- LSTMs and GRUs
- Sequence-to-sequence models
- Attention mechanisms

**Why it matters:**
- Natural language processing
- Time series prediction
- Speech recognition, translation

### 3. Transformers
📂 Path: `01-transformers/`

**What you'll learn:**
- Self-attention mechanism
- Transformer architecture
- BERT, GPT, and other transformer models
- Fine-tuning pre-trained models

**Why it matters:**
- Powers ChatGPT, BERT, GPT-4
- State-of-the-art for NLP
- Also used for vision (Vision Transformers)

### 4. Fine-Tuning
📂 Path: `02-fine-tuning/`

**What you'll learn:**
- Transfer learning principles
- Fine-tuning pre-trained models
- LoRA and QLoRA for efficient fine-tuning
- Domain adaptation

**Why it matters:**
- Train powerful models with limited data
- Adapt models to your specific task
- Cost-effective and time-efficient

### 5. RAG (Retrieval-Augmented Generation)
📂 Path: `03-rag/`

**What you'll learn:**
- Combining retrieval and generation
- Vector databases
- Semantic search
- Building RAG systems

**Why it matters:**
- Ground LLMs in factual data
- Build chatbots with up-to-date knowledge
- Reduce hallucinations

## 🎉 Congratulations!

## You've Completed the Neural Networks Fundamentals!

### What You've Accomplished:

**Notebooks 1-3: Building Blocks**
- ✅ Understanding neural networks conceptually
- ✅ Single neurons and weights
- ✅ Activation functions (Sigmoid, ReLU, Softmax)

**Notebooks 4-6: Forward Flow**
- ✅ Multi-layer architectures
- ✅ Forward propagation
- ✅ Loss functions

**Notebooks 7-8: Learning**
- ✅ Backpropagation and chain rule
- ✅ Training loops and optimization

**Notebooks 9-10: Real World**
- ✅ Complete MNIST implementation from scratch
- ✅ PyTorch equivalent (10x less code!)

### Your Superpowers:

1. **Deep Understanding**: You know what happens inside neural networks
2. **PyTorch Mastery**: You can build and train models efficiently
3. **Debugging Skills**: You can troubleshoot issues because you understand the math
4. **Foundation**: You're ready for advanced topics (CNNs, RNNs, Transformers)

### Remember:

- **Theory → Practice → Mastery**
- You started from scratch and built real neural networks
- You understand both the "what" and the "why"
- PyTorch makes you productive, fundamentals make you effective

### Keep Learning!

The journey doesn't end here. Neural networks are just the beginning:
- Explore CNNs for computer vision
- Learn RNNs for sequential data
- Master Transformers (the future of AI)
- Build real-world applications

**You're now equipped with the knowledge to understand and build modern AI systems!**

## 🌟 Happy Learning! 🌟