# Multi-Layer Perceptron (2 Hidden Layers) - Deep Learning!

## From Shallow to Deep

In the previous notebook, we had:
- Input → **1 Hidden Layer** → Output

Now we'll build:
- Input → **Hidden Layer 1** → **Hidden Layer 2** → Output

This is **deep learning** - multiple layers of transformations!

### Why Go Deeper?

1. **Hierarchical Feature Learning**
   - Layer 1: Learn simple features (edges, basic patterns)
   - Layer 2: Combine simple features into complex ones (shapes, parts)
   - Output: Combine complex features for final decision

2. **More Expressive**
   - Can approximate more complex functions
   - Better at modeling intricate patterns

3. **Parameter Efficiency**
   - Sometimes 2 smaller layers learn better than 1 huge layer
   - Fewer parameters but more representation power

## Architecture

```
Input (X)  →  Layer 1  →  Layer 2  →  Output  →  ŷ
 (n_input)    (n_h1)      (n_h2)     (n_out)
```

### Forward Pass:

1. **Layer 1**: $z_1 = XW_1 + b_1$,     $a_1 = \text{activation}(z_1)$
2. **Layer 2**: $z_2 = a_1W_2 + b_2$,   $a_2 = \text{activation}(z_2)$
3. **Output**:  $z_3 = a_2W_3 + b_3$,   $\hat{y} = \sigma(z_3)$

### Backward Pass:

The beautiful part: **same pattern repeats for each layer!**

**Output Layer (Layer 3):**
- $\frac{\partial L}{\partial z_3} = \hat{y} - y$
- $\frac{\partial L}{\partial W_3} = a_2^T\left(\frac{\partial L}{\partial z_3}\right)$
- $\frac{\partial L}{\partial b_3} = \sum\left(\frac{\partial L}{\partial z_3}\right)$

**Hidden Layer 2:**
- $\frac{\partial L}{\partial a_2} = \left(\frac{\partial L}{\partial z_3}\right)W_3^T$
- $\frac{\partial L}{\partial z_2} = \frac{\partial L}{\partial a_2} \odot \text{activation}'(z_2)$
- $\frac{\partial L}{\partial W_2} = a_1^T\left(\frac{\partial L}{\partial z_2}\right)$
- $\frac{\partial L}{\partial b_2} = \sum\left(\frac{\partial L}{\partial z_2}\right)$

**Hidden Layer 1:**
- $\frac{\partial L}{\partial a_1} = \left(\frac{\partial L}{\partial z_2}\right)W_2^T$
- $\frac{\partial L}{\partial z_1} = \frac{\partial L}{\partial a_1} \odot \text{activation}'(z_1)$
- $\frac{\partial L}{\partial W_1} = X^T\left(\frac{\partial L}{\partial z_1}\right)$
- $\frac{\partial L}{\partial b_1} = \sum\left(\frac{\partial L}{\partial z_1}\right)$

See the pattern? **Each layer follows the same backward recipe!**

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)
np.random.seed(42)

In [None]:
# Activation functions and derivatives
def sigmoid(z):
    z = np.clip(z, -500, 500)
    return 1 / (1 + np.exp(-z))

def sigmoid_derivative(z):
    s = sigmoid(z)
    return s * (1 - s)

def relu(z):
    return np.maximum(0, z)

def relu_derivative(z):
    return (z > 0).astype(float)

def tanh(z):
    return np.tanh(z)

def tanh_derivative(z):
    return 1 - np.tanh(z) ** 2

## Step 1: Generate Complex Dataset

Let's create a more challenging dataset that benefits from deeper networks!

In [None]:
# Generate multiple datasets
X_moons, y_moons = make_moons(n_samples=800, noise=0.15, random_state=42)
X_circles, y_circles = make_circles(n_samples=800, noise=0.1, factor=0.4, random_state=42)

# We'll use circles (harder problem)
X, y = X_circles, y_circles

# Split and normalize
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Moons
axes[0].scatter(X_moons[y_moons == 0, 0], X_moons[y_moons == 0, 1],
                label='Class 0', alpha=0.7, s=40, edgecolors='black', linewidth=0.5)
axes[0].scatter(X_moons[y_moons == 1, 0], X_moons[y_moons == 1, 1],
                label='Class 1', alpha=0.7, s=40, edgecolors='black', linewidth=0.5)
axes[0].set_title('Moons Dataset', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Circles (our choice)
axes[1].scatter(X_circles[y_circles == 0, 0], X_circles[y_circles == 0, 1],
                label='Class 0', alpha=0.7, s=40, edgecolors='black', linewidth=0.5)
axes[1].scatter(X_circles[y_circles == 1, 0], X_circles[y_circles == 1, 1],
                label='Class 1', alpha=0.7, s=40, edgecolors='black', linewidth=0.5)
axes[1].set_title('Circles Dataset (More Challenging!)', fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"Features: {X.shape[1]}")

## Step 2: Implement Deep MLP

Let's build a flexible MLP that can have any number of layers!

In [None]:
class MLP_Deep:
    """Multi-Layer Perceptron with configurable hidden layers"""
    
    def __init__(self, layer_sizes, learning_rate=0.1, n_iterations=1000, activation='relu'):
        """
        Args:
            layer_sizes: List of layer sizes [n_input, n_hidden1, n_hidden2, ..., n_output]
            learning_rate: Learning rate for gradient descent
            n_iterations: Number of training iterations
            activation: Activation function ('relu', 'sigmoid', 'tanh')
        """
        self.layer_sizes = layer_sizes
        self.n_layers = len(layer_sizes) - 1  # Number of weight matrices
        self.lr = learning_rate
        self.n_iterations = n_iterations
        self.activation = activation
        
        # Select activation functions
        if activation == 'relu':
            self.act_fn = relu
            self.act_derivative = relu_derivative
        elif activation == 'tanh':
            self.act_fn = tanh
            self.act_derivative = tanh_derivative
        else:
            self.act_fn = sigmoid
            self.act_derivative = sigmoid_derivative
        
        # Initialize weights and biases for all layers
        self.weights = []
        self.biases = []
        
        for i in range(self.n_layers):
            # He initialization for ReLU, Xavier for others
            if activation == 'relu':
                w = np.random.randn(layer_sizes[i], layer_sizes[i+1]) * np.sqrt(2.0 / layer_sizes[i])
            else:
                w = np.random.randn(layer_sizes[i], layer_sizes[i+1]) * np.sqrt(1.0 / layer_sizes[i])
            
            b = np.zeros((1, layer_sizes[i+1]))
            
            self.weights.append(w)
            self.biases.append(b)
        
        self.cost_history = []
        
        # Store activations for backward pass
        self.z_cache = []  # Pre-activation values
        self.a_cache = []  # Post-activation values
    
    def forward(self, X):
        """Forward pass through all layers"""
        self.z_cache = []
        self.a_cache = [X]  # Input is a_0
        
        a = X
        
        # Forward through all layers
        for i in range(self.n_layers):
            z = a @ self.weights[i] + self.biases[i]
            self.z_cache.append(z)
            
            # Use activation function (sigmoid for output layer, chosen activation for hidden)
            if i == self.n_layers - 1:  # Output layer
                a = sigmoid(z)  # Binary classification
            else:  # Hidden layers
                a = self.act_fn(z)
            
            self.a_cache.append(a)
        
        return a
    
    def compute_cost(self, y, y_pred):
        """Binary cross-entropy loss"""
        n = len(y)
        epsilon = 1e-7
        y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
        cost = -(1/n) * np.sum(y * np.log(y_pred) + (1 - y) * np.log(1 - y_pred))
        return cost
    
    def backward(self, X, y):
        """Backward pass - compute gradients for all layers"""
        n = len(y)
        y = y.reshape(-1, 1)
        
        # Initialize gradient storage
        dW = [None] * self.n_layers
        db = [None] * self.n_layers
        
        # Output layer gradient (Layer L)
        # For binary cross-entropy + sigmoid, gradient simplifies to (y_pred - y)
        dz = self.a_cache[-1] - y
        
        # Backward through all layers
        for i in range(self.n_layers - 1, -1, -1):
            # Gradient w.r.t. weights and biases
            dW[i] = (1/n) * (self.a_cache[i].T @ dz)
            db[i] = (1/n) * np.sum(dz, axis=0, keepdims=True)
            
            # Propagate gradient to previous layer
            if i > 0:  # Not the first layer
                da = dz @ self.weights[i].T
                dz = da * self.act_derivative(self.z_cache[i-1])
        
        return dW, db
    
    def fit(self, X, y, verbose=True):
        """Train the model"""
        for iteration in range(self.n_iterations):
            # Forward pass
            y_pred = self.forward(X)
            
            # Compute cost
            cost = self.compute_cost(y, y_pred)
            self.cost_history.append(cost)
            
            # Backward pass
            dW, db = self.backward(X, y)
            
            # Update weights and biases
            for i in range(self.n_layers):
                self.weights[i] -= self.lr * dW[i]
                self.biases[i] -= self.lr * db[i]
            
            if verbose and (iteration % 200 == 0 or iteration == self.n_iterations - 1):
                print(f"Iteration {iteration:4d} | Cost: {cost:.6f}")
    
    def predict_proba(self, X):
        """Predict probabilities"""
        return self.forward(X)
    
    def predict(self, X, threshold=0.5):
        """Predict class labels"""
        probabilities = self.predict_proba(X)
        return (probabilities >= threshold).astype(int).ravel()

## Step 3: Train Models with Different Depths

Let's compare:
1. Shallow: [2, 10, 1] (1 hidden layer)
2. Deep: [2, 16, 8, 1] (2 hidden layers)

In [None]:
# Model 1: Shallow (1 hidden layer)
print("Training Shallow Network (1 hidden layer)...")
print("Architecture: Input(2) → Hidden(10) → Output(1)")
print()
model_shallow = MLP_Deep(
    layer_sizes=[2, 10, 1],
    learning_rate=0.5,
    n_iterations=2000,
    activation='relu'
)
model_shallow.fit(X_train, y_train, verbose=True)

In [None]:
# Model 2: Deep (2 hidden layers)
print("\nTraining Deep Network (2 hidden layers)...")
print("Architecture: Input(2) → Hidden1(16) → Hidden2(8) → Output(1)")
print()
model_deep = MLP_Deep(
    layer_sizes=[2, 16, 8, 1],
    learning_rate=0.5,
    n_iterations=2000,
    activation='relu'
)
model_deep.fit(X_train, y_train, verbose=True)

## Step 4: Compare Performance

In [None]:
# Evaluate both models
models = {
    'Shallow (1 layer)': model_shallow,
    'Deep (2 layers)': model_deep
}

for name, model in models.items():
    train_acc = np.mean(model.predict(X_train) == y_train)
    test_acc = np.mean(model.predict(X_test) == y_test)
    
    total_params = sum(w.size + b.size for w, b in zip(model.weights, model.biases))
    
    print(f"\n{name}:")
    print(f"  Total Parameters: {total_params}")
    print(f"  Train Accuracy:   {train_acc:.4f}")
    print(f"  Test Accuracy:    {test_acc:.4f}")

In [None]:
# Visualize comparison
fig, axes = plt.subplots(2, 3, figsize=(20, 12))

# Plot 1: Learning curves
axes[0, 0].plot(model_shallow.cost_history, label='Shallow (1 layer)', linewidth=2, alpha=0.8)
axes[0, 0].plot(model_deep.cost_history, label='Deep (2 layers)', linewidth=2, alpha=0.8)
axes[0, 0].set_xlabel('Iteration', fontsize=12)
axes[0, 0].set_ylabel('Cost', fontsize=12)
axes[0, 0].set_title('Learning Curves', fontsize=14, fontweight='bold')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Helper function to plot decision boundary
def plot_decision_boundary(ax, model, X, y, title):
    h = 0.02
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    Z = model.predict_proba(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    contour = ax.contourf(xx, yy, Z, levels=20, cmap='RdYlBu_r', alpha=0.6)
    ax.contour(xx, yy, Z, levels=[0.5], colors='black', linewidths=2)
    ax.scatter(X[y == 0, 0], X[y == 0, 1], label='Class 0', 
               alpha=0.8, s=30, edgecolors='black', linewidth=0.5)
    ax.scatter(X[y == 1, 0], X[y == 1, 1], label='Class 1',
               alpha=0.8, s=30, edgecolors='black', linewidth=0.5)
    ax.set_xlabel('Feature 1', fontsize=11)
    ax.set_ylabel('Feature 2', fontsize=11)
    ax.set_title(title, fontsize=12, fontweight='bold')
    ax.legend(fontsize=9)
    return contour

# Plot 2 & 3: Decision boundaries (training)
plot_decision_boundary(axes[0, 1], model_shallow, X_train, y_train, 'Shallow - Training')
plot_decision_boundary(axes[0, 2], model_deep, X_train, y_train, 'Deep - Training')

# Plot 4 & 5: Decision boundaries (test)
plot_decision_boundary(axes[1, 1], model_shallow, X_test, y_test, 'Shallow - Test')
contour = plot_decision_boundary(axes[1, 2], model_deep, X_test, y_test, 'Deep - Test')

# Plot 6: Accuracy comparison
model_names = ['Shallow\n(1 layer)', 'Deep\n(2 layers)']
train_accs = [
    np.mean(model_shallow.predict(X_train) == y_train),
    np.mean(model_deep.predict(X_train) == y_train)
]
test_accs = [
    np.mean(model_shallow.predict(X_test) == y_test),
    np.mean(model_deep.predict(X_test) == y_test)
]

x_pos = np.arange(len(model_names))
width = 0.35
axes[1, 0].bar(x_pos - width/2, train_accs, width, label='Train', alpha=0.8)
axes[1, 0].bar(x_pos + width/2, test_accs, width, label='Test', alpha=0.8)
axes[1, 0].set_xlabel('Model', fontsize=12)
axes[1, 0].set_ylabel('Accuracy', fontsize=12)
axes[1, 0].set_title('Accuracy Comparison', fontsize=14, fontweight='bold')
axes[1, 0].set_xticks(x_pos)
axes[1, 0].set_xticklabels(model_names)
axes[1, 0].set_ylim(0.7, 1.0)
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3, axis='y')

# Add values on bars
for i, (train, test) in enumerate(zip(train_accs, test_accs)):
    axes[1, 0].text(i - width/2, train + 0.01, f'{train:.3f}', ha='center', fontsize=10)
    axes[1, 0].text(i + width/2, test + 0.01, f'{test:.3f}', ha='center', fontsize=10)

plt.tight_layout()
plt.show()

## Step 5: Visualize Layer Activations

Let's see what the hidden layers learned!

In [None]:
# Get activations from deep model
_ = model_deep.forward(X_train)

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Input layer (original features)
im0 = axes[0].scatter(X_train[:, 0], X_train[:, 1], c=y_train, 
                      cmap='RdYlBu_r', s=30, alpha=0.7, edgecolors='black', linewidth=0.5)
axes[0].set_xlabel('Feature 1', fontsize=12)
axes[0].set_ylabel('Feature 2', fontsize=12)
axes[0].set_title('Input Layer (Original Features)', fontsize=13, fontweight='bold')
plt.colorbar(im0, ax=axes[0], label='Class')
axes[0].grid(True, alpha=0.3)

# Hidden layer 1 activations (first 2 dimensions)
h1_activations = model_deep.a_cache[1]  # After first hidden layer
im1 = axes[1].scatter(h1_activations[:, 0], h1_activations[:, 1], c=y_train,
                      cmap='RdYlBu_r', s=30, alpha=0.7, edgecolors='black', linewidth=0.5)
axes[1].set_xlabel('Neuron 1 Activation', fontsize=12)
axes[1].set_ylabel('Neuron 2 Activation', fontsize=12)
axes[1].set_title('Hidden Layer 1 (First 2 Neurons)', fontsize=13, fontweight='bold')
plt.colorbar(im1, ax=axes[1], label='Class')
axes[1].grid(True, alpha=0.3)

# Hidden layer 2 activations (first 2 dimensions)
h2_activations = model_deep.a_cache[2]  # After second hidden layer
im2 = axes[2].scatter(h2_activations[:, 0], h2_activations[:, 1], c=y_train,
                      cmap='RdYlBu_r', s=30, alpha=0.7, edgecolors='black', linewidth=0.5)
axes[2].set_xlabel('Neuron 1 Activation', fontsize=12)
axes[2].set_ylabel('Neuron 2 Activation', fontsize=12)
axes[2].set_title('Hidden Layer 2 (First 2 Neurons)', fontsize=13, fontweight='bold')
plt.colorbar(im2, ax=axes[2], label='Class')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nNotice how the layers progressively separate the classes!")
print("Input → Hidden1: Features get transformed")
print("Hidden1 → Hidden2: Classes become more separable")
print("Hidden2 → Output: Simple linear separation possible")

## Step 6: Experiment with Even Deeper Networks

In [None]:
# Test different architectures
architectures = {
    'Shallow (1 layer)': [2, 20, 1],
    'Medium (2 layers)': [2, 16, 8, 1],
    'Deep (3 layers)': [2, 16, 12, 8, 1],
    'Very Deep (4 layers)': [2, 20, 16, 12, 8, 1]
}

results = {}

for name, arch in architectures.items():
    print(f"\nTraining {name}: {arch}")
    model = MLP_Deep(
        layer_sizes=arch,
        learning_rate=0.3,
        n_iterations=2000,
        activation='relu'
    )
    model.fit(X_train, y_train, verbose=False)
    
    train_acc = np.mean(model.predict(X_train) == y_train)
    test_acc = np.mean(model.predict(X_test) == y_test)
    total_params = sum(w.size + b.size for w, b in zip(model.weights, model.biases))
    
    results[name] = {
        'model': model,
        'train_acc': train_acc,
        'test_acc': test_acc,
        'params': total_params
    }
    
    print(f"  Parameters: {total_params}")
    print(f"  Train Acc:  {train_acc:.4f}")
    print(f"  Test Acc:   {test_acc:.4f}")

In [None]:
# Compare all architectures
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: Learning curves
for name, result in results.items():
    axes[0].plot(result['model'].cost_history, label=name, linewidth=2, alpha=0.8)
axes[0].set_xlabel('Iteration', fontsize=12)
axes[0].set_ylabel('Cost', fontsize=12)
axes[0].set_title('Learning Curves: Different Depths', fontsize=14, fontweight='bold')
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)

# Plot 2: Accuracy vs Parameters
params_list = [results[name]['params'] for name in architectures.keys()]
train_accs = [results[name]['train_acc'] for name in architectures.keys()]
test_accs = [results[name]['test_acc'] for name in architectures.keys()]

axes[1].plot(params_list, train_accs, 'o-', linewidth=2, markersize=10, label='Train', alpha=0.8)
axes[1].plot(params_list, test_accs, 's-', linewidth=2, markersize=10, label='Test', alpha=0.8)

# Add labels
for i, name in enumerate(architectures.keys()):
    axes[1].annotate(name.split()[0], 
                     (params_list[i], test_accs[i]),
                     textcoords="offset points", 
                     xytext=(0,10), 
                     ha='center',
                     fontsize=9)

axes[1].set_xlabel('Number of Parameters', fontsize=12)
axes[1].set_ylabel('Accuracy', fontsize=12)
axes[1].set_title('Accuracy vs Model Complexity', fontsize=14, fontweight='bold')
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)
axes[1].set_ylim(0.7, 1.0)

plt.tight_layout()
plt.show()

## Key Takeaways

### What We Learned:

1. **Deep Learning = Multiple Hidden Layers**
   - Shallow: Input → Hidden → Output
   - Deep: Input → Hidden₁ → Hidden₂ → ... → Output
   - Each layer transforms features progressively

2. **Backpropagation Scales Beautifully**
   - Same pattern for each layer:
     1. Compute error: $\frac{\partial L}{\partial z} = (\text{error from next layer}) \odot \text{activation}'(z)$
     2. Weight gradient: $\frac{\partial L}{\partial W} = \text{input}^T \times \text{error}$
     3. Bias gradient: $\frac{\partial L}{\partial b} = \sum(\text{error})$
     4. Propagate back: error for prev layer = error $\times W^T$
   - Works for ANY number of layers!

3. **Hierarchical Feature Learning**
   - **Layer 1**: Simple features
   - **Layer 2**: Combinations of simple features
   - **Layer 3**: High-level abstract features
   - **Output**: Final decision

4. **The Backpropagation Recipe (for any layer $i$):**
   ```python
   # Forward pass (save these!):
   z[i] = a[i-1] @ W[i] + b[i]
   a[i] = activation(z[i])
   
   # Backward pass:
   dz[i] = da[i] * activation'(z[i])  # Error at this layer
   dW[i] = a[i-1].T @ dz[i]           # Weight gradient
   db[i] = sum(dz[i])                 # Bias gradient
   da[i-1] = dz[i] @ W[i].T          # Propagate to previous layer
   ```

5. **Depth vs Width Trade-off**
   - **Wider networks** (more neurons per layer): More parameters
   - **Deeper networks** (more layers): Better feature hierarchy
   - Often: Deeper networks learn better with fewer parameters

6. **Practical Insights:**
   - **More layers ≠ always better**
     - Need more data for very deep networks
     - Risk of overfitting
     - Vanishing gradients in very deep networks
   
   - **Start shallow, go deeper if needed**
     - Try 1 hidden layer first
     - Add layers if performance plateaus
     - Monitor train vs test accuracy
   
   - **Activation functions matter**
     - ReLU: Most common for hidden layers
     - Sigmoid: Good for binary output
     - Tanh: Sometimes better than sigmoid for hidden layers

### Mathematical Beauty:

**Forward Pass (layer $i$):**
$$z^{(i)} = a^{(i-1)}W^{(i)} + b^{(i)}$$
$$a^{(i)} = g(z^{(i)})$$

**Backward Pass (layer $i$):**
$$\delta^{(i)} = (\delta^{(i+1)}W^{(i+1)^T}) \odot g'(z^{(i)})$$
$$\frac{\partial L}{\partial W^{(i)}} = \frac{1}{m}a^{(i-1)^T}\delta^{(i)}$$
$$\frac{\partial L}{\partial b^{(i)}} = \frac{1}{m}\sum\delta^{(i)}$$

Where:
- $\delta^{(i)}$ = error at layer $i$
- $g$ = activation function
- $g'$ = derivative of activation
- $\odot$ = element-wise multiplication

**This is the universal pattern for training neural networks!**

### Why This Matters:

You now understand:
1. ✅ Linear Regression (simple prediction)
2. ✅ Logistic Regression (binary classification)
3. ✅ Shallow Neural Networks (1 hidden layer)
4. ✅ Deep Neural Networks (multiple hidden layers)

**You understand the fundamentals of deep learning!**

Everything else (CNNs, RNNs, Transformers, etc.) builds on these same principles:
- Forward pass: compute predictions
- Loss function: measure error
- Backpropagation: compute gradients
- Gradient descent: update parameters

The only differences are:
- Architecture (how layers are connected)
- Activation functions
- Loss functions
- Optimization tricks

**But the core idea is always the same: gradient descent with backpropagation!**