## 3. Gradient Descent & Optimization Algorithms

We minimize a loss function $L(\theta)$ using **gradient descent**:

$$
\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)
$$

Variants:
- SGD (stochastic gradient descent)
- Momentum
- RMSProp
- Adam

Each changes the update dynamics depending on curvature of the loss.

In [None]:
# Compare optimizers on a simple regression
X = np.linspace(-1, 1, 100).reshape(-1,1)
y = X**3 + 0.1*np.random.randn(*X.shape)

def build_model():
    return tf.keras.Sequential([
        tf.keras.layers.Dense(64, activation='tanh', input_shape=(1,)),
        tf.keras.layers.Dense(1)
    ])

optimizers = {
    "SGD": tf.keras.optimizers.SGD(learning_rate=0.01),
    "Momentum": tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9),
    "RMSProp": tf.keras.optimizers.RMSprop(learning_rate=0.01),
    "Adam": tf.keras.optimizers.Adam(learning_rate=0.01)
}

histories = {}
for name, opt in optimizers.items():
    model = build_model()
    model.compile(optimizer=opt, loss='mse')
    h = model.fit(X, y, epochs=100, verbose=0)
    histories[name] = h.history['loss']

# Plot
for name, loss in histories.items():
    plt.plot(loss, label=name)
plt.legend()
plt.ylabel("MSE Loss")
plt.xlabel("Epoch")
plt.show()

## 4. Vanishing/Exploding Gradients

During backpropagation, gradients involve products of Jacobians:

$$
\frac{\partial L}{\partial x_0} =
\prod_{i=1}^n J_i \cdot \frac{\partial L}{\partial x_n}
$$

If eigenvalues of $J_i$ are < 1 → **vanishing gradient**.  
If > 1 → **exploding gradient**.

In [None]:
# Demonstrate vanishing gradient with deep sigmoid network
from tensorflow.keras import backend as K

deep_model = tf.keras.Sequential(
    [tf.keras.layers.Dense(32, activation='sigmoid', input_shape=(1,))] +
    [tf.keras.layers.Dense(32, activation='sigmoid') for _ in range(10)] +
    [tf.keras.layers.Dense(1)]
)

x_sample = tf.constant([[0.5]])
with tf.GradientTape() as tape:
    y_pred = deep_model(x_sample)
grads = tape.gradient(y_pred, deep_model.trainable_variables)

grad_norms = [tf.norm(g).numpy() for g in grads if g is not None]
grad_norms[:5]

## 6. Initialization Schemes

- **Xavier/Glorot** initialization: balances variance in forward/backward pass.  
- **He initialization**: suited for ReLU-like activations.  

This controls how activations and gradients propagate at initialization.

In [None]:
for init in ["glorot_uniform", "he_normal"]:
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(128, activation="relu", kernel_initializer=init, input_shape=(100,)),
        tf.keras.layers.Dense(1)
    ])
    x_rand = np.random.randn(1000, 100)
    y_pred = model(x_rand)
    print(init, "output variance:", np.var(y_pred.numpy()))

## 7. Bridges: Linking Math & Deep Learning

- **Activation choice → gradient flow**  
  Sigmoid/tanh cause vanishing gradients. ReLU/He init preserves signal.

- **Optimization method → curvature adaptation**  
  SGD can get stuck in valleys; Momentum accelerates; RMSProp & Adam adapt to curvature.

👉 The combination of activation + initialization + optimizer determines
how *learnable* a deep network is.