# Calculus & Optimization for ML

## Key Concepts
- Gradients, Jacobians, Hessians
- Chain rule in vector form
- Gradient descent variants (SGD, momentum, Adam)
- Convexity

## References
- Matrix Cookbook: Sections on derivatives
- FOML: Optimization sections
- CS231n: Optimization notes

In [None]:
import numpy as np
import matplotlib.pyplot as plt

## 1. Gradient Descent

Update rule:
$$w_{t+1} = w_t - \alpha \nabla L(w_t)$$

where $\alpha$ is the learning rate and $\nabla L$ is the gradient of the loss.

In [None]:
def gradient_descent(f, grad_f, x0, learning_rate=0.1, n_iters=100):
    """Basic gradient descent
    
    Args:
        f: Objective function
        grad_f: Gradient of objective function
        x0: Initial point
        learning_rate: Step size
        n_iters: Number of iterations
    
    Returns:
        x_history: List of x values
        f_history: List of f(x) values
    """
    x = x0.copy()
    x_history = [x.copy()]
    f_history = [f(x)]
    
    for _ in range(n_iters):
        grad = grad_f(x)
        x = x - learning_rate * grad
        x_history.append(x.copy())
        f_history.append(f(x))
    
    return np.array(x_history), np.array(f_history)

# Test on quadratic: f(x) = x^T A x / 2
A = np.array([[2, 0], [0, 1]])
f = lambda x: 0.5 * x @ A @ x
grad_f = lambda x: A @ x

x0 = np.array([2.0, 2.0])
x_hist, f_hist = gradient_descent(f, grad_f, x0, learning_rate=0.3, n_iters=20)

plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(f_hist)
plt.xlabel('Iteration')
plt.ylabel('f(x)')
plt.title('Loss over iterations')

plt.subplot(1, 2, 2)
plt.plot(x_hist[:, 0], x_hist[:, 1], 'o-')
plt.xlabel('x1')
plt.ylabel('x2')
plt.title('Optimization path')
plt.tight_layout()
plt.show()

## 2. SGD with Momentum

Adds momentum to smooth updates:
$$v_{t+1} = \beta v_t + \nabla L(w_t)$$
$$w_{t+1} = w_t - \alpha v_{t+1}$$

In [None]:
def sgd_momentum(f, grad_f, x0, learning_rate=0.1, momentum=0.9, n_iters=100):
    """SGD with momentum"""
    x = x0.copy()
    v = np.zeros_like(x)
    x_history = [x.copy()]
    f_history = [f(x)]
    
    for _ in range(n_iters):
        grad = grad_f(x)
        v = momentum * v + grad
        x = x - learning_rate * v
        x_history.append(x.copy())
        f_history.append(f(x))
    
    return np.array(x_history), np.array(f_history)

# Compare with basic GD
x_hist_mom, f_hist_mom = sgd_momentum(f, grad_f, x0, learning_rate=0.1, momentum=0.9, n_iters=20)

plt.plot(f_hist, label='GD')
plt.plot(f_hist_mom, label='SGD + Momentum')
plt.xlabel('Iteration')
plt.ylabel('f(x)')
plt.legend()
plt.title('Comparison: GD vs Momentum')
plt.show()

## 3. Adam Optimizer

Adaptive learning rates with momentum:
$$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$$
$$v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$$
$$\hat{m}_t = m_t / (1-\beta_1^t)$$
$$\hat{v}_t = v_t / (1-\beta_2^t)$$
$$w_t = w_{t-1} - \alpha \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon)$$

In [None]:
def adam(f, grad_f, x0, learning_rate=0.01, beta1=0.9, beta2=0.999, eps=1e-8, n_iters=100):
    """Adam optimizer"""
    x = x0.copy()
    m = np.zeros_like(x)
    v = np.zeros_like(x)
    x_history = [x.copy()]
    f_history = [f(x)]
    
    for t in range(1, n_iters + 1):
        grad = grad_f(x)
        
        # Update biased first moment
        m = beta1 * m + (1 - beta1) * grad
        # Update biased second moment
        v = beta2 * v + (1 - beta2) * grad**2
        
        # Bias correction
        m_hat = m / (1 - beta1**t)
        v_hat = v / (1 - beta2**t)
        
        # Update
        x = x - learning_rate * m_hat / (np.sqrt(v_hat) + eps)
        
        x_history.append(x.copy())
        f_history.append(f(x))
    
    return np.array(x_history), np.array(f_history)

# Test Adam
x_hist_adam, f_hist_adam = adam(f, grad_f, x0, learning_rate=0.5, n_iters=20)

plt.plot(f_hist, label='GD')
plt.plot(f_hist_mom, label='Momentum')
plt.plot(f_hist_adam, label='Adam')
plt.xlabel('Iteration')
plt.ylabel('f(x)')
plt.legend()
plt.title('Optimizer Comparison')
plt.show()

## Exercises

1. Derive the gradient of logistic regression loss with L2 regularization
2. Implement gradient descent for logistic regression
3. Compare convergence of GD, Momentum, Adam on Rosenbrock function
4. Visualize the effect of learning rate on convergence