# Gradient-based Numerical Optimization

We want to minimize a differentiable function $f(x): \mathbb{R}^n \rightarrow \mathbb{R}$ with respect to its multi-dimensional input vector $x \in \mathbb{R}^n$:

$$ \min_x f(x) $$

Let us assume we can efficiently compute the gradient of $f$, that is:

$$ \nabla f = ( \partial_{x_0} f, \partial_{x_1} f, \dots) \in \mathbb{R}^n $$

## Gradient Descent

A simple way to find a local minimum of $f$ might be with an iterative algorithm, where at each iteration we update $x$ in the negative gradient direction by a (generally small) factor $\eta>0$:

$$ x_{k+1} = x_k - \eta \nabla f(x_k)$$

The learning rate $\eta$ determines how fast we move along the gradient direction.

# 1. Gradient Descent on a Simple Convex Function

We start with a simple quadratic function:

$$ f(x) = (x - 3)^2 $$

- Convex and smooth
- Global minimum at $x = 3$
- Gradient: $f'(x) = 2(x - 3)$

Use the slider to explore different values of the learning rate $\eta$.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import interact, FloatSlider

def gradient_descent_convex(eta=0.1):
    f = lambda x: (x - 3)**2
    df = lambda x: 2*(x - 3)
    x = 0
    history = [x]
    for _ in range(30):
        x -= eta * df(x)
        history.append(x)

    x_vals = np.linspace(-1, 7, 200)
    fig = plt.figure(figsize=(12,5))
    ax = fig.subplots(1,2)
    ax[0].plot(x_vals, f(x_vals), label='f(x)=(x-3)^2')
    ax[0].scatter(history, f(np.array(history)), c='red', label='Iterations')
    ax[0].set_title(f'Convergence with Learning Rate η={eta}')
    ax[0].set_xlabel('x'); ax[0].set_ylabel('f(x)')
    ax[0].legend(); ax[0].grid(True)

    ax[1].plot(f(np.array(history)))
    ax[1].set_yscale("log")
    ax[1].set_ylabel("Cost function")
    ax[1].set_xlabel("Iteration")
    ax[1].grid(True)
    plt.show()

interact(gradient_descent_convex, eta=FloatSlider(value=0.1, min=0.01, max=1.1, step=0.01, description='η'));

interactive(children=(FloatSlider(value=0.1, description='η', max=1.1, min=0.01, step=0.01), Output()), _dom_c…

# 2. Slow Convergence on an Ill-Conditioned Function

Now we explore a 2D function:

$$ f(x, y) = 100x^2 + y^2 $$

This function has very different curvature along the $x$ and $y$ axes. Use the slider to adjust $\eta$ and observe how gradient descent zig-zags in the narrow valley.

In [2]:
def gradient_descent_ill_conditioned(eta=0.05):
    f = lambda x, y: 100*x**2 + y**2
    df = lambda x, y: np.array([200*x, 2*y])

    x = np.array([1.5, 1.5])
    history = [x.copy()]
    hystory_f = [f(*x)]
    for _ in range(100):
        x -= eta * df(*x)
        history.append(x.copy())
        hystory_f.append(f(*x))
    history = np.array(history)

    X, Y = np.meshgrid(np.linspace(-2, 2, 400), np.linspace(-2, 2, 400))
    Z = f(X, Y)
    fig = plt.figure(figsize=(12,5))
    ax = fig.subplots(1,2)
    ax[0].contour(X, Y, Z, levels=np.logspace(-1, 3, 20))
    ax[0].plot(history[:,0], history[:,1], 'ro-', label='Trajectory', alpha=0.5)
    ax[0].set_title(f'Ill-Conditioned Function, η={eta}')
    ax[0].set_xlabel('x'); ax[0].set_ylabel('y')
    ax[0].legend(); ax[0].grid(True)

    ax[1].plot(hystory_f)
    ax[1].set_yscale("log")
    ax[1].set_ylabel("Cost function")
    ax[1].set_xlabel("Iteration")
    ax[1].grid(True)
    plt.show()

interact(gradient_descent_ill_conditioned, eta=FloatSlider(value=0.01, min=0.001, max=0.05, step=0.001, description='η'))

interactive(children=(FloatSlider(value=0.01, description='η', max=0.05, min=0.001, step=0.001), Output()), _d…

<function __main__.gradient_descent_ill_conditioned(eta=0.05)>

# Summary and Discussion

| Case | Function | Behavior | Lesson |
|------|-----------|-----------|---------|
| Simple convex | $(x-3)^2$ | Converges quickly | Works well when convex and well-scaled |
| Ill-conditioned | $100x^2 + y^2$ | Slow progress | Sensitive to scaling, motivates preconditioning |

These interactive examples illustrate why we need more advanced (e.g., **second-order**) optimization methods for real-world problems.