# Simple gradient descent example
### Christian Igel, 2024

In [None]:
usetex = True  # set to False if you encounter rendering problems
import matplotlib.pyplot as plt
if usetex:
    plt.rcParams.update({
        "text.usetex": True,
        "font.family": "sans-serif",
        "font.sans-serif": "Helvetica",
    })
import numpy as np

Let's define a quadratic function $f:\mathbb R^2\to\mathbb R$  as
$$f(x,y)=(\alpha x)^2 + y^2 + \alpha xy$$ 
with $\alpha = \frac{1}{2}$ and its gradient:

In [None]:
alpha = 0.5  # some parameter changing the shape of the function
# Quadratic function
def f(x, y):
    return (alpha*x)**2 + y**2 + alpha*x*y
# Gradient of the function
def df(x, y):
    return (alpha**2)*2*x + alpha*y, 2*y + alpha*x

Now we minimize the function using gradient descent with learning rate `eta`.

In [None]:
# Learning rate
eta = 0.5
# Number of steps
n_iter = 4

r = 1.  # we will plot the function over x, y in [-r, r]

# Define starting point in the upper right corner of plot
xi = 0.9*r  
yi = 0.8*r
p_x = [xi]  # list of x-values
p_y = [yi]  # list of y-values

# Do steepest descent optimization:
for i in range(n_iter):
    dx, dy = df(xi, yi)  # compute gradient
    xi -= eta * dx  # update x-coordinate
    yi -= eta * dy  # update y-coordinate
    p_x.append(xi)  # store x-coordinate
    p_y.append(yi)  # store y-coordinate

Plot steps:

In [None]:
# Make contour plot
x = np.linspace(-r, r, 50)
y = np.linspace(-r, r, 50)

X, Y = np.meshgrid(x, y)
Z = f(X, Y)
contours = plt.contour(X, Y, Z, [0.01, 0.05, 0.1, 0.5, 1.], colors='grey')
plt.clabel(contours, inline=True, fontsize=6)
plt.xlabel(r'$x$')
plt.ylabel(r'$y$')
plt.imshow(Z, extent=[-r, r, -r, r], origin='lower', cmap='RdGy', alpha=0.5)

# Add optimum
plt.plot(0, 0, 'x', c='k')

# Plot gradient steps
for i in range(n_iter):
    plt.arrow(p_x[i], p_y[i], p_x[i+1]-p_x[i], p_y[i+1]-p_y[i], width=.005, head_width=.045, head_length=.025, length_includes_head=True, fc='b', ec='b', zorder=10)

Modify the example:

* Try different values, `eta = 0.01`, `eta = 0.1`, `eta = 0.5`, and `eta = 0.75`, and play with the number of steps.

* Add a momentum term to the update step

* Reimplement the model in PyTorch and use [automatic differentiation](https://pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html) to compute the gradient.