# Lesson 3 (fast.ai 2022): Gradient Descent Function

As an excercise to make sure I understood how the code for gradient descent works, I tried packaging the gradient descent code in the first part of the [lesson](https://www.youtube.com/watch?v=hBBOjCiFcuo) into a re-usable `gradient_descent` method. 

Then below, I try using the `gradient_descent` with various functions, plotting the results for each step into a graph. 

**Please note that I am a complete Python (and AI) noob so some things might be strange / non-standard.**

Also as of writing this, I have not watched the excel part after the 'break'.

## Gradient Descent Function

This is the code for the gradient_descent function. It receives all required input as parameters. The Lesson's notebook uses many variables in the outer (global?) scope, so it was a bit hard for me to keep track of what variable came from where. By having all required input passed in as parameters, it was easier to follow (for me at least).

In [1]:
from ipywidgets import interact, FloatSlider
import numpy as np
from numpy.random import normal, seed, uniform
from functools import partial

import torch
import matplotlib.pyplot as plt

# loss functions, can use "mean squared error" or "mean absolute error". 
# From my tests it seemed mse might be faster

def mse(preds, acts): 
    return ((preds-acts)**2).mean()

def mae(preds, acts):
    return (torch.abs(preds-acts)).mean()

# utility function for gradient_descent
# gets loss of predicted values using current parameter values
def _get_predicted_loss(get_predicted_y_fn, current_params, x_values, act_y_values, loss_function):
    predicted_y_values = get_predicted_y_fn(*current_params)(x_values)
    with torch.no_grad():
        plt.plot(x_values, predicted_y_values) # plot latest result
    return loss_function(predicted_y_values, act_y_values)

# do gradient descent, returns resulting params
# can specify loss function to use
def gradient_descent(x_values, act_y_values, get_predicted_y_fn, start_params, steps, learning_rate, loss_function=mse):
    current_params = start_params
    current_params.requires_grad_()
    for i in range(steps):
        # get_predicted_loss will get the loss using the supplied get_predicted_y_fn and x/y values
        loss = _get_predicted_loss(get_predicted_y_fn, current_params, x_values, act_y_values, loss_function)
        loss.backward()
        with torch.no_grad():
            current_params -= current_params.grad * learning_rate
        current_params.grad.zero_() # from testing, this is needed
    return current_params

######################################################

# utility noise methods for later

def noise(x, scale): return np.random.normal(scale=scale, size=x.shape)
def add_noise(x, mult, add): return x * (1+noise(x, mult)) + noise(x, add)


## Test Gradient Descent with Quadratic Function

Code below defines the quadratic function and uses the `gradient_descent` method to try to find the correct parameters.

The slider bars can be used to try changing the actual [a, b, c] values, the starting [a, b, c] values, the learning steps, and the learning rate.

Each line on the graph corresponds to the results of a step.

**Try playing around with all the sliders!**

It is quite magical how whatever the target and wherever it starts from it gradually converges onto the correct values!

In [2]:
# define quad and mk_quad
def quad(a, b, c, x): return a*x**2 + b*x + c
def mk_quad(a, b, c): return partial(quad, a, b, c)

np.random.seed(42)

# create input x values
quad_x = torch.linspace(-2, 2, steps=20)[:, None]

@interact(a=3.0, b=2.0, c=1.0, start_a=1.1, start_b=1.1, start_c=1.1, steps=30, learning_rate=FloatSlider(min=0.01, max=0.1, step=0.01, value=0.05))
def test_gd_quad(a, b, c, start_a, start_b, start_c, steps, learning_rate):
    # create actual y values with some noise
    quad_y = add_noise(mk_quad(a, b, c)(quad_x), 0.01, 0.2)

    # scatter plot the input x and actual y values
    plt.scatter(quad_x, quad_y)

    # prepare the starting params
    params = torch.tensor([start_a, start_b, start_c])

    # we call the gradient descent function
    # we pass
    # - our input x and actual y values
    # - mk_quad method
    # - the starting params (starting a, b, c values)
    # - how many steps we want it to do the gradient descent
    # - the learning rate used
    # - the loss function which is set to mse (you can try mae if you want)
    # the return value is the result params ie. the guessed values of [a, b, c]
    result_params = gradient_descent(quad_x, quad_y, mk_quad, params, steps, learning_rate, loss_function=mse)

    plt.title(result_params)


interactive(children=(FloatSlider(value=3.0, description='a', max=9.0, min=-3.0), FloatSlider(value=2.0, descr…

## Test Gradient Descent with Linear Function

Same thing, but we test it with the linear function.

In [3]:
def linear(m, b, x): return m*x + b
def mk_linear(m, b): return partial(linear, m, b)

np.random.seed(42)

linear_x = torch.linspace(-2, 2, steps=20)[:, None]

@interact(m=5.0, b=-2.0, start_m=1.1, start_b=1.1, steps=30, learning_rate=FloatSlider(min=0.01, max=0.1, step=0.01, value=0.05))
def test_gd_linear(m, b, start_m, start_b, steps, learning_rate):
    linear_y = add_noise(mk_linear(m, b)(linear_x), 0.01, 1)
    plt.scatter(linear_x, linear_y)
    params = torch.tensor([start_m, start_b])
    result_params = gradient_descent(
        linear_x, linear_y, mk_linear, params, steps, learning_rate, loss_function=mse)
    plt.title(result_params)


interactive(children=(FloatSlider(value=5.0, description='m', max=15.0, min=-5.0), FloatSlider(value=-2.0, des…

## Test Gradient Descent with Rectified Linear Function

It seems the rectified linear function takes more steps (or maybe it's just my starting parameters).

In [4]:
def relu(m,b,x):
    y = m*x+b
    return torch.clip(y, 0.)

def mk_relu(m, b): return partial(relu, m, b)


np.random.seed(42)

relu_x = torch.linspace(-2, 2, steps=20)[:, None]

@interact(m=8.0, b=2.0, start_m=5.0, start_b=-2.0, steps=30, learning_rate=FloatSlider(min=0.01, max=0.1, step=0.01, value=0.05))
def test_gd_relu(m, b, start_m, start_b, steps, learning_rate):
    relu_y = add_noise(mk_relu(m, b)(relu_x), 0.01, 0.2)
    plt.scatter(relu_x, relu_y)
    params = torch.tensor([start_m, start_b])
    result_params = gradient_descent(
        relu_x, relu_y, mk_relu, params, steps, learning_rate, loss_function=mse)
    plt.title(result_params)


interactive(children=(FloatSlider(value=8.0, description='m', max=24.0, min=-8.0), FloatSlider(value=2.0, desc…

## Test Gradient Descent with Double Rectified Linear Function

And I also tried it out with the double rectified linear function! Takes a lot more steps to get close to the actual values, but it is pretty cool!

In [5]:
def double_relu(m1,b1,m2,b2,x):
    return relu(m1,b1,x) + relu(m2,b2,x)
    #return linear(m1, b1, x) + linear(m2, b2, x)

def mk_double_relu(m1, b1, m2, b2): return partial(double_relu, m1, b1, m2, b2)

np.random.seed(42)

double_relu_x = torch.linspace(-10, 10, steps=40)[:, None]


@interact(m1=3.0, b1=3.0, m2=-5, b2=12.0, start_m1=5.0, start_b1=-2.0, start_m2=1.1, start_b2=1.1, steps=300, learning_rate=FloatSlider(min=0.01, max=0.1, step=0.01, value=0.04))
def test_gd_double_relu(m1, b1, m2, b2, start_m1, start_b1, start_m2, start_b2, steps, learning_rate):
    double_relu_y = add_noise(mk_double_relu(m1, b1, m2, b2)(double_relu_x), 0.01, 0.3)
    plt.scatter(double_relu_x, double_relu_y)
    params = torch.tensor([start_m1, start_b1, start_m2, start_b2])
    result_params = gradient_descent(
        double_relu_x, double_relu_y, mk_double_relu, params, steps, learning_rate, loss_function=mse)
    plt.title(result_params)


interactive(children=(FloatSlider(value=3.0, description='m1', max=9.0, min=-3.0), FloatSlider(value=3.0, desc…