In [None]:
from fastai.vision.all import *
from fastbook import *

Machine learning models fit functions to data

We start off with an infinitely flexible function and get it to recognize patterns in input data

Let's start off with a quadratic:

In [None]:
def f(x): return 3*x**2 + 2*x + 1

plot_function(f, "$3x^2 + 2x + 1$") # DOLLAR SIGNS LETS US WRITE MATH EQUATIONS

Let's create a function that makes creating quadratics easier:

In [None]:
def quad(a, b, c, x): return a*x**2 + b*x + c

quad(3,2,1, 1.5)

In [None]:
from functools import partial
# Partial application of quad where a,b,c are fixed values
def mk_quad(a,b,c): return partial(quad, a,b,c)
f = mk_quad(3,2,1)
f(1.5)  # Now only value we have to pass is x

In [None]:
plot_function(f)

Now we're going to create some data that matches the shape of the function:
- adding noise because in real life this is the case

In [None]:
from numpy.random import normal,seed,uniform
np.random.seed(42)
# sets the seed so we get the same random numbers

def noise(x, scale): return normal(scale=scale, size=x.shape)
# normal() creates normally distributed random numbers

def add_noise(x, mult, add): return x * (1+noise(x,mult)) + noise(x,add)

In [None]:
x = torch.linspace(-2, 2, steps=20)[:,None]
# creates a tensor (vector) from -2 to 2 with 20 steps

y = add_noise(f(x), 0.3, 1.5)
# f(x) with random noise added to it

plt.scatter(x,y);

The idea is that we're going to reconstruct the original equation, find the one which matches this data

In [None]:
from ipywidgets import interact

@interact(a=1.5, b=1.5, c=1.5)
def plot_quad(a, b, c):
    plt.scatter(x, y)
    plot_function(mk_quad(a, b, c), min=-2, max=2)
    plt.show()

What's supposed to happen is that we could manually change the parameters ourselves and eyeball it. 

That's not the best approach because so want a way to see how far off we are from the true function.

So we're going to implement a loss function:

In [None]:
def mse(preds, acts): return ((preds-acts)**2).mean()
# MEAN SQUARED ERROR

In [None]:
@interact(a=1.5, b=1.5, c=1.5)
def plot_quad(a,b,c):
    f = mk_quad(a,b,c)
    plt.scatter(x,y)
    loss = mse(f(x),y)
    plot_function(f, title=f"MSE: {loss:.2f}")

This way of changing the parameters is still manual. BUT, now we have a way of actually knowing how far we are from the true function (the lower the loss score the better our prediction)

### AUTOMATE PARAMETER OPTIMIZATION

We could either continue making tweaks to the weights repeatedly and prioritize the changes that lead to a lower score.

This is hella slow.

Instead we should use derivatives (remember we're trying to get to the bottom of the curve where the weights are so close to the true parameters that a small change doesn't lead to a drastic increase in loss score)

We need a function to tell us: if input increase does output increase or decrease and by how much.

PyTorch can automatically do this for us :)

First thing we need is a function that takes the coefficients of the quadratic a, b, and c as inputs

In [None]:
def quad_mse(params):
    f = mk_quad(*params)    # star is used to pass list into mk_quad as separate inputs
    return mse(f(x), y)     # returns mean squared errors of predictions against actuals

This function takes in the coefficients of the quadratics and returns the loss

Example:

In [None]:
quad_mse([1.5, 1.5, 1.5])

So our MSE is 5.8336

It says its a tensor, which means it doesn't just work with numbers but also lists, or vectors of numbers (1d tensor)

We're now going to create parameters a, b, and c

In [None]:
abc = torch.tensor([1.5,1.5,1.5])   # All parameters are put into a rank-1-tensor
abc.requires_grad_()                # tell PyTorch that we want gradient calculated
                                    # for these numbers when used in a calculation

So now let's use it in a calculation:

In [None]:
loss = quad_mse(abc)
loss

The grad_fn at the end tells us, if we wanted to, PyTorch knows how to calculate the gradients for our inputs. 

To tell PyTorch to do it, we use the backward() method:

In [None]:
loss.backward()

Now abc has an attribute called grad

In [None]:
abc.grad

This tells us:
- Increasing a will lead to lower loss      (also the biggest change)
- Incerasing b will lead to a lower loss    (not as much as a)
- Increasing c will lead to a lower loss    (lowest change of the three)

Thus, we should increase all of them to lower the loss

In [None]:
with torch.no_grad():
    # abc is being used in a function but we don't want gradients to be calculated
    
    abc -= abc.grad*0.01
    # only change parameters by a fraction of its respective gradient

    loss = quad_mse(abc)
    # calculate loss again
    
print(f'loss={loss:.2f}')

Let's now automate it so that we continue to decrease the loss:

In [None]:
for i in range(5):
    loss = quad_mse(abc)                        # calculate loss
    loss.backward()                             # calculate gradients
    with torch.no_grad(): abc -= abc.grad*0.01  # update parameters
    print(f'step={i}; loss={loss:.2f}')         # print results

So we now have some coefficients:

In [None]:
abc

This is the most basic type of optimzer using gradient descent

THIS IS THE FOUNDATION OF HOW WE CREATE PARAMETERS

So, what is the mathematical funciton we are finding those parameters for?

We can't just use quadratics because its unlikely that this is the case for complex problems

We can create an infinitely flexible function from rectified linear units:

In [None]:
def rectified_linear(m,b,x):
    y = m*x+b                   # LINEAR FUNCTION
    return torch.clip(y, 0.)    # Takes output y and turns anything below 0 to 0

In [None]:
plot_function(partial(rectified_linear, 1,1))

Let's now make this plot function interactive:

In [None]:
@interact(m=1.5, b=1.5)
def plot_relu(m, b):
    plot_function(partial(rectified_linear, m,b))

We could take this rectified linear function and create a double ReLu, which adds up two rectified linear functions together:

In [None]:
def double_relu(m1,b1,m2,b2,x):
    return rectified_linear(m1,b1,x) + rectified_linear(m2,b2,x)

In [None]:
@interact(m1=-1.5, b1=-1.5, m2=1.5, b2=1.5)
def plot_double_relu(m1, b1, m2, b2):
    plot_function(partial(double_relu, m1,b1,m2,b2))

We could add as many ReLus together as we want, so we can have an arbitrarily squiggly function and with enough ReLUs, we can match it as close as we want

Imagine an audio waveform, we could use millions of ReLUs to add together to almost exactly match it

With this foundation, you can construct an arbitrarily accurate precise model

Problem is that we need some parameters, but we can easily get these using gradient descent

We have just derived deep learning.

Everything from now on is ways to make it faster and need less data.