In [3]:
from random import randint
from typing import List
from plotly import graph_objects as go


## Optimisation

In machine learning, optimisation refers to the process of an algorithm improving previous results to improve the guess. The term `loss` is used to measure how far away the result is from the prediction. The distance between the two values is how we quantify if the algorithm is improving or declining. The most common `loss` functions are

#### Mean Squared Error (MSE)

$$
MSE = \frac{1}{n} = \sum_{i = 1}^{n} (Y_{i} - \hat{Y}_{i})^{2}
$$

In matrix notation,

$$
MSE = \frac{1}{n} \sum_{i=1}^{n}(e_{i})^{2} = \frac{1}{n}e^{T}e
$$

#### Residual Sum of Squares (RSS)/Sum of Squared Errors (SSE)

$$
RSS = \sum_{i = 1}^{n} (y_{i} - f(x_{i}))^{2}
$$

#### Root Mean Square Error (RMSE)

$$
\sqrt{\frac{\sum_{i = 1}^{N}(x_{i} - \hat{x}_{i})^{2}}{N}}
$$

Gradient descent is like a walk down a mountain, where if the next step increases in height, you take a step back and try to find a path that lets you get to the lowest point of the mountain.

In [2]:
def paraboloid(x, y):
    return(x**2 + y**2)

In [6]:
#Test Data

# Test data generation (only really necessary for the plotting below)
xs_start = ys_start = -10
xs_stop = ys_stop = 11
xs_step = ys_step = 1

xs: List[float] = [i for i in range(xs_start, xs_stop, xs_step)]
ys: List[float] = [i for i in range(ys_start, ys_stop, ys_step)]
zs: List[List[float]] = []

for x in xs:
    temp_res: List[float] = []
    for y in ys:
        result: float = paraboloid(x, y)
        temp_res.append(result)
    zs.append(temp_res)

print(f'xs: {xs}\n')
print(f'ys: {ys}\n')
print(f'zs: {zs[:5]} ...\n')

xs: [-10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

ys: [-10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

zs: [[200, 181, 164, 149, 136, 125, 116, 109, 104, 101, 100, 101, 104, 109, 116, 125, 136, 149, 164, 181, 200], [181, 162, 145, 130, 117, 106, 97, 90, 85, 82, 81, 82, 85, 90, 97, 106, 117, 130, 145, 162, 181], [164, 145, 128, 113, 100, 89, 80, 73, 68, 65, 64, 65, 68, 73, 80, 89, 100, 113, 128, 145, 164], [149, 130, 113, 98, 85, 74, 65, 58, 53, 50, 49, 50, 53, 58, 65, 74, 85, 98, 113, 130, 149], [136, 117, 100, 85, 72, 61, 52, 45, 40, 37, 36, 37, 40, 45, 52, 61, 72, 85, 100, 117, 136]] ...



In [None]:
# Plotting the generated test data
fig = go.Figure(
    go.Surface(
        x = xs, y = ys, z = zs, colorscale = 'Viridis'
        ))

fig.show()

## Gradients and Derivatives

> ...a vector-valued function $f...,$ whose value at a point $p$ is the vector whose components are the partial derivatives of $f$ at $p$.

In short, for any given value $p$ on the function, there is a vector of `partial derivatives`, which point in the direction of greatest increase.

- a derivative measures the rate of change of a function in respect to the changes in its input.
- the difference with partial derivatives is that you're deriving while keeping every other variable as a constant.

Using our parabloid example:

$$
\frac{\partial}{\partial x} (x^{2} + y^{2}) = 2x
$$

With those partial derivatives we're now able to compute any gradient for any point $p$ sitting on the plotted surface of function $f$. 

In [7]:
def compute_gradient(vec: List[float]) -> List[float]:
    assert len(vec) == 2
    x: float = vec[0]
    y: float = vec[1]
    return [2 * x, 2 * y]

Right now, the gradient vector is pointing upwards to the direction of greatest **increase**. We need to turn that vector into the opposite direction so that it points to the direction of greatest decrease.

We can do that if we multiply the gradient vector by `-1`.

The algorithm works as follows.

1. Get the starting position $p$ (which is represented as a vector) on $f$
2. Compute the gradient at point $p$
3. Multiply the gradient by a negative `step size` (usually a value smaller than `1`)
4. Compute the next position of $p$ on the surface by adding the rescaled gradient vector to the vector $p$
5. Repeat `step 2` with the new $p$ until convergence

In [8]:
#learning_rate = step_size

def compute_step(curr_pos: List[float], learning_rate: float) -> List[float]:
    grad: List[float] = compute_gradient(curr_pos)
    grad[0] *= -learning_rate
    grad[1] *= -learning_rate
    next_pos: List[float] = [0, 0]
    next_pos[0] = curr_pos[0] + grad[0]
    next_pos[1] = curr_pos[1] + grad[1]
    return(next_pos)

In [9]:
# define a random starting position p

start_pos: List[float]

# Ensure that we don't start at a minimum (0, 0 in our case)
while True:
    start_x: float = randint(xs_start, xs_stop)
    start_y: float = randint(ys_start, ys_stop)
    if start_x != 0 and start_y != 0:
        start_pos = [start_x, start_y]
        break

print(start_pos)

[8, 10]


And finally we wrap our `compute_step` function into a loop to iteratively walk down the surface and eventually reach a local minimum:

In [10]:
epochs: int = 5000
learning_rate: float = 0.001

best_pos: List[float] = start_pos

for i in range(0, epochs):
    next_pos: List[float] = compute_step(best_pos, learning_rate)
    if i % 500 == 0:
        print(f'Epoch {i}: {next_pos}')
    best_pos = next_pos

print(f'Best guess for a minimum: {best_pos}')

Epoch 0: [7.984, 9.98]
Epoch 500: [2.9342098587795595, 3.6677623234744514]
Epoch 1000: [1.078355147214322, 1.3479439340179038]
Epoch 1500: [0.3963076533344121, 0.4953845666680148]
Epoch 2000: [0.14564752298642572, 0.1820594037330319]
Epoch 2500: [0.05352710393957821, 0.06690887992447267]
Epoch 3000: [0.019671813137703987, 0.02458976642212994]
Epoch 3500: [0.007229612731553134, 0.009037015914441402]
Epoch 4000: [0.0026569640471043837, 0.0033212050588804732]
Epoch 4500: [0.0009764641910616871, 0.0012205802388271054]
Best guess for a minimum: [0.00035958074166348857, 0.0004494759270793589]
