# How to train your neural net

## Step 1: Define the neural network

In [4]:
var y = w * x;

Our neural net has one neuron with one input `x` and hence one weight `w`. The weight `w` is our single parameter.

Goal: We want to _train_ the neural net to return the value `2` when we give it the input `1`. I.e. we want to find the `w` for which this is the case.


## Step 2: Pick random start values for the parameters

To keep things simple I will be the random number generator.

In [5]:
var w = 0.5;

## Step 3: Training data

We need some.

In [6]:
double x = 1;          // input
double expectedY = 2;  // expected output, want our neural net to return
                       // this value for the given input

## Step 4: Calculate actual output

In [7]:
var y = w * x;
y

## Step 5: Calculate how far we are away from where we want to be

Want to quantify how far away we are from our goal. So we can then reduce the distance, a.k.a. "the loss".

In [8]:
var loss = y - expectedY;
loss

(Build up function composition graph.)

## Step 6: Find the gradient of the loss function with respect to each parameter

(This is the revenge bit.)

Want to know how our parameter affects the loss. More specifically, we want to know which direction we need to nudge the parameter to reduce the loss. In other words, need to know the gradient of the loss function with respect to the parameter.

Could calculate $\frac{dloss}{dw}$ straight up in this simple case,

$$
\frac{dloss}{dw} = \frac{d(y - expectedY)}{dw} = \frac{d((w \times x) - expectedY)}{dw} = ?

$$

But not possible for more complex cases (like neural networks) so lets break it down, for science. Also allows _automating_ the process.

Function composition allows application of chain-rule.

\begin{align*}
& & \text{Chain rule} & \implies & \frac{dloss}{dw} & = & \frac{dloss}{dy} \times \frac{dy}{dx} & &\\

loss & = & y - expectedY & \implies & \frac{dloss}{dy} & & & = & 1\\

y & = & w \times x         & \implies & \frac{dy}{dw} & & & = & x\\

& & \text{Chain rule} & \implies & \frac{dloss}{dw} & = & 1 \times x & = & x
\end{align*}

In [9]:
var grad = x;
grad

## Step 7: Adjust parameters

In [10]:
w = w - 0.01 * grad

(`0.01` is a magic number called the "learning rate")

## Step 8: Rinse and repeat...

...until the loss is acceptably small. Profit.

## Step 9: However...

If we calculate the actual output again, we see that it has actually gotten further away from the expected output?! :confused:

In [11]:
y = w * x;
y

And if we calculate the loss again, we can see that it has gotten bigger (in absolute terms)

In [12]:
loss = y - expectedY;
loss

It turns out, that, depending on our training data, which way round we do the subtraction matters. To avoid having this problem, we can square the difference.

In [13]:
loss = Math.Pow(y - expectedY, 2);
loss

But that makes the maths more complicated:

$$
\frac{dloss}{dw} = \frac{d(((w \times x) - expectedY)^2)}{dw} = ?\\
$$


\begin{align*}
loss & = & z^2 & \implies & \frac{dloss}{dz} & = & & & 2z\\

z & = & y - expectedY & \implies & \frac{dz}{dy} & = & & & 1\\

& & \text{Chain rule} & \implies & \frac{dloss}{dy} & = & \frac{dloss}{dz} \times \frac{dz}{dy} & = & 2z\\

y & = & w \times x         & \implies & \frac{dy}{dw} & = & & & x\\

& & \text{Chain rule} & \implies & \frac{dloss}{dw} & = & \frac{dloss}{dz} \times \frac{dz}{dw} & = & \frac{dloss}{dz} \times \frac{dz}{dy} \times \frac{dy}{dw} = 2z \times 1 \times x\\
& & & & & & = & & 2 \times (y - expectedY) \times x\\
\end{align*}

In [14]:
grad = 2 * (y - expectedY) * x

Nudge again

In [15]:
w = w - 0.01 * grad 

## Step 10: Putting it all together

In [16]:
// neural net
double w = 0.5;

// training data
double x = 1;
double expectedY = 2;

// just declared here for optimization...
double y;
double loss;
double grad;


In [19]:

for (int i = 0; i < 100; i++)
{
    y = w * x;                              // forward pass
    loss = Math.Pow(y - expectedY, 2);

    Console.WriteLine(loss);

    grad = 2 * (y - expectedY) * x;         // backward pass
    w = w - 0.01 * grad;
}

0.03957287986287364
0.0380057938203038
0.03650076438501975
0.035055334115372976
0.033667142884404186
0.03233392402618175
0.03105350063474499
0.02982378200960908
0.028642760242028557
0.02750850693644425
0.02641917006176105
0.025372970927315296
0.024368201278593637
0.0234032205079613
0.02247645297584601
0.021586385438002488
0.02073156457465762
0.01991059461750116
0.019122135070648125
0.018364898521850456
0.017637648540385166
0.01693919765818591
0.016268405430921742
0.015624176575857253
0.015005459183453283
0.014411242999788559
0.013840557776996946
0.013292471689027856
0.012766089810142366
0.01226055265366075
0.0117750347685758
0.011308743391740202
0.010860917153427276
0.010430824834151569
0.010017764170719161
0.009621060709558673
0.009240066705460138
0.008874160063923924
0.008522743325392549
0.00818524268970699
0.00786110707919461
0.00754980723885851
0.007250834872199707
0.006963701811260583
0.0066879392195346625
0.006423096826441072
0.006168742192114011
0.0059244600013063035
0.005689851

In [20]:
y

So now we can use our trained function to calculate outputs for inputs that it was not trained on. Behold:

In [21]:
var y = w * 3;
y

## One more thing (or two)

Neural nets are just more complicated functions, that use function composition to combine neurons. The structure of neural nets makes deep learning possible, because we can use linear algebra/matrices to represent neural nets, and GPUs are good at doing linear algebra.

Also, there are tools for automating calculating the gradient. PyTorch and Tensorflow do this. Makes creating neural nets a lot easier.

### Dictionary corner

Can now understand some more terms:

#### Backpropagation

`Propagating the gradient "backwards" through the neural net.`

#### Gradient descent

`Minimizing the loss by nudging the parameters in the right direction given their gradient.`

### The final thing

Judgement/experience required for training a neural net, artform, "fudging numbers", e.g. squaring the loss, learning rate, or when gradient goes 0.