# How to train a neural net(work)

## Step 1: Define the neural network

Our neural net has one neuron with one input and hence one weight. The weight is our single parameter.

Goal: We want to _train_ the neural net to return the desired output for a given input.

In [1]:
var y = w * x;

Error: (1,9): error CS0103: The name 'w' does not exist in the current context
(1,13): error CS0103: The name 'x' does not exist in the current context

## Step 2: Pick random start values for the parameters

To keep things simple I will be the random number generator.

In [2]:
var w = 0.5;

## Step 3: Training data

We need some.

In [3]:
double x = 1;          // input
double expectedY = 2;  // expected output, want our function to return this value for our given input

## Step 4: Calculate actual output

In [4]:
var y = w * x;
y

## Step 5: Calculate how far we are away from where we want to be

Want to quantize how far away we are from our goal. So we can then minimize "the loss".

In [5]:
var loss = y - expectedY; // = (w * x) - expectedY
loss

(Build up function composition graph.)

## Step 6: Find the gradient of the loss function with respect to each parameter

Want to know which direction we need to nudge our parameters in to reduce the loss. Calculus to the rescue! Could do $\frac{dl}{dw}$ straight up in this simple case,

$\frac{dl}{dw} = \frac{d((w * x) - expectedY)}{dw}$ = ?

But not possible for more complex examples so lets break it down, for science. Also allows _automating_ the process.

Function composition allows application of chain-rule.

\begin{align*}
& & \text{Chain rule} & \implies & \frac{dl}{dw} & = & \frac{dl}{dy} * \frac{dy}{dw} & &\\

l & = & y - expectedY & \implies & \frac{dl}{dy} & & & = & 1\\

y & = & w * x         & \implies & \frac{dy}{dw} & & & = & x\\

& & \text{Chain rule} & \implies & \frac{dl}{dw} & = & 1 * x & = & x
\end{align*}


In [24]:
var grad = x;
grad

## Step 7: Adjust parameters

In [7]:
w = w - 0.01 * grad

(`0.01` is a magic number called the "learning rate")

## Step 8: Rinse and repeat...

...until the loss is acceptably small. Profit.

## Step 9: However...

If we calculate the actual output again, we see that it has gotten further away from the expected output.

In [8]:
y = w * x;
y

And if we calculate the loss again, we can see that it has gotten bigger (in absolute terms)

In [9]:
loss = y - expectedY;
loss

It turns out, that, depending on our training data, which way round we do the subtraction matters. To avoid having this problem, we can square the difference.

In [10]:
loss = Math.Pow(y - expectedY, 2);
loss

But that makes the maths more complicated:

```
dl/dw = d(((w * x) - expected)^2)/dw

l = z^2              => dl/dz = 2z
z = y - expected     => dz/dy = 1
Chain rule           => dl/dy = dl/dz * dz/dy = 2z
y = w * x            => dy/dw = x
Chain rule           => dl/dw = dl/dy * dy/dw = 2zx = 2 * (y - expected) * x
```

In [25]:
grad = 2 * x * (y - expectedY)

Nudge again

In [12]:
w = w - 0.01 * grad 

## Step 10: Putting it all together

In [13]:
// neural net
double w = 0.5;

// training data
double x = 1;
double expectedY = 2;

// just some optimization
double y;
double loss;
double grad;


In [15]:

for (int i = 0; i < 100; i++)
{
    y = w * x;                              // forward pass
    loss = Math.Pow(y - expectedY, 2);

    Console.WriteLine(loss);

    grad = 2 * x * (y - expectedY);         // backward pass
    w = w - 0.01 * grad;
}

0.03957287986287364
0.0380057938203038
0.03650076438501975
0.035055334115372976
0.033667142884404186
0.03233392402618175
0.03105350063474499
0.02982378200960908
0.028642760242028557
0.02750850693644425
0.02641917006176105
0.025372970927315296
0.024368201278593637
0.0234032205079613
0.02247645297584601
0.021586385438002488
0.02073156457465762
0.01991059461750116
0.019122135070648125
0.018364898521850456
0.017637648540385166
0.01693919765818591
0.016268405430921742
0.015624176575857253
0.015005459183453283
0.014411242999788559
0.013840557776996946
0.013292471689027856
0.012766089810142366
0.01226055265366075
0.0117750347685758
0.011308743391740202
0.010860917153427276
0.010430824834151569
0.010017764170719161
0.009621060709558673
0.009240066705460138
0.008874160063923924
0.008522743325392549
0.00818524268970699
0.00786110707919461
0.00754980723885851
0.007250834872199707
0.006963701811260583
0.0066879392195346625
0.006423096826441072
0.006168742192114011
0.0059244600013063035
0.005689851

In [16]:
y

So now we can use our trained function to calculate outputs for inputs that it was not trained on. Behold:

In [17]:
var y = w * 3;
y

## One more thing

Can extend this to more complicated functions, as long as they are differentiable; neural nets are such functions with a particular structure.

One neuron:

$y = f(\sum w_i x_i + b) \text{ for some } f$

In [22]:
double[] weights = [0.5, 0.4];
double[] xs      = [1  , 2  ];

var actualY = xs.Zip(weights, (x, w) => x * w)
                .Sum();

Console.WriteLine(actualY);

actualY = Math.Tanh(actualY);

Console.WriteLine(actualY);

1.3
0.8617231593133063


![title](img/neuralnet.png)

### Dictionary corner

#### Backpropagation

`Propagating the gradient "backwards" through the neural net.`

#### Gradient descent

`Minimizing the loss by nudging the parameters in the right direction given their gradient.`

### A cool thing

Can automate backpropagation; autograd. PyTorch and Tenserflow do this.

* Limitations of neural nets: no loops, branching, recursion
* Things to understand: attention, transformers.

## Why train neural nets?

Because, somehow, when the neural network has the right structure, when the number of parameters is big enough, and when we train on the right data, magic happens.
