# 3-Layer Gradient Descent 

The basic flow:

* We started with 2-layer network, with one input, one output.
* We can generalize to (still 2-layer network) (1) multiple inputs, one output and (2) multiple inputs, multiple outputs.

The basic learning steps are the same, see below.

## Basic steps of learning


1. start with weight(s) initialized to some value
2. start with a input(s) and corresponding truth value
3. calculate `pred = input(s) x weights`
4. calculate `error = (pred-truth)**2`
5. calculate `node_delta = (pred-truth)`
6. calculate `weight_delta = node_delta x weights`
7. learning/weight adjustment, `weight = weight - weight_delta`


Grokking has some confusing use of variable names, particularly its `neural_network()` and `*_ele_mul()` both use `input` as the first parameter; Not only they are different to each other, they are also not the same as the actual `input`.

## Full, Stochastic and Batch Gradient Descent

* If you do weights update (or learn) with each example as input, then you have so called **Stochastic gradient descent**;

* If you do weights update after all inputs are processed, then you have **Full Gradient Descent**

* Somewhere in between, you have **Batch Gradient Descent**.


## Putting it together


Grokking book is not consistent here:
* p.128 use hidden size of 3
while the code example uses hidden_size = 4
* p.126 weight update by addition, it works since `layer_2_delta = walk_vs_stop[i:i+1] - layer_2` switched the order as well. It can be really confusing if not looked at carefully.
* `[i:i+1]` is important here, as it will return correct array shape.


In [48]:
import numpy as np 
np.random.seed(1)

def relu(x):
    ''' this function sets all negative number to 0 '''
    return (x > 0) * x 

def relu2deriv(x):
    ''' Return 1 for x > 0; return 0 otherwise '''
    return x > 0


alpha = 0.2
hidden_size = 4
streetlights = np.array([
    [1, 0, 1],
    [0, 1, 1],
    [0, 0, 1],
    [1, 1, 1]
])

walk_vs_stop = np.array([[1, 1, 0, 0]]).T 

# randomly initialize weight matrix: 0 to 1
weights_0_1 = 2 * np.random.random((3, hidden_size)) - 1
weights_1_2 = 2 * np.random.random((hidden_size, 1)) - 1 


for it in range(60):
    layer_2_error = 0
    for i in range(len(streetlights)):
        # go through each input
        # do forward propergation, which is weighted sum
        layer_0 = streetlights[i:i+1]

        # REFER TO Step #3
        layer_1 = relu(np.dot(layer_0, weights_0_1))
        layer_2 = np.dot(layer_1, weights_1_2)

        # REFER TO Step #4
        layer_2_error += np.sum((layer_2 - walk_vs_stop[i:i+1])**2) 

        # REFER TO Step #5
        layer_2_delta = (layer_2 - walk_vs_stop[i:i+1] )


        # NEW, not covered in previous steps
        # this line computes the delta at layer_1 given the delta at layer_2
        # by taking the layer_2_delta and multiplying it by its connecting 
        # weights (weights_1_2)
        layer_1_delta = layer_2_delta.dot(weights_1_2.T) * relu2deriv(layer_1)

        # REFER TO Step #6, but calculated different, need some revisit
        weight_delta_1_2 = layer_1.T.dot(layer_2_delta)
        weight_delta_0_1 = layer_0.T.dot(layer_1_delta)
        
        # update weights
        weights_1_2 -= alpha * weight_delta_1_2
        weights_0_1 -= alpha * weight_delta_0_1
    
    # 
    if (it % 10 == 9):
        print(f"Error: {layer_2_error}")
    


Error: 0.6342311598444467
Error: 0.35838407676317513
Error: 0.0830183113303298
Error: 0.006467054957103705
Error: 0.0003292669000750734
Error: 1.5055622665134859e-05


## Visualize a Neural Network

The Grokking Ch7 is the best chapter I read so far: the description is consistent, and no code snippet to confuse people. 

The code, the math, and mental picture come together as far as forward propagation goes. For a 3-layer network:

$L_2 = \text{relu}(L_0 W_0) W_1$

It is also important to see the vector-matrix multiplication as multiple (# matrix columns) of weighted sum.


