**Lecutre 3 - Backpropogation, Neural Network**

- non-linearities (activation functions)
    - logistic/sigmoid
    - tanh
    - hard tan h
    - ReLu (rectified linear unit): ReLU(z) = max(z,0)
        - a lot is dead but some things are alive
        - trains quickly and performs will due to good gradient backflow
    - leaky ReLu / parametric ReLu
        - z<0 is not 0 but slightly sloped < 0

- why non-linearities are needed?
    - neural networks do function approximation (regression/classification)
    - without non-linearities, deep NN can't do anything more than a linear transform
    - extra layers could collapse to single linear transform: W_1 * W_2 * x = Wx
    - but with more layers that include non-linearities, they can approximate any complex function

**Back propogation**

- how to we calculate the slope of the loss function

    - 1. by hand (matrix calculus)
        - fully vectorized gradients
        - gradient = vector of partial derivatices with respect to each input
        - given a function with m outputs and n inputs
            - its Jacobian is an m x n matrix of partial derivatives
        - for composition of one-variable functions: multiply derivatives (dz/dx = dz/dy * dy/dx)
            - for multiple variables functions: multiple Jacobians
                - h = f(z)
                - z = Wx + b
                - dh/dx = dh/dz * dz/dx
        - apply the chain rule
            - s = u^transpose * h
            - h = f(z)
            - z = Wx + b
            - x (input)
            - ds/db = ds/dh * dh/dz * dz/db
            - ds/dW = ds/dh * dh/dz * dz/dW
            - note: the first 2 elements are the same...let's avoid duplicated computation and set those = _delta_
                - _delta_ is the upstream gradient ("error signal")
        - what's the output shape of ds/dW?
            - 1 output, nm inputs = 1 by nm Jacobian
            - computationally expensive
            - leave pure math and use the **shape convention:** the shape of the gradient is the shape of the parameters!
                - so n x m matrix
        - derivative with respect to matrix
            - ds/dW = _delta_ * dx/dW
            - ds/dW = _delta_^transpose * x^transpose
                - _delta_ = upstream gradient/error signal at z
                - x is local input signal
            - why the transpose?
                - each input goes to each output - you want to get outer product
                - let's consider the derivative of a single weight Wij
                - Wij only contributes to z_i (i.e. W_23 is only used to compute z2 not z1)
        - what shape should derivatives be?
            - disagreement between Jacobian form (which makes chain rule easy) and the shape convention (which makes implementing SGD easy) (Jacobian results in row vector but shape convention says gradient should be a column vector bc b is a column vector)
            - expect hw answers to follow shape convention
            - but Jacobian form is useful for computing the answers
            - 2 options:
                - 1. use Jacobian as much as possible and reshape to follow shape convention at the end
                - 2. always follow the shape convention
                    - look at dimensions to figure out when to transpose and/or reorder terms
                    - error message _delta_ that arrives at a hidden layer has the same dim as that hidden layer

    - 2. **back propogation** algorithm
        -  one more concept: we re-use derivatives computed for higher layers in computing derivatives for lower layers to minimize computation
        - forward propogation
            - x --> W
            - then Wx --> b 
            - then z = Wx + b --> f
            - then h = f(z) --> u
            - then s --> output
        - backward propogation
            - go backwards along edges, pass along gradients
            - ds/ds --> ds/dh --> ds/dz --> ds/db
            - single node:
                - node receives an upstream gradient (node f receives ds/dh coming from h)
                - goal is to pass on the correct downstream gradient (node f passes ds/dz to z)
                    - ds/dz = ds/dh * dh/dz (chain rule)
            - node intuitions (forward propogation):
                - + distributes upstream gradient
                - max() routes upstream gradient
                - * switches the forward coefficients in the downstream gradient
            - efficiently compute
                - compute all gradients at once
        - **automatic differentiation**
            - inferred from the symbolic expression of the fprop
            - each node type needs to know how to compute its output and how to compute the gradient wrt its inputs given the gradient wrt its output
            - modern DL frameworks (Tensorflow, PyTorch) do backpropogation for you but mainly leave layer/node writer to hand-calculate the local derivatice


    

In [None]:
# backprop implementation

class ComputationalGraph(object):
    
    def forward(inputs):
        for gate in self.graph.nodes_topologically_sorted()
            gate.forward()
        retrun loss
    
    def backward():
        for gate in reversed(self.graph.nodes_topologically_sorted()):
            gate.backward()
        return inputs_gradients
    
# forward/backward API

class MultiplyGate(object):

    def forward(x,y):
        z = x*y
        self.x = x
        self.y = y
        return z
    
    def backward(dz):
        dx = self.y * dz # [dz/dx * dL/dz]
        dy = self.x * dz # [dz/dy * dL/dz]
        return [dx, dy]

**Summary**

- backpropagation = recursively apply the chain rule along computation graph
    - downstream gradient = upstream gradient x local gradient
- forward pass = compute results of operations and save intermediate values
- backward pass = apply chain rule to compute gradients