# Backpropagation example with numpy on MNIST
E. Krupczak - 19 Aug 2018

In [1]:
import numpy as np

Last node goes to loss, so d(loss)/d(final node) = 1
$$\frac{d(loss)}{d(node)} = \text{node weight}$$
i.e. for an affine layer:
$$Loss = x_1+x_2+x_3+x_4$$
$$\frac{dLoss}{dx_1} = 1$$
and for a non-linear layer:
$$Loss = \tanh(x_1)$$
$$\frac{dLoss}{dx_1}= \frac{d\tanh(x_1)}{dx_1}$$

Use chain rule to combine layers, eg. $\alpha$, $\beta$:
$$\frac{dL}{d\alpha} = \frac{d\beta}{d\alpha}\cdot\frac{dL}{d\beta}$$

Class for every kind of layer:
- Affine layer
    - Matrix multiplication layer
    - Bias layer
- Nonlinear layer (eg. tanh)
- Loss/squared-error layer
- Softmax cross entropy layer

In [78]:
import numpy as np

class Network:
    def __init__(self, layers):
        self.layers = layers
    
    def forward_pass(self, i):
        output = i
        for layer in self.layers:
            output = layer.forward_pass(output)
        return output
    
    def backward_pass(self, gradient, learning_rate):
        output = gradient
        for layer in self.layers[::-1]:
            output = layer.backward_pass(output, learning_rate)
        return output

class LinearLayer:
    '''
    Takes a vector of length i, returns a vector of length j.
    Network weights stored as "weights", a j by i matrix.
    '''
    def __init__(self, weights):
        self.weights = weights
    
    def forward_pass(self, i):
        self.last_input = i        
        return self.weights @ i
    
    def backward_pass(self, gradient, learning_rate):
        self.weights -= learning_rate * gradient @ self.last_input.T
        return self.weights.T @ gradient
    
class BiasLayer:
    '''
    Takes a vector of length i, returns a vector of length i.
    Biases stored in a vector of length i.
    '''
    def __init__(self, bias):
        self.bias = bias
        
    def forward_pass(self, i):
        self.last_input = i
        return i + self.bias
    
    def backward_pass(self, gradient, learning_rate):
        self.bias -= learning_rate * gradient
        return gradient
    
class TanhLayer:
    '''
    Takes a vector of length i, returns a vector of length i.
    '''
    def forward_pass(self, i):
        self.last_input = i
        return np.tanh(i)
    
    def backward_pass(self, gradient, learning_rate):
        '''
        gradient: gradient of loss function at output
        returns loss function at input
        '''
        return gradient / np.cosh(self.last_input)**2
    
class SoftmaxLayer:
    '''
    Softmax: exponentiate and normalize.
    '''
    def forward_pass(self, i):
        self.last_input = i
        exp = np.exp(i)
        return exp / np.sum(exp)
    
    def backward_pass(self, gradient, learning_rate):
        deriv = 
        

For the tanh layer:
$y =\tanh(x)$ on the forwards pass. 
On the backwards pass, then, to take $\frac{dL}{dy}$ to $\frac{dL}{dx}$, we use the chain rule:
$$\frac{dL}{dy}\cdot\frac{dy}{dx}\bigg|_{x_0} = \frac{dL}{dx}\bigg|_{x_0}$$

For the softmax layer:
$$y_j = \frac{e^x_j}{\sum_{k}e^x_k}$$
On the backwards pass:
$$\frac{dy_i}{dx_j} = x_i(\delta_{ij}-x_j)$$

In [74]:
inputvec = np.random.randn(5,1)
weights = np.random.randn(5,5)
biases = np.ones([5,1])

In [75]:
linlayer = LinearLayer(weights)
biaslayer = BiasLayer(biases)
tanhlayer = TanhLayer()
network = Network([linlayer, biaslayer, tanhlayer])

In [76]:
network.forward_pass(inputvec)

array([[-0.95741829],
       [ 0.99770961],
       [ 0.99842318],
       [ 0.93810152],
       [ 0.77690749]])

In [77]:
network.backward_pass(inputvec, learning_rate = 0.1)

array([[ 0.03072998],
       [ 0.08828204],
       [-0.08143696],
       [-0.06928156],
       [-0.14483833]])