<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>

# Backpropagation Practice

## *Data Science Unit 4 Sprint 2 Assignment 2*

Implement a 3 input, 4 node hidden-layer, 1 output node Multilayer Perceptron on the following dataset:

| x1 | x2 | x3 | y |
|----|----|----|---|
| 0  | 0  | 1  | 0 |
| 0  | 1  | 1  | 1 |
| 1  | 0  | 1  | 1 |
| 0  | 1  | 0  | 1 |
| 1  | 0  | 0  | 1 |
| 1  | 1  | 1  | 0 |
| 0  | 0  | 0  | 0 |

If you look at the data you'll notice that the first two columns behave like an XOR gate while the last column is mostly just noise. Remember that creating an XOR gate was what the perceptron was criticized for not being able to learn. 

In [1]:
import numpy as np
import pandas as pd
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical
import random

In [2]:
# input
X = np.array(([0, 0, 1],
              [0, 1, 1],
              [1, 0, 1],
              [0, 1, 0],
              [1, 0, 0],
              [1, 1, 1],
              [0, 0, 0]), dtype=float)

# output
y = np.array(([0],
              [1],
              [1],
              [1],
              [1],
              [0],
              [0]), dtype=float)

In [3]:
# defines sigmoid and derivative function
def sigmoid(s):
    return 1 / (1+np.exp(-s))
def sigmoidPrime(s):
    return s * (1-s)

# setts initial weights
inputWeights = np.random.randn(3, 4)
hiddenWeights = np.random.randn(4, 1)

In [4]:
# shows X to compare with the dot product in a later cell
X

array([[0., 0., 1.],
       [0., 1., 1.],
       [1., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.],
       [1., 1., 1.],
       [0., 0., 0.]])

In [5]:
# initial weights for the inputs
# 3x4 matrix for the 3 inputs to have an individual weight for each of the 4 hidden nodes
inputWeights

array([[-0.74263146, -0.24268876,  0.54229065,  0.5439284 ],
       [-0.15148208, -0.01477208, -1.01325626, -0.27265018],
       [ 0.04328582,  1.24773111,  0.45641554, -0.19573693]])

In [6]:
# a matrix of the weighted sums
# each row represents one observation from the input
# each column represents one of the 4 hidden nodes
hiddenSum = np.dot(X, inputWeights)
hiddenSum

array([[ 0.04328582,  1.24773111,  0.45641554, -0.19573693],
       [-0.10819626,  1.23295902, -0.55684071, -0.46838711],
       [-0.69934564,  1.00504234,  0.99870619,  0.34819147],
       [-0.15148208, -0.01477208, -1.01325626, -0.27265018],
       [-0.74263146, -0.24268876,  0.54229065,  0.5439284 ],
       [-0.85082772,  0.99027026, -0.01455006,  0.07554129],
       [ 0.        ,  0.        ,  0.        ,  0.        ]])

In [7]:
# applying the activation sigmoid function to the each of the sums in the hidden layer
hiddenActv = sigmoid(hiddenSum)
hiddenActv

array([[0.51081977, 0.77690686, 0.6121635 , 0.45122141],
       [0.47297729, 0.77433605, 0.36427877, 0.38499806],
       [0.33195732, 0.73204881, 0.73080412, 0.58617895],
       [0.46220173, 0.49630705, 0.26634308, 0.4322566 ],
       [0.32242898, 0.43962385, 0.63234512, 0.63272579],
       [0.29925925, 0.7291413 , 0.49636255, 0.51887635],
       [0.5       , 0.5       , 0.5       , 0.5       ]])

In [8]:
# initial hidden weights
# 4x1 matrix for one for each each hidden weight going to the one output
hiddenWeights

array([[-0.38140672],
       [ 0.69070355],
       [ 1.77215819],
       [-0.24780804]])

In [9]:
# vector of weighted sums
outputSum = np.dot(hiddenActv, hiddenWeights)
outputSum

array([[1.3148165 ],
       [0.90459394],
       [1.52885862],
       [0.5313996 ],
       [1.14449422],
       [1.14053222],
       [0.91682349]])

In [11]:
# geting activation values by applying sigmoid to the output weighted sums
outputActv = sigmoid(outputSum)
outputActv

array([[0.78831801],
       [0.71189264],
       [0.82183925],
       [0.62980949],
       [0.75850383],
       [0.75777734],
       [0.71439443]])

In [36]:
# showing y for future cell calculation
y

array([[0.],
       [1.],
       [1.],
       [1.],
       [1.],
       [0.],
       [0.]])

In [12]:
# taking the difference between the actual result and activated output
# is this the cost?
outputError = y - outputActv 
outputError

array([[-0.78831801],
       [ 0.28810736],
       [ 0.17816075],
       [ 0.37019051],
       [ 0.24149617],
       [-0.75777734],
       [-0.71439443]])

In [13]:
# shows derivative of activated output
sigmoidPrime(outputActv) # is this the gradient?

array([[0.16687272],
       [0.20510151],
       [0.14641949],
       [0.2331495 ],
       [0.18317577],
       [0.18355084],
       [0.20403503]])

In [14]:
# gets the derivative of the activated output multiplied by the error
outputDelta = outputError * sigmoidPrime(outputActv) # is this where I need to put the learning rate?
outputDelta

array([[-0.13154877],
       [ 0.05909125],
       [ 0.02608621],
       [ 0.08630973],
       [ 0.04423625],
       [-0.13909067],
       [-0.14576149]])

In [15]:
# shows hidden weights transposed for next cell
hiddenWeights.T

array([[-0.38140672,  0.69070355,  1.77215819, -0.24780804]])

In [16]:
# takes the dot product of the output delta and hidden weights
# this gives the error for each of the first weighted sums
HiddenError = outputDelta.dot(hiddenWeights.T)
HiddenError

array([[ 0.05017359, -0.0908612 , -0.23312524,  0.03259884],
       [-0.0225378 ,  0.04081454,  0.10471905, -0.01464329],
       [-0.00994945,  0.01801784,  0.04622888, -0.00646437],
       [-0.03291911,  0.05961444,  0.1529545 , -0.02138825],
       [-0.016872  ,  0.03055413,  0.07839363, -0.0109621 ],
       [ 0.05305012, -0.09607042, -0.24649067,  0.03446779],
       [ 0.05559441, -0.10067798, -0.25831242,  0.03612087]])

In [17]:
# shows the derivative of the activated hidden sums
sigmoidPrime(hiddenActv)

array([[0.24988293, 0.17332259, 0.23741935, 0.24762065],
       [0.24926977, 0.17473973, 0.23157975, 0.23677455],
       [0.22176166, 0.19615335, 0.19672946, 0.24257319],
       [0.24857129, 0.24998636, 0.19540444, 0.24541083],
       [0.21846853, 0.24635472, 0.23248477, 0.23238386],
       [0.20970315, 0.19749426, 0.24998677, 0.24964368],
       [0.25      , 0.25      , 0.25      , 0.25      ]])

In [18]:
# gets delta value for multiplying error from first weighted sums by derivative of 
# the activated values from the hidden sums
HiddenDelta = HiddenError * sigmoidPrime(hiddenActv)
HiddenDelta

array([[ 0.01253752, -0.0157483 , -0.05534844,  0.00807215],
       [-0.00561799,  0.00713192,  0.02425081, -0.00346716],
       [-0.00220641,  0.00353426,  0.00909458, -0.00156808],
       [-0.00818275,  0.0149028 ,  0.02988799, -0.00524891],
       [-0.003686  ,  0.00752715,  0.01822532, -0.00254741],
       [ 0.01112478, -0.01897336, -0.06161941,  0.00860467],
       [ 0.0138986 , -0.02516949, -0.0645781 ,  0.00903022]])

In [19]:
# shows transpose of hiddenActv for use in next cell
hiddenActv.T

array([[0.51081977, 0.47297729, 0.33195732, 0.46220173, 0.32242898,
        0.29925925, 0.5       ],
       [0.77690686, 0.77433605, 0.73204881, 0.49630705, 0.43962385,
        0.7291413 , 0.5       ],
       [0.6121635 , 0.36427877, 0.73080412, 0.26634308, 0.63234512,
        0.49636255, 0.5       ],
       [0.45122141, 0.38499806, 0.58617895, 0.4322566 , 0.63272579,
        0.51887635, 0.5       ]])

In [21]:
# shows adjustments to be added to the hidden weights
hiddenActv.T.dot(outputDelta)

array([[-0.09093874],
       [-0.14936234],
       [-0.13089933],
       [-0.10107066]])

In [22]:
# shows hidden weights
hiddenWeights

array([[-0.38140672],
       [ 0.69070355],
       [ 1.77215819],
       [-0.24780804]])

In [23]:
# value of new hidden weights
hiddenWeightsNEW = hiddenWeights + hiddenActv.T.dot(outputDelta)
hiddenWeightsNEW

array([[-0.47234546],
       [ 0.54134121],
       [ 1.64125886],
       [-0.3488787 ]])

In [24]:
# shows transpose of X
X.T

array([[0., 0., 1., 0., 1., 1., 0.],
       [0., 1., 0., 1., 0., 1., 0.],
       [1., 1., 1., 0., 0., 1., 0.]])

In [25]:
# shows z2 delta
HiddenDelta

array([[ 0.01253752, -0.0157483 , -0.05534844,  0.00807215],
       [-0.00561799,  0.00713192,  0.02425081, -0.00346716],
       [-0.00220641,  0.00353426,  0.00909458, -0.00156808],
       [-0.00818275,  0.0149028 ,  0.02988799, -0.00524891],
       [-0.003686  ,  0.00752715,  0.01822532, -0.00254741],
       [ 0.01112478, -0.01897336, -0.06161941,  0.00860467],
       [ 0.0138986 , -0.02516949, -0.0645781 ,  0.00903022]])

In [26]:
# takes dot product of the transpose of X and delta of the hidden layer
X.T.dot(HiddenDelta)

array([[ 0.00523237, -0.00791194, -0.0342995 ,  0.00448917],
       [-0.00267596,  0.00306136, -0.00748061, -0.0001114 ],
       [ 0.0158379 , -0.02405548, -0.08362245,  0.01164157]])

In [27]:
# shows input weights
inputWeights

array([[-0.74263146, -0.24268876,  0.54229065,  0.5439284 ],
       [-0.15148208, -0.01477208, -1.01325626, -0.27265018],
       [ 0.04328582,  1.24773111,  0.45641554, -0.19573693]])

In [28]:
# value of new input weights
inputWeightsNEW = X.T.dot(HiddenDelta) + inputWeights
inputWeightsNEW

array([[-0.73739909, -0.25060071,  0.50799115,  0.54841756],
       [-0.15415804, -0.01171072, -1.02073686, -0.27276158],
       [ 0.05912372,  1.22367563,  0.37279309, -0.18409536]])

In [63]:
# neural network class

class NeuralNetwork:
    
    def __init__(self, inputs, hiddenNodes, outputNodes=1):
        # Set up number of nodes with user input
        self.inputs = inputs
        self.hiddenNodes =  hiddenNodes
        self.outputNodes = outputNodes
        
        # set initial weights to random values in numpy array of correct shape
        self.inputWeights = np.random.randn(self.inputs, self.hiddenNodes)
        self.hiddenWeights = np.random.randn(self.hiddenNodes, self.outputNodes)
        
    def sigmoid(self, s):
        return 1 / (1+np.exp(-s))
        
    def sigmoidPrime(self, s):
        return s * (1-s)
    
    def feed_forward(self, X):
        '''
        the feed forward steps from earlier
        '''

        self.hiddenSum = np.dot(X, self.inputWeights)

        self.hiddenActv = self.sigmoid(self.hiddenSum)

        self.outputSum = np.dot(self.hiddenActv, self.hiddenWeights)

        self.outputActv = self.sigmoid(self.outputSum)
        
        return(self.outputActv) # used as output for backward
    
    def backward(self, X, y, output, learning_rate=.1):
        '''
        backprop steps from earlier
        '''
        
        self.outputError = y - output

        self.outputDelta = learning_rate * (self.outputError * self.sigmoidPrime(output)) 
        
        self.hiddenError = self.outputDelta.dot(self.hiddenWeights.T)
        
        self.hiddenDelta = self.hiddenError * self.sigmoidPrime(self.hiddenActv)
        
        self.hiddenWeights += self.hiddenActv.T.dot(self.outputDelta)

        self.inputWeights += X.T.dot(self.hiddenDelta)
        
    def train(self, X, y, epochs=100, learning_rate=.1):
        '''
        runs feedforward then updates weights with backprop
        '''
        
        for _ in range(epochs):
            output = self.feed_forward(X)
            self.backward(X, y, output, learning_rate=learning_rate)
        self.loss = np.mean(np.square(y - self.feed_forward(X)))
        return(f'Loss after {epochs} epochs was {self.loss}')
    
    def predict(self, X):
        '''
        creates a prediction vector to compare with y vector
        '''
        
        output = self.feed_forward(X)
        predictions = []
        for i in output:
            if i[0] >= 0.5:
                predictions.append([1])
            else:
                predictions.append([0])
        self.predictions = np.array(predictions)
        
    def check(self, y):
        '''
        checks the fraction of correct predictions
        '''
        
        correct_predictions = np.sum(self.predictions == y)
        total_predictions = len(self.predictions)
        accuracy = correct_predictions / total_predictions
        return(f'The model had a {accuracy} accuracy score')

In [68]:
# trains neural network with 3 inputs and 4 nodes in the hidden layer
nn = NeuralNetwork(3,4)
nn.train(X, y, epochs=10000, learning_rate=1)

'Loss after 10000 epochs was 0.0001318357155550651'

In [69]:
# makes prediction using X values and checks the fraction of values it predicted correctly
nn.predict(X)
nn.check(y)

'The model had a 1.0 accuracy score'

## Try building/training a more complex MLP on a bigger dataset.

Use the [MNIST dataset](http://yann.lecun.com/exdb/mnist/) to build the cannonical handwriting digit recognizer and see what kind of accuracy you can achieve. 

If you need inspiration, the internet is chalk-full of tutorials, but I want you to see how far you can get on your own first. I've linked to the original MNIST dataset above but it will probably be easier to download data through a neural network library. If you reference outside resources make sure you understand every line of code that you're using from other sources, and share with your fellow students helpful resources that you find.


### Parts
1. Gathering & Transforming the Data
2. Making MNIST a Binary Problem
3. Estimating your Neural Network (the part you focus on)

### Gathering the Data 

`keras` has a handy method to pull the mnist dataset for you. You'll notice that each observation is a 28x28 arrary which represents an image. Although most Neural Network frameworks can handle higher dimensional data, that is more overhead than necessary for us. We need to flatten the image to one long row which will be 784 values (28X28). Basically, you will be appending each row to one another to make on really long row. 

In [32]:
# input image dimensions
img_rows, img_cols = 28, 28

# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train = x_train.reshape(x_train.shape[0], img_rows * img_cols)
x_test = x_test.reshape(x_test.shape[0], img_rows * img_cols)

# Normalize Our Data
x_train = x_train / 255
x_test = x_test / 255

# Now the data should be in a format you're more familiar with
x_train.shape

(60000, 784)

### Making MNIST a Binary Problem 
MNIST is multiclass classification problem; however we haven't covered all the necessary techniques to handle this yet. You would need to one-hot encode the target, use a different loss metric, and use softmax activations for the last layer. This is all stuff we'll cover later this week, but let us simply the problem for now: Zero or all else.

In [58]:
# reduces features to 1 or 0 to indicate if a value is 0(1) or not 0(0)
y_temp = np.zeros(y_train.shape)
y_temp[np.where(y_train == 0.0)[0]] = 1
y_train = y_temp

y_temp = np.zeros(y_test.shape)
y_temp[np.where(y_test == 0.0)[0]] = 1
y_test = y_temp

# resize to work with previous class
y_train.resize(60000,1)
y_test.resize(10000,1)

### Estimating Your `net

In [62]:
# use previous neural network class updated for new number of inputs and using 3 hidden nodes
nn2 = NeuralNetwork(784, 3)

# trains with full train dataset
nn2.train(x_train, y_train, epochs=1000)

TypeError: train() got an unexpected keyword argument 'lr'

In [60]:
# predicts values and checks what fraction of values are correct
nn2.predict(x_test)
nn2.check(y_test)

## Stretch Goals: 

- Make MNIST a multiclass problem using cross entropy & soft-max
- Implement Cross Validation model evaluation on your MNIST implementation 
- Research different [Gradient Descent Based Optimizers](https://keras.io/optimizers/)
 - [Siraj Raval the evolution of gradient descent](https://www.youtube.com/watch?v=nhqo0u1a6fw)
- Build a housing price estimation model using a neural network. How does its accuracy compare with the regression models that we fit earlier on in class?