#### Overview

* What are NNs?
* Why use NNs? Why not multivariable non-linear regression?

#### Imports

In [25]:
import numpy as np

#### Data

* How can we normalize the data so that our ML model will perform better?
* Generate good data that makes for easy example

In [139]:
features = np.array([
        [1,3],
        [4,2],
        [5,6]
    ])
targets = np.array([
        [53],
        [27],
        [23]
    ])

#### Architecture

* Get a good diagram

In [140]:
#Neural Network Architecture
layer0Size = 2
layer1Size = 3
layer2Size = 1
learningRate = .01

#### Neuron

Each neuron stores a vector whose size equals the number of incoming synapses. Neurons don’t store a single value. They store information about the cost to move to that neuron from every feature/input/synapse. So a layer is essentially a big matrix of dimensions equal to the rows of the incoming layer and the columns of the incoming weights matrix. It can be 3x1 or 3x3. Even in the final layer with a single neuron, the neuron will still store a vector (1D array) with a value for each incoming connection. This is important b/c we need a vector of derivatives (one for each direction/synapse) so we can start backprop optimization. The final output isn’t a single number (like it would be if we were making a prediction). If we wanted to make a prediction, we would sum up that final vector values?

#### Hidden Layer

* Explain how neurons are connected in a hidden layer

### Weights

In [141]:
#Layer 0 --> 1
#Layer 0 has 2 columns
#Layer 1 is of size 3
#So our weights matrix must be size 2 x 3
#Rows of input by Size of next layer
l0_weights = np.array([
        [.5,.5,.5],
        [.5,.5,.5]        
    ])

#L1 dimensions = 3 x 3
#L2 = size 1
#L1 weights must be of 3 x 1
#Rows of input by Size of next layer
l1_weights = np.array([
        [.5],
        [.5],
        [.5]
    ])

### Activation Function

* What is the purpose of the activation function? 
* Why do we use it?

In [142]:
def sigmoid(X):
    #Non-linear activation function..
    return 1 / (1 + np.exp(-X))

In [143]:
#Test
sigmoid(features)

array([[ 0.73105858,  0.95257413],
       [ 0.98201379,  0.88079708],
       [ 0.99330715,  0.99752738]])

### Feed Forward

* How does feed forward work?

In [144]:
def feedForward(X, W):
    '''
    X - Previous Layer Matrix
    W - Weights matrix connecting previous layer to current
    Returns matrix representing neurons of current layer
    with sigmoid activation function applied 
    '''
    return sigmoid(np.dot(X,W))

In [145]:
a1 = feedForward(features,l0_weights)
a2 = feedForward(a1,l1_weights)

print "Layer1 Output"
print a1
print "Layer2 Output"
print a2

Layer1 Output
[[ 0.88079708  0.88079708  0.88079708]
 [ 0.95257413  0.95257413  0.95257413]
 [ 0.99592986  0.99592986  0.99592986]]
Layer2 Output
[[ 0.78938056]
 [ 0.80672381]
 [ 0.81666214]]


### Cost

* Which cost function do we use? And Why?
* How is it different from cost functions in Linear/Logistic Regression?
* Let's use MSE = 1/2(yHat-y)^2
* The derivative of MSE is (yHat - y)

In [146]:
def cost_function(predictions, targets):
    '''
    Predictions: N x M matrix
    Targets: M x K matrix
    Returns average squared error across predictions
    '''
    N = len(targets)
    
    #Take the squared error of each row
    sq_error = (predictions - targets)**2

    #Return the mean sum squared error among predictions
    return 1.0/(2*N) * sq_error.sum()

cost_function(a2,targets)

650.68949642533767

### Backpropagation

In [147]:
'''
# Our cost function == Cost(Sigmoid(Hypothesis(x)))
# Chain Rule = 
# via Chain Rule = dC/S * dS/H * dH/x

# So to calculate the gradient, we take the 
    1. Derivative of Hypothesis w repect to X 
    2. Derivative of Sigmoid(Hypothesis)
    3. Derivative of Cost(Sigmoid(Hypothesis)))
    4. Multiply them all together
    
Derivative of Hypothesis = h(x) = W1*x1 + W2*x2 = W1+W2  =  constant (e.g. d/dx of 2x is 2)
Derivative of Sigmoid = s(x) = x * (1-x)  =  h(x)*(1-h(x))
Derivative of Cost = x(y-h(x))  =  s(x)(y-h(x))  =  h(x)*(1-h(x)) * (y-h(x))

Sweet!!!!!!!!
'''
def costDeriv(yHat, y):
    return yHat - y

'''
One of the desirable properties of the sigmoid function is that its output 
can be used to create its derivative. If the sigmoid's output 
is x, then its derivative is x*(1-x). 
http://iamtrask.github.io/2015/07/12/basic-python-network/
'''
def sigmoidDeriv(X):
    '''
    X - Matrix representing layer of neurons
    Modifies values in matrix to derivative
    
    This is important because it's part of backprop
    via the Chain Rule
    
    Hadamard element-wise multiplication
    We are just modifying the matrix values in-place
    The matrix dimensions remain the same
    '''
    return X * (1 - X)

def lastLayerError(lastLayerWeightedInput, lastLayerOutput, targets):
    '''
    Should return a vector of the same dimensions as target
    
    l0_activation = 9x9
    l1_activation = 3x1
    targets = 3x1
    
    (lastLayerOutput-targets) * sigmoidDeriv(lastLayerWeightedInput)
    (l1_activation - targets) * sigmoidDeriv(l0_activation)
    [yHat]   [y]     
    [yHat] - [y]  
    [yHat]   [y]
    
    l1_activation - targets = 3x1
    sigmoidDeriv(l0_activation) = 3x3
    
    x = hadamard product (element-wise multiplication)
    
    [yHat - y]     [S S S]     [E]
    [yHat - y]  x  [S S S]  =  [E] 
    [yHat - y]     [S S S]     [E] 
    
    '''
    return (lastLayerOutput-targets) * sigmoidDeriv(lastLayerWeightedInput)

def hiddenLayerError(hiddenLayerWeightedInput, weightsToNextLayer, nextLayerError):
    return np.dot(nextLayerError, weightsToNextLayer.T) * sigmoidDeriv(hiddenLayerWeightedInput)

def costGradientWithRespectToWeights(currentLayerError, currentLayerInput):
    return np.dot(currentLayerInput.T, currentLayerError)

In [None]:
#Operation for multiplying weights + features and wrapping with sigmoid
# Get formula
#sigmoid(np.dot(features, weights))

'''
Above we had 2 incoming features and 3 neurons in the next layer

We have 6 weights (synapses):
    [.5,.5,.5],
    [.5,.5,.5]

The output looks like this:
    [ 0.88079708  0.88079708  0.88079708]
    [ 0.95257413  0.95257413  0.95257413]
    [ 0.99592986  0.99592986  0.99592986]
 
 This really means:
 
[[ sigmoid(f(W1,X))  sigmoid(f(W1,X))  sigmoid(f(W1,X))]
 [ sigmoid(f(W2,X))  sigmoid(f(W2,X))  sigmoid(f(W2,X))]
 [ sigmoid(f(W3,X))  sigmoid(f(W3,X))  sigmoid(f(W3,X))]]
 
Given these are functions, we can take the partial derivatives to guide us in the correct direction to update each weight.
'''
print "hey"

In [148]:
a = np.array([[2],[2],[2]])
b = np.array([
        [3, 3, 3],
        [3, 3, 3],
        [3, 3, 3],
    ])

In [149]:
print features
sigmoidDeriv(features)

[[1 3]
 [4 2]
 [5 6]]


array([[  0,  -6],
       [-12,  -2],
       [-20, -30]])

In [150]:
a = np.array([
        [1],[2]
    ])
b = np.array([
        [3,4],
        [5,6]
    ])
c = np.array([
        [3,4],
        [5,6],
        [5,6],
    ])
a - b

array([[-2, -3],
       [-3, -4]])

### Gradient Descent

In [151]:
def update_weights(features, l0_weights, l1_weights, learningRate=.01):

    # Feed Forward
    '''
    [1,3]                       [1 1 1]
    [4,2]  dot  [.5 .5 .5 ]  =  [1 1 1] 
    [5,6]       [.5 .5 .5 ]     [1 1 1]

    Then apply Sigmoid()
    
    10_activation = 9x9
    '''
    l0_activation = feedForward(features, l0_weights)

    '''
    [1 1 1]       [.5 ]     [2]
    [1 1 1]  dot  [.5 ]  =  [2] 
    [1 1 1]       [.5 ]     [2]
    
    l1_activation = 3x1
    '''
    l1_activation = feedForward(l0_activation, l1_weights)
    
    
    ### Backpropagate ###
    
    # Calculate error in each layer
    '''
    l0_activation = 9x9
    l1_activation = 3x1
    targets = 3x1
    
    (lastLayerOutput-targets) * sigmoidDeriv(lastLayerInput)
    (l1_activation - targets) * sigmoidDeriv(l0_activation)
    [yHat]   [y]     
    [yHat] - [y]  
    [yHat]   [y]
    
    l1_activation - targets = 3x1
    sigmoidDeriv(l0_activation) = 3x1
    
    x = hadamard product (element-wise multiplication)
    
    [yHat - y]     [S'(Z2)]     [E]
    [yHat - y]  x  [S'(Z2)]  =  [E] 
    [yHat - y]     [S'(Z2)]     [E] 
    
    '''
    l1_error = lastLayerError(np.dot(l0_activation, l1_weights), l1_activation, targets)
    l0_error = hiddenLayerError(np.dot(features, l0_weights), l1_weights, l1_error)
    
    # Calculate cost gradient for weights
    w1_gradient = costGradientWithRespectToWeights(l1_error, l0_activation)
    w0_gradient = costGradientWithRespectToWeights(l0_error, features)

    # Update weights based on our cost gradient
    l0_weights -= learningRate * w0_gradient
    l1_weights -= learningRate * w1_gradient
    print l0_weights, l1_weights

for i in range(10):
    update_weights(features, l0_weights, l1_weights, .001)

[[ 1.72750253  1.72750253  1.72750253]
 [ 1.87805752  1.87805752  1.87805752]] [[ 0.44889445]
 [ 0.44889445]
 [ 0.44889445]]
[[ 13.2603347   13.2603347   13.2603347 ]
 [ 15.05160837  15.05160837  15.05160837]] [[ 0.40201064]
 [ 0.40201064]
 [ 0.40201064]]
[[ 372.93232176  372.93232176  372.93232176]
 [ 428.04776666  428.04776666  428.04776666]] [[ 0.37699078]
 [ 0.37699078]
 [ 0.37699078]]
[[ 162699.04072517  162699.04072517  162699.04072517]
 [ 187020.57553104  187020.57553104  187020.57553104]] [[ 0.36206975]
 [ 0.36206975]
 [ 0.36206975]]
[[  1.88008649e+10   1.88008649e+10   1.88008649e+10]
 [  2.16139059e+10   2.16139059e+10   2.16139059e+10]] [[ 0.35263473]
 [ 0.35263473]
 [ 0.35263473]]
[[  1.60006101e+20   1.60006101e+20   1.60006101e+20]
 [  1.83947138e+20   1.83947138e+20   1.83947138e+20]] [[ 0.34646167]
 [ 0.34646167]
 [ 0.34646167]]
[[  7.61029309e+39   7.61029309e+39   7.61029309e+39]
 [  8.74895098e+39   8.74895098e+39   8.74895098e+39]] [[ 0.34233596]
 [ 0.34233596]
 [ 



#### Train

#### Evaluate

### Scikit-Learn Example

In [None]:
#https://scikit-neuralnetwork.readthedocs.io/en/latest/module_mlp.html#regressor
from sknn.mlp import Regressor
from sknn.mlp import Layer
hiddenLayer = Layer("Sigmoid", units=2)
outputLayer = Layer("Linear", units=1)
nn = Regressor([hiddenLayer, outputLayer],learning_rule='sgd',learning_rate=.01,
               batch_size=1,loss_type="mse",debug=True,verbose=True,regularize=None)

In [None]:
inputs = get_inputs(-4,5)
targets = get_targets(inputs)
nn.fit(inputs,targets)

In [None]:
predictions = nn.predict(inputs)
plt.plot(predictions)
plt.plot(targets)
plt.show()