# Python ML library: Keras


## 0. Motivation for Neural Network

Imagine we need to build a model predicting how many transaction each customer will make next year.     
We have predictive data (or features) like each customer's age, bank balance, whether they are retired etc.     

Consider how a linear regression will solve this problem:    
The linear regression embeds an assumption that the outcome, in this case how many transaction a user makes, is the sum of individual part. It starts by saying:"what is the average?", then the effect of each part comes in.      
So the linear regression model is *not* identifying the interactions between these parts, and how they affect banking activity.     

We can plot the prediction for retired/non-retired people (2 lines), with current bank balance on the x-axis, predicted number of transaction on the y-axis. This graph shows prediction from a model with no interactions (if these two lines are parallel, if interactions are allowed, then these 2 lines will not be parallel). In this model we simply add up the effect of the retirement status and current bank balance.     

Neural networks are a powerful modeling approach that accounts for interactions like this well.     
Deep learning uses especially powerful neural networks.     

Because deep learning account for interaction so well, they perform great on most prediction problems we seen before.     
But their ability to capture extremely complex interactions also allow them to do amazing stuff with text, images, video, source code etc.    

Going back to the transaction problem, if there is an interaction between retirement status and bank balance, instead of them *separately* affect the outcome, we can calculate a function  of these variables that accounts for their interaction, and use that to predict the outcome (with other features).    

In reality, we can represent interactions in neural network like:     
Most left: Input layer (consist of predictive features, e.g.age, income)      
Most right: Output layer (the prediction of the model, e.g. number of transaction)     
All layers that are not the input or output layers are called *hidden layers*.     
These are hidden because they are not something we have data about, or anything we observe directly form the world.     
Each *node* in the hidden layer, represents an aggregation of information from our input data, and each node adds to the model's ability to capture interactions. More nodes -> more interactions we capture.   




## 0.1 Forward propagation

Forward propagation algorithm: Use data to make predictions.    

Going back to the bank example where we try to predict how many transacations a user will make at our bank.    
If we make predictions based on only the no. of children and no. of existing accounts.    
For a customer with 2 children and 3 accounts, there will be 2 nodes (2 and 3) in the input layer.     
The hidden layer (also 2 nodes) are connected to the input layer nodes by lines. Each line has a weight indicating how strongly that input effects the hidden node that the line ends at. These are the first set of weights: the top node of the hidden layer is conected by no. child (2) with line = 1, and no. acct (3) with line = 1. These weights are the parameter we train (the number of the lines) or change when we fit a neural network to data.      

To make predictions for the top node of the hidden layer, we take the value of each node in the input layer, times it by the weight that ends at the node and sum it up. So this node will have the value of 2x1+3x1 = 5.      

With each node calculated and the weight of the lines determined, we propagate through till we get the number at the output layer.    

So forward propagation uses multiply and then add process. This is actually a dot product.    
This is forward propagation for a single data point. In general, we do forward propagation for 1 data point at a time.    



In [None]:
# forward propagation simple example
import numpy as np

# 2 children, 3 account
input_data = np.array([2,3])

# weights into each node in the hiddenlayer and to the output
weights = {"node_0": np.array([1,1]),"node_1": np.array([-1,1]),"output": np.array([2,-1])}

#top hidden node value
node_0_value = (input_data * weights["node_0"]).sum()
#bottom hidden node value
node_1_value = (input_data * weights["node_1"]).sum()
                
hidden_layer_values = np.array([node_0_value,node_1_value])
                
print(hidden_layer_values)

output = (hidden_layer_values*weights["output"]).sum()
print(output)

## 0.2 Activation functions

Creating the multiply-add-process is only half the story for hidden layers.     
For neural networks to achieve their maximum predictive power, we must apply *activation function* in the hidden layers.   

An *activation function* allows the model to capture non-linearities. That is, if the relationships in the data are not straight-line relationships, then we will need an activation functions that captures non-linearities.    

This *activation function* is something applied to the value coming into a node, which then transforms it into the value stored in that node (or the node output).        

An example of an activation function is a tanh function. If applied to the top node of the hidden layer in the example, then it will be tanh(5) instead of 5.     

Today, the standard in both industry and research applications is something called ReLU or Rectified Linear activation function:         
RELU(x) = 0 if x<0, x if x>=0      
It consist of 2 linear pieces, and can be powerful when composed together through multiple successive hidden layers.    



In [6]:
# forward propagation simple example
# tanh as activation function
import numpy as np


input_data = np.array([-1,2])

# weights into each node in the hiddenlayer and to the output
weights = {"node_0": np.array([3,3]),"node_1": np.array([1,5]),"output": np.array([2,-1])}

#top hidden node value
node_0_input = (input_data * weights["node_0"]).sum()
node_0_output = np.tanh(node_0_input)
#bottom hidden node value
node_1_input = (input_data * weights["node_1"]).sum()
node_1_output = np.tanh(node_1_input)
                
hidden_layer_output = np.array([node_0_output,node_1_output])
                
print(hidden_layer_output)

output = (hidden_layer_output*weights["output"]).sum()
print(output)

[0.99505475 0.99999997]
0.9901095378334199


In [7]:
# activation function: relu(x)
def relu(input):
    '''Define your relu activation function here'''
    # Calculate the value for the output of the relu function: output
    output = max(0, input)
    
    # Return the value just calculated
    return(output)

# Calculate node 0 value: node_0_output
node_0_input = (input_data * weights['node_0']).sum()
node_0_output = relu(node_0_input)

# Calculate node 1 value: node_1_output
node_1_input = (input_data * weights['node_1']).sum()
node_1_output = relu(node_1_input)

# Put node values into array: hidden_layer_outputs
hidden_layer_outputs = np.array([node_0_output, node_1_output])

# Calculate model output (do not apply relu)
model_output = (hidden_layer_outputs * weights['output']).sum()

# Print model output
print(model_output)

-3


In [8]:
# generate predictions for multiple data observation
input_data = [np.array([3, 5]), np.array([ 1, -1]), np.array([0, 0]), np.array([8, 4])]
weights = {'node_0': np.array([2, 4]), 'node_1': np.array([ 4, -5]), 'output': np.array([2, 7])}




# Define predict_with_network()
def predict_with_network(input_data_row, weights):

    # Calculate node 0 value
    node_0_input = (input_data_row * weights["node_0"]).sum()
    node_0_output = relu(node_0_input)

    # Calculate node 1 value
    node_1_input = (input_data_row * weights["node_1"]).sum()
    node_1_output = relu(node_1_input)

    # Put node values into array: hidden_layer_outputs
    hidden_layer_outputs = np.array([node_0_output, node_1_output])
    
    # Calculate model output
    input_to_final_layer = (hidden_layer_outputs * weights["output"]).sum()
    model_output = relu(input_to_final_layer)
    
    # Return model output
    return(model_output)


# Create empty list to store prediction results
results = []
for input_data_row in input_data:
    # Append prediction to results
    results.append(predict_with_network(input_data_row,weights))

# Print results
print(results)
        

[52, 63, 0, 148]


## 0.3 Deeper networks  

We can forward propagate through successive layers in a similar way to what we use for a single hidden layer.      
We first fill in the values for hidden layer one as a function of the input layer. Then apply the activation functions to fill in the values of these nodes. Then use values from the first hidden layer to fill in the second hidden layer. Then we can make a prediction based on the output of the hidden layer 2. We use the same forward propagation process, but apply the iterative process more times.     

For deep learning, they internally build up representations of the patterns in the data taht are useful for making prediction. And they find increasingly complex patterns as we go through successive hidden layers of the network.      
In this way neural networks partially replace the need for feature engineering (manually creating better predictive features).     
Deep learning is also sometimes called representation learning, because subsequent layers build increasingly sophisticated representations of the raw data, until we get to a stage where we can make predictions.     

When a nerual network tries to classify an image, the first hidden layers build up patterns or interactions that are conceptually simple. A simple interaction would like at groups of nearby pixels and find patterns like horizontal lines etc. Subsequent layers combines these information and find larger patterns.      

The cool ting about deep learning is that the modeler doesn't need to specify those interactions. We never tell the model to look for horizontal lines for example. Instead when we train the model, the neural network gets weights that find the relevant patterns to make better predictions.

## 0.4 The need for optimisation

To see the importance of model weights, we will revisit the simple example of the simple neural network above (2 nodes at input, 2 nodes at hidden layer (1 layer) and 1 output node).     
For the moment, we won't use an activation function in this example. If our input values are 2 and 3, while the true value of the target is 13. So the closer our prediciton is to 13, the more accurate this model is for this data point.     
If we use existing weights and perform forward propagation, our output is 9. Since the true targe value is 13, our error is 9-14=4. (error = predict - actual)      

Changing any weight will change our prediction down the network. Making accurate predictions gets harder with more data points. At any set of weights, there are many values of the error, corresponding to the many points we make predictions for. We use a *loss function* to aggregate all the errors into a single measure of the model's predictive performance. A common loss function for regression task is mean-squared-error (i.e. square each error of the data points, and take the average of that as a measure of model quality). Loss function aggregate all the error into a single score. The goal is to find the weights giving the lowest value for the loss function.       

We do this with an algorithm called *gradient descent*:        
1) start at a random point     
2) until you are somewhere flat, find the slope, and take a step downhill     

If slope is positive:      
-> going opposite slope means moving to lower numbers      
-> subtract the slope from the current value, but too big a step might lead us astray      
-> solution: learning rate: we multiply the slope by a small number, called the learning rate, and we change each weight by subtracting (learning rate * slope). (e.g. learning rate = 0.01)       

**slope for a weight**:     
weights feed from one node into another, and you always get the slope you need by multiplying 3 things:      
1) slope of the loss function w.r.t. value at the node we feed into     
2) value of the node that feeds into our weight     
3) slope of the activation function w.r.t. value we feed into     

consider 2 nodes: 3 -- 2 --> 6 (actual = 10):     
(ans above 3 points)     
1) here it is actually slope of mean-squred loss function w.r.t. prediction node: 2*(predicted value - actual value) = 2 x error = 2*(6-10) = -8    
2) 3        
3) no activation function = 0    
so total slope of mean squared loss function = -8x3=-24       

We would now improve this weight by subtracting the learning rate times that slope i.e. 2-0.01(-24)=2.24    

For multiple weights feeding to the output, we repeat this calculation separately for each weight. Then we update both weights simultaneously using their respective derivatives.





In [9]:
# code to calculate slopes and update weights
weights = np.array([1,2])
input_data = np.array([3,4])
target = 6
learning_rate = 0.01
preds = (weights * input_data).sum()

error = preds - target
print(error)

5


In [10]:
#slope calculation and update weights
# first/second node value for the 1st/2nd calucalted slope
# notice we update these two nodes simultaneously
gradient = 2*input_data*error
print(gradient)

weights_updated = weights - learning_rate * gradient

preds_updated = (weights_updated * input_data).sum()
error_updated = preds_updated - target
print(error_updated)

[30 40]
2.5


## 0.5 Back Propagation

We use gradient descent to optimise weights in a simple model. We can use back propagation to calculate the slopes we need to optimise more complex deep learning models.     

Just as forward propagation sends input data through the hidden layers and into the output layer, back propagation takes the error from the output layer and propagates it backward through the hidden layers, towards the input layer.    

It calculates the necessary slopes sequentially from the weights closest to the prediction, through hidden layers, eventually back to the weights coming from the inputs.     
It allows gradient descent to update all weights in neural network, by getting gradients for all weights.    

We then use these slopes to update our weights.    

It is important to understand the process and focus on the general structure of the algorithm.    

In a big picture, we are trying to estimate the slope of the loss function w.r.t each weight in our network.      
We always do forward propagation to make a prediction and calculate an error before we do back propagation, as we need the predicted value to get the error (which is needed to get the slope)     

For ReLU activation function, the slope is 0 for x<0 and slope = 1 when x > 0.     

So far we have focus on getting slopes of the loss function w.r.t. to weights. We also need to keep track of the slopes of the loss function w.r.t. to node values, because we use these slopes in our calculations of slopes at weights.    

The slope of the loss function w.r.t. any node value is the sum of slopes for every weight coming out of that node.      

