# Intro to Deep Learning

## Part 1

Imagine you work for a bank and you need to predict how many transactions each customer will make next year.

### Interactions
* Neural networks account for interactions really well
* Deep learning uses especially powerful neural networks
* Text
* Images
* Videos
* Audio
* Source code

### Course structure
* First two chapters focus on conceptual knowledge
    * Debug and tune deep learning models on conventional prediction problems
    * Lay the foundation for progressing towards modern applications
* This will pay off in the third and fourth chapters

In [1]:
# Build deep learning models with keras
import numpy as np
from keras.layers import Dense
from keras.models import Sequential

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


```python
predictors = np.loadtxt('predictors_data.csv', delimiter=',')
n_cols = predictors.shape[1]
model = Sequential()
model.add(Dense(100, activation='relu', input_shape = (n_cols,)))
model.add(Dense(100, activation='relu')
model.add(Dense(1))
```

---
## Forward propagation
    * ABILITY TO MAKE PREDICTION

#### Example case: How many bank transactions will someone do?
* Make predictions based on:
    * Number of children
    * Number of existing accounts
    
#### Forward propagation
* Multiply - add process
* Dot product
* Forward propagation for one data point at a time
* Output is the prediction for that data point

#### Forward propagation Code

In [2]:
input_data = np.array([2, 3])

weights = {'node_0': np.array([1, 1]),'node_1': np.array([-1, 1]),'output': np.array([2, -1])}
           
node_0_value = (input_data * weights['node_0']).sum()
node_1_value = (input_data * weights['node_1']).sum()

hidden_layer_values = np.array([node_0_value, node_1_value])

print(hidden_layer_values)

[5 1]


In [3]:
output = (hidden_layer_values * weights['output']).sum()
print(output)

9


---

## Activation functions
    * Allow model to capture non-linearities
    * Applied to node inputs to produce node output
    * Used to improve out neural network
    
### ReLU (Rectified Linear Activation)

$$RELU(x) = \Bigg\{^{0\quad if\quad x < 0}_{x\quad if\quad x \leq 0} $$

In [4]:
iput_data = np.array([-1, 2])

weights = {'node_0': np.array([3, 3]),'node_1': np.array([1, 5]),'output': np.array([2, -1])}

node_0_input = (input_data * weights['node_0']).sum()
node_0_output = np.tanh(node_0_input)

node_1_input = (input_data * weights['node_1']).sum()
node_1_output = np.tanh(node_1_input)

hidden_layer_outputs = np.array([node_0_output, node_1_output])
output = (hidden_layer_outputs * weights['output']).sum()

print(output)

0.9999999999996291


In [5]:
def relu(input):
    '''Define your relu activation function here'''
    # Calculate the value for the output of the relu function: output
    output = max(0, input)
    
    # Return the value just calculated
    return(output)

---
## Deeper networks
    * Mulyiple hidden layers

#### Representation learning
* Deep networks internally build representations of patterns in the data
* Partially replace the need for feature engineering
* Subsequent layers build increasingly sophisticated representations of raw data

#### Deep learning
* Modeler doesn’t need to specify the interactions
* When you train the model, the neural network gets weights that find the relevant patterns to make be er predictions

## The need for optimization

#### Predictions with multiple points
* Making accurate predictions gets harder with more points
* At any set of weights, there are many values of the error
    * ... corresponding to the many points we make predictions for
    
#### Loss function
* Aggregates errors in predictions from many data points into single number
* Measure of model’s predictive performance
* Lower loss function value means a be er model
* __Goal: Find the weights that give the lowest value for the loss function
* Gradient descent__

#### Gradient descent
* Imagine you are in a pitch dark field
* Want to find the lowest point
* Feel the ground to see how it slopes
* Take a small step downhill
* Repeat until it is uphill in every direction

#### Gradient descent steps
* Start at random point
* Until you are somewhere flat:
    * Find the slope
    * Take a step downhill

--- 
### Gradient descent

* If the slope is positive:
    * Going opposite the slope means moving to lower numbers
    * Subtract the slope from the current value
    * Too big a step might lead us astray
    
* Solution: learning rate
    * Update each weight by subtracting (__learning rate * slope__)

#### Slope calculation example
    3 --- 2 ---> 6 : Actul target value = 10
    
    * To calculate the slope for a weight, need to multiply:
    
        * Slope of the loss function w.r.t value at the node we feed into
            * Slope of mean-squared loss function w.r.t prediction: 2*Error = 2*(-4)
            
        * The value of the node that feeds into our weight
            * 2*(-4)*(3) = -24
            * If learning rate is 0.01, the new weight would be 2-0.01(-24)=2.24
            
        * Slope of the activation function w.r.t value we feed into

In [6]:
weights = np.array([1, 2])
input_data = np.array([3, 4])
target = 6
learning_rate = 0.01
preds = (weights * input_data).sum()
error = preds - target
print(error)

5


In [7]:
gradient = 2 * input_data * error
gradient

array([30, 40])

In [8]:
weights_updated = weights - learning_rate * gradient
preds_updated = (weights_updated * input_data).sum()
error_updated = preds_updated - target
print(error_updated)

2.5


### Example: Making multiple updates to weights

You're now going to make multiple updates so you can dramatically improve your model weights, and see how the predictions improve with each update.

To keep your code clean, there is a pre-loaded _get_slope()_ function that takes _input_data_, _target_, and _weights_ as arguments. There is also a _get_mse()_ function that takes the same arguments. The _input_data_, _target_, and _weights_ have been pre-loaded.

This network does not have any hidden layers, and it goes directly from the input (with 3 nodes) to an output node. Note that weights is a single array.

We have also pre-loaded _matplotlib.pyplot_, and the error history will be plotted after you have done your gradient descent steps.

```python 
n_updates = 20
mse_hist = []

# Iterate over the number of updates
for i in range(n_updates):
    # Calculate the slope: slope
    slope = get_slope(input_data, target, weights)
    
    # Update the weights: weights
    weights = weights - 0.01 * slope
    
    # Calculate mse with new weights: mse
    mse = get_mse(input_data, target, weights)
    
    # Append the mse to mse_hist
    mse_hist.append(mse)

# Plot the mse history
plt.plot(mse_hist)
plt.xlabel('Iterations')
plt.ylabel('Mean Squared Error')
plt.show()
```

---
## Backpropagation

* Allows gradient descent to update all weights in neural network (by getting gradients for all weights)
* Comes from chain rule of calculus
* Important to understand the process, but you will generally use a library that implements this

#### Backpropagation process

* Trying to estimate the slope of the loss function w.r.t each weight
* Do forward propagation to calculate predictions and errors

* Go back one layer at a time
* Gradients for weight is product of:
    1. Node value feeding into that weight
    2. Slope of loss function w.r.t node it feeds into
    3. Slope of activation function at the node it feeds into


* Need to also keep track of the slopes of the loss function w.r.t node values
* Slope of node values are the sum of the slopes for all weights that come out of them

#### Backpropagation: Recap

* Start at some random set of weights
* Use forward propagation to make a prediction
* Use backward propagation to calculate the slope of the loss function w.r.t each weight
* Multiply that slope by the learning rate, and subtract from the current weights
* Keep going with that cycle until we get to a flat part

#### Stochastic gradient descent

* It is common to calculate slopes on only a subset of the data (‘batch’)
* Use a different batch of data to calculate the next update
* Start over from the beginning once all data is used
* Each time through the training data is called an epoch
* When slopes are calculated on one batch at a time: stochastic gradient descent

***
## Creating a keras model

#### Model building steps

* Specify Architecture
* Compile
* Fit
* Predict

#### Model specification

```python

import numpy as np
from keras.layers import Dense
from keras.models import Sequential

predictors = np.loadtxt('predictors_data.csv', delimiter=',')

n_cols = predictors.shape[1]
model = Sequential()

model.add(Dense(100, activation='relu', input_shape = (n_cols,)))
model.add(Dense(100, activation='relu'))
model.add(Dense(1))
```

#### Compiling a model

* Specify the optimizer
* Many options and mathematically complex
    * “Adam” is usually a good choice
* Loss function
    * “mean_squared_error” common for regression
    
```python
model.compile(optimizer='adam', loss='mean_squared_error')
```

#### Fitting a model

* Applying backpropagation and gradient descent with your data to update the weights
* Scaling data before fitting can ease optimization

```python
model.fit(predictors, target)
```

---

## Classification models

* ‘categorical_crossentropy’ loss function
* Similar to log loss: Lower is better
* Add metrics = [‘accuracy’] to compile step for easy-to-understand diagnostics
* Output layer has separate node for each possible outcome, and uses ‘softmax’ activation

```python

from keras.utils import to_categorical

data = pd.read_csv('basketball_shot_log.csv')

predictors = data.drop(['shot_result'], axis=1).as_matrix()
target = to_categorical(data.shot_result)

model = Sequential()
model.add(Dense(100, activation='relu', input_shape = (n_cols,)))
model.add(Dense(100, activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(2, activation='softmax'))

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

model.fit(predictors, target)
```

#### Using models

* Save
* Reload
* Make predictions

```python
from keras.models import load_model

model.save('model_file.h5')

my_model = load_model('my_model.h5')

predictions = my_model.predict(data_to_predict_with)
probability_true = predictions[:,1]

# Verifying model structure
my_model.summary()
```

***
## Understanding model optimization

#### Why optimization is hard
* Simultaneously optimizing 1000s of parameters with complex relationships
* Updates may not improve model meaningfully
* Updates too small (if learning rate is low) or too large (if learning rate is high)

```python
def get_new_model(input_shape = input_shape):
    model = Sequential()
    model.add(Dense(100, activation='relu', input_shape = input_shape))
    model.add(Dense(100, activation='relu'))
    model.add(Dense(2, activation='softmax'))
    return(model)
    
lr_to_test = [.000001, 0.01, 1]

# loop over learning rates
for lr in lr_to_test:
    model = get_new_model()
    my_optimizer = SGD(lr=lr)
    model.compile(optimizer = my_optimizer, 
                  loss = 'categorical_crossentropy')

model.fit(predictors, target)
```
#### The dying neuron problem

* when a node starts getting always negative inputs (relu activation)
    * it may continue to only getting negative inputs
* Contributes nothing to the model ("Dead" neuron)    

#### Vanishing gradients

* Occurs when many layers have very small slopes (e.g. due to being on flat part of tanh curve)
* In deep networks, updates to backprop were close to 0

### Validation in deep learning

* Commonly use validation split rather than cross-validation
* Deep learning widely used on large datasets
* Single validation score is based on large amount of data, and is reliable
* Repeated training from cross-validation would take long time

```python
model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics=['accuracy'])
model.fit(predictors, target, validation_split=0.3)

# ====== Early Stopping ======= #
from keras.callbacks import EarlyStopping
# patience tells how many epoch the model can go without improving
early_stopping_monitor = EarlyStopping(patience=2)
model.fit(predictors, target, validation_split=0.3, epochs=20,
          callbacks = [early_stopping_monitor])
```

### Example: Adding layers to a network

You've seen how to experiment with wider networks. In this exercise, you'll try a deeper network (more hidden layers).

Once again, you have a baseline model called model_1 as a starting point. It has 1 hidden layer, with 50 units. You can see a summary of that model's structure printed out. You will create a similar network with 3 hidden layers (still keeping 50 units in each layer).

This will again take a moment to fit both models, so you'll need to wait a few seconds to see the results after you run your code.

```python
# The input shape to use in the first hidden layer
input_shape = (n_cols,)

# Create the new model: model_2
model_2 = Sequential()

# Add the first, second, and third hidden layers
model_2.add(Dense(50, activation='relu', input_shape=input_shape))
model_2.add(Dense(50, activation='relu'))
model_2.add(Dense(50, activation='relu'))

# Add the output layer
model_2.add(Dense(2, activation='softmax'))

# Compile model_2
model_2.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Fit model 1
model_1_training = model_1.fit(predictors, target, epochs=20, validation_split=0.4, callbacks=[early_stopping_monitor], verbose=False)

# Fit model 2
model_2_training = model_2.fit(predictors, target, epochs=20, validation_split=0.4, callbacks=[early_stopping_monitor], verbose=False)

# Create the plot
plt.plot(model_1_training.history['val_loss'], 'r', model_2_training.history['val_loss'], 'b')
plt.xlabel('Epochs')
plt.ylabel('Validation score')
plt.show()
```


### Thinking about model capacity

* Pay attention to overfitting

#### Workflow for optimizing model capacity
* Start with a small network
* Gradually increase capacity
* Keep increasing capacity until validation score is no longer improving

#### Recognizing handwritten digits

* MNIST dataset
* 28 x 28 grid fla ened to 784 values for each image
* Value in each part of array denotes darkness of that pixel


### Example: Building your own digit recognition model

You've reached the final exercise of the course - you now know everything you need to build an accurate model to recognize handwritten digits!

We've already done the basic manipulation of the MNIST dataset shown in the video, so you have X and y loaded and ready to model with. Sequential and Dense from keras are also pre-imported.

To add an extra challenge, we've loaded only 2500 images, rather than 60000 which you will see in some published results. Deep learning models perform better with more data, however, they also take longer to train, especially when they start becoming more complex.

```python
# Create the model: model
model = Sequential()

# Add the first hidden layer
model.add(Dense(50, activation='relu', input_shape=(784,)))

# Add the second hidden layer
model.add(Dense(50, activation='relu'))

# Add the output layer
model.add(Dense(10, activation='softmax'))

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Fit the model
model.fit(X, y, validation_split=0.3)
```