#Lab 5 - Deep Learning

##Introduction
In this lab we will cover some basic deep learning in Python. While Labs 1-4 utilized the Python package **scikit-learn** for most of the Machine Learning algorithms, it wasn't designed with deep learning applications in mind. **Keras** is another Python package widely used for just such purposes. It provides high-level coding structure for the typically-hard-to-approach **TensorFlow** deep learning platform build by Google but can function in its own right too.

Here is some additional information about each of these packages. You can also get acquainted with some of there features and capabilities we won't necessarily have time to cover in the class:

* **[Keras](https://keras.io/)**
* **[TensorFlow](https://www.tensorflow.org/)**

##Revisiting Machine Learning vs. Deep Learning
Machine Learning and Deep Learning aren't that different and are technically trying to accomplish the same goals. Remember that Deep Learning is considered a subfield of Machine Learning anyway:

![alt text](https://qph.fs.quoracdn.net/main-qimg-cf42db79eb79239884a29568fcc24002-c)

From a prediction standpoint, both ML and DL take inputs (i.e. data), outcomes (i.e. some target feature we'd like to predict) and try and map the former to the later using statistics, probability, and mathematics. The "learning" part of these algorithms is derived from how they look at our data in different ways called "representations." Representations are simply ways we transform our data -- something simple like rotating it about an axis or complex like breaking an image into its constituant RGB colors.

Why don't we just use ML algorithms for everything then? Well, they aren't that good for learning more than a couple representations ("shallow learning") whereas DL algorithms (almost always some form of Artificial Neural Network) can learn 100s, if not 1,000s or 1,000,000s ("deep learning"). That's really useful when trying to solve complex problems but bad when trying to *explain* how we arrived at a particular decision/prediction. Like the human brain, even experts in deep learning have a hard time trying to explain exactly how ANNs achieve such high-quality predictions!

Let's get started with a simple ANN for predicting the digit shown in an image of a handwritten digit!

##ANN for Digit Recognition
Have you recently deposited a check at an ATM? If you have then you've probably seen how the ATM can "read" the amount written on the check and verify that with you. This is an example of how we can "teach" a computer to understand what makes a "2" and two and a "5" a five. We will do the same with our first ANN.

In [0]:
#Load our packages
from keras.datasets import mnist
from keras import models 
from keras import layers
from keras.utils import to_categorical
from keras.datasets import boston_housing
import numpy as np

In [0]:
#Put training/testing images/labels into objects and inspect
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
print('Training size=', train_images.shape)
print('Testing size=', test_images.shape)

The Keras package comes with a built-in data set commonly used to test ANNs called MNIST for "Modified National Institute of Standards and Technology." This is a set of 70,000 images of handwritten digits -- 60,000 to train our ANN and 10,000 to test its performance. Simply put: we "feed" our ANN training images, it "learns" representations of these images associated between them and the number they are labeled with, then we ask for predictions. Rinse and repeat!

![alt text](https://cdn-images-1.medium.com/max/584/1*9Mjoc_J0JR294YwHGXwCeg.jpeg)

Representations are learned within each layer and each consecutive layer learns better and better representations that combine what was learned in previous layers. Eventually, that is, because at first remember our ANNs **weights** and **biases** are randomized and these are ultimately what make learning happen. The **backpropagation** optimization process is what updates their values to improve predictions.

Let's build our ANN:

In [0]:
#Create ANN architecture
network = models.Sequential() 
network.add(layers.Dense(512, activation='relu', input_shape=(28 * 28,))) 
network.add(layers.Dense(10, activation='softmax'))

network.compile(optimizer='rmsprop', 
                loss='categorical_crossentropy', 
                metrics=['accuracy'])

train_images = train_images.reshape((60000, 28 * 28)) 
train_images = train_images.astype('float32') / 255
test_images = test_images.reshape((10000, 28 * 28)) 
test_images = test_images.astype('float32') / 255

train_labels = to_categorical(train_labels) 
test_labels = to_categorical(test_labels)

Let's walk through each line of the above code:

1. `network = models.Sequential()` tells Keras we are going to build a fully-connected ANN that will connect each layer's nodes to the next layer's.
2. `network.add(layers.Dense(512, activation='relu', input_shape=(28 * 28,)))` creates the first layer. `Dense` means the layer is **fully-connected** with `512` nodes all using the `relu` activation function. The parameter `input_shape` indicates the dimensions of each digit image, 28x28 pixels.
3. `network.add(layers.Dense(10, activation='softmax'))` creates our output layer. We chose `10` nodes because we have 10 possible classes our target can take -- digits 0-9. The activation function `softmax` is used when we have a multi-class classification problem. It transforms our output probabilities to sum to 1 across all 10 potential classifications 0-9.
4. `network.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics='accuracy')` tells our ANN how to learn and optimize parameters. We can use different `optimizers` but `rmsprop` is fairly standard. We use the `loss` parameter to indicate our cost/loss function -- in this case categorical since that is the type of problem we have. Finally, `metrics` simply specifies the measure we will use to determine how well our model works.
5. `train_images = train_images.reshape((60000, 28 \* 28))`   
`train_images = train_images.astype('float32') / 255`     
`test_images = test_images.reshape((10000, 28 \* 28))`    
`test_images = test_images.astype('float32') / 255`    
These transform our data into the format Keras is expecting -- mainly a floating point numerical value.
6. `train_labels = to_categorical(train_labels)`     
`test_labels = to_categorical(test_labels)`  `  
This sets our target labels (i.e. "0-9") to categories.

Now let's fit our model and see how we do!

In [0]:
#Fit our ANN and evaluate accuracy
network.fit(train_images, 
            train_labels, 
            epochs=5, 
            batch_size=128)

test_loss, test_acc = network.evaluate(test_images, test_labels)
print('Testing accuracy=', test_acc)
print('Testing loss=', test_loss)

Not bad at **98%** (scores may vary due to randomization)! Here are two parameters we haven't seen yet: **epoch** and **batch_size**. One epoch is simple one full forward pass of our training data set. Remember that ANNs send input data through in small **batches** so that with each successive batch we can improve our model a bit and try with another batch. Once we've passed all batches through, we call that an epoch. Go ahead and adjust the `epoch` and `batch_size` features to see if you can improve the model. You can also set the cursor inside a blank space between the parenthesis of `network.fit()`, press the Tab key, and see what other parameters you can pass through the ANN. Careful though, you could break it!

###ANN Math
Briefly, ANNs rely on a *lot* of complex mathematics including linear algebra/matrix operations and derivative calculus. We don't have time to cover that but just remember that under the hood of a running ANN, there are a significant amount of calculations being made which again is why ANNs are somtimes called "black box" algorithms -- it's really hard to explain how they work!

##ANN for Boston Home Prices
Let's return to our previous data set. Keras also comes with the **Boston Housing Price** data set built into it. We first used linear regression to predict the median home value (**MEDV**) for homes in Boston census tracts (each observation was the average of homes within on tract). Will we get better results with an ANN?

In [0]:
#Put training/testing data into objects and inspect
(train_data, train_targets), (test_data, test_targets) = boston_housing.load_data()
print('Training size=', train_data.shape)
print('Testing size=', test_data.shape)

Our training set contains 404 observations and 13 features while our testing set contains 102 observations and 13 features just like the last time. Rather than just loading in our data as-is, we should make an adjustment first. This is something we didn't talk about in the lectures but it is important in real-life applications of both ML and DL. **Normalization** is where we take our input features and scale them in a way so that all their values now fall between 0 and 1. The reason we do this is so that they are all *competing* at the same level and no one variable intrinsically dominates them all because it can take on many more values (and therefore have a greated variance) than the others.

Normalization is simple. We calculate the mean of a numeric feature then subtract each value from that mean then dividing by the standard deviation (which remember is a measure of the average distance values fall from the mean within a feature).

In [0]:
#Normalize
mean = train_data.mean(axis=0) 
train_data -= mean 
std = train_data.std(axis=0) 
train_data /= std
test_data -= mean 
test_data /= std

Now let's build our model like we did for the MNIST data set only this time we are going to set it up as a function in Python. The reason for this is because compared to our MNIST data which had 60,000 training examples, now we only have 404. We'll need a way to build our ANN to account for this or we could overfit our model.

In [0]:
#Define ANN function
def build_model(): 
    model = models.Sequential() 
    model.add(layers.Dense(64, 
                           activation='relu', 
                           input_shape=(train_data.shape[1],)))
    
    model.add(layers.Dense(64, activation='relu')) 
    model.add(layers.Dense(1)) 
    model.compile(optimizer='rmsprop', 
                  loss='mse', 
                  metrics=['mae']) 
    
    return model

This model differs from the MNIST one because it has 3 layers rather than 2. We are also changing our cost/loss functions and overall quality measure to **Mean Squared Error** and **Mean Absolute Error**. Why do you think that is? (NOTE: Mean absolute error is calculated just like mean squared error only instead of squaring the differences between the predictions and the actual outcomes, we just take the absolute value. This means the resulting score is in the same units of the target we are trying to predict).

Also notice that our final `Dense` layer is just a single node compared to the `10` nodes in our MNIST ANN. Why so?

Let's move on. It was mentioned that we only have a small number of observations to build our ANN this time compared to the last. This means our ANN model could suffer from high *variance* meaning that if we were to take another set of similar inputs to build our model, the model itself could change quite a bit. Therefore we are going to use **cross validation** to build several ANNs and average their accuracy scores. Remember that ANNs train using batches of data. Adding cross validation further enhances that because now we are also using subsets of the data in each iteration.

Let's build in a **k-folds cross validation** routine in our ANN training. (Before running this next code block, you might want to try using a GPU backend. Got to "Runtime --> Change Runtime Type" and select "GPU" then save)



In [0]:
#Build 5-fold cross validation
k=5
num_val_samples = len(train_data) // k
epochs = 100 
all_scores = []

for i in range(k): 
    print('processing fold #', i) 
    val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples] 
    val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]
    
    partial_train_data = np.concatenate( 
        [train_data[:i * num_val_samples], 
         train_data[(i + 1) * num_val_samples:]], 
         axis=0)
    
    partial_train_targets = np.concatenate( 
        [train_targets[:i * num_val_samples], 
         train_targets[(i + 1) * num_val_samples:]], 
         axis=0)
    
    model = build_model() 
    
    model.fit(partial_train_data, 
              partial_train_targets, 
              epochs=epochs, 
              batch_size=1, 
              verbose=0)
    
    val_mse, val_mae = model.evaluate(val_data, val_targets, verbose=0) 
    all_scores.append(val_mae)

print(k,'-fold ANN CV mean absolute error=',np.mean(all_scores))

To quickly summarize what is happening, before the for-loop we set our k-folds to `5`, divide our training data by that number, set the number of epochs which is the number of times we pass the full data set through the ANN to train, then we create an empty list to store our mean squared error and mean absolute error in.

Within the for-loop we build 5 ANNs using 4 of our folds to train and 1 to test, rotating them each iteration. Each models' mean squared/absolute error are stored in `all_scores` which we take the mean of and return.

Although results may vary due to randomization, the mean absolute error on first pass was **2.46**. This means that our ANN model is about $2,460 off in its prediction of **MEDV**. Is this acceptable to you?



##Your Turn!
Now see what you can do with ANNs! You have two examples and links to the resources you need to operate Keras/TensorFlow. Try with another built-in data set from Keras called **[CIFAR10](https://www.cs.toronto.edu/~kriz/cifar.html)** which is another image database similar to MNIST only this time we are trying to classify images of things like dogs and ships. See if you can set things up on your own. Maybe even try a convolutional neural network (CNN) since those are made for computer vision.

Here is the code to load the data set and get you started. Good luck!

In [0]:
from keras.datasets import cifar10

(cf_train_images, cf_train_labels), (cf_test_images, cf_test_labels) = cifar10.load_data()
print('Training size=', cf_train_images.shape)
print('Testing size=', cf_test_images.shape)

**Build a `Sequential` ANN like we did above for the MNIST data set. What parameters did you use and why? Do you have to do anything different this time to setup the ANN correctly?**

**Fit your ANN model and report the results. How many epochs did you use? What about batch size? Do you think we can improve accuracy/loss if we change parameters -- or something else?**