# Introduction to Keras
In this notebook, we will begin our introduction to keras.   We will follow Chapter 2 of **Deep Learning with Python** for most of this, with some slight changes.

We will build a simple model to start with, implementing a network just like the one we used in assignment7_prep.

# Make Sure Your Environment is setup properly
First, make sure that your **kernel** is Python 3.6 (Conda 5.2).   This is in the upper right hand corner of this notebook.   If it is not - change it by selecting the **Kernel** menu option and navigating to **Change Kernel**.

Next, execute the following cell, using **pip list** to determine if you have the proper Keras and tensorflow packages.   You should have:
1. Keras                              2.2.4
2. tensorflow-gpu                     1.9.0 

In [1]:
!pip list

Package                            Version   
---------------------------------- ----------
absl-py                            0.8.0     
alabaster                          0.7.11    
anaconda-clean                     1.0       
anaconda-client                    1.7.1     
anaconda-navigator                 1.8.7     
anaconda-project                   0.8.2     
appdirs                            1.4.3     
argcomplete                        1.9.4     
asn1crypto                         0.24.0    
astor                              0.8.0     
astroid                            2.0.4     
astropy                            3.0.4     
atomicwrites                       1.1.5     
attrs                              18.1.0    
Automat                            0.7.0     
Babel                              2.6.0     
backcall                           0.1.0     
backports.shutil-get-terminal-size 1.0.0     
backports.weakref                  1.0.post1 
beautifulsoup

You should consider upgrading via the 'pip install --upgrade pip' command.[0m


# Modify package environment - if necessary
If the above Keras and tensorflow packages were installed, you are good and can go to the next cell.

If the above Keras and tensorflow packages were **not** installed, do the following:
1. Open up a pitzer terminal window.
2. Load the proper python by executing this: 
     * module load python/3.6-conda5.2
3. Uninstall keras and tensorflow:
     * pip uninstall tensorflow
     * pip uninstall tensorflow-gpu
     * pip uninstall keras
4. Install the proper packages:
     * pip install --user keras==2.2.4
     * pip install --user tensorflow-gpu
5. Come back to this noteboook and **restart** the kernel

Now you can proceed with the rest of this notebook!


## Get the data
We will use our MNIST data sample yet again!   This time, we will use the version that comes along prepackaged with the keras package.

Keras has a small number of datasets included as part of the package (see [here](https://keras.io/datasets/) for more details)   These include:
1.  MNIST:  60,000 28x28 grayscale images of the 10 digits, along with a test set of 10,000 images.
2.  Reuters newswire topics classification:  11,228 newswires from Reuters, labeled over 46 topics, for text processing and classification. 
3.  CIFAR10 small image classification: Dataset of 50,000 32x32 color training images, labeled over 10 categories, and 10,000 test images.   There is a similar dataset (CIFAR100) with 100 labeled catagories.

Below we load the MNIST dataset (both training and test).   We include an option for loading a "short" version to speed things up, but for real studies you should set short to False.

In [2]:
from keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

short = True
if short:
    train_images = train_images[:7000,:]
    train_labels = train_labels[:7000]
    test_images = test_images[:3000,:]
    test_labels = test_labels[:3000]
#
print("Train info",train_images.shape, train_labels.shape)
print("Test info",test_images.shape, test_labels.shape)


Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


Train info (7000, 28, 28) (7000,)
Test info (3000, 28, 28) (3000,)


## Prepare the feature data
We need to make sure the feature data is:
1. shaped appropriately. Eaach sample needs to be a 1D vector
2. normalized.  Since we know our max and min is 255/0, we can just divide each pixel by 255.

In [3]:
train_images = train_images.reshape((train_images.shape[0],28*28))
train_images = train_images.astype('float32')/255

test_images = test_images.reshape((test_images.shape[0],28*28))
test_images = test_images.astype('float32')/255


## Prepare the label data
The labels run from 0-9, but we need to make them 1-hot.

In [4]:
from keras.utils import to_categorical

train_labels_cat = to_categorical(train_labels)
test_labels_cat = to_categorical(test_labels)


## Build the Model
Our model will be just like the one we build from scratch in assignment 7 prep:
1. An input layer, 784 features wide.
2. A hidden layer, 100 "nodes" wide, using the "tanh" activation function.
3. An output layer, 10 modes wide, using the softmax activation function.

Building this network with keras is quite simple.

In [5]:
from keras import models
from keras import layers

network = models.Sequential()
network.add(layers.Dense(100,activation='tanh',input_shape=(28*28,)))
network.add(layers.Dense(10,activation='softmax'))

## Compile the model
Compiling the model is necessary begfore you can train the model.  Compiling configures the learning process

1.  A loss function. This is the objective that the model will try to minimize. There are a range of choices which can be examined [here](https://keras.io/losses/).   For classisifaction problems the typical choices are:
    * categorical_crossentropy: used for multi-class classification (like MNIST)
    * binary_crossentropy: used for binary classification (like any one vs all problem)
2.  An optimizer. This controls how the minimum of the loss function is found.   SGD (stochastic gradient descent) is typical, as is Adam (see [here](https://arxiv.org/abs/1412.6980v8) for more details).
3.  A list of metrics. For any classification problem you will want to set this to metrics=['accuracy']. 

Another thing we do below is to save the weights of the compiled network right after we first compile it.   These weight are initiailized to some random (and typically small) values.   This will be useful if we end up calling the network in an optimzation loop later.  For now, just make sure you do this.

In [6]:

network.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])
#
# If we reload this right before fitting the model, the model will start from scratch
network.save_weights('model_init.h5')


## Fitting the model
The "fit" method takes the following arguments:
1.  The input features: in our case this is "train_images".
2.  The output labels: input case this is the 1-hot "train_labels_cat"
3.  The number of epochs to run.  Remember that an "epoch" is defined as an iteration in which the entire set of training samples has been passed through the network.   We use 50 below, but it is important to choose a number large enough that your performance (on the test set!) converges.   We will find that we might not want to use ALL of the epochs we give to the "fit" method - this is called "early stopping".   More on this below.
4.  The batch size: this is the number of training samples that are passed through the network before the weights are updated.  Note the difference between this and the number of "epochs".  We will use 128 (typically). A good discussion of the issues surrounding batch size and epochs is found [here] (https://stats.stackexchange.com/questions/164876/tradeoff-batch-size-vs-number-of-iterations-to-train-a-neural-network).
5.  An **optional** validation set.   This is a set of features and labels that are used to assess the performance of the model during the fit, at the end of each epoch.   Statistics on this (and the training set) are collected and returned when the fit is finished.

The fit returns a **history** object, containing a .history dictionary with the following entries:
*  history.history\['loss'\]: A list of the values of the loss function (evaluated on the training sample) at the end of each epoch, ordered by epoch.
*  history.history\['acc'\]: A list of the values of the accuracy (evaluated on the training sample) at the end of each epoch, ordered by epoch.
*  history.history\['val_loss'\]: A list of the values of the loss function (evaluated on the validation sample) at the end of each epoch, ordered by epoch.  Only returned if a validation sample is supplied.
*  history.history\['val_acc'\]: A list of the values of the accuracy (evaluated on the validation sample) at the end of each epoch, ordered by epoch.  Only returned if a validation sample is supplied.



## Training vs Validation vs Testing
You may have noticed that we introduced a new concept called the "validation" sample.   This is sometimes confused with the "testing" sample, but they are different.

To be clear:
1.  **Training set**: A set of examples used for learning, that is to **fit** the values parameters (weights) of the classifier.

2.  **Validation set**: A set of examples used to **tune**  the parameters (for example the number of nodes in the hidden layer) of a classifier.

3.  **Test set**: A set of examples used only to assess the performance of an already fit classifier.

If we do k-fold validation, we typically have *no* separate validation or testing set.   We split the training set up into k-folds, train on each fold and average the results.   Do this many times to choose our parameter setting (like the number of hidden nodes).   Once finished, we retrain our model using the **full** training sample.   Our expected performance is the average performance using the k-fold results (at the parameter setting we chose).

For this MNIST data sample we will do something slightly different, since we have a large available training set:
1.  We will use the MNIST **training** sample to supply data for our k-fold validation process - meaning this sample will be broken up into training and validation.
2.  We will use the MNIST **testing** sample to test our fully trained sample, after k-fold validation.

In the example fit below, we use the MNIST **training** sample, and split it into a single **temporary** training sample and a separate **validation** set in the "fit" function.  We will use the **test** sample from above separately.

In [7]:
from sklearn.model_selection import train_test_split
train_images_temp,val_images,train_labels_cat_temp,val_labels_cat = train_test_split(train_images,train_labels_cat, test_size=0.2, random_state=42)


In [8]:
network.load_weights('model_init.h5')
history = network.fit(train_images_temp,train_labels_cat_temp,epochs=100,batch_size=128,validation_data=(val_images,val_labels_cat))

Train on 5600 samples, validate on 1400 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100


Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


## Saving a model
Once we have trained our model, we are ready to use it.  However, it often takes a **long time** to train a model, and once trained we may want to use it at a different time (and using a different python program).   Retraining the model right before we use it is not practical.  Instead, we will often save the model immediately upon training it, so we can simply **load** the already trained model into memory the next time we want to use it.

In [9]:
network.save('fully_trained_model.h5')  # creates a HDF5 file 'my_model.h5'

## Examine performance
First let's look at the returned history object:

In [10]:
training_vals_acc = history.history['acc']
training_vals_loss = history.history['loss']
valid_vals_acc = history.history['val_acc']
valid_vals_loss = history.history['val_loss']
iterations = len(training_vals_acc)
print("Number of iterations:",iterations)
print("Epoch\t Train Loss\t Train Acc\t Val Loss\t Val Acc")
for tl,ta,vl,va in zip(training_vals_loss,training_vals_acc,valid_vals_loss,valid_vals_acc):
    print(round(tl,5),'\t',round(ta,5),'\t',round(vl,5),'\t',round(va,5))

Number of iterations: 100
Epoch	 Train Loss	 Train Acc	 Val Loss	 Val Acc
1.01098 	 0.71929 	 0.52308 	 0.85643
0.44448 	 0.88286 	 0.37678 	 0.90214
0.34384 	 0.90768 	 0.325 	 0.91357
0.29343 	 0.92214 	 0.294 	 0.91857
0.25936 	 0.93143 	 0.27672 	 0.92643
0.23461 	 0.93839 	 0.26667 	 0.925
0.21422 	 0.94464 	 0.25825 	 0.92714
0.19348 	 0.94804 	 0.24709 	 0.92857
0.17585 	 0.95464 	 0.23915 	 0.93286
0.16259 	 0.95696 	 0.2346 	 0.93571
0.14847 	 0.96321 	 0.23247 	 0.93214
0.13547 	 0.96571 	 0.2278 	 0.93214
0.12413 	 0.97089 	 0.22609 	 0.93286
0.11383 	 0.97339 	 0.22561 	 0.93
0.10658 	 0.97464 	 0.22476 	 0.93
0.099 	 0.9775 	 0.21804 	 0.93429
0.0876 	 0.98125 	 0.21778 	 0.93571
0.08224 	 0.98232 	 0.21376 	 0.93714
0.07544 	 0.98536 	 0.2157 	 0.935
0.06835 	 0.98839 	 0.21434 	 0.93714
0.06367 	 0.98821 	 0.21564 	 0.93357
0.05879 	 0.99071 	 0.20984 	 0.93857
0.05288 	 0.99268 	 0.21247 	 0.93786
0.0494 	 0.99268 	 0.21017 	 0.93714
0.04548 	 0.99375 	 0.21281 	 0.9357

## Plotting Performance
Here we look at the train and validation performance versus epoch.

In [11]:
def enable_plotly_in_cell():
  import IPython
  from plotly.offline import init_notebook_mode
#
# OLD (google colab)
#  display(IPython.core.display.HTML('''
#        <script src="/static/components/requirejs/require.js"></script>
#  '''))
#  init_notebook_mode(connected=False)
#
# New (OSC) [thanks to Stephen Gant for this!]
  init_notebook_mode(connected=True)


In [12]:
from plotly.offline import iplot
import plotly.graph_objs as go
import numpy as np

enable_plotly_in_cell()
#
# Costs
data_train = go.Scatter(
    x=np.array(range(0,len(history.history['loss']))),
    y=history.history['loss'],
    mode='markers',
    name="Train data"
)
data_test = go.Scatter(
    x=np.array(range(0,len(history.history['val_loss']))),
    y=history.history['val_loss'],
    mode='markers',
    name="Test data"
)
iplot(dict(data=[data_train,data_test]))

#
# Costs
data_train = go.Scatter(
    x=np.array(range(0,len(history.history['acc']))),
    y=history.history['acc'],
    mode='markers',
    name="Train data"
)
data_test = go.Scatter(
    x=np.array(range(0,len(history.history['val_acc']))),
    y=history.history['val_acc'],
    mode='markers',
    name="Test data"
)
iplot(dict(data=[data_train,data_test]))

## Loading a pre-trained network
Here we will load the pretrained network (deleting the version in memory to prove that this works!), and then apply this network to unseen data - our testing sample that we loaded above.

To get the network performance, we have two options:
1.  network.evaluate: This we use if we have labeled samples.   It returns the overall loss, as well as the calculated accuracy on that labeled dataset.
2.  network.predict:  This can be used on labeled or unlabeled data.  It returns the output of the network (in our case the 10 probabilities for the 10 classes) for each sample.  If you do have labeled data, you can compare the predicted output to the known label.

In [13]:
from keras.models import load_model
import numpy as np
#
# Delete the current model if it exists
try:
    del network  # deletes the existing model
except:
    print("network already deleted")
    
# returns a compiled model
# identical to the previous one (note the new name!!)
trained_network = load_model('fully_trained_model.h5')
#
# Get the overall performance for the test sample
test_loss, test_acc = trained_network.evaluate(test_images,test_labels_cat)
print("Test sample loss: ",test_loss, "; Test sample accuracy: ",test_acc)
#
# Get the individual predictions for each sample in the test set
predictions = trained_network.predict(test_images)
#
# Get the max probabilites for each rows
probs = np.max(predictions, axis = 1)
#
# Get the predicted classes for each row
classes = np.argmax(predictions, axis = 1)
#
# Now loop over the first twenty samples and compare truth to prediction
print("Label\t Pred\t Prob")
for label,cl,pr in zip(test_labels[:20],classes[:20],probs[:20]):
    print(label,'\t',cl,'\t',round(pr,3))


Test sample loss:  0.36053816095987956 ; Test sample accuracy:  0.916666666507721
Label	 Pred	 Prob
7 	 7 	 1.0
2 	 2 	 0.996
1 	 1 	 1.0
0 	 0 	 1.0
4 	 4 	 1.0
1 	 1 	 1.0
4 	 4 	 1.0
9 	 9 	 1.0
5 	 6 	 0.84
9 	 9 	 1.0
0 	 0 	 1.0
6 	 6 	 1.0
9 	 9 	 1.0
0 	 0 	 1.0
1 	 1 	 1.0
5 	 5 	 0.999
9 	 9 	 1.0
7 	 7 	 1.0
3 	 3 	 0.782
4 	 4 	 1.0


## Early Stopping
Notice that in the loss plot above, the network performance was best somewhere in the range of epochs 10-20, yet we continued to train the network until epoch 50.   Keras makes it possible to do two things:
1.  Stop the training once a condition has been met, using a module called "EarlyStopping".   This has two parameters:
   * what is monitored for stopping: we will use 'val_loss' the loss in the validation set.
   * "patience": this is how many epochs to wait after the condition has been met.  The idea being that there are fluctuations in the parameter you are monitoring, and you don't want to stop if you just had a small downward fluctuation.   So you wait a few epochs to see if the performance does not get better.
2.  Save the best network prior to stopping, using a module called "ModelCheckpoint".   You tell this module what to monitor, and every time the condition is met, you write out a (and overwrite the previous) new file containing the full model info.

In [14]:
from keras.callbacks import EarlyStopping, ModelCheckpoint

hidden_nodes = 100
activation = 'tanh'
optimizer = 'adam'
network = models.Sequential()
network.add(layers.Dense(hidden_nodes,activation=activation,input_shape=(28*28,)))
network.add(layers.Dense(10,activation='softmax'))
network.compile(optimizer=optimizer,loss='categorical_crossentropy',metrics=['accuracy'])
#
# If we reload this right before fitting the model, the model will start from scratch
network.save_weights('model_init.h5')
callbacks = [EarlyStopping(monitor='val_loss', patience=10),
             ModelCheckpoint(filepath='best_model.h5', monitor='val_loss', save_best_only=True)]

network.load_weights('model_init.h5')
history = network.fit(train_images_temp,train_labels_cat_temp,
                              epochs=50,
                              batch_size=128,
                              verbose=1, # set to 0 for no printout while running
                              callbacks=callbacks, # Early stopping
                              validation_data=(val_images,val_labels_cat))
#
# get performance info
training_vals_acc = history.history['acc']
training_vals_loss = history.history['loss']
valid_vals_acc = history.history['val_acc']
valid_vals_loss = history.history['val_loss']
iterations = len(training_vals_acc)
print("Number of iterations:",iterations)
print("Epoch\t Train Loss\t Train Acc\t Val Loss\t Val Acc")
i = 0
for tl,ta,vl,va in zip(training_vals_loss,training_vals_acc,valid_vals_loss,valid_vals_acc):
    print(i,'\t',round(tl,5),'\t',round(ta,5),'\t',round(vl,5),'\t',round(va,5))
    i += 1

Train on 5600 samples, validate on 1400 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Number of iterations: 39
Epoch	 Train Loss	 Train Acc	 Val Loss	 Val Acc
0 	 1.03613 	 0.70411 	 0.52165 	 0.86643
1 	 0.44139 	 0.88518 	 0.38103 	 0.90786
2 	 0.34545 	 0.90804 	 0.33323 	 0.91857
3 	 0.29662 	 0.92268 	 0.30028 	 0.92286
4 	 0.26138 	 0.92929 	 0.28444 	 0.92429
5 	 0.23341 	 0.94089 	 0.26669 	 0.93143
6 	 0.2129 	 0.94357 	 0.26161 	 0.93071
7 	 0.19268 	 0.94893 	 0.25075 	 0.93286
8 	 0.17624 	 0.95607 	 0.24805 	 0.93214
9 	 0.15988 	 0.95696 	 0.24675 	 0.93143
1

In [15]:
training_vals_acc[iterations-1]

0.9996428571428572

In [16]:
training_vals_acc

[0.7041071428571428,
 0.8851785717691694,
 0.9080357146263123,
 0.9226785714285715,
 0.9292857139451163,
 0.9408928568022592,
 0.9435714282308306,
 0.9489285717691694,
 0.9560714289120266,
 0.9569642860548837,
 0.9630357142857143,
 0.9662499996594021,
 0.969999999659402,
 0.9732142853736877,
 0.9741071428571428,
 0.9780357142857142,
 0.9816071425165449,
 0.983750000340598,
 0.9857142860548836,
 0.9866071428571429,
 0.9901785710879735,
 0.99125,
 0.9923214285714286,
 0.9932142857142857,
 0.9942857139451163,
 0.9958928568022591,
 0.9955357142857143,
 0.9969642857142857,
 0.9975,
 0.9978571425165449,
 0.9985714285714286,
 0.9983928571428572,
 0.99875,
 0.9991071428571429,
 0.9991071428571429,
 0.9996428571428572,
 0.9994642857142857,
 0.9996428571428572,
 0.9996428571428572]

**NOTICE**: Training stopped at XX epochs (the exact number is somewhat random), not 50, and the minimum validation sample loss was at epoch XX-10 (training continued for patience=10 epochs after this minimum to make sure we did not hit yet another minimum).   