<h2>Convolutional Neural Network with Graph in Keras</h2>
<p>Train a simple convnet on the MNIST dataset.</p>
<p>Run on GPU: THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python mnist_cnn.py
Get to 99.25% test accuracy after 12 epochs (there is still a lot of margin for parameter tuning).
16 seconds per epoch on a GRID K520 GPU</p>
<p>
For this tutorial, a convolutional neural network (CNN) is built using Keras. It is trained and tested using the MNIST handwritten digits dataset. The CNN consists of multiple layers of convolution and max pooling, ending with a fully connected MLP for classification. 
</p>
<p>
This example is built using the Graph model rather than a Sequential model. 
</p>

In [1]:
from __future__ import print_function
import numpy as np

In [2]:
np.random.seed(1337) # for reproducibility of initial weights

In [3]:
from keras.datasets import mnist
from keras.models import Graph
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.convolutional import Convolution2D, MaxPooling2D
from keras.utils import np_utils

Using Theano backend.


In [4]:
# Batch size for stochastic gradient descent; e.g. number of samples per run
batch_size = 128
# Output number of classes. MNIST has 10 possible classes: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
nb_classes = 10
# Number of iterations over the entire dataset when training
nb_epoch = 12

In [5]:
# input image dimensions MNIST
img_rows, img_cols = 28, 28
# number of image bands, RGB, or single band
nb_image_bands = 1
# number of convolutional filters to use, can be different for multiple convolutional layers
nb_filters1 = 32
nb_filters2 = 32
# size of pooling area for max pooling
nb_pool = 2
# convolution kernel size , can vary for different layers
nb_conv1 = 3 # (3x3 covolution)
nb_conv2 = 4 # (4x4 covolution)

In [6]:
# the data, shuffled and split between tran and test sets, may have issues with proxy/firewall here. 
(X_train, y_train), (X_test, y_test) = mnist.load_data()

In [7]:
# Reshape to include a 4th dimension, such that the dataset is tupled as (num_samples, num_bands, img_num_rows, img_num_cols)
X_train = X_train.reshape(X_train.shape[0], 1, img_rows, img_cols)
X_test = X_test.reshape(X_test.shape[0], 1, img_rows, img_cols)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
# Normalize the training set to a value between 0 and 1
X_train /= 255
X_test /= 255
print('X_train shape:', X_train.shape)
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')

X_train shape: (60000, 1, 28, 28)
60000 train samples
10000 test samples


In [8]:
# convert class vectors to binary class matrices
Y_train = np_utils.to_categorical(y_train, nb_classes)
Y_test = np_utils.to_categorical(y_test, nb_classes)

In [9]:
# Initialize an empty Graph model
model = Graph()

model.add_input(name='input', input_shape=(nb_image_bands, img_rows, img_cols))
# First layer: Convolution2D - Generates 32 feature maps, using a 3x3 convolution filter
model.add_node(Convolution2D(nb_filters1, nb_conv1, nb_conv1, border_mode='valid'), name='conv1', input='input')
# For each node, sum the input x weights, and run Rectified Linear Unit (ReLu) activation function. Can also use
# tanh, sigmoid, softplus, relu, hard_sigmoid, linear. The softmax activation is also available, but only makes sense
# to use this activation for output, as this is probability of classification.
model.add_node(Activation('relu'), name='activation1', input='conv1')
# Run max pooling using a 2x2 pooling filter, storing the maximum value
model.add_node(MaxPooling2D(pool_size=(nb_pool, nb_pool)), name='maxpool1', input='activation1')
# Second Convolution layer - Generates 32 feature maps using a 4x4 convolution filter
model.add_node(Convolution2D(nb_filters2, nb_conv2, nb_conv2, border_mode='valid'), name='conv2', input='maxpool1')
# Activation for second convolution step
model.add_node(Activation('relu'), name='activation2', input='conv2')
# Second max pooling step
model.add_node(MaxPooling2D(pool_size=(nb_pool, nb_pool)), name='maxpool2', input='activation2')
# Dropout is used as a percentage of inputs to exclude during backpropagation, gradient updates. Here, 20% of the
# input units are "dropped" and not updated during backprop. This is to help prevent overfitting.
model.add_node(Dropout(0.25), name='dropout1', input='maxpool2')

# After convolutions/max pooling, the features extracted can then be passed through a classification algorithm.
# A common approach is to simply the features through a fully connected layer before classifying using softmax.

# Convert the features into a single dimension vector
model.add_node(Flatten(), name='flatten', input='dropout1')
# Add a fully connected hidden layer of 128 nodes - This can be modified as a hyperparameter for testing various models
model.add_node(Dense(128), name='hidden1', input='flatten')
# ReLu activation function for hidden layer
model.add_node(Activation('relu'), name='activation3', input='hidden1')
# Dropout percentage for hidden layer
model.add_node(Dropout(0.5), name='dropout2', input='activation3')
# Output layer, fully connected to 10 nodes, for each possible class (0-9)
model.add_node(Dense(nb_classes), name='output', input='dropout2')
# Softmax is an activation function that converts the values to a probability for that particular class. 
# A generalization of the logistic function 
model.add_node(Activation('softmax'), name='softmax', input='output')
# Add model output
model.add_output(name='outputActivation', input='softmax')

In [18]:
# Compile the model, using the RMSprop optimizer, and a the categorical cross entropy loss function.

# ADADelta is a variant of stochastic gradient descent. A per-dimension learning-rate method that adapts over time, 
# requires no manual parameter tuning

# Categorical_crossentropy is used with softmax to determine the N-category cross entropy of the predicted vs. 
# target variable category. Also known as multiclass logloss.
# Many additional loss functions are available, including mean_squared_error / mse, root_mean_squared_error / rmse
# mean_absolute_error / mae, mean_absolute_percentage_error / mape, mean_squared_logarithmic_error / msle, squared_hinge
# hinge, binary_crossentropy: Also known as logloss., categorical_crossentropy
#model.compile(loss='categorical_crossentropy', optimizer='adadelta')
%time model.compile(optimizer = 'adadelta', loss = {'outputActivation':'categorical_crossentropy'})

CPU times: user 2.76 s, sys: 23 ms, total: 2.78 s
Wall time: 2.77 s


In [19]:
# Begin Training the model
#
# Pass the training set: input and targets
# batch_size: size of the mini batch, or number of samples to run at once, including gradient updates, 
# rather than run the entire dataset
# nb_epoch: number of epochs or iterations over the entire dataset
# verbose: how much detail to display, 0 - No output, 1 - More detail, 2 - Less detail
# validation_data: Dataset the model is validated against, the output displays the loss and accuracy 
# of the validation set
#
# The loss function should be minimized. Note that the graph model does not have a show_accuracy. According to the
# developers, due to the complexity of the graph model, it is very difficult to include this value as output of the
# model. Their recommendation is that it is much easier to take the predictions and calculate the accuracy directly.
%time history = model.fit({'input':X_train, 'outputActivation':Y_train}, nb_epoch=nb_epoch, \
                    batch_size=batch_size, verbose=1, validation_data=({'input':X_test, 'outputActivation':Y_test}))

Train on 60000 samples, validate on 10000 samples
Epoch 1/12
Epoch 2/12
Epoch 3/12
Epoch 4/12
Epoch 5/12
Epoch 6/12
Epoch 7/12
Epoch 8/12
Epoch 9/12
Epoch 10/12
Epoch 11/12
Epoch 12/12
CPU times: user 1h 31min 1s, sys: 9.39 s, total: 1h 31min 11s
Wall time: 1h 31min 15s


In [20]:
# Run the trained model on the test set. For this example, the test and validation sets are the same. This function
# is useful for running the model on a new dataset not previously seen. 
score = model.evaluate({'input': X_test, 'outputActivation': Y_test}, batch_size=batch_size, verbose=1)



In [21]:
# Graph model does not have an accuracy. Here, we calculate it outselves
prediction = model.predict({'input': X_test}, batch_size=batch_size, verbose=1)
#Calculate the abs of the differences between the predicted value and the target value. Sum all the errors, divided by
# number of samples to get the percent of error. Accuracy is 1 - percent error.
accuracy = 1 - np.sum(np.abs(prediction['outputActivation'] - Y_test)) / len(Y_test)



In [23]:
# print the categorical_crossentropy value of model run on the test set
print('Test score:', score)
# print the accuracy of the model run on the test set
print('Test accuracy:', accuracy)

Test score: 0.0235688971492
Test accuracy: 0.980328007724
