# Lecture 8: Neural Networks

In this lecture, we'll be looking at various kinds of neural networks.  Neural networks in python are a very quickly evolving area, and there are many different competing packages for working with them.  Unfortunately, there's not yet a standard set of packages in scikit-learn like we've seen for many other machine learning methods.  

Most of the packages are high level wrappers around [Theano](http://deeplearning.net/software/theano/), which is a mathematical package for easily working with numerical expressions of arrays and matrices and their gradients.  Additionally, Theano code will also run seamlessly on a GPU if one is available.  This makes training much, much faster.  Here's an [ipython notebook](http://nbviewer.ipython.org/github/craffel/theano-tutorial/blob/master/Theano%20Tutorial.ipynb) on Theano if you're interested. 

We're going to look at three packages.  The first is [scikit-neuralnetwork](https://github.com/aigamedev/scikit-neuralnetwork) (installation instructions are at this link as well).  It's interface is the simplest, but it doesn't appear to be as widely used, and it's unclear if this package will "win" the race or not.

[Keras](https://github.com/fchollet/keras/) is a relatively new package.  It looks to be a good balance between sophistication and simplicity.  Installation instructions are at that link.

We will also look at Google's tensorflow package at the end.

In [8]:
%matplotlib inline

In [9]:
import gzip
import os
import sys
import time
import numpy as np
import pandas as pd

# pickle lets us save python objects to a file and read them back in
import pickle
import itertools

# here are our neural network imports
#from sknn import mlp

from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD
from keras.utils import np_utils

import theano

from urllib import urlretrieve
from sklearn import datasets
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
# scikit-learn does have a restricted boltzman machine class for doing unsupervised
# feature learning
from sklearn.neural_network import BernoulliRBM

import matplotlib.pyplot as plt
import seaborn as sns

ImportError: cannot import name 'urlretrieve'

Let's see where Theano will run our code:

In [3]:
print (theano.config.device)

cpu


If we had a GPU, we could use it by setting `theano.config.device='gpu'`.

## Multilayer Perceptron with a Single Hidden Layer

We'll be working with the [MNIST dataset](http://yann.lecun.com/exdb/mnist/), a standard dataset of handwritten digits.  Note that this is a much bigger, higher resolution dataset than the handwritten digits dataset that we've seein in previous lectures.  In this first example, we'll use scikit-neuralnetwork.

First, we'll download the MNIST dataset as a pickle file, save it to a local file, and then read in its contents.  If we've already downloaded the file, we'll just read it in.

In [6]:
DATA_URL = 'http://deeplearning.net/data/mnist/mnist.pkl.gz'
DATA_FILENAME = 'mnist.pkl.gz'

if not os.path.exists(DATA_FILENAME):
    print ("Downloading MNIST dataset...")
    urlretrieve(DATA_URL, DATA_FILENAME)

with gzip.open(DATA_FILENAME, 'rb') as f:
    data = pickle.load(f)

Downloading MNIST dataset...


NameError: name 'urlretrieve' is not defined

The pickle object has a training set, a validation set, and a test set.  Let's split those out.

In [None]:
X_train, y_train = data[0]
X_valid, y_valid = data[1]
X_test, y_test = data[2]

The images of handwritten digits are 28 by 28 pixels (28*28=784):

In [None]:
X_train.shape

In [None]:
X_train[0, :]

The response varaible is a label ranging between 0 and 9.

In [None]:
y_train.shape

In [None]:
y_train

The previous handwritten digits dataset that we worked with was only 8 by 8 pixels.  Let's define a function so that we can look at some of the images:

In [None]:
def plot_handwritten_digit(the_image, label):
    plt.axis('off')
    plt.imshow(the_image, cmap=plt.cm.gray_r, interpolation='nearest')
    plt.title('Training: %i' % label)

In [None]:
image_num = 1220
plot_handwritten_digit(X_train[image_num].reshape((28, 28)), y_train[image_num])

Here, we define some constants that will be used in the training:

In [None]:
# number of units in a single hidden layer
NUM_HIDDEN_UNITS = 512

# these parameters control the gradient descent process to learn the weights
LEARNING_RATE = 0.01
MOMENTUM = 0.9 

# we'll feed in this many training examples at a time (for stochastic gradient descent)
BATCH_SIZE = 600

# this is how many times we'll go through the set of batches, i.e. a full pass over
# all of the training data
NUM_EPOCHS = 10


### `scikit-neuralnetwork`

First, let's define a network with scikit-neuralnetwork because it's by far the simplest.  Unfortunately, it looks like this package only supports squared loss and not cross-entropy for classification problems:

In [10]:
import sknn

ImportError: No module named 'sknn'

In [None]:
# First, we'll specify the layers. We'll build a simple network with an input layer, 
# one hidden layer that uses sigmoid activation, and one output layer that uses
# softmax activation.  We're using softmax for the output layer since we're doing multi-class classification; softmax
# rescales the output layer so that the values of the output nodes are all positive and sum to 1 (hence, they
# can be viewed as a probability distribution).

# The 'mlp' you see throughout corresponds to 'multi-layer perceptron'
layers = [mlp.Layer("Sigmoid", units=NUM_HIDDEN_UNITS), mlp.Layer("Softmax")]

# Second, we create the model, much as we did for models earlier in the course
# This model contains all the settings and hyperparameters required to train the network. 
# It does not actally refer to our training data.  
# Notice also that we're using the mlp.Classifier class, since we're doing classifiation.
# With this information, sknn will know that our output consists of a set of labels, and so
# we don't need to vectorize the response data.
sknn_mlp = mlp.Classifier(loss_type="mse", batch_size=BATCH_SIZE, layers=layers, learning_rate=LEARNING_RATE, 
                        learning_rule="nesterov", learning_momentum=MOMENTUM, n_iter=NUM_EPOCHS, verbose=True)

# We can also get a summary of the model's settings.
sknn_mlp

Then we'll fit it and make predictions:

In [None]:
sknn_mlp.fit(X_train, y_train)

Next, we can predict on the test data using the trained network

In [None]:
test_preds = sknn_mlp.predict(X_test)
test_preds

We can also look at the network's performance, using the same classification accuracy metrics we used in earlier lectures.

In [None]:
print classification_report(y_test, test_preds)
print accuracy_score(y_test, test_preds)

### `keras`

sknn is easy to use, but not extremely popular.  On the other hand, keras is quite popular.  Interestingly, it does not actually do computations -it provides a relatively easy to use syntax for setting up and training neural networks that can be combined with a computational backend (theano, tensorflow).  This makes it very convenient since the keras code you write when using a theano backend is the same as the code you'd write if you were using tensorflow as a backend.  Swapping between theano and tensorflow on the backend requires a simple change to the keras configuration file (should be located in the ```.keras``` folder in your root directory, and called ```keras.json```; see https://keras.io/backend/).

For a list of keras commands, and references on what each does, see: https://keras.io/layers/core/.

Thes examples are based on the example found at: 
https://github.com/fchollet/keras/blob/master/examples/mnist_mlp.py
Be careful with keras examples -you have to go to the actual Github
respository to get examples that are consistent with the latest version
of the package, otherwise everything breaks...

First, we setup the model.  This model is again simple - one layer for inputs, one hidden layer, and then one layer for outputs.  We will use sigmoid activation for the hidden layer, and softmax activation for the output layer.  This setup lets us perform classification -by taking the largest value in the softmax layer across the ten possible outputs (0, 1, ..., 9), we can classify new observations.  We will also measure our network's performance on the test set.

In [None]:
# Tell keras we want to create a sequential (feed-forward network) model, in which one
# layer follows the next
model = Sequential()

# Create the input layer and the hidden layer of the network
# 'Dense' indicates that we want all inputs to connect to every node in the hidden layer
# 'input_shape' tells the hidden layer the dimension of the input to expect, which is determined
#    by our data (the number of predictors in our data set = 784 pixels per image)
# 'NUM_HIDDEN_UNITS' (which we defined above) tells keras how many nodes are the hidden layer connected
#    to the input layer
# 'activation' specifies how values from the input node should be processed by hidden nodes.
model.add(Dense(NUM_HIDDEN_UNITS, input_shape=(784,), activation='sigmoid', init='uniform'))

# Next, we'll create an output layer.
# The value '10' tells keras we want this layer to have ten nodes
# The 'activation' tells keras we want to use the softmax function
# Note that we don't specify 'input_shape'.  The input to this layer is the output of the hidden
# layer we created above. keras is smart enough to figure this out, which is why we only need
# to specify how many nodes are in the output layer.
model.add(Dense(10, activation='softmax', init='uniform'))

# Next, we specify the properties of our optimizer.
# We'll be using stochastic gradient descent with momentum, along with 
# the crossentropy cost function.
# Note: The 'decay' argument was not discussed in lecture - it reduces the learning
# rate as we get further into training (e.g., as we go from one epoch to the next)
# The benefit of decay is that the optimizer will make bigger adjustments
# early on, then do fine-tuning later in the training process.
sgd = SGD(lr=LEARNING_RATE, decay=1e-6, momentum=MOMENTUM, nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=sgd)

# Summarize the model setup
model.summary()

Second, we will convert the training labels to vectors of dimension 10. Each vector only has one '1', in the index corresponding to the correct label.  For instance, if in an image is a '5', then there will be a '1' in the corresponding spot in the response vector, and 0's everywhere else.  This approach mirrors the fact that our output layer has 10 nodes, one per possible digit.

In [None]:
# The np_utils module is part of keras, and has functioan designed to make it easier to work with neural networks.
# Here, we're using it to convert the label response data into vectors.
Y_train = np_utils.to_categorical(y_train, 10)
Y_train

Third, let's fit the model.  We pass it the training data, the batch_size (for SGD), and the number of epochs.  Notice the ```.fit``` syntax, which is the same syntax used for all scikit-learn models we've used thusfar

In [None]:
model.fit(X_train, Y_train, batch_size=BATCH_SIZE, nb_epoch=NUM_EPOCHS)

Next, we'll predict on the test set using the neural network.  Since we had ten output nodes, keras will give us a 10-dimensional vector for each test observation.  Each entry in the vector can be interpreted as the probability that the test vector's label corresponds to the entry (e.g., first entry is the probability that the test observation is a `0`).

In [None]:
Y_test_keras = model.predict(X_test)
Y_test_keras

In [None]:
# Now, we'll use np_utils again to convert the 10-dimensional probability distribution for each test
# observation into an actual label (by looking at the max)
test_preds = np_utils.categorical_probas_to_classes(Y_test_keras)
test_preds

Finally, we measure the accuracy of the neural network.  Recall that precision is the fraction of observations labeled as a class that actually are of that class, and recall is the fraction of observations from a class that were labeled correctly.

In [None]:
# Finaly we measure the accuracy of the neural network.
print classification_report(y_test, test_preds)
print accuracy_score(y_test, test_preds)

Much better!

If we want to visualize what the hidden nodes are picking up on, we can extract the weights for a single hidden node, then pass those weights into the function we created earlier to plot digits.  This works because there is one weight per pixel, so the number of weights for a node can be rehaped into an 'image'.  When a pixel has a large weight, it will be darker in the image, and when a pixel has a low weight, it will lighter in the image.

In [None]:
# We can extract the weights from the hidden layer using the get_weights
# function provided by keras.
len(model.get_weights())

This tell us that keras is tracking four sets of 'weights'.  The first set of weights, ```model.get_weights()[0]``` corresponds to the weights used when mapping the inputs to the first hidden layer.  The second set of weights, ```model.get_weights()[1]```, actually corresponds to the bias terms to the first hidden layers.  The second two entries, indexed by ```2``` and ```3``` are the weights and biases applid to the hidden layer when mapping to the output layer.

In [None]:
model.get_weights()[0].shape

We see that the first set of weights is stored in an array with one row per pixel (i.e., one row per input node in our network), and one column per hidden node.  Therefore, if we want to see what pattern the a hidden node is detecting, we pass the corresponding weight column to the digit plot function we created earlier.

In [None]:
# Which hidden node?
hid_node = 450

# Plot it
# we're recylcing the function from before, so the title will still say
# 'Training', but ignore that.
plot_handwritten_digit(model.get_weights()[0][:,hid_node-1].reshape((28, 28)), hid_node)

In [None]:
model.get_weights()[0][:,1].shape

## Deeper Network with Dropout

Let's also try a deeper neural network with more layers.  Specifically, one input layer, two fully connected hidden layers using sigmoid activation, and one output layer using softmax activation.

### `keras`

In [None]:
# Tell keras we want to create a sequential (feed-forward network) model, in which one
# layer follows the next
deeper_model = Sequential()

# Create the input layer and the hidden layer of the network
# 'Dense' indicates that we want all inputs to connect to every node in the hidden layer
# 'input_shape' tells the hidden layer the dimension of the input to expect, which is determined
#    by our data (the number of predictors in our data set = 784 pixels per image)
# 'NUM_HIDDEN_UNITS' (which we defined above) tells keras how many nodes are the hidden layer connected
#    to the input layer
# 'activation' specifies how values from the input node should be processed by hidden nodes.
deeper_model.add(Dense(NUM_HIDDEN_UNITS, input_shape=(784,), activation='sigmoid', init='uniform'))

# Next, implement dropout in the first layer, with 50% of the input nodes being 
# dropped for each iteration.
deeper_model.add(Dropout(0.5))

# Next, create a second hidden layer with sigmoid activation, and
# implement drop out.  Keras will look at the most layer we just created
# to figure out how many inputs to expect.
deeper_model.add(Dense(NUM_HIDDEN_UNITS, activation='sigmoid', init='uniform'))
deeper_model.add(Dropout(0.5))

# Next, we'll create the output layer.
deeper_model.add(Dense(10, activation='softmax', init='uniform'))

# Next, we specify the properties of our optimizer.
# We'll be using stochastic gradient descent with momentum, along with 
# the crossentropy cost function.
# Note: We're using a larger learning rate than in the single-layer network, and
# and no decay.  I arrived at these settings through tuning, but intuitively this is 
# a more complicated function to optimize (given the additional layer)
# and since I don't want to run over a huge number of epochs, I'm asking keras
# to learn faster.  This may not always work in practice.
sgd = SGD(lr=4.*LEARNING_RATE, decay=0., momentum=MOMENTUM, nesterov=True)
deeper_model.compile(loss='categorical_crossentropy', optimizer=sgd)

# Summarize the model setup
deeper_model.summary()

In [None]:
# Vectorize the response once again, and train the network.
Y_train = np_utils.to_categorical(y_train, 10)
deeper_model.fit(X_train, Y_train, batch_size=BATCH_SIZE, nb_epoch=NUM_EPOCHS)

Finally, let's check the accuracy of this deeper network.

In [None]:
Y_test_keras = deeper_model.predict(X_test)
test_preds = np_utils.categorical_probas_to_classes(Y_test_keras)
print classification_report(y_test, test_preds)
print accuracy_score(y_test, test_preds)

Not great, especially since it's worse than the single layer neural network!  I played around with this, and found that turning decay off and going for 30 epochs, instead of 10, helps quite a bit.  Let's do that now.

## Tensorflow

Tensorflow is an open-source package provided by Google specifically designed for deep learning. In terms of what we've done so far in this notebook, it's comparable to theano in that it's a computational backend that can be paired with keras.  However, it can also be used directly (as can theano actually) to specify, train, and evaluate neural networks.  

The basic idea behind tensorflow is to describe a network as a  'map' in which nodes correspond to computations (e.g., apply softmax function or sigmoid function), and the edges between nodes correspond to the flow of *tensors* (or multi-dimensional arrays) as they get processed via successive computations. Essentially, the tensors "flow" through the map, getting modified along the way.  

Like theano, one key function of tensorflow is its ability to automatically calculate the derivatives required by gradient descent.  It figures out how all the parameters feed into the loss function, and calculates derivatives accordingly.

In addition, tensorflow has many advanced features that we won't discuss.  For instance, it is designed to work well with GPU's, which have been found to be very good at performing the computations required by neural networks.  It can also be used for distributed processing, or even on smart phones.

Instructions for installing tensorflow can be found here: https://www.tensorflow.org/versions/r0.11/get_started/os_setup.html.  If you want the simplest possible install, I'd stick with the 'CPU Only' approach, which avoids having to setup additional packages required for GPU processing.

Also, there now exists a tensorflow wrapper in R, which let's you use tensorflow without leaving R. I have not tried it, but if you do, let me know what you think.
https://rstudio.github.io/tensorflow/

Note that the R wrapper warns against installing tensorflow using anaconda, which is unfortunate.  I used the conda installation for use in Python, which worked fine.

We'll now do the digit classification problem in tensorflow.  This example is heavily based on Google's own example: https://github.com/tensorflow/tensorflow/blob/r0.11/tensorflow/examples/tutorials/mnist/mnist_softmax.py

### Digit Classification

In [None]:
# Let's import tensorflow
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

In [None]:
# Pull down the data in a tensforflow-friendly format
mnist_tf = input_data.read_data_sets("MNIST_data/",one_hot=True)

Our data again consists of 28x28 pixel images, for 784 total pixes per image.  We therefore create a tensor flow *placeholder* to reprsent the input layer in our network.  We won't actually provide data yet -we're just telling tensorflow to expect data of a certain shape.

In [None]:
x = tf.placeholder(tf.float32, [None,784])

The ```float32``` tells tensorflow the type of data that will be stored in ```x```.  The ```[None,784]``` tells tensorflow to expect data with some (unspecified) number of rows, and 784 columns.

Next, we need to add objects to our tensorflow map that correspond to the weights and biases of our neural network.  Specifically, we create tensorflow ```variables```, which can be updated as tensforflow performs gradient descent.

In [None]:
# Weights and biases used to map inputs to hidden layer
W = tf.Variable(tf.zeros([784,500]))
b = tf.Variable(tf.zeros([500]))

# Calculation of hidden layer using sigmoid activation
h = tf.sigmoid(tf.matmul(x,W)+b)

# Weights and biases used to map hidden layer to output layer
hW = tf.Variable(tf.zeros([500,10]))
hb = tf.Variable(tf.zeros([10]))

# Calculation to map hidden layer to output layer
# Note: We'll apply softmax implicitly in the cost function
y = tf.matmul(h,hW)+hb

# Placeholder for the training response data
y_ = tf.placeholder(tf.float32,[None,10])


Next, we specify our objective function, and how we'd like to update our parameters -in our case, via gradient descent

In [None]:
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y,y_))
train_step = tf.train.GradientDescentOptimizer(0.9).minimize(cross_entropy)
#train_step = tf.train.MomentumOptimizer(learning_rate=0.5,momentum=0.9).minimize(cross_entropy)

Finally, we initialize all of the variables in the tensorflow map, start a session (tensorflow operations need to take place within a session), then iterate using the gradient descent algorithm in order to adjust the weights and minimize cost.

In [None]:
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)

for i in range(1000):
    batch_xs, batch_ys = mnist_tf.train.next_batch(600)
    sess.run(train_step,feed_dict={x: batch_xs, y_: batch_ys})

In [None]:
# Lastly, we calculate accuracy of our fitted neural network
correct_prediction = tf.equal(tf.argmax(y,1),tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

print(sess.run(accuracy,feed_dict={x: mnist_tf.test.images, y_: mnist_tf.test.labels}))

Definitely bad, but I did no tuning. 

In [None]:
# Close the tensorflow session
sess.close()