# Neural Networks

In this lecture, we'll be looking at various kinds of neural networks.  Neural networks in python are a very quickly evolving area, and there are many different competing packages for working with them.  Unfortunately, there's not yet a standard set of packages in scikit-learn like we've seen for many other machine learning methods.  

Most of the packages are high level wrappers around [Theano](http://deeplearning.net/software/theano/), which is a mathematical package for easily working with numerical expressions of arrays and matrices and their gradients.  Additionally, Theano code will also run seamlessly on a GPU if one is available.  This makes training much, much faster.  Here's an [ipython notebook](http://nbviewer.ipython.org/github/craffel/theano-tutorial/blob/master/Theano%20Tutorial.ipynb) on Theano if you're interested. 

We're going to look at two packages.  The first is [scikit-neuralnetwork](https://github.com/aigamedev/scikit-neuralnetwork) (installation instructions are at this link as well).  It's interface is the simplest, but it doesn't appear to be as widely used, and it's unclear if this package will "win" the race or not.

[Keras](https://github.com/fchollet/keras/) is a relatively new package.  It looks to be a good balance between sophistication and simplicity.  Installation instructions are at that link.

[Lasagne](http://lasagne.readthedocs.org/en/latest/) is another, more full-featured, package, but we won't have time to go into it.

In [None]:
%matplotlib inline

In [None]:
import gzip
import os
import sys
import time
import numpy as np
import pandas as pd

# pickle lets us save python objects to a file and read them back in
import pickle
import itertools

# here are our neural network imports
from sknn import mlp

from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD
from keras.utils import np_utils

import theano

from urllib import urlretrieve
from sklearn import datasets
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
# scikit-learn does have a restricted boltzman machine class for doing unsupervised
# feature learning
from sklearn.neural_network import BernoulliRBM

import matplotlib.pyplot as plt
import seaborn as sns

Let's see where Theano will run our code:

In [None]:
print theano.config.device

If we had a GPU, we could use it by setting `theano.config.device='gpu'`.

## Multilayer Perceptron with a Single Hidden Layer

We'll be working with the [MNIST dataset](http://yann.lecun.com/exdb/mnist/), a standard dataset of handwritten digits.  Note that this is a much bigger, higher resolution dataset than the handwritten digits dataset that we've seein in previous lectures.  In this first example, we'll use scikit-neuralnetwork and then Lasagne.  The following code is a modified version based off of the Lasagne MNIST example [here](https://github.com/Lasagne/Lasagne/blob/master/examples/mnist.py).

First, we'll download the MNIST dataset as a pickle file, save it to a local file, and then read in its contents.  If we've already downloaded the file, we'll just read it in.

In [None]:
DATA_URL = 'http://deeplearning.net/data/mnist/mnist.pkl.gz'
DATA_FILENAME = 'mnist.pkl.gz'

if not os.path.exists(DATA_FILENAME):
    print "Downloading MNIST dataset..."
    urlretrieve(DATA_URL, DATA_FILENAME)

with gzip.open(DATA_FILENAME, 'rb') as f:
    data = pickle.load(f)

The pickle object has a training set, a validation set, and a test set.  Let's split those out and make a dictionary of things to pass to Theano/Lasagne:

In [None]:
X_train, y_train = data[0]
X_valid, y_valid = data[1]
X_test, y_test = data[2]

The images of handwritten digits are 28 by 28 pixels (28*28=784):

In [None]:
X_train.shape

In [None]:
X_train[0, :]

The previous handwritten digits dataset that we worked with was only 8 by 8 pixels.  Let's define a function so that we can look at some of the images:

In [None]:
def plot_handwritten_digit(the_image, label):
    plt.axis('off')
    plt.imshow(the_image, cmap=plt.cm.gray_r, interpolation='nearest')
    plt.title('Training: %i' % label)

In [None]:
image_num = 109
plot_handwritten_digit(X_train[image_num].reshape((28, 28)), y_train[image_num])

Here, we define some constants that will be used in the training:

In [None]:
# we'll feed in this many training examples at a time
BATCH_SIZE = 600

# this is how many times we'll go through the set of batches, i.e. a full pass over
# all of the training data
NUM_EPOCHS = 10

# number of units in the hidden layer
NUM_HIDDEN_UNITS = 512

# these parameters control the gradient descent process to learn the weights
LEARNING_RATE = 0.01
MOMENTUM = 0.9 

### `scikit-neuralnetwork`

First, let's define a network with scikit-neuralnetwork because it's by far the simplest.  Unfortunately, it looks like this package only supports squared loss and not cross-entropy for classification problems:

In [None]:
import sknn

In [None]:
layers = [mlp.Layer("Sigmoid", units=NUM_HIDDEN_UNITS), mlp.Layer("Softmax")]
sknn_mlp = mlp.Classifier(loss_type="mse", batch_size=BATCH_SIZE, layers=layers, learning_rate=LEARNING_RATE, 
                        learning_rule="nesterov", learning_momentum=MOMENTUM, n_iter=NUM_EPOCHS, verbose=True)
sknn_mlp

Then we'll fit it and make predictions:

In [None]:
sknn_mlp.fit(X_train, y_train)

In [None]:
test_preds = sknn_mlp.predict(X_test)
test_preds

In [None]:
print classification_report(y_test, test_preds)
print accuracy_score(y_test, test_preds)

### `keras`

First, we setup the model:

In [None]:
model = Sequential()

model.add(Dense(X_train.shape[1], NUM_HIDDEN_UNITS, init='uniform', activation='sigmoid'))
model.add(Dense(NUM_HIDDEN_UNITS, 10, init='uniform', activation='softmax'))

sgd = SGD(lr=LEARNING_RATE, decay=1e-6, momentum=MOMENTUM, nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=sgd)

Then we fit it:

In [None]:
# keras takes a matrix of binary output labels
Y_train = np_utils.to_categorical(y_train, 10)
Y_train

In [None]:
model.fit(X_train, Y_train, batch_size=BATCH_SIZE, nb_epoch=NUM_EPOCHS)

In [None]:
Y_test_keras = model.predict(X_test)
Y_test_keras

In [None]:
test_preds = np_utils.categorical_probas_to_classes(Y_test_keras)
test_preds

In [None]:
print classification_report(y_test, test_preds)
print accuracy_score(y_test, test_preds)

## Deeper Network with Dropout

Let's also try a deeper neural network with more layers.

### `keras`

In [None]:
deeper_model = Sequential()

deeper_model.add(Dense(X_train.shape[1], NUM_HIDDEN_UNITS, init='he_normal', activation='relu'))
deeper_model.add(Dropout(0.5))
deeper_model.add(Dense(NUM_HIDDEN_UNITS, NUM_HIDDEN_UNITS, init='he_normal', activation='relu'))
deeper_model.add(Dropout(0.5))
deeper_model.add(Dense(NUM_HIDDEN_UNITS, 10, init='he_normal', activation='softmax'))

sgd = SGD(lr=4.*LEARNING_RATE, decay=0., momentum=MOMENTUM, nesterov=True)
deeper_model.compile(loss='categorical_crossentropy', optimizer=sgd)

In [None]:
# keras takes a matrix of binary output labels
Y_train = np_utils.to_categorical(y_train, 10)
deeper_model.fit(X_train, Y_train, batch_size=BATCH_SIZE, nb_epoch=NUM_EPOCHS)

In [None]:
Y_test_keras = deeper_model.predict(X_test)
test_preds = np_utils.categorical_probas_to_classes(Y_test_keras)
print classification_report(y_test, test_preds)
print accuracy_score(y_test, test_preds)

## Unsurpervised Feature Learning with a Restricted Boltzmann Machine (RBM)

Here, we'll do unsupervised feature learning, or dimensionality reduction.  This can be done with an autoencoder or a restricted boltzmann machine (RBM).  Without getting into too many of the details, both an autoencoder and an RBM have an input and a single hidden layer of reduced dimensionality.  The hidden layer is trained so that it can take the input and reconstruct it as accurately as possible, but with a smaller set of hidden nodes than input nodes.  Luckily for us, scikit-learn has an RBM class.

In [None]:
#digits = datasets.load_digits()
#X_train = np.asarray(digits.data, 'float32')
#y_train = digits.target

Let's take our training set and center it so that the features are all between 0 and 1:

In [None]:
X_train = (X_train - np.min(X_train, 0)) / (np.max(X_train, 0) + 0.0001)  # 0-1 scaling

Let's take a random 10k observations to train on:

In [None]:
N_EXAMPLES = 10000
X_train_rbm, X_test_rbm, y_train_rbm, y_test_rbm = train_test_split(X_train, y_train, test_size=0.5, random_state=0)

X_train_rbm = X_train_rbm[0:N_EXAMPLES]
y_train_rbm = y_train_rbm[0:N_EXAMPLES]

The RBM needs a learning rate and a number of iterations to train for.  The parameter `n_components` tells it how many hidden nodes to use.  It's the dimensionality of the reduced dimension space (like PCA or t-SNE).

In [None]:
rbm = BernoulliRBM(learning_rate=0.05, n_iter=20, n_components=200, random_state=0, verbose=True)

We fit the RBM like so:

In [None]:
rbm.fit(X_train_rbm)

In [None]:
rbm.components_.shape

To see what kinds of features the hidden nodes are learning, we can plot each node as a 28 by 28 pixel image where the darkness is how large the weight is connecting the hidden node to the corresponding input node:

In [None]:
plt.figure(figsize=(4.2, 4))
for i, comp in enumerate(rbm.components_):
    plt.subplot(10, 20, i + 1)
    plt.imshow(comp.reshape((28, 28)), cmap=plt.cm.gray_r, interpolation='nearest')
    plt.xticks(())
    plt.yticks(())

plt.suptitle('Components extracted by the RBM', fontsize=16)
plt.subplots_adjust(0.08, 0.02, 0.92, 0.85, 0.08, 0.23)

plt.show()

Let's look in more detail at some of the individual hidden nodes:

In [None]:
def plot_rbm_component(comp_num):
    plt.figure(figsize=(4.2, 4))

    comp = rbm.components_[comp_num]
    plt.imshow(comp.reshape((28, 28)), cmap=plt.cm.gray_r, interpolation='nearest')
    plt.xticks(())
    plt.yticks(())

    plt.show()

In [None]:
# 60, 80, 90, 120
plot_rbm_component(120)

Let's see how a logistic regression trained using just the raw pixel values does:

In [None]:
N_LOGIT_TRAIN_EXAMPLES = 4000

pixel_logit = LogisticRegression()
pixel_logit.fit(X_train_rbm[0:N_LOGIT_TRAIN_EXAMPLES], y_train_rbm[0:N_LOGIT_TRAIN_EXAMPLES])

# score it on the same test set we used above
print classification_report(y_test, pixel_logit.predict(X_test))
print accuracy_score(y_test, pixel_logit.predict(X_test))

Now let's see how it does with the features learned by the RBM:

In [None]:
rbm_logit = LogisticRegression()
rbm_features_train = rbm.transform(X_train_rbm[0:N_LOGIT_TRAIN_EXAMPLES]) 
rbm_logit.fit(rbm_features_train, y_train_rbm[0:N_LOGIT_TRAIN_EXAMPLES])

rbm_features_test = rbm.transform(X_test) 
print classification_report(y_test, rbm_logit.predict(rbm_features_test))
print accuracy_score(y_test, rbm_logit.predict(rbm_features_test))

We can also take the hidden features learned by the RBM, and train a new RBM on them:

In [None]:
rbm_2 = BernoulliRBM(learning_rate=0.05, n_iter=50, n_components=150, random_state=0, verbose=True)
rbm_2.fit(rbm_features_train)

In [None]:
rbm_logit_2 = LogisticRegression()
rbm_features_train_2 = rbm_2.transform(rbm_features_train) 
rbm_logit_2.fit(rbm_features_train_2, y_train_rbm[0:N_LOGIT_TRAIN_EXAMPLES])

rbm_features_test_2 = rbm_2.transform(rbm_features_test) 
print classification_report(y_test, rbm_logit_2.predict(rbm_features_test_2))
print accuracy_score(y_test, rbm_logit_2.predict(rbm_features_test_2))