#Traditional (10 years ago) 3-layer ANN

##What you'll learn in this lecture
1.  How to add momentum-method weight update to simple ANN
2.  The structure of a standard neural net
3.  Familarization with Cifar data set
4.  How to code a simple ANN for color images

##Order of topics

1.  Upgrade 2-layer ANN to incorporate momentum.
2.  Architecture of standard neural net
3.  Code for MNIST classifier using standard ANN
4.  Introduction to Cifar image classification data set
5.  Coding exercise to build Cifar classifier


##Pre-reading

http://www.cs.toronto.edu/~kriz/cifar.html



##Upgrading 2-Layer Network to Incorporate Momentum-Method Weight Updates

Recall the equations for momentum method from Lecture 2.2.  They're repeated below.

$Step_{i+1} = \beta Step_i - (1 - \beta)\delta\nabla J_i$

and

$w_{i+1} = w_i + Step_{i+1}$

With simple gradient descent you adjust incrementally with gradient information.  With momentum method you introduce a new variable (called "step" in the equation above).  The new variable also gets adjusted incrementally using gradient information and the weights now get adjusted incrementally using the new variable.  Recall that the "update" parameter in the theano function provided the mechanism for incrementally adjusting the weights with the gradient.  You'll have to use that same mechanism for updating the new variable that momentum method introduces.  

The code block below highlights the alterations required to incorporate momentum method updates.  The key difference shows up in the definition of "update".  With pure gradient descent the line of code defining the update previously was:

update = [[w, w - gradient * 0.05]]

The python list had two components.  The first is the (shared) variable being updated in the function and the second is the new value to be assigned.  If you were doing the updates in python you might author a line something like 

w = w - gradient * 0.05.  

The old definition gets replaced with 

update = [(w, w - dw), (dw, beta * dw + (1 - beta) * delta * gradient)].  

Inside the list there are two tuples.  What's called "step" in the equation above is called "dw" in the code.  The first tuple is (w, w - dw) which describes the new weight update and the second one is (dw, beta * dw + (1 - beta) * delta * gradient) which describes the update for the new variable introduced by the momentum method.  Enclosing the update expressions in a tuple instead a list demonstrates that the tuple works in the same way.  You can use whichever is convenient for you.  The second novel element is that the list of variables that require updating leads to a list of tuples (or lists) each tuple describes the update equation for its corresponding variable.  You'll see that mechanism exercised again to update weights for more than one layer.  



In [1]:
#code for 2-layer classifier network
import theano
from theano import tensor as T
__author__ = 'mike.bowles'
import numpy as np
import pylab as plt
import matplotlib.cm as cm
from mnistReader import mnist
%matplotlib inline

def floatX(X):
    return np.asarray(X, dtype=theano.config.floatX)

def init_weights(shape):
    return theano.shared(floatX(np.random.randn(*shape) * 0.01))

def init_dw(shape):                                #new init function for momentum variable
    return theano.shared(floatX(np.zeros(shape)))

def model(X, w):
    return T.nnet.softmax(T.dot(X, w))

xTrain, xTest, yTrain, yTest = mnist()

X = T.fmatrix()
Y = T.fmatrix()

shape = (784, 10)
w = init_weights(shape)
dw = init_dw(shape)


py_x = model(X, w)
y_pred = T.argmax(py_x, axis=1)

cost = T.mean(T.nnet.categorical_crossentropy(py_x, Y))
gradient = T.grad(cost=cost, wrt=w)

beta = T.constant(0.9)    #parameter for momentum method
delta = T.constant(0.5)   #parameter for momentum method

update = [(w, w + dw), (dw, beta * dw - (1 - beta) * delta * gradient)]  #expanded for momentum

train = theano.function(inputs=[X, Y], outputs=cost, updates=update, allow_input_downcast=True)
predict = theano.function(inputs=[X], outputs=y_pred, allow_input_downcast=True)

for i in range(101):
    for start, end in zip(range(0, len(xTrain), 128), range(128, len(xTrain), 128)):
        cost = train(xTrain[start:end], yTrain[start:end])
    if i%10 == 0: print i, np.mean(np.argmax(yTest, axis=1) == predict(xTest))

0 0.9046
10 0.9225
20 0.9233
30 0.9239
40 0.9232
50 0.9223
60 0.9211
70 0.9213
80 0.9215
90 0.9214
100 0.9218


##In-class coding exercise
1.  In the 2-layer MNIST classifier change the weights to satisfy Glorot's conditions and change the weight updates from momentum to AdaGrad or AdaDelta.  

##The ANN BD (Before Deep Learning)

The figure below shows the basic structure used for neural up until new training approaches were developed in the last 10 years or so.  The network has three layers - input layer, hidden layer and output layer.  Typically the number of hidden neurons is less than the number of input neurons and the number of output neurons is smaller still (dictated largely by the type of problem being solved).  



In [4]:
import theano
from theano import tensor as T
import numpy as np
from math import sqrt

from scipy.misc import imsave


def floatX(X):
    return np.asarray(X, dtype=theano.config.floatX)

def init_weights(shape):
    (h, w) = shape
    normalizer = 2.0 * sqrt(6) / sqrt(h + w) * 0.2 # 0.25 for sigmoid and 0.1 for softmax
    return theano.shared(floatX(np.random.randn(*shape) * 0.01))  #code for standard initialization
    #return theano.shared(floatX((np.random.random_sample(shape) - 0.5) * normalizer))  #code for using Glorot init

def sgd(cost, params, lr=0.05):
    grads = T.grad(cost=cost, wrt=params)
    updates = []
    for p, g in zip(params, grads):
        updates.append([p, p - g * lr])
    return updates

def model(X, w_h, w_o):
    #code of tahn activation functions
    #h = T.tanh(T.dot(X, w_h))

    #code for sigmoid activation functions
    h = T.nnet.sigmoid(T.dot(X, w_h))
    pyx = T.nnet.softmax(T.dot(h, w_o))
    return pyx

X = T.fmatrix()
Y = T.fmatrix()

w_h = init_weights((784, 625))
w_o = init_weights((625, 10))

py_x = model(X, w_h, w_o)
y_x = T.argmax(py_x, axis=1)

cost = T.mean(T.nnet.categorical_crossentropy(py_x, Y))
params = [w_h, w_o]
updates = sgd(cost, params)

train = theano.function(inputs=[X, Y], outputs=cost, updates=updates, allow_input_downcast=True)
predict = theano.function(inputs=[X], outputs=y_x, allow_input_downcast=True)

for i in range(101):
    for start, end in zip(range(0, len(xTrain), 128), range(128, len(xTrain), 128)):
        cost = train(xTrain[start:end], yTrain[start:end])
    if i%10 == 0: print "i = ", i, "  misclassification error rate = ", np.mean(np.argmax(yTest, axis=1) == predict(xTest))

i =  0   misclassification error rate =  0.7021
i =  10   misclassification error rate =  0.909
i =  20   misclassification error rate =  0.9184
i =  30   misclassification error rate =  0.9227
i =  40   misclassification error rate =  0.9282
i =  50   misclassification error rate =  0.9338
i =  60   misclassification error rate =  0.9397
i =  70   misclassification error rate =  0.945
i =  80   misclassification error rate =  0.9491
i =  90   misclassification error rate =  0.9536
i =  100   misclassification error rate =  0.9574


##In-Class coding exercise

1.  Change to Glorot, Bengio initial weights and tanh non-linearities and rerun.  You can make those changes by  uncommenting the code above

2.  Pick a configuration and switch to AdaGrad, AdaDelta or NAG updates (keeping Glorot, Bengio initial wts).  What is the difference?

##CIFAR data set - processing full color images

Here's a link to the web page where you can download the Cifar image classification images.  You'll build a neural net to classifiy the Cifar-10 images.  This data set has 60,000 32x32 pixel images containing 10 different classes.  The data set is balanced.  It contains 6000 of each of the 10 classes.  The data are divided into 50,000 training examples and 10,000 test examples.  The 10 classes are - airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck.  The image below shows several examples.  
<img src="CifarImages.png">

The code below reads the training and test data sets.  You'll only train on 10,000 of the training examples in order to cut down on the training time.  



In [5]:
__author__ = 'mike.bowles'
from readCifar import cifar

xTrain, yTrain, xTest, yTest = cifar()

print 'xTrain shape = ', xTrain.shape
print 'yTrain shape = ', yTrain.shape
print 'xTest shape = ', xTest.shape
print 'yTest shape = ', yTest.shape

xTrain shape =  (10000, 3072)
yTrain shape =  (10000, 10)
xTest shape =  (10000, 3072)
yTest shape =  (10000, 10)


##Q's

1.  There are 10000 rows of x's and y's for training and for testing.  Explain the other dimension of the xTrain and xTest arrays.  

##In-class coding exercise

Adapt the simple logistic regression code from last lecture to train a classifier for the cifar images.  Notice that the code snip above has already read in the training and testing files.  The code block below contains the MNIST classifier code from last lecture as a starting point. 

In [None]:
import theano
from theano import tensor as T
import numpy as np

def floatX(X):
    return np.asarray(X, dtype=theano.config.floatX)

def init_weights(shape):
    return theano.shared(floatX(np.random.randn(*shape) * 0.01))

def model(X, w):
    return T.nnet.softmax(T.dot(X, w))

xTrain, yTrain, xTest, yTest = mnist(onehot=True)

X = T.fmatrix()
Y = T.fmatrix()

w = init_weights((784, 10))

py_x = model(X, w)
y_pred = T.argmax(py_x, axis=1)

cost = T.mean(T.nnet.categorical_crossentropy(py_x, Y))
gradient = T.grad(cost=cost, wrt=w)
update = [[w, w - gradient * 0.05]]

train = theano.function(inputs=[X, Y], outputs=cost, updates=update, allow_input_downcast=True)
predict = theano.function(inputs=[X], outputs=y_pred, allow_input_downcast=True)

for i in range(101):
    for start, end in zip(range(0, len(trX), 128), range(128, len(trX), 128)):
        cost = train(trX[start:end], trY[start:end])
    if i % 10 == 0: print i, np.mean(np.argmax(teY, axis=1) == predict(teX))


##Homework exercise
Code a standard 3 layer ANN for classifying Cifar images.