#Lecture 4.1 - Training a Neural Net


##Pre-reading

The following two references are related.  The first is the full text of a paper.  The second is an outline of the topics covered.  You might skim the first one and the look at the outline to see what you missed or to uncover things you'd like to look at in more detail.

http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf - full text of paper on problems and approaches to training

https://aresearch.wordpress.com/2015/04/11/efficient-backprop-lecun-bottou-orr-muller-neural-networks-tricks-of-the-trade-1998q/ - outline summary of paper above 

http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40808.pdf -  empirical learning rate algo - google ny

http://arxiv.org/pdf/1206.1106.pdf - Shaul (w. LeCun) paper on adaptive learning

http://arxiv.org/pdf/1301.3764.pdf - Shaul replaces diagonal Hessian w finite difference approx


##Learning Goals


##Training problems

As you've discovered from the last homework problem, neural nets are a pain to train.  The difficulties with training neural nets have been a source of criticism for neural nets.  For example look at the strong language used in the training section of the wikipedia page on neural nets.  https://en.wikipedia.org/wiki/Artificial_neural_network#Training_issues

The quote on wikipedia comes from 1997, the dark ages for neural nets.  Things have improved, but as you saw with the home work, they can still be tough to train.  In today's class you'll learn how to monitor the progress of training and what to do if your network isn't training properly.  

The code below has several flexibilities built in so you can alter the training properties of the network and see how it affects the training performance.  Your goal for the class is to learn how to determine what's going wrong in the network you're training and to the modify the network in a manner that fixes the problem.  

In [None]:
import theano
from theano import tensor as T
import numpy as np
from math import sqrt
from cifarHandler import cifar
from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
import matplotlib.pyplot as plt
import itertools
import time

srng = RandomStreams()

def floatX(X):
    return np.asarray(X, dtype=theano.config.floatX)

def init_weights(shape):
    (h, w) = shape
    # Glorot normalization - last factor depends on non-linearity
    # 0.25 for sigmoid and 0.1 for softmax, 1.0 for tanh or Relu
    normalizer = 2.0 * sqrt(6) / sqrt(h + w) * 1.0
    #return theano.shared(floatX(np.random.randn(*shape) * 0.01))  #code for standard initialization
    #code for using Glorot initialization
    return theano.shared(floatX((np.random.random_sample(shape) - 0.5) * normalizer))

def rectify(X):
    return T.maximum(X, 0.)

def softmax(X):
    e_x = T.exp(X - X.max(axis=1).dimshuffle(0, 'x'))
    return e_x / e_x.sum(axis=1).dimshuffle(0, 'x')

def RMSprop(cost, params, lr=0.001, rho=0.9, epsilon=1e-6):
    grads = T.grad(cost=cost, wrt=params)
    updates = []
    for p, g in zip(params, grads):
        acc = theano.shared(p.get_value() * 0.)
        acc_new = rho * acc + (1 - rho) * g ** 2
        gradient_scaling = T.sqrt(acc_new + epsilon)
        g = g / gradient_scaling
        updates.append((acc, acc_new))
        updates.append((p, p - lr * g))
    return updates

def adaGrad(cost, params, eta=0.1, epsilon=1e-6):
    grads = T.grad(cost=cost, wrt=params)
    updates = []
    for p, g in zip(params, grads):
        sumGSq = theano.shared(p.get_value() * 0.)
        sumGSq_new = sumGSq + g ** 2
        gradient_scaling = T.sqrt(sumGSq_new + epsilon)
        g = g / gradient_scaling
        updates.append((sumGSq, sumGSq_new))
        updates.append((p, p - eta * g))
    return updates

def adaDelta(cost, params, eta=1.0, rho=0.9, epsilon=1e-6):
    grads = T.grad(cost=cost, wrt=params)
    updates = []
    for p, g in zip(params, grads):
        #calc g-squared
        gSq = theano.shared(p.get_value() * 0.)
        dwSq = theano.shared(p.get_value() * 0.)

        #exp smoothed g squared
        gSqNew = rho * gSq + (1 - rho) * g * g

        #calc dx-squared
        dw = eta * T.sqrt(dwSq + epsilon) * g / T.sqrt(gSq + epsilon)
        dwSqNew = rho * dwSq + (1 - rho) * dw * dw

        updates.append((dwSq, dwSqNew))
        updates.append((gSq, gSqNew))
        updates.append((p, p - dw))
    return updates


def dropout(X, p=0.):
    if p > 0:
        retain_prob = 1 - p
        X *= srng.binomial(X.shape, p=retain_prob, dtype=theano.config.floatX)
        X /= retain_prob
    return X

def model(X, w_h, w_h2, w_o, p_drop_input, p_drop_hidden):
    X = dropout(X, p_drop_input)
    h = rectify(T.dot(X, w_h))

    h = dropout(h, p_drop_hidden)
    h2 = rectify(T.dot(h, w_h2))

    h2 = dropout(h2, p_drop_hidden)
    py_x = softmax(T.dot(h2, w_o))
    return h, h2, py_x


X = T.fmatrix()
Y = T.fmatrix()

w_h = init_weights((3072, 1500))
w_h2 = init_weights((1500, 700))
w_o = init_weights((700, 10))

noise_h, noise_h2, noise_py_x = model(X, w_h, w_h2, w_o, 0.2, 0.5)
h, h2, py_x = model(X, w_h, w_h2, w_o, 0., 0.)
y_x = T.argmax(py_x, axis=1)

cost = T.mean(T.nnet.categorical_crossentropy(noise_py_x, Y))
params = [w_h, w_h2, w_o]

#updates = RMSprop(cost, params, lr=0.0001)
#updates = adaGrad(cost, params, eta=0.001, epsilon=0.01) #
updates = adaDelta(cost, params, eta=0.1, rho=0.9, epsilon=0.1)

xTrain, yTrain, xTest, yTest = cifar(2, True)

train = theano.function(inputs=[X, Y], outputs=cost, updates=updates, allow_input_downcast=True)
predict = theano.function(inputs=[X], outputs=y_x, allow_input_downcast=True)


for i in range(101):
    iJump = 128

    for start, end in zip(range(0, len(xTrain), iJump), range(iJump, len(xTrain), iJump)):
        cost = train(xTrain[start:end], yTrain[start:end])
        print i, np.mean(np.argmax(yTest, axis=1) == predict(xTest)), np.mean(w_h.get_value()), np.mean(w_h2.get_value())

    #asList = list(itertools.chain.from_iterable((w_h.get_value()).tolist()))
    #plt.hist(asList)
    #plt.show()

##In-class training exercises
1  Use only one of the cifar training data sets.  Use RMSprop updates for the weights. Set normalization to "False" in the routine that reads in Cifar data.  Increase the learning rate until the network obviously overtrains.  What characterizes over-training behavior?  How can you identify it?  When the network is overtraining, uncomment the plot-related statements at the bottom of the code and observe what the corresponding histogram looks like as the network overtrains.

2  Comment on any odd behaviors that you observe in the weight matrix as it's training.  


3  Try this exercise with the other choices for update that are available in the code for this network.  Does one of the update algorithms seem more or less well behaved to you?  What properties make it stand out?

4  Pick the worst behaved algorithm (in your view) and change the "Normalized" option for the cifar reader to "True".  Do the same thing with the best behaved algorithm.  How much difference would you say normalization makes.  Can you get both the best and the worst update algorithms to train?  

5  Change the number of data packages included in the training to 5, so that the all the data are available for training.  Start training with the parameter settings that you got working in the last exercise.  Monitor the training by looking at the available output variables.  Add your own if you like.  Does adding more data make it so you can increase the learning parameter and train more rapidly?  

##Homework
Build a function to perform Nesterov updates and run enough training to compare it to your favorite. 


