#Lecture 3.4 - Deep Learning Networks.  

##What you'll learn today
1.  The basic architecture and features of a modern deep learning neural net.
2.  How to apply a deep learning network to a new problem
3.  Some of the alterations you can make to improve performance of your deep network for a particular problem

##Order of Topics
1.  Review second Glorot, Bengio paper 
2.  Look at a 4 layer network incorporating GB rectifying linear units (Relu).  
3.  In class exercises to modify network looking for performance improvements.



##Pre Reading
http://jmlr.org/proceedings/papers/v15/glorot11a/glorot11a.pdf - Glorot, Bengio - Relu in deep learning

http://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf - Srivastava, Hinton paper on using dropout for regularization.  

http://deeplearning.net/software/theano/tutorial/examples.html - Read and run the sample code for generating random numbers in theano.  

http://climin.readthedocs.org/en/latest/rmsprop.html#id1 - look at definition of RMSProp

##Modern Deep Neural Net

There are several key features of this network that make it perform as well as it does.  The main features are:

1.  More than one hidden layer - 4 or more network layers
2.  Rectifying linear units for activation functions
3.  Dropout for regularization
4.  Use of better weight update than plain gradient descent.
5.  Proper weight initialization.  

Having more than one hidden layer defines deep learning and the Glorot Bengio paper that you saw a few lectures ago showed how the layers further from the input layer tended to saturate and stall training.  The rest of the items in the list describe the improvements that are necessary to achieve reliable training in networks with more than one hidden layer.  

A rectifying linear unit is an activation function f(x) = max(0, x).  It is flat and equal to zero for negative values of its input and equal to the input for positive input values.  

##Q's
Can you think of any possible issues with the rectifying linear unit?

The Glorot paper http://jmlr.org/proceedings/papers/v15/glorot11a/glorot11a.pdf explains the use and benefits of rectifying linear units.  

Dropout is a method for preventing over-training in a neural net.  The basic idea is to randomly set some of the outputs from input neurons and hidden neurons to zero.  The added unreliability of inputs being present reduces the networks ability to depend too much on individual inputs and to thereby overtrain.  The basic method is outlined and discussed in the Shrivastava, Hinton paper.  http://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf

The code below implements a four layer fully connected network for classifying MNIST digits.  This code uses RMSProp, which is another accelerated gradient descent method that bears some similarity to AdaDelt and AdaGrad.  Here's a link to a description of RMSProp.  http://climin.readthedocs.org/en/latest/rmsprop.html#id1

##Code Walk-through - group discussion

1  Look at the function definitions to get a rough idea of what each of them do.

2  Walk through the main body of the code and discuss the overall structure.  

3  Any comments on the weight initialization?   

Handling random numbers in theano requires using a theano version of a random number generator.  Dropout requires random draws to determine which neuron outputs to ignore.  Here's how the theano version of random number generation works.  

http://deeplearning.net/software/theano/tutorial/examples.html

4  Walk through the handling of the weight and gradient-handling variables.  Do you see any problems with the update?  

5  Group coding exercise - Fix the problem with the subroutine that returns update equations.  

6  Rewrite the random weight initialization to incorporate Glorot scaling.   

7  Re-run and discuss improvement or deterioration.  

In [1]:
import theano
from theano import tensor as T
from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
from mnistReader import mnist
import numpy as np

srng = RandomStreams()

def floatX(X):
    return np.asarray(X, dtype=theano.config.floatX)

def init_weights(shape):
    return theano.shared(floatX(np.random.randn(*shape) * 0.01))

def rectify(X):
    return T.maximum(X, 0.)

def softmax(X):
    e_x = T.exp(X - X.max(axis=1).dimshuffle(0, 'x'))
    return e_x / e_x.sum(axis=1).dimshuffle(0, 'x')

def RMSprop(cost, params, lr=0.001, rho=0.9, epsilon=1e-6):
    grads = T.grad(cost=cost, wrt=params)
    updates = []
    for p, g in zip(params, grads):
        acc = theano.shared(p.get_value() * 0.)
        acc_new = rho * acc + (1 - rho) * g ** 2
        gradient_scaling = T.sqrt(acc_new + epsilon)
        g = g / gradient_scaling
        updates.append((acc, acc_new))
        updates.append((p, p - lr * g))
    return updates

def dropout(X, p=0.):
    if p > 0:
        retain_prob = 1 - p
        X *= srng.binomial(X.shape, p=retain_prob, dtype=theano.config.floatX)
        X /= retain_prob
    return X

def model(X, w_h, w_h2, w_o, p_drop_input, p_drop_hidden):
    X = dropout(X, p_drop_input)
    h = rectify(T.dot(X, w_h))

    h = dropout(h, p_drop_hidden)
    h2 = rectify(T.dot(h, w_h2))

    h2 = dropout(h2, p_drop_hidden)
    py_x = softmax(T.dot(h2, w_o))
    return h, h2, py_x

xTrain, xTest, yTrain, yTest = mnist()
X = T.fmatrix()
Y = T.fmatrix()

w_h = init_weights((784, 625))
w_h2 = init_weights((625, 625))
w_o = init_weights((625, 10))

noise_h, noise_h2, noise_py_x = model(X, w_h, w_h2, w_o, 0.2, 0.5)
h, h2, py_x = model(X, w_h, w_h2, w_o, 0., 0.)
y_x = T.argmax(py_x, axis=1)

cost = T.mean(T.nnet.categorical_crossentropy(noise_py_x, Y))
params = [w_h, w_h2, w_o]
updates = RMSprop(cost, params, lr=0.001)

train = theano.function(inputs=[X, Y], outputs=cost, updates=updates, allow_input_downcast=True)
predict = theano.function(inputs=[X], outputs=y_x, allow_input_downcast=True)

for i in range(101):
    for start, end in zip(range(0, len(xTrain), 128), range(128, len(xTrain), 128)):
        cost = train(xTrain[start:end], yTrain[start:end])
    if i%10 == 0: print i, np.mean(np.argmax(yTest, axis=1) == predict(xTest))

0 0.9373
10 0.9822
20 0.9842
30 0.9858
40 0.9862


KeyboardInterrupt: 

##In-class coding exercise
Replace RMSProp with a different accelerated gradient method - AdaDelt, NAG or AdaGrad.  

##Homework Exercise
Build 4-layer network for classifying Cifar images.  Use 10k training data (as in last lecture) to truncate the training time.  