# Deep Learning - The 4-layer network

## What you'll learn today
1.  Why is a 4 layer network materially harder than a 3 layer network
2.  How to overcome the training difficulties of a 4 layer network
3.  The basic architecture and features of a modern deep learning neural net.
4.  How to apply a deep learning network to a new problem
5.  Some of the alterations you can make to improve performance of your deep network for a particular problem

## Order of Topics
1.  Review second Glorot, Bengio paper 
2.  Look at a 4 layer network incorporating Glorot, Bengio rectifying linear units (Relu).  
3.  In class exercises to modify network looking for performance improvements.



## Pre Reading
http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf - Glorot, Bengio paper on training  
http://jmlr.org/proceedings/papers/v15/glorot11a/glorot11a.pdf - Glorot, Bengio - Relu in deep learning  
http://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf - Srivastava, Hinton paper on using dropout for regularization.  
https://www.tensorflow.org/versions/r0.10/api_docs/python/nn.html#dropout - Droput function in TensorFlow  
http://climin.readthedocs.org/en/latest/rmsprop.html#id1 - look at definition of RMSProp  
http://arxiv.org/pdf/1502.03167v3.pdf - look at batch normalization  

## Modern Deep Neural Net

There are several key features of this network that make it perform as well as it does.  The main features are:

1.  More than one hidden layer - 4 or more network layers
2.  Rectifying linear units for activation functions
3.  Dropout for regularization
4.  Use of better weight update than plain gradient descent.
5.  Proper weight initialization.  

Having more than one hidden layer defines deep learning and the Glorot Bengio paper that you saw a few lectures ago showed how the layers further from the input layer tended to saturate and stall training.  The rest of the items in the list describe the improvements that are necessary to achieve reliable training in networks with more than one hidden layer.  

A rectifying linear unit is an activation function f(x) = max(0, x).  It is flat and equal to zero for negative values of its input and equal to the input for positive input values.  

## Q's
Can you think of any possible issues with the rectifying linear unit?

The Glorot paper http://jmlr.org/proceedings/papers/v15/glorot11a/glorot11a.pdf explains the use and benefits of rectifying linear units.  

Dropout is a method for preventing over-training in a neural net.  The basic idea is to randomly set some of the outputs from input neurons and hidden neurons to zero.  The added unreliability of inputs being present reduces the networks ability to depend too much on individual inputs and to thereby overtrain.  The basic method is outlined and discussed in the Shrivastava, Hinton paper.  http://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf

The code below implements a four layer fully connected network for classifying MNIST digits.  This code uses RMSProp, which is another accelerated gradient descent method that bears some similarity to AdaDelt and AdaGrad.  Here's a link to a description of RMSProp.  http://climin.readthedocs.org/en/latest/rmsprop.html#id1

## Code Walk-through - group discussion

1.  Look at the function definitions to get a rough idea of what each of them do.  
2.  Walk through the main body of the code and discuss the overall structure.   
3.  Any comments on the weight initialization?   
4.  Walk through the handling of the weight and gradient-handling variables.  Do you see any problems with the update?   
5.  Group coding exercise - Fix the problem with the subroutine that returns update equations.  
6.  Rewrite the random weight initialization to incorporate Glorot scaling.   
7.  Re-run and discuss improvement or deterioration.  

In [2]:
# 4-layer MNIST
import tensorflow as tf
import numpy as np
from mnistReader import mnist
from math import sqrt

#build and initialize weights
def init_weights(shape, name, glorot=False):
    [n_inputs, n_outputs] = shape
    init_range = sqrt(6.0 / (n_inputs + n_outputs))
    if glorot: return tf.Variable(tf.random_uniform(shape, -init_range, init_range), name=name)
    else: return tf.Variable(tf.random_normal(shape, stddev=0.01), name=name)
    
def gdUpdate(W, G, lr): # not used anywhere
    for (w, g) in zip(W, G):
        w.assign(w - lr * g)
    return W[0]


#read in the data and run through training
xTrain, xTest, yTrain, yTest = mnist()


tf.reset_default_graph() 
graph = tf.Graph() 
with graph.as_default():
    X = tf.placeholder(tf.float32, shape=[None, 784])
    Y = tf.placeholder(tf.float32, shape=[None, 10])
    lr = tf.constant(0.00002, dtype=tf.float32, name='lr')

    w1 = init_weights([784, 625], 'w1')
    w2 = init_weights([625, 300], 'w2')
    w3 = init_weights((300, 10), 'w3')

    #define network
    h1 = tf.nn.relu(tf.matmul(X, w1))  #look under Neural Net -> Activation in API left column
    h2 = tf.nn.relu(tf.matmul(h1, w2))
    logits = tf.matmul(h2, w3)
    py_x = tf.nn.softmax(logits)
    y_pred = tf.argmax(py_x, dimension=1)

    #define cost
    rows_of_cost = tf.nn.softmax_cross_entropy_with_logits(logits, Y, name='rows_of_cost')
    cost = tf.reduce_mean(rows_of_cost, reduction_indices=None, keep_dims=False, name='cost')

    #start building list that you'll reference in sess.run
    udList = [cost]

    #use hand-crafted updater
    W = [w1, w2, w3]

    #calculate gradients
    grad = tf.gradients(cost, W)

    #form a list of the updates - including this in sess.run will force calculation of new weights each step
    udList = udList + [w.assign(w - lr * g) for (w, g) in zip(W, grad)]

    #use tf.optimizer by uncommenting the following two lines (and modifying where necessary)
    #optimizer = tf.train.GradientDescentOptimizer(lr)
    #train = optimizer.minimize(cost)

    #output for tensorboard
    summary1 = tf.scalar_summary("Cost over time", cost) 
    summary2 = tf.histogram_summary('Weight w1 over time', w1)
    summary3 = tf.histogram_summary('Weight w2 over time', w2)
    summary4 = tf.histogram_summary('Weight w3 over time', w3)
    merged = tf.merge_summary([summary1, summary2, summary3, summary4]) 

    #add tensorboard output to sess.run list
    udList.append(merged)

with tf.Session(graph=graph) as sess:
    result = sess.run(tf.initialize_all_variables())
    writer = tf.train.SummaryWriter('logs/',graph=sess.graph)
    miniBatchSize = 40
    startEnd = zip(range(0, len(xTrain), miniBatchSize), range(miniBatchSize, len(xTrain) + 1, miniBatchSize))
    costList = []
    nPasses = 30
    iteration = 0
    for iPass in range(nPasses):
        for (s, e) in startEnd:
            [costVal, update1, update2, update3, tbSummary] = sess.run(udList, feed_dict={X: xTrain[s:e,], Y: yTrain[s:e]})
            
            writer.add_summary(tbSummary, iteration)
            iteration += 1
            costList.append(costVal)
        if iPass % 5 == 0: 
            testResult = sess.run([y_pred], feed_dict={X:xTest})
            print iPass, np.mean(np.argmax(yTest, axis=1) == testResult)

0 0.1289
5 0.1484
10 0.1644
15 0.1783
20 0.1924
25 0.2054


### If you accidentally stop the neural net or want to continue to train the neural net without starting over:
Replace  
w1 = init_weights([3072, 300], 'w1')  
w2 = init_weights([300, 10], 'w2')  
with  
w1 = tf.Variable(update1)  
w2 = tf.Variable(update2)  
where update1 and update2 are the outputs from sess.run for the weight matrices. You can save update1 and update2 in a .pkl file and start training wherever you left off.

## Q's
This network performs at about the same level as the 3-layer network that you saw in the last lecture.  Could the problem be overfitting?  What can you look at to determine if that's true?  What further tests can you run to determine if the problem is overfitting?  What can you do to improve the performance of this classifier?  

## In-class coding exercises
Replace plain vanilla gradient descent with a different accelerated gradient method - momentum, AdaDelta, NAG or AdaGrad. Momentum is the easiest.   
Hint: if you need the shape of a tensorflow varible, you can use tensorflow_variable.get_shape().  
<br>
Add dropout for regularization - What meta parameters does dropout add?  
Hint: just add 1 placeholder for percent of neurons you want to keep and add 1 line using tf.nn.dropout(layer_here, percent_you_want_to_keep). In session.run, fill in the percent of kept neurons in feed_dict. You do this for the training and testing. In testing, you want to keep all the neurons (i.e. 1.0).    
<br>
Also try L2 weight penalty instead of dropout for regularization.  What meta parameters does weight penalty add?  
Hint: just add 1 placeholder for regularization factor and modify cost function. In session.run, fill in the regularization placeholder in the feed_dict.  

## Batch Normalization
http://arxiv.org/pdf/1502.03167v3.pdf  
https://www.tensorflow.org/versions/r0.10/api_docs/python/contrib.layers.html#batch_norm  

Insert the batch normalization layers (below) and compare the performance to the above neural net that doesn't include batch normalization.  

In [None]:
# 4-layer MNIST--completely copied from above. Please add batch norm layers. See link for code.
import tensorflow as tf
import numpy as np
from mnistReader import mnist
from math import sqrt

#build and initialize weights
def init_weights(shape, name, glorot=False):
    [n_inputs, n_outputs] = shape
    init_range = sqrt(6.0 / (n_inputs + n_outputs))
    if glorot: return tf.Variable(tf.random_uniform(shape, -init_range, init_range), name=name)
    else: return tf.Variable(tf.random_normal(shape, stddev=0.01), name=name)
    
def gdUpdate(W, G, lr): # not used anywhere
    for (w, g) in zip(W, G):
        w.assign(w - lr * g)
    return W[0]


#read in the data and run through training
xTrain, xTest, yTrain, yTest = mnist()


tf.reset_default_graph() 
graph = tf.Graph() 
with graph.as_default():
    X = tf.placeholder(tf.float32, shape=[None, 784])
    Y = tf.placeholder(tf.float32, shape=[None, 10])
    lr = tf.constant(0.00002, dtype=tf.float32, name='lr')

    w1 = init_weights([784, 625], 'w1')
    w2 = init_weights([625, 300], 'w2')
    w3 = init_weights((300, 10), 'w3')

    #define network
    h1 = tf.nn.relu(tf.matmul(X, w1))  #look under Neural Net -> Activation in API left column
    
    h2 = tf.nn.relu(tf.matmul(h1, w2))
    logits = tf.matmul(h2, w3)
    py_x = tf.nn.softmax(logits)
    y_pred = tf.argmax(py_x, dimension=1)

    #define cost
    rows_of_cost = tf.nn.softmax_cross_entropy_with_logits(logits, Y, name='rows_of_cost')
    cost = tf.reduce_mean(rows_of_cost, reduction_indices=None, keep_dims=False, name='cost')

    #start building list that you'll reference in sess.run
    udList = [cost]

    #use hand-crafted updater
    W = [w1, w2, w3]

    #calculate gradients
    grad = tf.gradients(cost, W)

    #form a list of the updates - including this in sess.run will force calculation of new weights each step
    udList = udList + [w.assign(w - lr * g) for (w, g) in zip(W, grad)]

    #use tf.optimizer by uncommenting the following two lines (and modifying where necessary)
    #optimizer = tf.train.GradientDescentOptimizer(lr)
    #train = optimizer.minimize(cost)

    #output for tensorboard
    summary1 = tf.scalar_summary("Cost over time", cost) 
    summary2 = tf.histogram_summary('Weight w1 over time', w1)
    summary3 = tf.histogram_summary('Weight w2 over time', w2)
    summary4 = tf.histogram_summary('Weight w3 over time', w3)
    merged = tf.merge_summary([summary1, summary2, summary3, summary4]) 

    #add tensorboard output to sess.run list
    udList.append(merged)

with tf.Session(graph=graph) as sess:
    result = sess.run(tf.initialize_all_variables())
    writer = tf.train.SummaryWriter('logs/',graph=sess.graph)
    miniBatchSize = 40
    startEnd = zip(range(0, len(xTrain), miniBatchSize), range(miniBatchSize, len(xTrain) + 1, miniBatchSize))
    costList = []
    nPasses = 30
    iteration = 0
    for iPass in range(nPasses):
        for (s, e) in startEnd:
            [costVal, update1, update2, update3, tbSummary] = sess.run(udList, feed_dict={X: xTrain[s:e,], Y: yTrain[s:e]})
            
            writer.add_summary(tbSummary, iteration)
            iteration += 1
            costList.append(costVal)
        if iPass % 5 == 0: 
            testResult = sess.run([y_pred], feed_dict={X:xTest})
            print iPass, np.mean(np.argmax(yTest, axis=1) == testResult)

## Homework Exercise
Build 4-layer network for classifying Cifar images.  Use 10k training data (as in last lecture) to truncate the training time.  
Hint: You can either copy code from last lecture's homework or from this lecture's code above. There's lots of things to change. Good luck!  