#Lecture 4.2 - LeNet, Convolutional Neural Net (CNN)

##Learning Objectives
1.  What is a convolutional layer is, why is it used and on what types of problems.
2.  What are typical convolutional neural net (CNN) architectures
3.  How are convolutional layers sized
4.  What choices need to be made in designing a CNN?

##Topics for today
1.  Intro CNN
2.  Basic elements of CNN and their function
3.  Layer sizing calculations
4.  Tradeoffs between convolutional receptive field and network depth
5.  Architectural choices and trends


##Pre-reading materials

http://cs231n.github.io/convolutional-networks/ - Andrej Karpathy - Intro to CNN.  Go over this before class.  Lecture will discuss several sections.

http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf - Paper describing 2012 winning imagenet entry


##Convolutional Neural Nets

You saw in an earlier lecture that convolution transforms an image into a new image by performing a mapping on groups adjacent pixels.  The wikipedia page gives a one-dimensional example https://en.wikipedia.org/wiki/Convolution and you'll see some more examples later in this discussion.  The attraction of using a convolutional operator is that it takes fewer weights and therefore isn't as prone to overfit as a fully connected layer would be.  fewer weights also makes a convolutional layer easier to train.  For its success, a convolutional layer depends on local features being useful across the whole input array.  For example detecting edges is likely to be a useful feature across all of an image.  That's a property of images that may not hold for other types of data.  

##Q's
1  It is useful to visualize the input to a convolutional layer as a volume with two dimensions representing the two dimensions of the input image.  What is the other dimension of the volume - for the input layer, for subsequent layers?

2  The MNIST data set contains 28x28 pixel images.  How many weights would be required for a fully neuron in the first hidden layer that's fully connected to the input layer?  How many would be required for 3x3 convolution?  Roughly how many features would the 3x3 convolution generate?


##In-class exercises

1 The two code blocks immediately below provide a setup for you to experiment with the theano nnet.conv.conv2d function and the signal.downsample.max_pool_2d function.  Run a couple of test examples through each one of these and confirm that the input and output dimensions satisfy the formula in Karpathy's writeup and the formula in the theano documentation for the function.  

2  The third code block sets up a CNN for MNIST.  On a layer-by-layer basis the shape of the layer outputs are:
(128, 32, 15, 15) (128, 64, 7, 7) (128, 1152) (128, 625).  At the end of the code for the MNIST CNN, you can see the print statement (commented) that generated these shape tuples for each of the layer outputs.  Run through the calculations to confirm these dimensions starting from the layer input, the settings in the conv2d and max_pool functions and the dimensions used to initialize the layer-by-layer weights.  Use both the calculations and the code snips to confirm that you get the same result.  

(128, 32, 15, 15) (128, 64, 7, 7) (128, 1152) (128, 625) - printed from CNN code

In [8]:
__author__ = 'mike.bowles'
import theano
from theano import tensor as T
import numpy as np
from theano.tensor.signal.downsample import max_pool_2d
from theano.tensor.nnet.conv import conv2d

X = theano.shared(np.array(range(784), dtype=theano.config.floatX).reshape((1,1,28,28)))
w = theano.shared(np.array([[[[0.0, 0.0, 0.0],
                  [0.0, 1.0, 0.0],
                  [0.0, 0.0, 0.0]]]], dtype=theano.config.floatX))
convOut = conv2d(X, w, border_mode='valid')  #border_mode={'valid', 'full'}


convOutTest = convOut.eval()
print convOutTest
print convOutTest.shape

[[[[  29.   30.   31.   32.   33.   34.   35.   36.   37.   38.   39.   40.
      41.   42.   43.   44.   45.   46.   47.   48.   49.   50.   51.   52.
      53.   54.]
   [  57.   58.   59.   60.   61.   62.   63.   64.   65.   66.   67.   68.
      69.   70.   71.   72.   73.   74.   75.   76.   77.   78.   79.   80.
      81.   82.]
   [  85.   86.   87.   88.   89.   90.   91.   92.   93.   94.   95.   96.
      97.   98.   99.  100.  101.  102.  103.  104.  105.  106.  107.  108.
     109.  110.]
   [ 113.  114.  115.  116.  117.  118.  119.  120.  121.  122.  123.  124.
     125.  126.  127.  128.  129.  130.  131.  132.  133.  134.  135.  136.
     137.  138.]
   [ 141.  142.  143.  144.  145.  146.  147.  148.  149.  150.  151.  152.
     153.  154.  155.  156.  157.  158.  159.  160.  161.  162.  163.  164.
     165.  166.]
   [ 169.  170.  171.  172.  173.  174.  175.  176.  177.  178.  179.  180.
     181.  182.  183.  184.  185.  186.  187.  188.  189.  190.  191.  192.
   

In [7]:
imageTmp = T.matrix("imageTemp")
testOut = T.signal.downsample.max_pool_2d(imageTmp,
                                          ds=(2,2),
                                          ignore_border=False,
                                          st=None,
                                          padding=(0,0)) 

poolTest = theano.function([imageTmp], testOut)

shape = (5, 5)
testIn = np.ones(shape, dtype=theano.config.floatX)


print 'function method   ', poolTest(testIn)
print 'shape    ', poolTest(testIn).shape
print 'eval method    ', testOut.eval({imageTmp: testIn})

testIn2 = np.array([[1.0, 2.0, 3.0, 4.0, 3.0, 2.0, 1.0],
                    [2.0, 3.0, 4.0, 5.0, 4.0, 3.0, 2.0],
                    [3.0, 4.0, 5.0, 6.0, 5.0, 4.0, 3.0],
                    [2.0, 3.0, 4.0, 5.0, 4.0, 3.0, 2.0],
                    [1.0, 2.0, 3.0, 4.0, 3.0, 2.0, 1.0]], dtype=theano.config.floatX)
print 'location of max  ', poolTest(testIn)

print testOut.eval({imageTmp: testIn2})

shape     (3, 3)


Q's
1  Use sample code above to verify calculations of input-output sizing and weight dimensions in code below.  
2  

In [None]:
import theano
from theano import tensor as T
from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
import numpy as np
from theano.tensor.nnet.conv import conv2d
from theano.tensor.signal.downsample import max_pool_2d

srng = RandomStreams()

def floatX(X):
    return np.asarray(X, dtype=theano.config.floatX)

def init_weights(shape):
    return theano.shared(floatX(np.random.randn(*shape) * 0.01))

def rectify(X):
    return T.maximum(X, 0.)
    #return T.maximum(X, 0.01*X)  #leaky rectifier

def softmax(X):
    e_x = T.exp(X - X.max(axis=1).dimshuffle(0, 'x'))
    return e_x / e_x.sum(axis=1).dimshuffle(0, 'x')

def dropout(X, p=0.0):
    if p > 0:
        retain_prob = 1 - p
        X *= srng.binomial(X.shape, p=retain_prob, dtype=theano.config.floatX)
        X /= retain_prob
    return X

def RMSprop(cost, params, lr=0.001, rho=0.9, epsilon=1e-6):
    grads = T.grad(cost=cost, wrt=params)
    updates = []
    for p, g in zip(params, grads):
        acc = theano.shared(p.get_value() * 0.)
        acc_new = rho * acc + (1 - rho) * g ** 2
        gradient_scaling = T.sqrt(acc_new + epsilon)
        g = g / gradient_scaling
        updates.append((acc, acc_new))
        updates.append((p, p - lr * g))
    return updates

def model(X, w, w2, w3, w4, p_drop_conv, p_drop_hidden):
    l1a = rectify(conv2d(X, w, border_mode='full'))
    l1 = max_pool_2d(l1a, (2, 2))
    l1 = dropout(l1, p_drop_conv)

    l2a = rectify(conv2d(l1, w2))
    l2 = max_pool_2d(l2a, (2, 2))
    l2 = dropout(l2, p_drop_conv)

    l3a = rectify(conv2d(l2, w3))
    l3b = max_pool_2d(l3a, (2, 2))
    l3 = T.flatten(l3b, outdim=2)
    l3 = dropout(l3, p_drop_conv)

    l4 = rectify(T.dot(l3, w4))
    l4 = dropout(l4, p_drop_hidden)

    pyx = softmax(T.dot(l4, w_o))
    return l1, l2, l3, l4, pyx



xTrain = xTrain.reshape(-1, 1, 28, 28)
xTest = xTest.reshape(-1, 1, 28, 28)

X = T.ftensor4()
Y = T.fmatrix()

w = init_weights((32, 1, 3, 3))
w2 = init_weights((64, 32, 3, 3))
w3 = init_weights((128, 64, 3, 3))
w4 = init_weights((128 * 3 * 3, 625))
w_o = init_weights((625, 10))

noise_l1, noise_l2, noise_l3, noise_l4, noise_py_x = model(X, w, w2, w3, w4, 0.2, 0.5)
l1, l2, l3, l4, py_x = model(X, w, w2, w3, w4, 0., 0.)
y_x = T.argmax(py_x, axis=1)


cost = T.mean(T.nnet.categorical_crossentropy(noise_py_x, Y))
params = [w, w2, w3, w4, w_o]
updates = RMSprop(cost, params, lr=0.001)

train = theano.function(inputs=[X, Y], outputs=cost, updates=updates, allow_input_downcast=True)
predict = theano.function(inputs=[X], outputs=y_x, allow_input_downcast=True)

for i in range(100):
    for start, end in zip(range(0, len(xTrain), 128), range(128, len(xTrain), 128)):
        cost = train(xTrain[start:end], yTrain[start:end])
        #a, b, c, d, e = model(floatX(trX[start:end]), w, w2, w3, w4, 0., 0.)
        #print a.eval().shape, b.eval().shape, c.eval().shape, d.eval().shape
    print np.mean(np.argmax(yTest, axis=1) == predict(xTest))

Results - These results were generated in an off-line run.  They represent 20 passes through the MNIST data and took approximately 18 hours of computing time on single CPU.

0.9332 0.9744 0.9834 0.9828 0.9891 0.9869 0.9911 0.9914 0.992 0.9928 0.9935 0.9923 0.9937 0.9931 0.9934 0.9938 0.9942 0.9938 0.9935 0.9945 0.9934 0.9937 0.9944 0.9938

##Homework
1  Karpathy's notes on convolutional networks suggest that performance can be improved by removimg some the max pooling layers - particularly the ones early in the network.  Test this suggestion by removing the first  pooling layer from the network above and rearrange the sizing on convolutional layers to adjust for this removal.  Run the network to demonstrate that your rearrangement works and to get a feel for the resulting improvement or deterioration of performance.