# Day 6: Deep Learning

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
import sys
sys.path.append('/Users/davidbuchacaprats/Documents/lxmls2015/lxmls-toolkit_fork')
import lxmls
import scipy
#path_inside_lxmls_toolkit_student = "/Users/davidbuchacaprats/Dropbox/lxmls2015/lxmls-toolkit-student"

## Exercise 6.1

Get in contact with the multi-layer perceptron (MLP) class in Numpy and see that for a single layer this is simply a log-linear model. Revisit the sentiment classification exercise of day one. Reformulate train and test data in a way suitable for the exercises of today.

In [3]:
import numpy as np
import lxmls.readers.sentiment_reader as srs
scr = srs.SentimentCorpus("books")
train_x = scr.train_X.T
train_y = scr.train_y[:, 0]
test_x = scr.test_X.T
test_y = scr.test_y[:, 0]

2000
1600


In [4]:
test_x

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

In [5]:
train_x.shape, train_x[0], train_x[0].sum()

((13989, 1600), array([ 0.,  0.,  0., ...,  0.,  0.,  0.]), 5.0)

Load the MLP and SGD code and create a single layer model by specifying the number of inputs, outputs and the type of layer. Note that the number of inputs equals the number of features and the number of outputs the number of classes (2).

In [6]:
# Define MLP (log linear)
import lxmls.deep_learning.mlp as dl 
import lxmls.deep_learning.sgd as sgd # Model parameters
geometry = [train_x.shape[0], 2] 
actvfunc = ['softmax']

# Instantiate model
mlp      = dl.NumpyMLP(geometry, actvfunc)

In [7]:

mlp.__dict__

{'actvfunc': ['softmax'],
 'n_layers': 1,
 'params': [array([[-0.05110566,  0.02022964, -0.01031658, ..., -0.04864428,
           0.02159766, -0.03177295],
         [ 0.01863954, -0.02708282,  0.07601295, ...,  0.02542927,
          -0.02739964,  0.03537327]]), array([[ 0.],
         [ 0.]])]}

Put a breakpoint inside of the lxmls/deep learning/mlp.py function and debug step by step. Identify the forward pass in Eq: 6.12 and the computation of the gradients (to be completed in the next exercise).

In [8]:
# Play with the untrained MLP forward
hat_train_y = mlp.forward(train_x)
hat_test_y = mlp.forward(test_x)

# Compute accuracy
acc_train = sgd.class_acc(hat_train_y, train_y)[0]
acc_test = sgd.class_acc(hat_test_y, test_y)[0]
print "Untrained Log-linear Accuracy train: %f test: %f"%(acc_train,acc_test)

Untrained Log-linear Accuracy train: 0.518750 test: 0.542500


## Exercise 6.2

Go to ```lxmls/deep_learning/mlp.py``` in the class  NumpyMLP and check grads(). Complete the code of the NumpyMLP class with the Backpropagation recursion that we just saw. Once you are done. Try different network geometries by increasing the number of layers and layer sizes e.g

In [9]:
geometry = [train_x.shape[0], 20, 2]
actvfunc = ['sigmoid', 'softmax']

# Instantiate model
mlp      = dl.NumpyMLP(geometry, actvfunc)

You can test the different models with the same sentiment analysis problem as in Exercise 6.1.

In [10]:
n_iter = 2
bsize  = 5
lrate  = 0.01


# Train
sgd.SGD_train(mlp, n_iter, bsize=bsize, lrate=lrate, train_set=(train_x, train_y))
acc_train = sgd.class_acc(mlp.forward(train_x), train_y)[0]
acc_test = sgd.class_acc(mlp.forward(test_x), test_y)[0]
print "MLP (%s) Amazon Sentiment Accuracy train: %f test: %f" % (geometry, acc_train, acc_test)

Batch 321/321 (100%)   Epoch  1/ 2 in 5.45 seg
Batch 321/321 (100%)   Epoch  2/ 2 in 5.40 seg

MLP ([13989, 20, 2]) Amazon Sentiment Accuracy train: 0.954375 test: 0.747500


Notice that if the batch size is increased the code can take advantage from the paralelization
of the vectorized functions build in numpy. Try bsize=256

In [34]:
n_iter = 6
bsize  = 128
lrate  = 0.01


# Train
sgd.SGD_train(mlp, n_iter, bsize=bsize, lrate=lrate, train_set=(train_x, train_y))
acc_train = sgd.class_acc(mlp.forward(train_x), train_y)[0]
acc_test = sgd.class_acc(mlp.forward(test_x), test_y)[0]
print "MLP (%s) Amazon Sentiment Accuracy train: %f test: %f" % (geometry, acc_train, acc_test)

Batch 13/13 (100%)   Epoch  1/ 6 in 3.07 seg
Batch 13/13 (100%)   Epoch  2/ 6 in 2.84 seg
Batch 13/13 (100%)   Epoch  3/ 6 in 2.79 seg
Batch 13/13 (100%)   Epoch  4/ 6 in 2.80 seg
Batch 13/13 (100%)   Epoch  5/ 6 in 3.03 seg
Batch 13/13 (100%)   Epoch  6/ 6 in 3.19 seg

MLP ([13989, 20, 2]) Amazon Sentiment Accuracy train: 1.000000 test: 0.825000


## Exercise 6.3

 Get in contact with Theano. Learn the difference between a symbolic representation and a function. Start by implementing the first layer of our previous MLP in Numpy

In [23]:
test_x

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

In [28]:
x = test_x 

In [30]:
# Numpy code
x = test_x 
W1, b1 = mlp.params[:2]       # Weights and bias of the first layer
z1 = np.dot(W1,x) + b1        # Linear transformation
tilde_z1 = 1/(1+np.exp(-z1))  # Non-linear transformation

Now we will implement this in Theano. We start by creating the variables over which we will produce the operations. For example the symbolic input is defined as

In [31]:
# Theano code.
# NOTE: We use undescore to denote symbolic equivalents to Numpy variables. # This is no Python convention!.
import theano
import theano.tensor as T
_x = T.matrix('x')

Note that this variable does not have any particular value, nor a space reserved in memory for it. It contains just a symbolic definition of what the variable can contain. The particular values will be given when we use it to compile a function.

We could actually use the same definition format to define the weights and give their particular values as inputs to the compiled function. However, since we will be using a more complicated format in later exercises, we will use it here as well. The shared class allows to define variables that are shared across functions. They are also given a concrete value so that we do not need to give it for each function call. This format is therefore ideal for the weights of our network.

In [33]:
_W1 = theano.shared(value=W1, name='W1', borrow=True)
_b1 = theano.shared(value=b1, name='b1', borrow=True, broadcastable=(False, True))

Now lets describe the operations we want to do with the variables. Again only symbolically. This is done by replacing our usual operations by Theano symbolic ones when necessary e. g. the internal product dot() or the sigmoid. Some operations like e.g. + are automatically recognized by Theano (operator overloading).

In [35]:
_z1            = T.dot(_W1, _x) + _b1
_tilde_z1      = T.nnet.sigmoid(_z1)
# Keep in mind that naming variables is useful when debugging
_z1.name       = 'z1'
_tilde_z1.name = 'tilde_z1'

When debugging the code it is often useful to print the graph of computations.

In [36]:
theano.printing.debugprint(_tilde_z1)

sigmoid [@A] 'tilde_z1'   
 |Elemwise{add,no_inplace} [@B] 'z1'   
   |dot [@C] ''   
   | |W1 [@D]
   | |x [@E]
   |b1 [@F]


It is important to keep in mind that, until this point, we do not have a function we can use to produce any practical input. In order to obtain this we have to compile this function by calling


In [37]:
layer1 = theano.function([_x], _tilde_z1)

Note the use of [ ] for the input variables, even if we just specify one variable. We can now do a test to compare the Numpy and Theano implementations and see that they give the same outputs.

In [38]:
# Check Numpy and Theano match
if np.allclose(tilde_z1, layer1(x.astype(theano.config.floatX))):
    print "\nNumpy and Theano Perceptrons are equivalent"
else:
    raise ValueError, "Numpy and Theano Perceptrons are different"


Numpy and Theano Perceptrons are equivalent


## Exercise 6.4

Complete the method forward() inside of the ```lxmls/deep_learning/mlp.py```  class TheanoMLP. Note that this is called only once at the initialization of the class. To debug your implementation put a breakpoint at the ￼ ￼ init ￼ ￼ function call. Hint: Note that this is very similar to ```NumpyMLP.forward()```. You just need to keep track of the symbolic variable representing the output of the network after each layer is applied and compile the function at the end. After you are finished instantiate a Theano class and check that Numpy and Theano forward pass are the same.

In [45]:

mlp_a = dl.NumpyMLP(geometry, actvfunc)
mlp_b = dl.TheanoMLP(geometry, actvfunc)

## Exercise 6.5

We first see an example that does not use any of the code in TheanoMLP but rather continues from what you wrote in exercise 6.3. In this exercise you completed a sigmoid layer with Theano. To get some values for the weights we used the first layer of the network you trained in 6.2. now we are going to use the second layer as well. This is thus assuming that your network in 6.2 has only two layers e.g. the recommended geometry (I, 20, 2). Make sure this is the case before starting this exercise.

For the sake of clarity, lets write here the part of Ex. 6.2 that we had completed

In [46]:
# Get the values from our MLP from Ex 6.2
W1, b1 = mlp.params[:2] # Weights and bias of fist layer
# First layer symbolic variables
_x = T.matrix('x')
_W1 = theano.shared(value=W1, name='W1', borrow=True)
_b1 = theano.shared(value=b1, name='b1', borrow=True, broadcastable=(False, True)) # First layer symbolic expressions
_z1 = T.dot(_W1, _x) + _b1
_tilde_z1 = T.nnet.sigmoid(_z1)

Now we just need to complete this with the second layer, using a softmax non-linearity

In [47]:
W2, b2 = mlp.params[2:] # Weights and bias of second (and last!) layer
# Second layer symbolic variables
_W2 = theano.shared(value=W2, name='W2', borrow=True)
_b2 = theano.shared(value=b2, name='b2', borrow=True, broadcastable=(False, True)) # Second layer symbolic expressions
_z2       = T.dot(_W2, _tilde_z1) + _b2
_tilde_z2 = T.nnet.softmax(_z2.T).T


With this, we could compile a function to obtain the output of the network symb tilde z2 for a given input symb x. In this exercise we are however interested in obtaining the misclassification cost. This is given in Eq: 6.6. First we are going to need the symbolic variable for the correct output

In [48]:
_y = T.ivector('y')

The minus posterior probability of the class given the input is the same as selecting the k(m)-th softmax output, were k(m) is the index of the correct class for xm. If we want to do this for a vector y containing M different examples, we can write this as

In [49]:
_F = -T.mean(T.log(_tilde_z2[_y, T.arange(_y.shape[0])]))

Now obtaining a function that computes the gradient could not be easier.

In [56]:
_nabla_F = T.grad(_F, _W1)
nabla_F = theano.function([_x,_y], _nabla_F)

To finish this exercise have a look at the TheanoMLP class. As you may realise it just implements what is shown above for the generic case of N layers

## Exercise 6.6

Let’s first have an understanding of handling train/test data inside the Theano computation graph. One important aspect to take into account is that both type and shape of the data have to match their corresponding graph variables. This is the main source of errors when you are starting with Theano.

In [57]:
# Cast data into the types and shapes used in the Theano graph
train_x = train_x.astype(theano.config.floatX)
train_y = train_y.astype('int32')

Note the Theano type theano.config.floatX. This will automatically switch between float32 (GPU) and float64 (CPU). To use data in a Theano computation graph, we use the theano.shared variable. This will also push data into the GPU,
if used.

In [62]:
_train_x = theano.shared(train_x, "train_x", borrow=True)
_train_y = theano.shared(train_y, "train_y", borrow=True)

Once this is done, we can create and compile functions using these variables. One function that will be useful in the future will be one returning a batch of instances

In [64]:
_i = T.lscalar()
get_tr_batch_y  = theano.function([_i], _train_y[_i * bsize:(_i+1)*bsize])

## Exercise 6.7

 The mini-batch function in the previous exercise is the key to fast batch update. This is combined with the updates argument of theano.function. The input to this argument, is a list of tuples with each parameter and update rule. This can be compactly defined using list comprehensions.

In [69]:
mlp_c   = dl.TheanoMLP(geometry, actvfunc)
_x      = T.matrix('x')
_y      = T.ivector('y')
_F      = mlp_c._cost(_x, _y)
updates = [(par, par - lrate*T.grad(_F, par)) for par in mlp_c.params]


This can be now combined with the givens argument of theano.function. This maps input and target to other variables. In this case a mini-batch of inputs and targets given an index.

In [71]:
_j      = T.lscalar()
givens  = { _x : _train_x[:, _j*bsize:(_j+1)*bsize],
            _y : _train_y[_j*bsize:(_j+1)*bsize] }

With updates and givens, we can now define the batch update function. This will return the cost of each batch and update the MLP parameters at the same time using updates

In [73]:
batch_up = theano.function([_j], _F, updates=updates, givens=givens)
n_batch  = train_x.shape[1]/bsize  + 1

Once we have defined this, we can compare speed and accuracy of the Numpy and simple gradient versions using

In [80]:
import time
# Model
geometry = [train_x.shape[0], 20, 2] 
actvfunc = ['sigmoid', 'softmax']

# Numpy MLP
mlp_a = dl.NumpyMLP(geometry, actvfunc)
init_t = time.clock()
sgd.SGD_train(mlp_a, n_iter, bsize=bsize, lrate=lrate, train_set=(train_x, train_y)) 
print "\nNumpy version took %2.2f sec" % (time.clock() - init_t)

acc_train = sgd.class_acc(mlp_a.forward(train_x), train_y)[0]
acc_test = sgd.class_acc(mlp_a.forward(test_x), test_y)[0]
print "Amazon Sentiment Accuracy train: %f test: %f\n" % (acc_train, acc_test)
# Theano grads
mlp_b = dl.TheanoMLP(geometry, actvfunc)
init_t = time.clock()
sgd.SGD_train(mlp_b, n_iter, bsize=bsize, lrate=lrate, train_set=(train_x, train_y)) 
print "\nCompiled gradient version took %2.2f sec" % (time.clock() - init_t) 

acc_train = sgd.class_acc(mlp_b.forward(train_x), train_y)[0]
acc_test = sgd.class_acc(mlp_b.forward(test_x), test_y)[0]
print "Amazon Sentiment Accuracy train: %f test: %f\n" % (acc_train, acc_test)
# Theano batch update
init_t = time.clock()
sgd.SGD_train(mlp_c, n_iter, batch_up=batch_up, n_batch=n_batch)
print "\nTheano compiled batch update version took %2.2f" % (time.clock() - init_t) 
acc_train = sgd.class_acc(mlp_c.forward(train_x), train_y)[0]
acc_test = sgd.class_acc(mlp_c.forward(test_x), test_y)[0]
print "Amazon Sentiment Accuracy train: %f test: %f\n"%(acc_train,acc_test)

Batch 321/321 (100%)   Epoch  1/ 2 in 8.36 seg
Batch 321/321 (100%)   Epoch  2/ 2 in 8.28 seg


Numpy version took 16.64 sec
Amazon Sentiment Accuracy train: 0.954375 test: 0.747500

Batch 321/321 (100%)   Epoch  1/ 2 in 4.48 seg
Batch 321/321 (100%)   Epoch  2/ 2 in 4.52 seg


Compiled gradient version took 9.00 sec
Amazon Sentiment Accuracy train: 0.954375 test: 0.747500

Batch 321/321 (100%)   Epoch  1/ 2 in 1.16 seg
Batch 321/321 (100%)   Epoch  2/ 2 in 1.24 seg


Theano compiled batch update version took 2.40
Amazon Sentiment Accuracy train: 0.954375 test: 0.747500



As you may observe, just computing the gradients with Theano may not lead to a decrease, but rather an increase in computing speed. To maximally exploit the power of Theano, it is necessary to bundle both computations and data together using approaches like the compiled batch update.

In [None]:
geometry =