#Lecture 5.1 - Auto Encoders and Semi-Supervised Learning

##Learning Objectives
1.  What is semi-supervised learning?
2.  What is an auto encoder?
3.  How can an autoencoder be built for semi-supervised learning problems?

##Order of Topics
1.  Definition of semi-supervised problems and some examples
2.  Background on stacked autoencoders
3.  Walkthrough code for 2-hidden-layer autoencoder.


##Pre-class reading
https://en.wikipedia.org/wiki/Semi-supervised_learning - Background on semi-supervised learning problems

http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders - Look this over.  We'll build one of these autoencoders in theano

##Semi-supervised learning
Sometimes label are easy to produce along with each instance.  For example, if the problem predict whether a visitor to a web site will click on an ad, then the labels for a training set are whether or not past customers clicked and the cost of labeling the data is just the cost of storage for the "yes/no" bit.  A computer will do the work of attaching labels to each example.  If the problem is to determine whether a picture violates the standards of your website, then the images are relatively easy to get, but labeling them requires a human looking at all the pictures.  Maybe that's a job for mechanical turk.  If the problem is determining the toxicity of a drug from giving the drug to rats and then inspecting their livers for signs of damage, then getting the examples takes time and money and the associated labels require a reading from a pathologist.  Semi-supervised learning is a collection of techniques for incorporating both labeled and unlabeled data in order to achieve better accuracy than would be available with either supervised or unsupervised techniques alone.  You'll see a couple of methods for accomplish this.  Today you'll see how to use some of the deep learning techniques that you've learning in the last couple of weeks.  

##Autoencoder stack
The basic idea with an autoencoder stack is to train a neural network one layer at a time, without using labels.  Instead of training against labels, each layer is paired with a symmetric output layer and trained that stack is trained to reproduce its input.  Walking through the figures included in http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders will make the concepts clear.  

The code below demonstrates training the first layer in the stack and shows the results of training.  



In [None]:
__author__ = 'mike_bowles'
#training for 1st layer
import theano
from theano import tensor as T
import numpy as np
from math import sqrt
from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
from mnReader import mnist
from sklearn import preprocessing
import matplotlib.pyplot as plt
import itertools
from random import sample

srng = RandomStreams()


def writeFile(w, filename):
    import cPickle
    import os
    file = open(filename, 'wb')
    cPickle.dump(w, file)

def readFile(filename):
    import cPickle
    import os
    file = open(filename, 'rb')
    return cPickle.load(file)

def floatX(X):
    return np.asarray(X, dtype=theano.config.floatX)

def init_weights(shape):
    (h, w) = shape
    # Glorot normalization - last factor depends on non-linearity
    # 0.25 for sigmoid and 0.1 for softmax, 1.0 for tanh or Relu
    normalizer = 2.0 * sqrt(6) / sqrt(h + w) * 1.0
    #return theano.shared(floatX(np.random.randn(*shape) * 0.01))  #code for standard initialization
    #code for using Glorot initialization
    return theano.shared(floatX((np.random.random_sample(shape) - 0.5) * normalizer))

def rectify(X):
    #return T.maximum(X, 0.)   #original rectifier
    return T.maximum(X, 0.01 * X)  #leaky rectifier


def RMSprop(cost, params, lr=0.001, rho=0.9, epsilon=1e-6):
    grads = T.grad(cost=cost, wrt=params)
    updates = []
    for p, g in zip(params, grads):
        acc = theano.shared(p.get_value() * 0.)
        acc_new = rho * acc + (1 - rho) * g ** 2
        gradient_scaling = T.sqrt(acc_new + epsilon)
        g = g / gradient_scaling
        updates.append((acc, acc_new))
        updates.append((p, p - lr * g))
    return updates

def adaGrad(cost, params, eta=0.1, epsilon=1e-6):
    grads = T.grad(cost=cost, wrt=params)
    updates = []
    for p, g in zip(params, grads):
        sumGSq = theano.shared(p.get_value() * 0.)
        sumGSq_new = sumGSq + g ** 2
        gradient_scaling = T.sqrt(sumGSq_new + epsilon)
        g = g / gradient_scaling
        updates.append((sumGSq, sumGSq_new))
        updates.append((p, p - eta * g))
    return updates

def adaDelta(cost, params, eta=1.0, rho=0.9, epsilon=1e-6):
    grads = T.grad(cost=cost, wrt=params)
    updates = []
    for p, g in zip(params, grads):
        #calc g-squared
        gSq = theano.shared(p.get_value() * 0.)
        dwSq = theano.shared(p.get_value() * 0.)

        #exp smoothed g squared
        gSqNew = rho * gSq + (1 - rho) * g * g

        #calc dx-squared
        dw = eta * T.sqrt(dwSq + epsilon) * g / T.sqrt(gSq + epsilon)
        dwSqNew = rho * dwSq + (1 - rho) * dw * dw

        updates.append((dwSq, dwSqNew))
        updates.append((gSq, gSqNew))
        updates.append((p, p - dw))
    return updates

def nesterovAG(cost, params, c=1.0):
    grads = T.grad(cost, wrt=params)
    updates = []
    lam = theano.shared(np.asarray(1.0, dtype=theano.config.floatX), 'lam')
    lamSP1 = (1 + T.sqrt(1 + 4 * lam * lam)) / 2.0
    updates.append((lam, lamSP1))
    gammaS = (1 - lam) / lamSP1

    for p,g in zip(params, grads):
        #calc y and x at s + 1
        yS = theano.shared(p.get_value())  #initialize yS
        ySP1 = p - g / c
        pSP1 = (1 - gammaS) * ySP1 + gammaS * yS

        updates.append((yS, ySP1))
        updates.append((p, pSP1))
    return updates

def dropout(X, p=0.):
    if p > 0:
        retain_prob = 1 - p
        X *= srng.binomial(X.shape, p=retain_prob, dtype=theano.config.floatX)
        X /= retain_prob
    return X

def model(X, w_h, w_o, p_drop_input, p_drop_hidden):
    X = dropout(X, p_drop_input)
    h = rectify(T.dot(X, w_h))

    h = dropout(h, p_drop_hidden)
    x_x = rectify(T.dot(h, w_o))

    return h, x_x


X = T.fmatrix()
Y = T.fmatrix()

w_h = init_weights((784, 450))
w_o = init_weights((450, 784))

#Tie weights together
wtVal = 0.5 * (w_h.get_value() + np.transpose(w_o.get_value()))
w_h.set_value(wtVal)
w_o.set_value(np.transpose(wtVal))

noise_h, noise_x_x = model(X, w_h, w_o, 0.2, 0.5)
h, x_x = model(X, w_h, w_o, 0., 0.)

cost = T.mean(T.sqr(X - x_x))
params = [w_h, w_o]

#updates = RMSprop(cost, params, lr=0.0001)
updates = adaGrad(cost, params, eta=0.1, epsilon=1.0) #
#updates = adaDelta(cost, params, eta=0.1, rho=0.9, epsilon=1.0e-1)
#updates = nesterovAG(cost, params, c=100.0)

xTr, xTe, yTr, yTe = mnist(onehot=True)

train = theano.function(inputs=[X], outputs=cost, updates=updates, allow_input_downcast=True)
predict = theano.function(inputs=[X], outputs=x_x, allow_input_downcast=True)


for i in range(101):
    iJump = 40

    for start, end in zip(range(0, len(xTr), iJump), range(iJump, len(xTr), iJump)):
        cost = train(xTr[start:end])
        #make sure weights stay tied together
        wtVal = 0.5 * (w_h.get_value() + np.transpose(w_o.get_value()))
        w_h.set_value(wtVal)
        w_o.set_value(np.transpose(wtVal))
    dum, outTr = model(xTr[start:end], w_h, w_o, 0.0, 0.0)
    dum, outTe = model(xTe, w_h, w_o, 0.0, 0.0)
    print i, np.mean(np.square(xTr[start:end] - outTr.eval())), np.mean(np.square(xTe - outTe.eval()))
    filename = 'wh1_784x450'
    writeFile(w_h, filename)

    #asList = list(itertools.chain.from_iterable((w_h.get_value()).tolist()))
    #plt.hist(asList)
    #plt.show()

  training error   test error

0 0.00720460403344 0.00558570951932  
1 0.00543443144029 0.00422348246745  
2 0.0046455672724 0.00361716637157  
3 0.00417445467595 0.00324217888212  
4 0.00385111593976 0.00297388732927  
5 0.00360445778003 0.00276614644623  
6 0.00340239912557 0.00259675805118  
7 0.0032322766274 0.0024537540176  
8 0.00308121691577 0.00233017746953  
9 0.00295164503737 0.00222138863088  
10 0.00283979780605 0.00212492667653  
11 0.00273879005775 0.00203863466993  
12 0.00264886922637 0.00196052569108  
13 0.00256720595295 0.00188942103276  
14 0.00249258589244 0.00182407461019  
15 0.00242454550107 0.00176386154547  
16 0.00236264458621 0.00170793366705  
17 0.00230631018995 0.00165582853733  
18 0.00225383456126 0.00160708400946  
19 0.0022053266137 0.0015612323431  
20 0.00215938875221 0.0015178246612  
21 0.00211525037984 0.0014767117029  
22 0.00207342402015 0.00143746157609  
23 0.00203322783804 0.00139999618543  
24 0.00199495118219 0.00136412340604  
25 0.00195746655757 0.00132977314186  
26 0.00192219023513 0.00129674631617  
27 0.00188703374476 0.00126498188364  
28 0.00185297203411 0.00123442277568  
29 0.00182144361564 0.00120480502838  
30 0.00179030642091 0.0011763192598  
31 0.00175912686576 0.00114904858212  
32 0.00172852706855 0.00112287862498  
33 0.00169923333915 0.00109772200791  
34 0.00167067637625 0.00107358720205  
35 0.00164296463865 0.00105040718782  
36 0.00161614120953 0.00102812043263  
37 0.00158991740138 0.00100669241586  
38 0.00156435792518 0.000986102135528  

The next section of code takes the 1st layer weights calculated by the program above and uses them to calculate the input to the second hidden layer.  The input to the second layer is then treated as input and labels for training the second layer weights.  

In [2]:

__author__ = 'mike_bowles'
#training for second layer
import theano
from theano import tensor as T
import numpy as np
from math import sqrt
from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
from mnistReader import mnist
from sklearn import preprocessing
import matplotlib.pyplot as plt
import itertools
from random import sample

srng = RandomStreams()


def writeFile(w, filename):
    import cPickle
    import os
    file = open(filename, 'wb')
    cPickle.dump(w, file)

def readFile(filename):
    import cPickle
    import os
    file = open(filename, 'rb')
    return cPickle.load(file)

def floatX(X):
    return np.asarray(X, dtype=theano.config.floatX)

def init_weights(shape):
    (h, w) = shape
    # Glorot normalization - last factor depends on non-linearity
    # 0.25 for sigmoid and 0.1 for softmax, 1.0 for tanh or Relu
    normalizer = 2.0 * sqrt(6) / sqrt(h + w) * 1.0
    #return theano.shared(floatX(np.random.randn(*shape) * 0.01))  #code for standard initialization
    #code for using Glorot initialization
    return theano.shared(floatX((np.random.random_sample(shape) - 0.5) * normalizer))

def rectify(X):
    #return T.maximum(X, 0.)   #original rectifier
    return T.maximum(X, 0.01 * X)  #leaky rectifier


def RMSprop(cost, params, lr=0.001, rho=0.9, epsilon=1e-6):
    grads = T.grad(cost=cost, wrt=params)
    updates = []
    for p, g in zip(params, grads):
        acc = theano.shared(p.get_value() * 0.)
        acc_new = rho * acc + (1 - rho) * g ** 2
        gradient_scaling = T.sqrt(acc_new + epsilon)
        g = g / gradient_scaling
        updates.append((acc, acc_new))
        updates.append((p, p - lr * g))
    return updates

def adaGrad(cost, params, eta=0.1, epsilon=1e-6):
    grads = T.grad(cost=cost, wrt=params)
    updates = []
    for p, g in zip(params, grads):
        sumGSq = theano.shared(p.get_value() * 0.)
        sumGSq_new = sumGSq + g ** 2
        gradient_scaling = T.sqrt(sumGSq_new + epsilon)
        g = g / gradient_scaling
        updates.append((sumGSq, sumGSq_new))
        updates.append((p, p - eta * g))
    return updates

def adaDelta(cost, params, eta=1.0, rho=0.9, epsilon=1e-6):
    grads = T.grad(cost=cost, wrt=params)
    updates = []
    for p, g in zip(params, grads):
        #calc g-squared
        gSq = theano.shared(p.get_value() * 0.)
        dwSq = theano.shared(p.get_value() * 0.)

        #exp smoothed g squared
        gSqNew = rho * gSq + (1 - rho) * g * g

        #calc dx-squared
        dw = eta * T.sqrt(dwSq + epsilon) * g / T.sqrt(gSq + epsilon)
        dwSqNew = rho * dwSq + (1 - rho) * dw * dw

        updates.append((dwSq, dwSqNew))
        updates.append((gSq, gSqNew))
        updates.append((p, p - dw))
    return updates

def nesterovAG(cost, params, c=1.0):
    grads = T.grad(cost, wrt=params)
    updates = []
    lam = theano.shared(np.asarray(1.0, dtype=theano.config.floatX), 'lam')
    lamSP1 = (1 + T.sqrt(1 + 4 * lam * lam)) / 2.0
    updates.append((lam, lamSP1))
    gammaS = (1 - lam) / lamSP1

    for p,g in zip(params, grads):
        #calc y and x at s + 1
        yS = theano.shared(p.get_value())  #initialize yS
        ySP1 = p - g / c
        pSP1 = (1 - gammaS) * ySP1 + gammaS * yS

        updates.append((yS, ySP1))
        updates.append((p, pSP1))
    return updates

def dropout(X, p=0.):
    if p > 0:
        retain_prob = 1 - p
        X *= srng.binomial(X.shape, p=retain_prob, dtype=theano.config.floatX)
        X /= retain_prob
    return X

def model(X, w_h, w_o, p_drop_input, p_drop_hidden):
    X = dropout(X, p_drop_input)
    h = rectify(T.dot(X, w_h))

    h = dropout(h, p_drop_hidden)
    x_x = rectify(T.dot(h, w_o))

    return h, x_x


X = T.fmatrix()
Y = T.fmatrix()

w_h = init_weights((450, 225))
w_o = init_weights((225, 450))

#Tie weights together
wtVal = 0.5 * (w_h.get_value() + np.transpose(w_o.get_value()))
w_h.set_value(wtVal)
w_o.set_value(np.transpose(wtVal))

noise_h, noise_x_x = model(X, w_h, w_o, 0.2, 0.5)
h, x_x = model(X, w_h, w_o, 0., 0.)

cost = T.mean(T.sqr(X - x_x))
params = [w_h, w_o]

#updates = RMSprop(cost, params, lr=0.0001)
updates = adaGrad(cost, params, eta=0.1, epsilon=0.001) #
#updates = adaDelta(cost, params, eta=0.1, rho=0.9, epsilon=1.0e-1)
#updates = nesterovAG(cost, params, c=100.0)

xTrTmp, xTeTmp, yTr, yTe = mnist(onehot=True)
w1 = readFile('wh1_784x450-keep')

print w1.shape
xTr = np.maximum(np.dot(xTrTmp, w1), 0.01 * np.dot(xTrTmp, w1))
xTe = np.maximum(np.dot(xTeTmp, w1), 0.01 * np.dot(xTeTmp, w1))
print xTr.shape, xTe.shape


train = theano.function(inputs=[X], outputs=cost, updates=updates, allow_input_downcast=True)
predict = theano.function(inputs=[X], outputs=x_x, allow_input_downcast=True)


for i in range(3):
    iJump = 40

    for start, end in zip(range(0, len(xTr), iJump), range(iJump, len(xTr), iJump)):
        cost = train(xTr[start:end])
        #make sure weights stay tied together
        wtVal = 0.5 * (w_h.get_value() + np.transpose(w_o.get_value()))
        w_h.set_value(wtVal)
        w_o.set_value(np.transpose(wtVal))
    dum, outTr = model(xTr[start:end], w_h, w_o, 0.0, 0.0)
    dum, outTe = model(xTe, w_h, w_o, 0.0, 0.0)
    print i, np.mean(np.square(xTr[start:end] - outTr.eval())), np.mean(np.square(xTe - outTe.eval()))
    filename = 'wh2_450x225'
    writeFile(w_h.get_value(), filename)

    #asList = list(itertools.chain.from_iterable((w_h.get_value()).tolist()))
    #plt.hist(asList)
    #plt.show()

(784, 450)
(60000, 450) (10000, 450)
0 0.0126660570167 0.0114325418772
1 0.00962611170861 0.00867053096868
2 0.0082047344161 0.0073957540318


-----in-sample-------out-sample  
0 0.010870159608 0.0104834690039  
1 0.00825208722435 0.00797470370934  
2 0.00710059428044 0.00684058742782  
3 0.00642624539801 0.00615923461415  
4 0.00597277514473 0.00569340612338  
5 0.0056449302224 0.0053501671885  
6 0.00539468920043 0.00508671921458  
7 0.00519488359327 0.0048773264877  
8 0.00503135127701 0.00470778592707  
9 0.00489191450737 0.0045675123186  
:      :                :  
:      :                :  
69 0.00332224730792 0.00309935234036  
70 0.00331477648566 0.00309371076566  
71 0.00330752030716 0.00308835436001  
72 0.00330044540032 0.00308304784549  
73 0.00329322725735 0.00307783421705  
74 0.00328622875749 0.00307279897479  

In [None]:
from sklearn import preprocessing

#read full data sets and pull out subset of training data to simulate semi-supervised problem
xTr, xTe, yTr, yTe = mnist(onehot=True)
nPts = 10000
xTr = xTr[:nPts, :]
yTr = yTr[:nPts, :]

rfMnistRaw = RandomForestClassifier(n_estimators=100)
rfMnistRaw.fit(xTr, yTr)

pred = rfMnistRaw.predict(xTe)

pred2 = pred.tolist()
yTe2 = yTe.tolist()



print pred[0], yTe[0,:]
print 'Error Rate on Raw Features =  ', float(sum(sum(pred !=  yTe)))/(2.0*len(xTe))

##In-class coding
1.  Take out the code tying the weights together and to see how it affects the test error.
2.  Add another layer to the autoencoder.

##Homework
To see how effective this semi-supervised learning is, suppose that you've only got 10k labelled MNIST examples.  Try two approaches to building a classifier.  First train random forest on the 10k examples and see what out-of-sample error you get.  Then train random forest again on 10k examples but use 225 features that you get by running each example through the two-layer trained autoencoder stack that we've just built.  How does the performance compare?

What could you do to improve the performance?  