#Lecture 5.2 - Methods for Semi-supervised learning

##Readings for the lecture
https://en.wikipedia.org/wiki/Semi-supervised_learning - General summary

http://www.cs.cmu.edu/~zhuxj/pub/CMU-CALD-02-107.pdf - Label propagation

http://papers.nips.cc/paper/2506-learning-with-local-and-global-consistency.pdf - Label Spreading

##Self training
One method for semi-supervised learning is to train a model on the labeled data and then to apply that model to the unlabeled data in order to generate labels.  Then the newly labeled data is include as part of the training set and a new model is trained.  Many classifiers give probability estimates for each of the possible classes.  That makes it possible to only add the new points where the classifier is the most certain.  That makes the process iterative.  New points are added to the training set which alters the model somewhat and results in new points be added.  The code below implements self-learning using random forest as the basic classifier.  The MNIST data are used for illustration.  Some of the MNIST data samples are chosen to be labeled and some are chosen to be unlabeled.  The code incorporates a probability threshold which is set to 0.75 in the code.  

In [3]:
__author__ = 'mike_bowles'

import numpy as np
from math import sqrt
from sklearn.ensemble import RandomForestClassifier
from mnistReader import mnist
from sklearn import preprocessing

#read full data sets and pull out subset of training data to simulate semi-supervised problem
xTr, xTe, yTr, yTe = mnist(onehot=False)
nUnLab = 5000
nLab = 5000
pThresh = 0.75

xLab = xTr[:nLab, :]
xTot = xTr[:nLab + nUnLab, :]

yLab = yTr[:nLab]
yTot = yTr[:nUnLab + nLab]

rfSemi = RandomForestClassifier(n_estimators=100)
rfSemi.fit(xLab, yLab)

predTe = rfSemi.predict(xTe)
print predTe[0], yTe[0]
print 'Error Rate =  ', float(sum(predTe != yTe))/len(xTe)


#index manipulations to
totPred = rfSemi.predict(xTot)
totProb = rfSemi.predict_proba(xTot)

#identify any prediction probabilities that are above threshold and not already in the label set
#for i in in range(nLab + nUnLab)


labGtTh = [i for i in range(len(totPred)) if np.amax(totProb[i, :]) > pThresh]
newLab = [i for i in labGtTh if i not in range(nLab)]



nIter = 10
for i in range(nIter):
    yTemp = np.hstack((yLab, yTot[newLab]))
    xTemp = np.vstack((xLab, xTot[newLab, :]))

    rfSemi.fit(xTemp, yTemp)

    #index manipulations to
    totPred = rfSemi.predict(xTot)
    totProb = rfSemi.predict_proba(xTot)

    labGtTh = [i for i in range(len(totPred)) if np.amax(totProb[i, :]) > pThresh]
    newLab = [i for i in labGtTh if i not in range(nLab)]



    predTe = rfSemi.predict(xTe)
    print 'Error Rate =  ', float(sum(predTe != yTe))/len(xTe), '  #Added Cases=  ', len(newLab)


7 7
Error Rate =   0.0636
Error Rate =   0.0661   #Added Cases=   2586
Error Rate =   0.0651   #Added Cases=   2715
Error Rate =   0.066   #Added Cases=   2790
Error Rate =   0.0639   #Added Cases=   2858
Error Rate =   0.0636   #Added Cases=   2899
Error Rate =   0.0622   #Added Cases=   2943
Error Rate =   0.0653   #Added Cases=   2973
Error Rate =   0.065   #Added Cases=   2999
Error Rate =   0.065   #Added Cases=   3025
Error Rate =   0.0636   #Added Cases=   3054


As you can see from the results above, the performance doesn't improve dramatically with the incorporation of the self-trained data, given the parametric choices shown in the code.  

##In-class coding exercise
Alter some of the parameter values in the code above to learn what elements of the problem help or hurt performance improvement from the value achieved by training on the unlabeled cases only.  You can alter the proportion of labeled to unlabeled points and the probability threshold.  

##Label Propagation
Label propagation operates by building a proximity matrix between all the pairs of points in the data set.  The label propagation algorithm calculates assigns labels to each point based on the labels for all the points that are close to it.  The paper discuses how the Euclidean distance is used to derive weights that are used to calculate an overall score for each possible class based on the number and distance of points of each class in a neighborhood of the point for which a score is being sought.  This results in labels being assigned to the unlabeled points.  The labels for the labeled points are not changed.  The changing labels assigned to the unlabeled points makes label propagation an iterative algorithm.  

The code below uses the LabelPropagation program from sklearn.  

In [7]:
__author__ = 'mike_bowles'

from sklearn.semi_supervised import LabelPropagation, LabelSpreading
import numpy as np
from mnistReader import mnist
from sklearn import preprocessing

#read full data sets and pull out subset of training data to simulate semi-supervised problem
xTr, xTe, yTr, yTe = mnist(onehot=False)
nUnLab = 5000
nLab = 5000


lpModel = LabelPropagation(kernel='rbf', gamma=2.0, n_neighbors=13, alpha=0.99, max_iter=10, tol=0.001)

x = xTr[:nLab + nUnLab, :]
y = yTr[:nLab + nUnLab]
y[nLab:nLab + nUnLab] = -1

lpModel.fit(x,y)



print float(np.sum(lpModel.predict(xTe) == yTe))/float(len(xTe))

0.4824


The results here are a little underwheliming.  The out-of-sample performance with the added unlabeled data is worse than without adding it.  That's not good.  At least it can be measured, so you'll know if you're getting any improvement from trying to incorporate the unlabeled data.  The authors admit that the algorithm doesn't consistently lead to improvement.  

##In-class coding exercise
Twiddle the parameters in the code example above and discuss how the relative sizes of the labeled and unlabeled sets affect the results.  



In [9]:
__author__ = 'mike_bowles'

from sklearn.semi_supervised import LabelPropagation, LabelSpreading
import numpy as np
from mnistReader import mnist
from sklearn import preprocessing

#read full data sets and pull out subset of training data to simulate semi-supervised problem
xTr, xTe, yTr, yTe = mnist(onehot=False)
nUnLab = 2000
nLab = 2000



lsModel = LabelSpreading(kernel='rbf', gamma=3.0, n_neighbors=25, alpha=0.2, max_iter=30, tol=0.001)
x = xTr[:nLab + nUnLab, :]
y = yTr[:nLab + nUnLab]
y[nLab:nLab + nUnLab] = -1

lsModel.fit(x,y)

print float(np.sum(lsModel.predict(xTe) == yTe))/float(len(xTe))

0.4405


Again the results aren't spectacular.  

##Homework problem
Using the autoencoder method on the MNIST data you were getting slightly worse results using the autoencoder weights and random forest than running random forest on the raw data.  Train a four layer network on a 10k sample of the MNIST data, but instead of using random initialization for the weights, start the weights with the weights that came from autoencoder training.  