*ADS-A Week 4 Assignment*

# Face Recognition with Support Vector Machines

In this partly finished notebook, the Support Vector Machine algorithm is used to recognize faces. We will use the Olivetti faces dataset, as included in Scikit-Learn library. More info at: http://scikit-learn.org/stable/datasets/olivetti_faces.html

We start by importing numpy, scikit-learn, and matplotlib, the Python libraries we will be using for this analysis. 

First, show the versions we of these libraries (that is always wise to do in case you have to report problems running the notebooks!) and use the inline plotting mode statement.

In [4]:
import sklearn as sk
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

## Your code ...


## 1 - Load Olivetti Face Dataset

Import the olivetti faces dataset.

In [2]:
from sklearn.datasets import fetch_olivetti_faces

# Fetch the faces data
faces = fetch_olivetti_faces()
print(faces.DESCR)

downloading Olivetti faces from http://cs.nyu.edu/~roweis/data/olivettifaces.mat to C:\Users\Gebruiker\scikit_learn_data
Modified Olivetti faces dataset.

The original database was available from

    http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html

The version retrieved here comes in MATLAB format from the personal
web page of Sam Roweis:

    http://www.cs.nyu.edu/~roweis/

There are ten different images of each of 40 distinct subjects. For some
subjects, the images were taken at different times, varying the lighting,
facial expressions (open / closed eyes, smiling / not smiling) and facial
details (glasses / no glasses). All the images were taken against a dark
homogeneous background with the subjects in an upright, frontal position (with
tolerance for some side movement).

The original dataset consisted of 92 x 112, while the Roweis version
consists of 64x64 images.



## 2 - Investigate the Olivetti Face Dataset

Have a look at the data. As you can see by running the following statement `faces` is a dictionary with the following keys: `faces.target`, `faces.images`, `DESCR`, `data`. Investigate these items. How many images are present in the dataset? What is the image size in terms of pixels? How many persons are there? 

In [8]:
print(faces.keys())

## Your code ...


dict_keys(['data', 'images', 'target', 'DESCR'])


400

We don't have to scale features, why? 

In [None]:
# Show that the data is already normalized

## Your code ...


Plot the first 20 images in a row. 

In [5]:
def print_faces(images, target, top_n):
    # set up the figure size in inches
    fig = plt.figure(figsize=(12, 12))
    fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)
    for i in range(top_n):
        # plot the images in a matrix of 20x20
        p = fig.add_subplot(20, 20, i + 1, xticks=[], yticks=[])
        p.imshow(images[i], cmap=plt.cm.bone)
        
        # label the image with the target value
        p.text(0, 14, str(target[i]))
        p.text(0, 60, str(i))
        
print_faces(faces.images, faces.target, 20)

Plot all the faces in a matrix of 20x20, for each one, put the target value in the top left corner and its index in the bottom left corner.
It may take a few seconds.

In [None]:
## Your code ...


As you can see now we have confirmed that there are 40 individuals with 10 different images each in the dataset.

## 3 - Analysis with SVM

We will build a classifier whose model is a hyperplane that separates instances of one class from the rest. Support Vector Machines (SVM) are supervised learning methods that try to obtain these hyperplanes in an optimal way, by selecting the ones that pass through the widest possible gaps between instances of different classes. New instances will be classified as belonging to a certain category based on which side of the surfaces they fall on. Let's import the SVC class from the sklearn.svm module. SVC stands for Support Vector Classifier.

In [None]:
from sklearn.svm import SVC
svc_1 = SVC(kernel='linear')
print(svc_1)

Build training and testing sets.

In [None]:
# Use the train_test_split() function to create a train set (75%) and test set (25%)
from sklearn.model_selection import train_test_split

## Your code ...


Once you have the classifier (algorithm) defined and the train and test data available you are ready to do the analysis.

In [None]:
# Do the analysis with the classifier, next predict the labels of the test set and use the accuracy_score helper 
# function to determine the accuracy.
from sklearn import metrics

## Your code ...


### <span style="color:blue">Explanation Cross-validation</span>

<span style="color:blue">
Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a **test set**.
</span>

<span style="color:blue">
As you have seen in scikit-learn a random split into training and test sets can be quickly computed with the ``train_test_split`` helper function. 
</span>

<span style="color:blue">
When evaluating different settings (“hyperparameters”) for estimators (such as the C setting that must be manually set for an SVM) there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance. To solve this problem, yet another part of the dataset can be held out as a so-called “validation set”: training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set.
</span>

<span style="color:blue">
However the downside is that by partitioning the available data into three sets, we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets.
</span>

<span style="color:blue">
A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets. The following procedure is followed for each of the k “folds”:
•	A model is trained using k-1 of the folds as training data;
•	The resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).
The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. This approach can be computationally expensive, but does not waste too much data.
</span>

<span style="color:blue">
The function ``KFold`` divides the dataset in k groups of samples, called folds (if k = n, this is equivalent to the Leave One Out
strategy), of equal sizes (if possible). The prediction function is learned using k-1 folds, and the fold left out is
used for test.
</span>

Perform 5-fold cross-validation. Show what all the accuracy scores are and compute the average value. Consult the sklearn documentation and when needed ask your fellow students or teacher for help.

In [None]:
from sklearn.model_selection import cross_val_score, KFold

## Your code ...


Write down your conclusion of the K-fold cross validation.

In [None]:
## Your answer ...


### <span style="color:blue">Optionally: More on Cross-validation</span>

<span style="color:blue">
The function ``StratifiedKFold`` is a variation of k-fold which returns stratified folds: Each set contains approximately the same percentage of samples of each target class as the complete set.
</span>

In [None]:
from sklearn.cross_validation import StratifiedKFold

labels = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
skf = StratifiedKFold(labels, 3)
for train, test in skf:
     print("%s %s" % (train, test))

<span style="color:blue">
The function ``LeaveOneOut`` (or LOO) is a simple cross-validation. Each learning set is created by taking all the samples except one, the test set being the sample left out. Thus, for n samples, we have n different training sets and n different tests set. This cross-validation procedure does not waste much data as only one sample is removed from the training set.
</span>

In [None]:
from sklearn.cross_validation import LeaveOneOut

loo = LeaveOneOut(4)
for train, test in loo:
    print("%s %s" % (train, test))

<span style="color:blue">
Potential users of LOO for model selection should weigh a few known caveats. When compared with k-fold cross validation, one builds n models from n samples instead of k models. Moreover, LOO is trained on n-1 samples rather than (k-1)/k * n. Hence LOO is computationally more expensive than k-fold cross validation.
</span>

<span style="color:blue">
In terms of accuracy, LOO often results in high variance as an estimator for the test error. Intuitively, since n-1 of the n samples are used to build each model, models constructed from folds are virtually identical to each other and to the model built from the entire training set.
However, if the learning curve is steep for the training size in question, then 5- or 10- fold cross validation can overestimate the generalization error.
</span>

<span style="color:blue">
As a general rule, most authors, and empirical evidence, suggest that 5- or 10- fold cross validation should be preferred to LOO.
</span>

## 4 - Optionally: Other Metrics

Import the sklearn ``metrics`` package and determine also precision and recall for the test set, for _each class_. The code is given ... can you figure out what happens?

In [None]:
from sklearn import metrics

def train_and_evaluate(clf, X_train, X_test, y_train, y_test):
    
    clf.fit(X_train, y_train)
    
    print("Accuracy on training set:")
    print(clf.score(X_train, y_train))
    print("Accuracy on testing set:")
    print(clf.score(X_test, y_test))
    
    y_pred = clf.predict(X_test)
    
    print("Classification Report:")
    print(metrics.classification_report(y_test, y_pred))
    print("Confusion Matrix:")
    print(metrics.confusion_matrix(y_test, y_pred))

In [None]:
train_and_evaluate(svc_1, X_train, X_test, y_train, y_test)

What is your overall conclusion? Can you explain the confusion matrix?

In [None]:
## Your answer

#Conclusion: Performance of SVM for face recognition is incredibly high!
#Confusion matrix: Excellent tool to find problemetic data (which face goes wrong).

## 5 - Discriminate People with or without Glasses

Now, another problem. 

Try to classify images of people with and without glasses. A few tips to take into account.
- Relabel all the images (by hand, carefully look at 20x20 matrix plot above)
- Create a training & test set for this new problem
- Again try a [linear SVC classifier](http://en.wikipedia.org/wiki/Kernel_%28linear_algebra%29) (start by using the default parameters)
- Do the analysis and evaluate.
- And  show a classification report as above.
- Which images go wrong?

In [None]:
## Your code ...


In [None]:
## Your answers ...
