# Eigenfaces

A classic result in Principal Component Analysis comes from applying PCA to images of faces: the principal components that result from this process are known as eigenfaces, and they represent the optimal basis by which to represent faces.  

We'll perform this analysis on the labelled faces in the wild dataset.  To get this dataset, use the following commands.  Note that this will likely take a few minutes the first time, as the dataset is nearly 200MB.


In [None]:
from sklearn.datasets import fetch_lfw_people
lfw_people = fetch_lfw_people(min_faces_per_person=20, resize=0.4,color=False)

The resulting dataset will be a dictionary.  Inside the dictionary there is a field called 'data', which is an m by 1850 numpy array.  This is just a flattened image, as with MNIST.  To reshape it to a proper black and white face image, you can reshape to 50 by 37.  Try this with a few of these images and plot them using imshow (you may wish to switch to grayscale).  NOTE: The LFW dataset pixel values are scaled from 0-255.  You'll need to rescale these to 0-1 by dividing by 255.  

Now, perform a principal component analysis on this dataset.  If we call the m by 1850 matrix of faces $X$, then the first step is to center the data as
$$X' = X - \bar{X},$$
with $\bar{X}$ the pixelwise mean of the data (as such, $\bar{X}$ should be size 1 by 1850).  


Next we can easily compute the covariance matrix as 
$$ S = \frac{1}{m} X'^T X',$$
which should have dimensions 1850 by 1850.

Now, we can compute the principal components and their scores by taking an eigenvalue decomposition of $S$.  You can use the numpy command linalg.eig(S) to do this.  It may take a few moments to run, as this is cubic in the size of $S$!

In [None]:
import numpy as np
lamda,V = np.linalg.eig(S)

Next, plot the eigenvalues $\lambda$.  You should find that they rapidly decrease in size (you may want to use a logarithmic $y$ axis).  Come up with some means to select a value $p$ after which you truncate.  Create a matrix $V'$ that contains as its columns the first $p$ principal components.   

The principal components are of an appropriate size that they can be reshaped into images.  Visualize a few principal components (perhaps the first, second, tenth, hundredth, etc.).  These are the so-called eigenfaces: if you add them up in various proportions, you can recover nearly any face.  

Finally, let's explore the effects of truncation.  First, transform the data into its PCA coordinates by using the formula
$$ Z = X' V'.$$
$Z$ is an $m$ by $p$ matrix.  It contains the coordinates of each face, but now represented as coefficients of this new and better basis.  It's also lower dimensional, since $p<1850$.  As such, we've lost some information.  To get a sense of how much information was lost, we can transform $Z$ back into the original data coordinates using
$$
X_{recon} = Z V'^T + \bar{X}.
$$
Plot a few images from $X_{recon}$ alongside the equivalent original $X$.  Comment on how good the reconstruction is.    

Finally, repeat the above procedure for different values of $p$.  For example, $p=1, p=5, p=20, etc.$.  How does the quality of the reconstruction change with different $p$?

BONUS ROUND: Repeat the above procedure, but in color.  You can get the color LFW dataset by simply setting the color flag to true in fetch_lfw_people.