<a href="https://colab.research.google.com/github/cagBRT/Machine-Learning/blob/master/PCA5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# This notebook uses Randomized PCA<br>
The major difference between RandomizedPCA and PCA is:<br>
>PCA is trying hard to pick the "best" basis vectors by looking for directions in which the original data varies most. <br>

>Random projection is picking the directions randomly!

The point is that random projects may be 'worse' because they're blindly picked, but may not be much worse at all, and of course **picking them randomly is much faster than running PCA**.

## Eigenfaces

This notebook uses the Labeled Faces in the Wild dataset made available through Scikit-Learn:<br>
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_lfw_people.html

Note the face have already been localized and scaled to a common size. This is an important preprocessing piece for faical recognition andis a process that can require a large collection of training data.<br>

This notebook uses PCA “eigenfaces” as a pre-processing step for facial recognition. The reason we chose this is because PCA is a broadly-applicable technique, which can be useful for a wide array of data types. Research in the field of facial recognition in particular, however, has shown that other more specific feature extraction methods are can be much more effective.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from sklearn.datasets import fetch_lfw_people
from sklearn.decomposition import PCA as RandomizedPCA

Get the faces that have at least 60 images in the dataset.<br>
>There are 1348 images in the selected dataset<br>
There are 8 faces in the selected dataset


In [None]:
#The extracted dataset will only retain pictures of people that have at least
#min_faces_per_person different pictures.
faces = fetch_lfw_people(min_faces_per_person=60)
print(faces.target_names)
print(faces.images.shape)

Plot the some of the faces that have at least 60 images

In [None]:
# Plot the photos
fig, ax = plt.subplots(1, 8, figsize=(15, 4),
                       subplot_kw={'xticks':[], 'yticks':[]},
                       gridspec_kw=dict(hspace=0.1, wspace=0.1))
for i in range(0,8):
    ax[i].imshow(faces.data[i].reshape(62, 47), cmap='binary_r')
    ax[i].set_xlabel(faces.target[i])

Since this is a large dataset, we will use ``RandomizedPCA``—it contains a randomized method to approximate the first $N$ principal components much more quickly than the standard ``PCA`` estimator, and thus is very useful for high-dimensional data (here, a dimensionality of nearly 3,000 ... 62*47=2914 pixels).
We will take a look at the first 150 components:

In [None]:
#for the assignment change this number 1 - 2900
n_pcs = 150

In [None]:
pca = RandomizedPCA(n_pcs)
pca.fit(faces.data)

In this case, it can be interesting to visualize the images associated with the first several principal components (these components are technically known as "eigenvectors,"
so these types of images are often called "eigenfaces").

In [None]:
fig, axes = plt.subplots(3, 8, figsize=(15, 6),
                         subplot_kw={'xticks':[], 'yticks':[]},
                         gridspec_kw=dict(hspace=0.1, wspace=0.1))
for i, ax in enumerate(axes.flat):
    ax.imshow(pca.components_[i].reshape(62, 47), cmap='bone')

The results are very interesting, and give us insight into how the images vary: for example, the first few eigenfaces (from the top left) seem to be associated with the angle of lighting on the face, and later principal vectors seem to be picking out certain features, such as eyes, noses, and lips.
Let's take a look at the cumulative variance of these components to see how much of the data information the projection is preserving:

In [None]:
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');

We see that these 150 components account for just over 90% of the variance.
That would lead us to believe that using these 150 components, we would recover most of the essential characteristics of the data.
To make this more concrete, we can compare the input images with the images reconstructed from these 150 components:

Suppose you had a data set with 100 rows and each row had 500 columns. If you constructed a PCA like PCA(n_components = 10) and then called fit you'd find that components_ has 10 rows, one for each of the components you requested, and 500 columns as that's the input dimension.

Transform takes all 100 rows of your data to be projected into this 10-dimensional space so the output would have 100 rows (1 for each in the input) but only 10 columns thus reducing the dimension of your data.

In [None]:
# Compute the components and projected faces
pca = RandomizedPCA(n_pcs)
pca.fit(faces.data)
components = pca.transform(faces.data)
#reconstruct the image from the components
projected = pca.inverse_transform(components)

In [None]:
# Plot the results
fig, ax = plt.subplots(2, 8, figsize=(15, 4),
                       subplot_kw={'xticks':[], 'yticks':[]},
                       gridspec_kw=dict(hspace=0.1, wspace=0.1))
for i in range(8):
    ax[0, i].imshow(faces.data[i].reshape(62, 47), cmap='binary_r')
    ax[1, i].imshow(projected[i].reshape(62, 47), cmap='binary_r')
    ax[1,i].set_xlabel(faces.target[i])
    
ax[0, 0].set_ylabel('full-dim\ninput')
ax[1, 0].set_ylabel('150-dim\nreconstruction');

## Assignment
Change the number of components used PCA. When does the reconstructed image become "acceptable" to you?