# Problem2: Face Recognition Using SVM and PCA:

#### a) Download the dataset “Face” from this link:  https://drive.google.com/drive/folders/14Mi1I91iVQ13PG0SPjh9wN5NSNlBb3rb?usp=sharingLinks to an external site.

#### Check out the dataset. This is an image dataset from AT&T research lab. It includes 400 faces (64x64 pixels) from 40 people (10 images per person).

#### You have to also download the csv file that includes the labels of the images (the label is person’s ID. The file is in the same folder). The goal is to build a Face Recognition algorithm to recognize each person using PCA dim-reduction and a non-linear SVM.

#### you can use:

#### mpimg.imread(file_name)   to load an image, and

#### plt.imshow(image_name, cmap=plt.cm.gray)  to show an image (This is a little different from what we had before!). Add   %matplotlib inline   at top of your code to make sure that the images will be shown inside the Jupyter explorer page.

In [1]:
import pandas as pd
import os
import numpy as np
import matplotlib.image as mpimg

csv_path = 'label.csv'

labels = pd.read_csv(csv_path)

images = np.empty((400, 4096))

files_in_folder = os.listdir('./Face')
files_in_folder.sort(key=lambda x: int(x.split(".")[0]))

for i, file in enumerate(files_in_folder):
    if file != 'label.csv':
        path = os.path.join('./Face/', file)
        images[i] = mpimg.imread(path).flatten()

#### b) Build the feature matrix and label vector: Each image is considered as a data sample with pixels as features. Thus, to build the feature table you have to convert each 64x64 image into a row of the feature matrix with 4096 columns (i.e 4096 features for 4096 pixels).

In [12]:
prob2_X = images
prob2_y = labels['Label']

#### c) Normalizations: Normalize each column of your feature matrix using preprocessing.scale (This step is very important!).

In [13]:
from sklearn import preprocessing

# normalize/scale the data
prob_2_X_norm = preprocessing.scale(prob2_X)
prob_2_X_norm[:5]

array([[ 1.37649641,  1.11885303,  0.79610373, ..., -1.17094622,
        -1.24726506, -1.21711982],
       [ 1.68113398,  1.3654141 ,  1.03570156, ...,  0.68710075,
         1.48558299,  1.58234648],
       [-0.31593455, -0.59063704, -0.75329558, ...,  1.84210291,
         1.84204144,  1.13304942],
       [-0.73904229, -0.40982559, -0.49772456, ...,  1.10557979,
        -0.31368343, -0.99247129],
       [-0.09591852,  0.31342021,  0.57247909, ...,  0.40253499,
         0.26343976,  0.71831368]])

#### d) Use sklearn functions to split the normalized dataset into testing and training sets with the following parameters: test_size=0.25, random_state=5.

In [14]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(prob_2_X_norm, prob2_y, test_size=0.25, random_state=5)

#### e) The dimensionality of the data samples is 4096. Use PCA (Principal Component Analysis) to reduce the dimensionality from 4096 to 50 (i.e. only k=50 principal components!). You should “fit” your PCA on your training set only, and then use this fitted model to “transform” both training and testing sets (When you finish this step, the number of columns in your testing and training sets should be 50).

In [15]:
from  sklearn.decomposition  import  PCA

k = 50  #  k  is the number of components (new features) after dimensionality reduction
my_pca = PCA(n_components = k)

# new datasets after PCA
X_Train_PCA = my_pca.fit_transform(X_train)
X_Test_PCA = my_pca.transform(X_test)

#### f) Design and Train a non-linear SVM classifier with “RBF Kernel” to recognize the face based on the training dataset that you built. Use SVC(C=1, kernel='rbf', gamma=0.0005, random_state=1). Then, Test your SVM on testing set, and calculate and report the accuracy. Also, calculate and report the Confusion Matrix using confusion_matrix(y_test, y_predict).

In [16]:
from sklearn.svm import SVC

svm = SVC(C=1, kernel='rbf', gamma=0.0005, random_state=1)

svm.fit(X_Train_PCA, y_train)

y_predict = svm.predict(X_Test_PCA)

In [17]:
from sklearn.metrics import accuracy_score, confusion_matrix

print(accuracy_score(y_test, y_predict))

0.91


#### Accuracy: 91%

In [18]:
conf_matrix = confusion_matrix(y_test, y_predict)
print(conf_matrix)

[[3 0 0 ... 0 0 0]
 [0 3 0 ... 0 0 0]
 [0 0 1 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 4 0]
 [0 0 0 ... 0 0 1]]


#### Confusion Matrix:
[[3 0 0 ... 0 0 0]
 [0 3 0 ... 0 0 0]
 [0 0 1 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 4 0]
 [0 0 0 ... 0 0 1]]


#### g) Now, use GridSearchCV to find the best value for parameter C in your SVM with scoring='accuracy'. Search in this list: [0.1, 1, 10, 100, 1e3, 5e3, 1e4, 5e4, 1e5] .



In [19]:
from sklearn.model_selection import GridSearchCV

X_normalized_pca = my_pca.fit_transform(prob_2_X_norm)

# Define the parameter grid for GridSearchCV
param_grid = {'C': [0.1, 1, 10, 100, 1000, 10000, 100000, 1000000, 10000000]}

# Create a GridSearchCV object with SVC classifier
grid = GridSearchCV(SVC(kernel='rbf', gamma=0.0005, random_state=1), param_grid, scoring='accuracy', cv=10)

# Fit the GridSearchCV on the entire dataset after PCA
grid_result = grid.fit(X_normalized_pca, prob2_y)

# Get the best parameter value
print("Best score:", grid.best_score_)
print("Best C value:", grid.best_params_['C'])

Best score: 0.9649999999999999
Best C value: 10


Best score: 0.9649999999999999
Best C value: 10