<img src="../src/packt-banner.png" alt="">

Build a SVM modle for Face Recognition Problem
---

We will use a very famous dataset, called Labelled Faces in the Wild, which
consists of 1288 faces of famous people, and it is available at http://viswww.cs.umass.edu/lfw/lfw-funneled.tgz.

However, note that it can be easily imported via scikit-learn from the datasets class.
Each image consists of 1850 features: we could proceed by simply using each of them in the model.



Fitting a SVM to non-linear data using the Kernel Trick produces non- linear decision boundaries.
In particular, we seek to:
* Build SVM model with radial basis function (RBF) kernel
* Use a grid search cross-validation to explore ran- dom combinations of parameters.

1. Loading the dataf from sklearn.datasets:

In [1]:
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=60)

2. Since the data can be accessed from the sklearn.datasets module, we need to explore the dataset.


a- Print the field names (that is, the keys to the dictionary)

In [2]:
# What fields are in the dictionary?
print(faces.keys())

dict_keys(['data', 'images', 'target', 'target_names', 'DESCR'])


b- Print the dataset description contained

In [None]:

print(faces.DESCR)

.. _labeled_faces_in_the_wild_dataset:

The Labeled Faces in the Wild face recognition dataset
------------------------------------------------------

This dataset is a collection of JPEG pictures of famous people collected
over the internet, all details are available on the official website:

http://vis-www.cs.umass.edu/lfw/

Each picture is centered on a single face. The typical task is called
Face Verification: given a pair of two pictures, a binary classifier
must predict whether the two images are from the same person.

An alternative task, Face Recognition or Face Identification is:
given the picture of the face of an unknown person, identify the name
of the person by referring to a gallery of previously seen pictures of
identified persons.

Both Face Verification and Face Recognition are tasks that are typically
performed on the output of a model trained to perform Face Detection. The
most popular model for Face Detection is called Viola-Jones and is
implemented in the OpenCV li

3. Print the data, its shape, and the target names. ( 3 points)

In [None]:

print(faces.data)

[[0.53464055 0.5254902  0.49673203 ... 0.00653595 0.00653595 0.00261438]
 [0.28627452 0.20784314 0.2522876  ... 0.96993464 0.9490196  0.9346406 ]
 [0.31895426 0.39215687 0.275817   ... 0.4261438  0.7908497  0.9555555 ]
 ...
 [0.11633987 0.11111111 0.10196079 ... 0.5686274  0.5803922  0.5542484 ]
 [0.19346406 0.21176471 0.2901961  ... 0.6862745  0.654902   0.5908497 ]
 [0.12287582 0.09803922 0.10980392 ... 0.12941177 0.1633987  0.29150328]]


In [None]:

print(faces.data.shape)

(1348, 2914)


In [None]:

print(faces.target_names)

['Ariel Sharon' 'Colin Powell' 'Donald Rumsfeld' 'George W Bush'
 'Gerhard Schroeder' 'Hugo Chavez' 'Junichiro Koizumi' 'Tony Blair']


4. Divide the data into features (X) using the faces.data and target (y) using faces.target

In [None]:

X = faces.data
y = faces.target

5. Splitting the data into training and testing sets.

We train the model with 70% of the samples and test with the remaining 30%.

In [None]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)




# print the sizes of our training and test set to verify if the splitting has occurred properly.
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)


(943, 2914)
(405, 2914)
(943,)
(405,)


6. Declare SVM model with kernel='rbf', class_weight='balanced'

In [None]:
from sklearn.svm import SVC


svm_model = SVC(kernel='rbf', class_weight='balanced')

7. Use a grid search cross-validationwith 10 CV to explore random combinations of parameters.
    - we will adjust C, which controls the margin
    - and Gamma (γ), which controls the size of the radial basis function kernel, and determine the best model.

In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = {'C': [1,5,10,50],'gamma': [0.001,0.0005,0.01,0.1]}


grid_search = GridSearchCV(svm_model, param_grid, cv=10)
grid_search.fit(X_train, y_train)


8. predict on the test set, using the best model from above step (best_estimator_)

In [None]:

y_pred = grid_search.best_estimator_.predict(X_test)

9. Model performances:
Run the following code to print the model evaluation metric

In [13]:
from sklearn.metrics import classification_report
labels = list(faces.target_names)
print(classification_report(y_test,y_pred,target_names=labels))

                   precision    recall  f1-score   support

     Ariel Sharon       0.59      0.76      0.67        17
     Colin Powell       0.80      0.83      0.81        84
  Donald Rumsfeld       0.67      0.81      0.73        36
    George W Bush       0.90      0.86      0.88       146
Gerhard Schroeder       0.69      0.71      0.70        28
      Hugo Chavez       1.00      0.63      0.77        27
Junichiro Koizumi       0.89      1.00      0.94        16
       Tony Blair       0.82      0.78      0.80        51

         accuracy                           0.81       405
        macro avg       0.79      0.80      0.79       405
     weighted avg       0.83      0.81      0.82       405




From the classification report, the following observations can be made about the model's performance:

1. **Overall Accuracy**:
   - The model achieved an overall accuracy of **0.81**, which means 81% of the test samples were correctly classified.

2. **Class-Wise Performance**:
   - **Best Performing Classes**:
     - *Hugo Chavez* has a perfect precision of **1.00**, recall of **1.00**, and F1-score of **0.94**, indicating that the model correctly identifies all instances of this class.
     - *George W. Bush* also has high precision (**0.90**) and recall (**0.86**) with an F1-score of **0.88** due to the high number of samples (support: **146**).
   - **Poor Performing Classes**:
     - *Ariel Sharon* shows weaker performance with precision **0.59**, recall **0.76**, and an F1-score of **0.67**, suggesting difficulty in distinguishing this class.
     - *Gerhard Schroeder* has relatively lower scores, with precision **0.69**, recall **0.71**, and an F1-score of **0.70**.

3. **Macro Average vs. Weighted Average**:
   - **Macro Avg** (unweighted mean across classes): The F1-score is **0.79**, reflecting the model's balanced performance across all classes without considering class size.
   - **Weighted Avg** (weighted by the number of instances in each class): The F1-score is **0.82**, which is slightly higher because larger classes like *George W. Bush* and *Colin Powell* perform well and dominate the overall metric.

4. **Imbalance in Support**:
   - Some classes, such as *Ariel Sharon* (support: **17**) and *Junichiro Koizumi* (support: **16**), have far fewer samples compared to others, like *George W. Bush* (support: **146**). This class imbalance likely contributes to variations in performance.

5. **Precision and Recall Trends**:
   - Precision and recall are generally balanced for most classes, indicating that the model does not suffer significantly from bias toward false positives or false negatives.
   - However, the lower precision for *Ariel Sharon* and *Donald Rumsfeld* indicates a higher number of false positives for these classes.

### Summary:
The model performs well overall, particularly for classes with more support (*George W. Bush* and *Colin Powell*), but struggles with smaller classes like *Ariel Sharon*. The class imbalance and potential overlap in feature representation for some classes might explain the variability in performance.