<a href="https://colab.research.google.com/github/gherbin/ComputerVisionKUL/blob/master/CV_Group9_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hi there! 

> *\[14 Apr 2020] A notebook written by Geoffroy Herbin, group9, r0426473, in the context of the Computer Vision course [H02A5](https://p.cygnus.cc.kuleuven.be/webapps/blackboard/execute/announcement?method=search&context=course_entry&course_id=_891702_1&handle=announcements_entry&mode=view), Master of Artificial Intelligence, KULeuven.* 


Welcome to this Colab where we'll dig into some Computer Vision fancy stuff!

![face recognition image](https://i.ibb.co/KKmkZYJ/emma-stone.jpg)

The goals of this notebook is to perform faces classification and identification. To reach that goals we will first:
- retrieve training and test images
- build two "features".
At this point, we simply say that a "feature" is another way to represent the input data (images, in our case). 
    1. handcrafted feature: Histogram of Oriented Gradients
    2. feature learnt from the data: Principal Component Analysis

Then, we will train different models based on the two features and compare the classification and identification results.

---

* Several optimized libraries (ex: `sklearn`) will be extensively used. However, at first, some of the key functionalities will be coded as to provide a better view of what really happens behind the calls to library functions.

* The notebook is compatible with grayscale and colormode, depending on a parameter defined a bit later. The text, static content, is written based on analysis made in `color = False` mode. Results may vary a little.


## Import (most of) the required packages

Almost all the required packages are imported first.
> a few, used very locally, will be imported on the code snippet. 


In [0]:
import os
import cv2
import tarfile
import zipfile
import shutil

import numpy as np
from numpy.testing import assert_array_almost_equal

import random
import logging

from urllib import request
from socket import timeout
from urllib.error import HTTPError, URLError

from google.colab import drive
from google.colab.patches import cv2_imshow

from distutils.dir_util import copy_tree

from skimage.feature import hog as skimage_feature_hog
from skimage import exposure

from sklearn.decomposition import PCA as sklearn_decomposition_PCA
from sklearn.model_selection import GridSearchCV
from sklearn import svm 
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import cross_val_predict
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

import sklearn.manifold


from math import sqrt

from matplotlib import pyplot as plt
from matplotlib.gridspec import GridSpec
from matplotlib.offsetbox import OffsetImage, AnnotationBbox
%matplotlib inline

import seaborn as sns

from scipy.interpolate import RectBivariateSpline
from scipy.linalg import svd as scipy_linalg_svd
from scipy import ndimage, misc


logging.basicConfig(level=logging.INFO)

mpl_logger = logging.getLogger("matplotlib")
mpl_logger.setLevel(logging.WARNING)


## Parameters

As all computerized systems, several parameters help in defining how the system should react. 
Those parameters are centralized here.

In [0]:
base_path = "/content/sample_data/CV__Group_assignment"
path_datasets = r"/content/datasets/"
path_discard = r"/content/discard/"
path_database = r"/content/DATABASE/"

'''
Parameters handling the database build up
'''
need_vgg_download = False
confirmation_db_renewal = False
to_drive_confirmation = False
to_drive_confirmation_vgg = False

load_from_local_drive = True # allows downloading the source images directly from the archive in github repository (see "important note")

show = True # similar as global verbose parameter, for images (when custom functions allows it)

sq_size = 64 # square size used -> shall be smaller than the output of get_min_size(faces_cropped) # assert sq_size <= min(get_min_size(faces_cropped))

color = False # if False, tasks are run in Grayscale. if True, tasls are run in full colormode

if color:
    my_color_map = plt.cm.viridis
else:
    my_color_map = plt.cm.gray



## Several utils functions

Several *utils* functions are used to:
- pretty plot a dictionnary content, 
- retrieve minimal size of a batch of images, 
- reshape in the appropriate way an input, considering the `color` and `sq_size` parameters defined, 
- retrieve the data in the appropriate format from the datasets initially built.


You can find more info on the functions and the codes below, or when we'll use them later in the tutorial.

In [0]:
def pretty_return_dict_size(my_dict):
    '''
    returns a string containing the different size of the elements of a dict
    '''
    output_list = ["\n"]
    for k in my_dict.keys():
        output_list.append(str(k))
        output_list.append(":")
        output_list.append(str(len(my_dict[k])))
        output_list.append("\n")
    return ''.join(output_list)

def show_images_from_dict(my_dict, show_index = False):
    '''
    shows the images contained in a dictionary, going through all keys
    '''
    for k in my_dict.keys():
        logging.debug("@------------------- Images of " + str(k) + " -------------------@")
        index = 0
        for img in my_dict[k]:
            if show_index:
                logging.debug("Image index: " + str(index))
                index+=1
            cv2_imshow(img)
            logging.debug("Shape = " + str(img.shape))
def get_min_size(images_dict):
    '''
    returns the minimum size of images contained in the images_dict input
    '''
    min_rows, min_cols = float("inf"), float("inf")
    max_rows, max_cols = 0, 0
    for person in persons:
        for src in images_dict[person]:
            r, c = src.shape[0], src.shape[1]    
            min_rows = min(min_rows, r)
            max_rows = max(max_rows, r)
            min_cols = min(min_cols, c)
            max_cols = max(max_cols, c)
    # logging.info("smallest px numbers (row, cols) = " + str((min_rows,min_cols)))
    return min_rows, min_cols
        
def my_reshape(image_vector, sq_size, color):
    '''
    returns a reshape version of an image represented as an image array, depending of the color parameter.
    If color is True, it returns a colored RGB format image of size (sq_size x sq_size) (useable as is by matplotlib)
    If color is False, it returns a grayscale image (sq_size x sq_size)
    '''
    flattened = image_vector.ndim == 1
    if flattened:
        if color:
            img_reshaped = (np.reshape(image_vector, (sq_size, sq_size, 3))).astype('uint8')
            return cv2.cvtColor(img_reshaped, cv2.COLOR_BGR2RGB)
        else:
            return np.reshape(image_vector, (sq_size, sq_size))
    else:
        if color:
            img_reshaped = (np.reshape(image_vector, (sq_size, sq_size, 3))).astype('uint8')
            return cv2.cvtColor(img_reshaped, cv2.COLOR_BGR2RGB)
        else:
            return image_vector

def get_matrix_from_set(images_set, color, sq_size = 64, flatten = True):
    '''
    from images_set (training_set or test_set), create and fill in matrix so that it contains the input data.
    if flatten, then the matrix contains images represented in 1D
    '''

    # init output
    matrix = None
    nb_faces = sum([len(images_set[x]) for x in images_set if isinstance(images_set[x], list)])
    # depending on mode, select appropriate size items. N
    
    if color and flatten:
        matrix = np.empty((nb_faces, sq_size*sq_size*3)) # *3 => color images
    elif color and (not flatten):
        matrix = np.empty((nb_faces, sq_size, sq_size, 3))
    elif (not color) and flatten:
        matrix = np.empty((nb_faces, sq_size*sq_size))
    elif (not color) and (not flatten):
        matrix = np.empty((nb_faces, sq_size,sq_size ))
    else:
        raise RuntimeError

    i = 0
    for person in persons:
        for src in images_set[person]:
            src_rescaled = cv2.resize(src, (sq_size,sq_size))
            if color and flatten:
                matrix[i,:] = src_rescaled.flatten()
            elif color and (not flatten):
                matrix[i,:,:,:] = src_rescaled
            elif (not color) and flatten:
                matrix[i,:] = cv2.cvtColor(src_rescaled, cv2.COLOR_BGR2GRAY).flatten()
            elif (not color) and (not flatten):
                matrix[i,:,:] = cv2.cvtColor(src_rescaled, cv2.COLOR_BGR2GRAY)

            i +=1
    return matrix

def plot_matrix(images_matrix, color, my_color_map, h=8, w=5, transpose = False, return_figure = False):
    '''
    plots the images contained in a matrix of data, reshaping and coloring them
    '''
    fig = plt.figure(figsize=(w,h)) 
    fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05) 
    # plot the faces, each image is 64 by 64 pixels 

    if transpose:
        images_matrix_used = images_matrix.T.copy()
    else:
        images_matrix_used = images_matrix.copy()

    i=0
    for img_vector in images_matrix_used: 
        ax = fig.add_subplot(h, w, i+1, xticks=[], yticks=[]) 
        ax.imshow(my_reshape(img_vector, sq_size, color), cmap = my_color_map, interpolation='nearest') 
        i+=1
    plt.show()

    if return_figure:
        return fig

## Inputs

* The very first input of the system is an archive containing several text files containing each 1000 weblinks to images. This archive is downloaded from [here](http://www.robots.ox.ac.uk/~vgg/data/vgg_face) and extracted locally in `/content/sample_data/CV__Group_assignment` folder. We download and extract it only if needed.

* To extract faces in the images, we download the Haarcascade model




In [0]:
if not os.path.isdir(base_path):
  os.makedirs(base_path)

if need_vgg_download:
    vgg_face_dataset_url = "http://www.robots.ox.ac.uk/~vgg/data/vgg_face/vgg_face_dataset.tar.gz"
    with request.urlopen(vgg_face_dataset_url) as r, open(os.path.join(base_path, "vgg_face_dataset.tar.gz"), 'wb') as f:
        f.write(r.read())

    with tarfile.open(os.path.join(base_path, "vgg_face_dataset.tar.gz")) as f:
        f.extractall(os.path.join(base_path))


trained_haarcascade_url = "https://raw.githubusercontent.com/opencv/opencv/master/data/haarcascades/haarcascade_frontalface_default.xml"
with request.urlopen(trained_haarcascade_url) as r, open(os.path.join(base_path, "haarcascade_frontalface_default.xml"), 'wb') as f:
    f.write(r.read())
logging.info("Downloaded haarcascade_frontalface_default")

# Data Retrieval

This tutorial will extensively use images from four different actors. The images are selected (pseudo-)randomly.

The movie stars are (chosen quite randomly as well):
1.   personA: Emma Stone
2.   personB: Bradley Cooper
3.   personC: Jane Levy
4.   personD: Marc Blucas


Process to get datasets images:
1. Randomly pick N images (60 for persons A and B, and 30 for persons C and D) images from the list of 1000 images provided in the textfile. 
3. Reject some "i" images that are not appropriated (see rejection step later) for each person. it may of course be different "i" for all actors.
2. Select M images randomly out of the (N-i) images obtained for each persons:
    - M=30 for personA and personB (Training and Test),
    - M=10 for personC and personD (Test only)


---


**[IMPORTANT NOTE]**

If nothing else is set up, getting the N images require to perform an url request on websites we do not control. This is risky, as for any reason, the target website could be modified, not responding, responding too slowly, have removed the picture of interest, ...
To prevent such issue, this tutorial provide the code to do things differently.
- the first time (with some parameters below properly set), the source images are downloaded from the website (retrieving errors, skipping too slow website, etc.)
- the images, downloaded, are then saved and zipped with a logfile
- this zip archive is then uploaded on my personal Github account, as a public file

It leads to a controlled database containing the source images, and ensure reproducibility during the different test run.

*Three remarks*
1. only the original files are stored in the archive in the ZIP. Those files were selected randomly, using a random number generator.
2. the curation of the source files, the face cropping, and selection between training and test sets is still done at every run of this notebook.
3. the code dedicated to the archiving and saving part will not be detailed (while yet provided) in this notebook, but surely, you are welcome to contact me for more details using geoffroy.herbin@student.kuleuven.be.

---


Start from clean sheet


In [0]:
try:
    shutil.rmtree(path_database)
    shutil.rmtree(path_datasets)
    shutil.rmtree(path_discard)
except:
    pass

Create required folders

In [0]:
file_info = path_database+ r"info_retrieved.txt"

try: 
    os.mkdir(path_database)
    os.mkdir(path_datasets) 
    os.mkdir(path_discard)
except OSError as error: 
    logging.error(error) 



Instead of randomly download from web, download images from a "clean" and controlled repository (in [github](https://raw.githubusercontent.com/gherbin/cv_group9_database_replica/master/DATABASE-20200318T142918Z-001.zip) ), dedicated for this notebook. It ensures reproducibility and accessibility to the 180 input images.


In [0]:
if load_from_local_drive:
    
    !wget https://raw.githubusercontent.com/gherbin/cv_group9_database_replica/master/DATABASE-20200318T142918Z-001.zip

    with zipfile.ZipFile("DATABASE-20200318T142918Z-001.zip", 'r') as zip_ref:
        zip_ref.extractall()
    !rm -r "DATABASE-20200318T142918Z-001.zip"

path_, dirs_, files = next(os.walk(path_database))
if len(files) == 180+1:
    logging.info("Successful database retrieval")
elif load_from_local_drive:
    logging.error("Most Likely problem with database retrieval, number of files = " + str(len(files)))
else:
    logging.info("No database images retrieved yet")

###Definition of several data structures

`images_size` : dictionary containing the number of images to first get from the web

`persons` : list containing the names of the four persons (the names actually are the name of the text file in original database)

`images`: dictionary containing the source images. The keys of the dictionary are the names of the four persons of interest

In [0]:
personA = "Emma_Stone.txt"
personC = "Jane_Levy.txt"
personB = "Bradley_Cooper.txt"
personD = "Marc_Blucas.txt"
persons = [personA, personB, personC, personD]
datasets_dict = {}
images_size = {}
images_size[personA] = 60
images_size[personB] = 60
images_size[personC] = 30
images_size[personD] = 30

total_images_size = sum(images_size.values())

# Dictionary containing the ids of the pictures downloaded from internet
vgg_ids = {}
for p in persons:
    vgg_ids[p] = []


If `confirmation_db_renewal` is `True`, the following code picks randomly (based on a seed being the name of the person) the images from the web.

For a normal run, if the user does not want to change the original sourced data, `confirmation_db_renewal` should remain `False` (aka *change at your own risk* ;-) )



In [0]:
if confirmation_db_renewal:
    try:
        shutil.rmtree(path_database)
    except:
        pass 
    try:
        os.mkdir(path_database)
    except:
        pass 

    fo = open(file_info, "w+")

    # images = {}
    # images_nominal_indices = {}
    for person in persons:
        logging.debug("Taking care of: " + str(person))
        random.seed(person)
        # print(hash(person))
        images_ = []
        # images_nominal_indices_ = []
        prev_index = []


        with open(os.path.join(base_path, "vgg_face_dataset", "files", person), 'r') as f:
            lines = f.readlines()       
        

        while len(images_) < images_size[person]:
            index = random.randrange(0, 1000)
            logging.debug("Index = " + str(index))
            if index in prev_index:
                logging.debug("Index = " + str(index) + " => already there")
                continue
            else:
                prev_index.append(index)
                line = lines[index]
                # only curated data
                if int(line.split(" ")[8]) == 1:
                    url = line[line.find("http://"): line.find(".jpg") + 4]
                    logging.debug("URL > \"" + str(url))
                    try:
                        res = request.urlopen(url, timeout = 1)
                        img = np.asarray(bytearray(res.read()), dtype="uint8")
                        img = cv2.imdecode(img, cv2.IMREAD_COLOR)

                        h, w = img.shape[:2]
                        cv2_imshow(cv2.resize(img, (w//4, h//4)))
                        # images_nominal_indices_.append(index)

                        filename = path_database +  str(index) + "_" + str(person.split(".")[0]) + ".jpg"

                        value = cv2.imwrite(filename, img) 
                        # logging.debug("saved in DB: " + str(filename))
                        images_.append(img)
                        fo.write(line)
                    except ValueError as e:
                            logging.error("Value Error >" + str(e))
                    except (HTTPError, URLError) as e:
                            logging.error('ERROR RETRIEVING URL >' + str(e))
                    except timeout:
                            logging.error('socket timed out - URL %s', str(url))
                    except cv2.error as e: 
                            logging.error("ERROR WRITING FILE IN DB  >" + str(e))
                    except:
                        logging.error("Weird exception : " + str(line))
                else:
                    logging.debug("File not curated => rejected (id = " + str(index) + " )")    
                
                # images[person] = images_
                # images_nominal_indices[person] = images_nominal_indices_

    fo.close()
else:
    logging.warning("If you really want to erase and renew the database, please change first the \"confirmation\" boolean variable, at the beginning of this cell")


From the logfile in the archive, extract the information and fill the dictionary containing all the images `images`.



In [0]:
with open(file_info, 'r') as f: 
    lines = f.readlines()

assert len(lines)==total_images_size, "amount of lines in file incompatible" 

images = {}

for p in persons:
    images[p] = []


images_index = {}
for running_index in range(len(lines)):
    if running_index in range(0,images_size[personA]):
        p = personA
    elif running_index in range(images_size[personA],images_size[personA]+images_size[personB]):
        p = personB
    elif running_index in range(images_size[personA]+images_size[personB],images_size[personA]+images_size[personB]+images_size[personC]):
        p = personC
    elif running_index in range(images_size[personA]+images_size[personB]+images_size[personC],total_images_size):
        p = personD
    ind = str(int(lines[running_index].split(" ")[0])-1)
    vgg_ids[p].append(ind)
    filename = ind + "_" + str(p.split(".")[0]) + ".jpg"
    images[p].append(cv2.imread(path_database+filename, cv2.IMREAD_COLOR))
    

###Rejection step
From the sources files, although the images downloaded were part of a curated data, several images need to be removed to be used in the context of this *educative* tutorial. The main reasons are:

*   too different from usual representation (make up, masks, ...)
*   really poor image quality
*   irrelevance and/or error in dataset
*   same image already in dataset
*   cropped image

Considering the tight selection of images to train our model (20 from personA and 20 from personB), and the relatively large global amount of image candidates (1000 for each person), it is acceptable to reject the images we know won't help.

From the initial retrieved images, we then remove the undesired images, that we copy in discard images folder, for tracking purposes. We/you may want to use them later.

---
`datasets_size`: dictionary of the size required per persons ( keys = person names)

`to_remove`: dictionary containing the indices to remove, per persons ( keys = person names)

`print_images = False` indicates that the remaining images in `images` dictionary will not be printed. 

---

From the remaining images (after rejection), we can select randomly the images that are part of the final sets (training and test sets are not split yet).

In [0]:
# Dictionary of the size required (see section 3)
datasets_size = {}
datasets_size[personA] = 30
datasets_size[personB] = 30
datasets_size[personC] = 10
datasets_size[personD] = 10


# manually remove images that are not relevant or considered not good enough to be part of the dataset
to_remove = {}
to_remove[personA] = [0,1,4,8,12,13,16,23,28,34,36,42,44,47,48,49,54]
to_remove[personB] = [4,7,11,12,13,16,21,22,23,24,25,26,27,32,36,39,41,46,49,53,55,58]
to_remove[personC] = [0,1,6,7,8,11,14,16,17,19,20,21,24]
to_remove[personD] = [0,3,5,6,8,10,12,15,16,17,24]

# goal is to sort in descending to remove elements from lists without modifying the indexes
for p in persons:
    to_remove[p].sort(reverse = True)


# retrieve images candidates
# --------------------------
if len(os.listdir(path_datasets) ) == 0 or True:
    logging.debug("datasets empty - need to retrieve all !")
    # removing images to discard
    for person in persons:
        for index in to_remove[person]:
            img = images[person].pop(index)
            logging.debug("Removing item " + str(index) + " from list " + str(person))
            try:
                filename = path_discard +  str(index) +"_discarded_" + str(person.split(".")[0]) + ".jpg"
                cv2.imwrite(filename, img) 
            except:
                logging.error("Error while writing discarded image " + str(filename))

    # randomly select among remaining images
    for person in persons:
        # build list of indices from remaining images
        logging.debug("Phase 2 (part 1) -> random indices selection for " + str(person))

        images_ = []
        indices = []
        new_ids = []
        # prev_index = []
        random.seed(person)

        while len(indices) < datasets_size[person]:       
            index = random.randrange(0, len(images[person]))
            if index in indices:
                logging.debug("Index among remaining = " + str(index) + " => already there")
                continue
            else:
                # prev_index.append(index)
                indices.append(index)

        logging.debug("Phase 2 (part 2) -> image selection based on indices")

        for index in indices:
            img = images[person][index]
            images_.append(img)
            filename = path_datasets +  str(vgg_ids[person][index]) + "_" + str(person.split(".")[0]) + ".jpg"
            logging.debug("saved: " + str(filename))
            cv2.imwrite(filename, img) 
            new_ids.append(vgg_ids[person][index])
        images[person] = images_
        vgg_ids[person] = new_ids
else:
    logging.debug("folders not empty => can build directly images dictionnary")

# logging.debug("Number of images keys=" + len(images.keys))
# logging.debug("Number of images values=" + len(images.values))

logging.info(pretty_return_dict_size(images))
''' 
print images to get to_remove indices
'''
print_images = False
if print_images:
    for person in persons:
        counter = 0
        for img in images[person]:
            h = 0
            w = 0
            img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
            faces = faceCascade.detectMultiScale(
                img_gray,
                scaleFactor=1.13,
                minNeighbors=10,
                minSize=(30, 30),
                flags=cv2.CASCADE_SCALE_IMAGE
            )
            for (x,y,w,h) in faces:
                # faces_cropped[person].append(img[y:y+h, x:x+w])
                cv2.rectangle(img, (x, y), (x+w, y+h), (0, 255, 0), 2)
            logging.debug("------------------------------------------------------")
            logging.debug("Photo ID = " + str(counter))
            logging.debug("size = " + str((h,w)))
            cv2_imshow(img)
            counter += 1

###Save images on drive (only if required)
Mount drive and save images, according to the parameter `to_drive_confirmation` value. *Change at your own risk ;-)*

In [0]:
# Save to drive folders
path_drive_DB = r"/content/drive/My Drive/ComputerVision/DATABASE"
path_drive_Datasets = r"/content/drive/My Drive/ComputerVision/DATASETS"

# drive folders should be properly set up

if to_drive_confirmation:
    
    logging.warning("You need to have a drive mounted for this snippet to run successfully")

    try:
        drive.mount('/content/drive')
    except:
        pass
    try:
        shutil.rmtree(path_drive_DB)
        shutil.rmtree(path_drive_Datasets)
    except:
        logging.error("Error in rmtree") 


    try: 
        os.mkdir(path_drive_DB) 
        os.mkdir(path_drive_Datasets)
    except OSError as error: 
        logging.error(error) 
    
    logging.debug("Saving database in drive : start")

    fromDirectory = path_database
    toDirectory = path_drive_DB
    copy_tree(fromDirectory, toDirectory)

    logging.debug("Saving datasets in drive : start")

    fromDirectory = path_datasets
    toDirectory = path_drive_Datasets
    copy_tree(fromDirectory, toDirectory)

    logging.debug("Saving: done !")


In [0]:
path_vgg = r"/content/sample_data/CV__Group_assignment"
path_drive_vgg = r"/content/drive/My Drive/ComputerVision/CV__Group_assignment"

# drive folders should be properly set up

if to_drive_confirmation_vgg:
    
    logging.warning("You need to have a drive mounted for this snippet to run successfully")

    try:
        drive.mount('/content/drive')
    except:
        pass
    try:
        shutil.rmtree(path_drive_vgg)
    except:
        logging.error("Error in rmtree") 

    try: 
        os.mkdir(path_drive_vgg) 
    except OSError as error: 
        logging.error(error) 
    
    logging.debug("Saving database in drive : start")

    fromDirectory = path_vgg
    toDirectory = path_drive_vgg
    copy_tree(fromDirectory, toDirectory)

    logging.debug("Saving: done !")


##Face detection using Haar Cascade
From the raw images saved in the `images` dictionary, the faces are extracted using the *HaarCascade* method.

The following code is based on the tutorial: [How to detect faces using Haar Cascade](https://www.digitalocean.com/community/tutorials/how-to-detect-and-extract-faces-from-an-image-with-opencv-and-python)

The faces are saved in a new dictionary: `faces_cropped`

In [0]:
faceCascade = cv2.CascadeClassifier(os.path.join(base_path, "haarcascade_frontalface_default.xml"))
faces_cropped = {}

with open(file_info, 'r') as f: 
    lines = f.readlines()

for person in persons:

    faces_cropped[person] = []

    for img in images[person]:
        img_ = img.copy()
        img_gray = cv2.cvtColor(img_, cv2.COLOR_BGR2GRAY)
        faces = faceCascade.detectMultiScale(
            img_gray,
            scaleFactor=1.13,
            minNeighbors=10,
            minSize=(30, 30),
            flags=cv2.CASCADE_SCALE_IMAGE
        )
        for (x,y,w,h) in faces:
            faces_cropped[person].append(img[y:y+h, x:x+w])
            cv2.rectangle(img_, (x, y), (x+w, y+h), (0, 255, 0), 2)

        # h, w = img_.shape[:2]
        # draw_box(lines, int(vgg_ids[person][running_index])+1, img_, person)
        # cv2_imshow(cv2.resize(img_, (w // 2, h // 2)))

logging.info("Faces extracted and saved in dictionnary faces_cropped")
logging.info(pretty_return_dict_size(faces_cropped))

At this point, `faces_cropped` dictionary contains 30 cropped faces for personA and B, and 10 images for personC and personD.

The following code selects randomly (based on a seed) the 20 images part of the training set for personA and personB. The other faces (10 for each person) are then part of the test set.

---

* `training_set`: dictionary containing faces cropped (original size) part of the training set
* `test_set`: dictionary containing faces cropped (original size) part of the test set
---

At this point, there is not (yet) dedicated validation sets. It is discussed later on.

All the training will be done on the training set faces, without any tailoring or dedicated fitting on the test set images. Indeed, metrics on the test set faces indicate how well our model will generalize. It's therefore important to not influence our model with the data of the test set.


In [0]:
training_sets_size = {}
training_sets_size[personA] = 20
training_sets_size[personB] = 20
training_sets_size[personC] = 0
training_sets_size[personD] = 0

test_sets_size = {}
test_sets_size[personA] = 10
test_sets_size[personB] = 10
test_sets_size[personC] = 10
test_sets_size[personD] = 10

training_set = {}
test_set = {}
for person in persons:
    image_ = faces_cropped[person]
    training_set_ = []
    random.seed(person)
    init_set = set(range(0, len(image_)))

    indices_training = random.sample(init_set, training_sets_size[person])
    indices_test = list(init_set - set(indices_training))

    training_set[person] = [faces_cropped[person][i] for i in indices_training] 
    test_set[person] = [faces_cropped[person][i] for i in indices_test]

logging.info("Faces saved in dictionnary training_set: ")
logging.info(pretty_return_dict_size(training_set))
logging.info("Faces saved in dictionnary test_set: ")
logging.info(pretty_return_dict_size(test_set))

In [0]:
# show_images_from_dict(training_set)

In [0]:
# show_images_from_dict(test_set)

### Faces of the training set
- 20 faces of Emma Stone, personA
- 20 faces of Bradley Cooper, personB

PersonA and PersonB are quite different, A being a female, and B a male. Furthermore, the images within a class are somehow dissimilar as well
- different viewpoints (front, left, right)
- different lighting conditions
- not same hair color
- beard/no beard (personB)
- not same (limited) background

However, a similar characteristic is that both of the actors are most of the time smiling on the faces extracted. Sometimes showing their teeth, sometimes not.


In [0]:
#  Visualization 
training_set_matrix = get_matrix_from_set(training_set, color = True, sq_size = sq_size,flatten = True)
plot_matrix(training_set_matrix, color = True, my_color_map = plt.cm.viridis, h=4, w=10, transpose = False)

### Faces of the test set
Test images are needed for four persons:
- 10 faces of Emma Stone, personA
- 10 faces of Bradley Cooper, personB
- 10 faces of Jane Levy, personC
- 10 faces of Marc Blucas, personD

(A - C) and (B - D) respectively share some characteristics:
* both female / male
* same kind of skin tone
* visually quite similar (especially for A and C)

Within each groups, as for the training set, the faces are taken from different viewpoints, lightening conditions, ...


In [0]:
test_set_matrix = get_matrix_from_set(test_set, color = True, sq_size=sq_size, flatten = True)
plot_matrix(test_set_matrix, color = True, my_color_map = plt.cm.viridis, h=4, w=10, transpose = False)

# Feature Representations

A feature representation of an object is intuitively a piece of *information*, of a reduced dimension with respect to the object, and that captures the object.
It tells what defines the object, and allows differentiating different objects.

In the context of an image, a good feature needs to be:
- **robust**: the same feature extracted from the same object on an image should be *close*, even if the lightening condition change, the view point change, ...
- **discriminative**: different images, representing different object, should lead to different representation in the feature space. As a toy example, the size of an image is not a good feature to detect a person, as several person can be represented in images of the same size.

We will look at two features:
1. **HOG**: Histogram of Oriented Gradients. This is a handcrafted feature, extracted using a specific algorithm 
2. **PCA**: Principal Component Analysis. This is a feature learnt from the data.






## Histogram of Oriented Gradients - HOG



As this is a tutorial of Computer Vision, let's look first at what is visually / intuitively the HOG on a real image - one of the Emma Stone (personA) faces. The execution of the following code snippet shows on the left the input face, and on the right, the results.

Then, we'll see the details of the algorithm, and its specificities (parameters)

This section is extensively inspired by [this course](https://www.learnopencv.com/histogram-of-oriented-gradients/), while the technique has been introduced by [this paper](http://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf), which I strongly advise to read!

In [0]:
'''
First example - Emma stone first image
'''
src = faces_cropped[personA][0]

#1 resizing
resized_img = cv2.resize(src, (sq_size, sq_size))
#2 computing HOG
fd, hog_image = skimage_feature_hog(resized_img, 
                    orientations=9, 
                    pixels_per_cell=(8,8), 
                    cells_per_block=(2, 2), 
                    block_norm = "L2",
                    visualize=True, 
                    transform_sqrt = True,
                    multichannel=True)

'''
Plotting results
'''
logging.info("Resized image and its Histogram of Oriented Gradients (its visual representation)")
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8), sharex=True, sharey=True) 
ax1.imshow(cv2.cvtColor(resized_img, cv2.COLOR_BGR2RGB))
ax1.set_title('Input image') 

# Rescale histogram for better display 
hog_image_rescaled = exposure.rescale_intensity(hog_image, in_range=(0, 10)) 

ax2.imshow(hog_image_rescaled, cmap=plt.cm.gray) 
ax2.set_title('Histogram of Oriented Gradients - rescaled')
logging.debug("HOG Rescaled: " + str(hog_image_rescaled.min()) + " -> " + str(hog_image_rescaled.max()) )

# ax3.imshow(hog_image, cmap=plt.cm.gray) 
# ax3.set_title('Histogram of Oriented Gradients')
# logging.debug("HOG: " + str(hog_image.min()) + " -> " + str(hog_image.max()) )

plt.show()

# logging.debug(hog_image_rescaled.shape)
logging.debug("fd shape = " + str(fd.shape))

### HOG - What is that ?

HOG is a feature descriptor that extracts information from an image (or more precisely, a patch) based on the gradients in this image. More specifically, it builds a vector representing the weighted distribution of the gradients orientation across the images.

#### Why is it interesting ?
Let's remember that our goal is to perform image classification and identification. 
A face can be recognized through the inherent shapes: circular of face, shapes of the eyes, the nose, potentially the glasses, etc. The *edge* information is therefore useful! It is even more useful than the colors... Intuitively, you can think about recognizing someone familiar with only some contours of one face.

![Drawing Obama](http://www.drawingskill.com/wp-content/uploads/2/Barack-Obama-Drawing-Pics.jpg)

It is easy to recognize the former US President, while no color information is given. Intuitively, HOG gives the same information:
- magnitude of gradient is large around the edges and corners
- orientation gives the shape

#### How to compute the HOG of an image ?

In a nutshell:
- The gradients are first computed on each pixel. 
- The gradients orientation and magnitude are used to build an histogram for a cell. The size of a cell is typically 8 x 8 pixels. 
- Those histogram are normalized 
- All the histograms computed on the images are then concatenated in a *long* vector, yet much smaller than original image. 

In the upcoming sections, we will detail the process, as well as the code and parameters required at each step.

The following code snippet is a homemade class required to compute the HOG. The results obtained with this code will be compared with the infamous skimage library optimized for the HOG descriptor. 

You can simply run the snippet and come back later on to see the details of the implementation.


In [0]:
class MyHog():
    
    def __init__(self, img):
        self.img = img # image of the size 64x64; 64x128; 128x128; ... => resized image of the original
        self.mag_max, self.orn_max = self.compute_gradients()

    def compute_gradients(self):
        gx = cv2.Sobel(self.img, cv2.CV_32F, 1, 0, ksize = 1)
        gy = cv2.Sobel(self.img, cv2.CV_32F, 0, 1, ksize = 1)

        mag, angle = cv2.cartToPolar(gx, gy, angleInDegrees=True)
        orn = angle.copy()

        logging.debug("mag shape :" + str(mag.shape))
        logging.debug("orn shape :" + str(orn.shape))

        # constructing matrices of max dimension
        mag_max = np.zeros((mag.shape[0], mag.shape[1]))
        orn_max = np.zeros((orn.shape[0], orn.shape[1]))
        for i in range(mag.shape[0]):
            for j in range(mag.shape[1]):
                mag_max[i,j] = mag[i,j].max()
                idx = np.argmax(mag[i,j])
                orn_max[i,j] = orn[i,j,idx] 

        # mag_max = mag_max.T    
        # orn_max = orn_max.T 

        return mag_max, orn_max

    def get_cells_mag_orn(self, y_start, x_start, cell_h, cell_w):
        '''
        returns the cell magnitude, orientation and "clipped" orientation 
        ( where 0 -> 360 is mapped into 0 -> 180)
        '''
        cell_mag = np.zeros((cell_h,cell_w))
        cell_orn = np.zeros((cell_h,cell_w))
        for i in range(cell_h):
            for j in range(cell_w):
                cell_mag[i,j] = self.mag_max[y_start+i, x_start+j]
                cell_orn[i,j] = round(self.orn_max[y_start+i,x_start+j])
        
        cell_orn_clipped = cell_orn.copy() 
        cell_orn_clipped = ((cell_orn_clipped) + 90 ) % 360
        for i in range(cell_h):
            for j in range(cell_w):
                if 0 <= cell_orn_clipped[i,j] < 180:
                    cell_orn_clipped[i,j] = 180 - cell_orn_clipped[i,j]
                elif 180 <= cell_orn_clipped[i,j] <=360:
                    cell_orn_clipped[i,j] = 180 - cell_orn_clipped[i,j] % 180

        return cell_mag.T, cell_orn.T, cell_orn_clipped.T
    
    def fill_bins_one_pixel(self, mag, orn, bin_list, implementation_type = "skimage"):
        '''
        # mag: magnitude of the gradient of 1 px
        # orn: orientation of the gradient of 1 px
        bin_list: reference, list of bins that is incremented
        '''
        N_BUCKETS = len(bin_list)
        assert N_BUCKETS == 9, "N_BUCKETS is not 9!!!"
        size_bin = 20.
        if implementation_type == "learnopencv":
            if orn >= 160:
                left_bin = 8
                right_bin = 9
                left_val= mag * (right_bin * 20 - orn) / 20
                right_val = mag * (orn - left_bin * 20) / 20
                left_bin = 8
                right_bin = 0
            else:
                left_bin = int(orn / size_bin)
                right_bin = (int(orn / size_bin) + 1) % N_BUCKETS
                left_val= mag * (right_bin * 20 - orn) / 20
                right_val = mag * (orn - left_bin * 20) / 20
            
            assert left_val >= 0, "leftval = " + str(left_val) + ", " + str("mag = ") + str(mag) + " & orn = " + str(orn)
            assert right_val >= 0, "rightval = " + str(right_val) + ", " + str("mag = ") + str(mag) + " & orn = " + str(orn)

            # print(left_val)
            # print(right_val)

            bin_list[left_bin] += left_val
            bin_list[right_bin] += right_val

        elif implementation_type == "skimage":
            # easiest 
            '''
            this implementation mimics the one from skimage
            '''

            if 0 <= orn <= 10:
                bin_list[4] += mag
            elif 10 < orn <= 30:
                bin_list[3] += mag
            elif 30 < orn <= 50:
                bin_list[2] += mag
            elif 50 < orn <= 70:
                bin_list[1] += mag
            elif 70 < orn <= 90:
                bin_list[0] += mag
            elif 90 < orn <= 110:
                bin_list[8] += mag 
            elif 110 < orn <= 130:
                bin_list[7] += mag
            elif 130 < orn <= 150:
                bin_list[6] += mag
            elif 150 < orn <= 170:
                bin_list[5] += mag 
            elif 170 < orn <= 180:
                bin_list[4] += mag 
            else:
                raise RuntimeError("Impossible ! > " + str(orn))
        else:
            raise NotImplementedError



    def compute_hog_bins(self, y_start, x_start, cell_h, cell_w, show_src = True, show_results=True, figsize = (12,4)):
        '''
        y_start: y value of the top left pixel
        x_start: x value of the top left pixel
        cell_h : height of the cell in which HOG is computed
        cell_w : width of the cell in which HOG is computed
        '''
        cell_img = self.img[y_start:y_start + cell_h, x_start:x_start+cell_w]

        if show_src:
            tmp = self.img.copy()
            cv2.rectangle(tmp, (x_start-1, y_start-1), (x_start+cell_w+1, y_start+cell_h+1), (0,255,0))

            fig, ax = plt.subplots(1,1, figsize = (figsize[1],figsize[1]))
            ax.imshow(cv2.cvtColor(tmp, cv2.COLOR_BGR2RGB))
            ax.set_title("Selection of a cell")

            plt.show()

        # construction of the magnitude and orn matrices
        cell_mag, cell_orn, cell_orn_clipped = self.get_cells_mag_orn(y_start, x_start, cell_h, cell_w)

        number_of_bins = 9
        bin_list = np.zeros(number_of_bins)
        for i in range(cell_h):
            for j in range(cell_w):
                # m = round(mag_normalized[y_start+j,x_start+i].max())
                m = cell_mag[i,j]
                d = cell_orn_clipped[i,j]
                # print("m,d =" + str((m,d)))
                self.fill_bins_one_pixel(m,d,bin_list)
        
        # logging.debug("Bins computed:" + str(bin_list))
        n = np.linalg.norm(bin_list)
        bin_norms = bin_list/n
        # logging.debug("Bins normalized:" + str(bin_norms))
             
        if show_results:
            fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=figsize, sharex=True, sharey=True) 

            # ax1 => just to visually represent the arrows
            for i in range(cell_h):
                for j in range(cell_w):

                    radius = cell_mag[i,j] / (cell_mag.max() - cell_mag.min())
                    angle_ = cell_orn[i,j]
                    
                    orn_value_clipped = cell_orn_clipped[i,j]

                    mag_value = round(cell_mag[i,j])
                    
                    ax1.arrow(i, j, radius*np.cos(np.deg2rad(angle_)), radius*np.sin(np.deg2rad(angle_)), head_width=0.15, head_length=0.15, fc='b', ec='b')
                    ax2.text(i, j, str(orn_value_clipped.astype(np.int64)), fontsize=10,va='center', ha='center')
                    ax3.text(i, j, str(mag_value.astype(np.int64)), fontsize=10,va='center', ha='center')

            ax1.imshow(cv2.cvtColor(cell_img, cv2.COLOR_BGR2RGB))
            ax1.set_title('Input image') 

            ax2.matshow(cell_orn_clipped, alpha=0)
            ax2.set_title('Orientation values')

            ax3.set_title('Magnitude values')
            intersection_matrix = np.ones(cell_mag.shape)
            ax3.matshow(cell_mag, alpha = 0)
           
            plt.show()
        return bin_list, bin_norms

###HOG How-to, Step1: Preprocessing the image

Usually, the size of an image is not appropriate to perform the HOG computation. The easiest thing is just to resize the image to an appropriate size. In this tutorial, we use a multiple of 8 and a square image. Considering the smallest face cropped, we select 64 x 64 pixels.

Furthermore, as often in image processing, the largest the image, the more resource consuming it is. Keeping a reasonably small image helps in having a decent computation time.


In [0]:
logging.info("Toy Example")

img = faces_cropped[personA][0].copy()
resized_img = cv2.resize(img, (64,64))
logging.info("Shape of source  face: " + str(img.shape))
logging.info("Shape of resized face: " + str(resized_img.shape))


fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 4), sharex=False, sharey=False) 

ax1.imshow(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))
ax1.set_title('Input image') 

ax2.imshow(cv2.cvtColor(resized_img, cv2.COLOR_BGR2RGB))
ax2.set_title('Resized image') 
plt.show()

###HOG How-to, Step2: Computing the gradient for all pixels

Computing the gradient in horizontal ($x$) and vertical ($y$) directions can be done using a pass of the Sobel Filter, part of the *openCV* library. 
This is implemented in the `MyHog.compute_gradients()` function (see implementation of `MyHog` class, above). From these gradients in $x$ and $y$ we can derive the magnitudes and orientations in all pixels using the formulas:
> $
\begin{align} 
mag &= \sqrt{g_x^2 + g_y^2} \\ 
orn &= atan(\frac{g_y}{g_x}) 
\end{align}
$

This is implemented using *openCV* library with `cartToPolar`.

#####*Grayscale or Colored image*
> For grayscale image, every pixel has one value so that this computation is straightforward. For colored image, a pixel has three values (one for Red, one for Green, one for Blue). In the HOG algorithm, the gradients are computed for all three channels, and the final magnitude is the maximum of the three, and the orientation is the one corresponding to the magnitude.


In [0]:
'''
Create an object MyHog, which takes a resized image as input, and compute its 
gradients.
'''
myhog = MyHog(resized_img)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 4), sharex=True, sharey=True) 

ax1.imshow(myhog.mag_max, cmap = plt.cm.jet)
ax1.set_title('Max of magnitude') 

ax2.imshow(myhog.orn_max, cmap = plt.cm.jet)
ax2.set_title('Max of Orientation') 
plt.show()

As visible on the magnitude and orientation plots above, only essential information regarding the edges is kept. 

###HOG How-to, Step3: Compute histograms for cells

The histograms are first computed for small cells. It has two major benefits
1. the representation is more compact: Suppose we take cells of $8 \times 8$ pixels. The gradient of each pixel is described using 2 numbers (magnitude and orientation), leading to 128 numbers. Considering an histogram applied on such a cell allows to represent those 128 numbers by a tiny array of typically 9 numbers. In total a colored image of $64 \times 64$ pixels is represented using $9*8*8$ vector.
2. the representation is less sensitive to noise, as applying a cell is equivalent to a low-pass filter. Higher frequency outliers are therefore of less importance.


The choice of the cell size is a design choice that can be modified. In a later section, we will modify this parameter to see how it can affect the results. 
In the paper that first presented the technique, a cell of $8 \times 8$ pixels was used - we will continue with this (hyper-)parameter. 


> *You may try yourself to modify the cell size or the x_start or y_start values, to see the influence on the histogram computed for that particular cell*





In [0]:
'''
Size of a cell
'''
cell_h = 8
cell_w = 8

'''
starting point of the cell on the image
'''
y_start = 3 * cell_h - 1
x_start = 2 * cell_w - 1

hog_val, hog_val_normalized = myhog.compute_hog_bins(y_start, x_start, cell_h, cell_w, show_results=True, figsize=(16,5))

fig, ax = plt.subplots(1,1, figsize = (2*5, 5))
ax.bar(["]70-90]","]50-70]", "]30-50]", "]10-30]","]10-\n180]","]150-\n170]","]130-\n150]","]110-\n130]", "]90-\n110]"], hog_val_normalized)
ax.set_title("Histogram computed with MyHog (homemade)")
plt.show()
logging.info("MyHog bins   normalized  : {}, {}, {}, {}, {}, {}, {}, {}, {}".format(*np.round(hog_val_normalized,2)))


**Legend of the above images**
1. source image with *cell* visible in flashy green,
2. details of the gradients computations
    - *leftmost*: cell enlarged with an arrow indicating the gradient: length of the arrow represent the magnitude, and orientation is the gradient orientation computed on that pixel
    - *middle*: matrix (shape == cell) containing the orientations computed (unsigned)
    - *rightmost*: matrix (shape == cell) containung the magnitude computed
3. histogram computed (`keyword = "skimage"`)


####In details

The details of building the histogram for a cell is not complicated:
- consider N bins. N is a design parameter, and each of the bins represent a range of gradient orientations. I have chosen N = 9, following the [introducing paper](http://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf), as it induces a granularity fine enough to observe change in the picture.
> The HOG is usually applied using **unsigned gradients**. The numbers on the orientation matrix are between 0 and 180 instead of 0 and 360 degrees. Concretely, an angle $\alpha [deg] $ and $(180 + \alpha) [deg]$ contribute to the same bin. Empirically, it has been observed that it wasn't decreasing performance in the detection. Of course, nothing forbids to use signed gradients. 
- for a particular pixel:
    - the bin is selected according to the orientation of the gradient; 
    - the value that goes in the bin is based on the magnitude. Different methods were proposed by the [introducing paper](http://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf), and are also explained in details for instance in [vidha blog](https://www.analyticsvidhya.com/blog/2019/09/feature-engineering-images-introduction-hog-feature-descriptor/). The class `MyHog` above contains two implementation: either the magnitude is split proportionnaly between two bins (as described in [learnopencv](https://www.learnopencv.com/histogram-of-oriented-gradients/), or -- as done in *openCV* library --, the whole magnitude is assigned to the closest bin. 

While this step is not difficult, let's realize in image how it's done!


####Creation of dedicated images

The following function allows creating images "on demand", in order to better understand the histogram computation. The parameter "special" indicate what type of image is required.

In [0]:
'''
CREATE_IMAGE returns an image created based on a keyword. 
'''
def create_image(height, width, special=None):

    mat0 = np.ones((height, width), dtype=np.uint8)*255
    mat1 = np.ones((height, width), dtype=np.uint8)*255
    mat2 = np.ones((height, width), dtype=np.uint8)*255

    if special == "center_black":
        mat0[height//2-1:height//2+1, width//2-1 : width//2+1 ] = 0
        mat1[height//2-1:height//2+1, width//2-1 : width//2+1 ] = 0
        mat2[height//2-1:height//2+1, width//2-1 : width//2+1 ] = 0
    elif special == "center_gray":
        mat0[height//2-3:height//2+3, width//2-3 : width//2+3 ] = 125
        mat1[height//2-3:height//2+3, width//2-3 : width//2+3 ] = 125
        mat2[height//2-3:height//2+3, width//2-3 : width//2+3 ] = 125
        mat0[height//2-1:height//2+1, width//2-1 : width//2+1 ] = 0
        mat1[height//2-1:height//2+1, width//2-1 : width//2+1 ] = 0
        mat2[height//2-1:height//2+1, width//2-1 : width//2+1 ] = 0
    elif special == "90":
        mat0[0:height, width//2 : width ] = 0
        mat1[0:height, width//2 : width ] = 0
        mat2[0:height, width//2 : width ] = 0
    elif special == "180":
        mat0[height//2:height, 0 : width ] = 0
        mat1[height//2:height, 0 : width ] = 0
        mat2[height//2:height, 0 : width ] = 0
    elif special == "135":
        for i in range(height):
            for j in range(i,width):
                mat0[i,j] = mat1[i,j] = mat2[i,j] = 0
    elif special == "45":
        for i in range(height):
            for j in range(width-i-1,width):
                mat0[i,j] = mat1[i,j] = mat2[i,j] = 0
    elif special == "28_34_37":
        mat0[4, 6:8] = 200
        mat0[5, 4:8] = 150
        mat0[6, 2:8] = 100
        mat0[7, 0:8] = 50
        mat1 = mat0.copy()
        mat2 = mat0.copy()
    elif special == "up_01":
        mat0[4, 4:8] = 250
        mat0[5, 0:8] = 50
        mat0[6, 0:8] = 50
        mat0[7, 0:8] = 50
        mat1 = mat0.copy()
        mat2 = mat0.copy()
    elif special == "up_10":
        mat0[4, 4:8] = 220
        mat0[5, 0:8] = 50
        mat0[6, 0:8] = 50
        mat0[7, 0:8] = 50
        mat1 = mat0.copy()
        mat2 = mat0.copy()
    elif special == "up_11":
        mat0[4, 4:8] = 225
        mat0[5, 0:8] = 50
        mat0[6, 0:8] = 50
        mat0[7, 0:8] = 50
        mat1 = mat0.copy()
        mat2 = mat0.copy()
    elif special == "up_15":
        mat0[4, 4:8] = 200
        mat0[5, 0:8] = 50
        mat0[6, 0:8] = 50
        mat0[7, 0:8] = 50
        mat1 = mat0.copy()
        mat2 = mat0.copy()
    elif special == "up_27":
        mat0[4, 4:8] = 150
        mat0[5, 0:8] = 50
        mat0[6, 0:8] = 50
        mat0[7, 0:8] = 50
        mat1 = mat0.copy()
        mat2 = mat0.copy()
    elif special == "up_152":
        mat0 = np.ones((height, width), dtype=np.uint8)*0
        mat1 = np.ones((height, width), dtype=np.uint8)*0
        mat2 = np.ones((height, width), dtype=np.uint8)*0
        mat0[0,6:7] = mat1[0,6:7] = mat2[0,6:7] = 255
        mat0[1,7] = mat1[1,7] = mat2[1,7] = 135
    elif special == "down_153":
        mat0 = np.ones((height, width), dtype=np.uint8)*255
        mat1 = np.ones((height, width), dtype=np.uint8)*255
        mat2 = np.ones((height, width), dtype=np.uint8)*255
        mat0[0,6:7] = mat1[0,6:7] = mat2[0,6:7] = 0
        mat0[1,7] = mat1[1,7] = mat2[1,7] = 125
    elif special == "up_141":
        mat0 = np.ones((height, width), dtype=np.uint8)*0
        mat1 = np.ones((height, width), dtype=np.uint8)*0
        mat2 = np.ones((height, width), dtype=np.uint8)*0
        mat0[0,6:7] = mat1[0,6:7] = mat2[0,6:7] = 255
        mat0[1,7] = mat1[1,7] = mat2[1,7] = 210
    elif special == "up_111":
        mat0 = np.ones((height, width), dtype=np.uint8)*0
        mat1 = np.ones((height, width), dtype=np.uint8)*0
        mat2 = np.ones((height, width), dtype=np.uint8)*0
        mat0[0,0:7] = mat1[0,0:7] = mat2[0,0:7] = 100
        mat0[1,7] = mat1[1,7] = mat2[1,7] = 255
    elif special == "personA":
        mat0 = np.ones((height, width), dtype=np.uint8)*122
        mat1 = np.ones((height, width), dtype=np.uint8)*122
        mat2 = np.ones((height, width), dtype=np.uint8)*122
        for i in range(24):
            for j in range(40, width):
                    if j >= (i+40):
                        mat0[i,j] = mat1[i,j] = mat2[i,j] = 10
    elif special == "personB":
        mat0 = np.ones((height, width), dtype=np.uint8)*122
        mat1 = np.ones((height, width), dtype=np.uint8)*122
        mat2 = np.ones((height, width), dtype=np.uint8)*122
        mat0[:,50:width] = 10
        mat1[:,50:width] = 10
        mat2[:,50:width] = 10
    image = np.dstack((mat0, mat1, mat2))
    return image

####Examples of histogram computed on created images
In order to understand how the bins are fulled, let's look at a few of simple images. Those images are $8 \times 8$, meaning 1 cell == 1 image

* pure 90° gradient
* pure 180° gradient
* diagonal: 45°
* diagonal: 135°

For each case, we plot:
1. the arrows representing the gradients, 
2. the matrices of the magnitude and orientation values, 
3. the resulting histogram

**and** we log:

- the raw values of the histogram bins
- the normalized values of the histogram bins, using *L2-Normalization*:
$\begin{align}
bins\_values &= [x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8, x_9] \\
\|bins\_values\| &= \sqrt{x_1^2 + x_2^2 + x_3^2 + x_4^2 + x_5^2 + x_6^2 + x_7^2 + x_8^2 + x_9^2 } \\
bins\_values_{normalized} &=  \frac{v}{\|v\|} \\ 
&= \Bigg[ \frac{x_1}{\|v\|}, \frac{x_2}{\|v\|}, \frac{x_3}{\|v\|},  \frac{x_4}{\|v\|}, \frac{x_5}{\|v\|}, \frac{x_6}{\|v\|},  \frac{x_7}{\|v\|}, \frac{x_8}{\|v\|}, \frac{x_9}{\|v\|} \Bigg]
\end{align}$



#### Validation of intuition
To prove ourselves our implementation and understanding is correct, we will confront the results with the infamous library `skimage`, using `skimage.feature.hog` with the same parameters as the handmade function: 9 bins, an $8\times 8$ cell, and 1 cell per *block* (we discuss the *blocks* in the next section). Two parameters are still unknown: transform_sqrt and multichannel 
- `transform_sqrt`: if True, then the sqrt operator is applied to the global image first. It can give better results. We can safely leave it to False for the purpose of this exercices with the HOG bins...
- `multichannel`: simply indicates if the image is grayscale (`multichannel = False`) or in color (`multichannel = True`) 

<!--Note: we briefly discussed the `block_norm` parameter, but more to come in the next step.-->

####Finally...
Coming back to the very first example of the HOG, we saw the Emma Stone picture with weird white-ish pictograms describing her face... Well, thanks to the `skimage` library, it's very easy to get this image, and we show it for the toy example we are studying now. It allows grabbing the full overview of how, eventually, the complete histogram and visualization is computed.

> *Of course, don't hesitate to change yourself the list of images that are analyzed, considering the list implemented. You find the keywords accepted in the special list*


In [0]:
'''
Creation of the images
'''
special = ["up_01","45","90", "135","180","up_10","up_11","up_15", "up_27", "28_34_37", "up_111", "up_141", "up_152", "down_153","center_black", "center_gray"]
list_as_example = ["90", "180", "45", "135", "28_34_37"]

# created_img = create_image(cell_h,cell_w, "45")

'''
Homemade implementation of the histogram
and
Validation with an optimized library
'''
for keyword in list_as_example:
    logging.info("Considering image: " + str(keyword))

    # creation of the simple image
    created_img = create_image(cell_h, cell_w, keyword)

    # creation of MyHog object
    toyhog=MyHog(created_img)

    # compute bins using MyHog
    y_start_loc = 0
    x_start_loc = 0
    bins, bins_normalized = toyhog.compute_hog_bins(y_start_loc, x_start_loc, cell_h, cell_w, show_src=False, show_results=True, figsize = (14,4))

    # compute bins using Skimage 
    fd, hog_image = skimage_feature_hog(created_img, 
                orientations=9, 
                pixels_per_cell=(8,8), 
                cells_per_block=(1, 1), 
                block_norm = "L2",
                visualize=True, 
                transform_sqrt = False,
                multichannel=True)

    # plot results from Skimage
    plt.figure()
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 4), sharex=False, sharey=False) 

    # Rescale hog for better display 
    hog_image_rescaled = exposure.rescale_intensity(hog_image, in_range=(0, 10)) 


    xlabels = ["]70-90]","]50-70]", "]30-50]", "]10-30]","]10-\n180]","]150-\n170]","]130-\n150]","]110-\n130]", "]90-\n110]"]
    x = np.arange(9)
    w = 0.2
    ax1.bar( x-w, bins_normalized,  width=2*w, align='center',color="b")
    ax1.bar( x+w, fd, width=2*w, align='center',color="r")
   
    ax1.set_title("Histogram computed by Skimage.feature.hog")
    ax1.legend(["MyHog (homemade)", "Skimage.feature.hog"])
    # start, end = ax.get_xlim()
    # ax.xaxis.set_ticks(np.arange(start, end, 1))
    # ax1.set_xticklabels(xlabels)
    ax1.set_xticks(x)
    ax1.set_xticklabels(xlabels)

    ax2.imshow(hog_image_rescaled, cmap=plt.cm.gray) 
    ax2.set_title('Histogram of Oriented Gradients - visual')
    plt.show()



    logging.info("MyHog bins     computed  : {}, {}, {}, {}, {}, {}, {}, {}, {}".format(*np.round(bins,2)))
    logging.info("MyHog bins   normalized  : {}, {}, {}, {}, {}, {}, {}, {}, {}".format(*np.round(bins_normalized,2)))
    logging.info("Skimage bins normalized  : {}, {}, {}, {}, {}, {}, {}, {}, {}".format(*np.round(fd,2)))
    logging.info("***"*30)





> Notice: as a reminder, the purpose of the `MyHog` class (or any of the other class from this tutorial) is definitely not to reproduce exactly the behavior of a well-known and optimized library, but solely to break the magic behind using a library function without understanding the algorithm behind. As a result, the HOG computed may differ in several ways.

####Coming back to our initial cell...
Emma Stone cell defined above can now be shown in terms of HOG, helped by the `skimage` library.



In [0]:
'''
Retrieving the cell defined above
'''
cell_img=resized_img[y_start:y_start + cell_h, x_start:x_start+cell_w]

'''
computing HOG of the cell using same parameters
'''
fd, hog_image = skimage_feature_hog(cell_img, 
                    orientations=9, 
                    pixels_per_cell=(8,8), 
                    cells_per_block=(1, 1), 
                    block_norm = "L2",
                    visualize=True, 
                    transform_sqrt = False,
                    multichannel=True)



plt.figure()

fig, (ax0, ax1, ax2) = plt.subplots(1, 3, figsize=(12, 4), sharex=False, sharey=False) 
# hog_cropped = hog_image[y_start:y_start + cell_h, x_start:x_start+cell_w]
ax0.imshow(cv2.cvtColor(resized_img, cv2.COLOR_BGR2RGB))
ax0.set_title('Input image') 

ax1.imshow(cv2.cvtColor(cell_img, cv2.COLOR_BGR2RGB))
ax1.set_title('Extracted cell') 

# Rescale histogram for better display 
hog_image_rescaled = exposure.rescale_intensity(hog_image, in_range=(0, 10)) 
ax2.imshow(hog_image_rescaled, cmap=plt.cm.gray) 
ax2.set_title('Histogram of Oriented Gradients - rescaled')
# print(hog_image_rescaled)
plt.show()

logging.info("Skimage bins normalized  : {}, {}, {}, {}, {}, {}, {}, {}, {}".format(*np.round(fd,2)))




This is the end of the Step3: the computation of the histogram for one cell! A careful eye will have seen the parameters `cells_per_block` and `block_norm` of the library method. This is linked to Step4: Block Normalization!


###HOG How-to, Step4: Block normalization

In the Step3, we have extensively seen how to compute the histogram of gradients for a cell. We are almost at the end of the feature representation build up, but there are yet one normalization step. 
> In the previous step, we actually already normalized the histogram values using *L2-Normalization*. This is a simple case of the Block normalization where there is 1 cell per block. In general, we can define to perform Block normalization on more than 1 block. A common value is 4, as discussed in the [introducing paper](http://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf). 

####<u>Why do we need normalization ?</u>
 
When normalized, a histogram becomes independant from the lighting variation. 
Indeed, illumination has the impact of increasing/decreasing the values of the pixels in a cell. 

Using normalization, a change on the pixel value has no impact if all the pixels in a cell are subject to the same change. 
> let's say a low illumination make the pixel values divided by two. Having a normalized histogram on the cell will not be affected by such a change, as in the end, the absolute value is not important: only the relative value of one pixel to others matter. This is the very essence of the normalization.

As a result, normalizing the histogram makes it quite independant of the lighting condition (provided that in a cell, all the pixels have the same lighting condition, which seems a sensible assumption).

####<u>Normalizing by block</u> 
A nice visualization of the normalization by block of multiple cells is given in [learnopencv](https://www.learnopencv.com/wp-content/uploads/2016/12/hog-16x16-block-normalization.gif) where we see in green the different cells, and in blue a block of 4 cells. 
Using a block normalization - so, normalizing multiple cells at ones, and slide the normalization window across the image - is introduced in the [introducing paper](http://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf) which shows some benefits in terms of missing rate. Typically, 4 cells per blocks is often used. In a later section (see Classification), a grid search tends to try out other block sizes.

####<u>What normalization ?</u>
Several normalization can be considered: *L1*, *L2*, *L2-Hys*, ...



In [0]:
'''
Definition of the block size (in number of px)
1 cell = 8 x 8
1 block => 16 x 16 => 4 cells
'''
block_w = 16
block_h = 16

x_cells = np.arange(0,64,8)
y_cells = np.arange(0,64,8)

# credit: https://stackoverflow.com/questions/44816682/drawing-grid-lines-across-the-image-uisng-opencv-python
def draw_grid(img, line_color=(0, 255, 0), thickness=1, type_=8, pxstep=8):
    '''(ndarray, 3-tuple, int, int) -> void
    draw gridlines on img
    line_color:
        BGR representation of colour
    thickness:
        line thickness
    type:
        8, 4 or cv2.LINE_AA
    pxstep:
        grid line frequency in pixels
    '''
    x = pxstep
    y = pxstep
    while x < img.shape[1]:
        cv2.line(img, (x, 0), (x, img.shape[0]), color=line_color, lineType=type_, thickness=thickness)
        x += pxstep

    while y < img.shape[0]:
        cv2.line(img, (0, y), (img.shape[1], y), color=line_color, lineType=type_, thickness=thickness)
        y += pxstep

def draw_one_block(img, origin=(0,0), line_color=(255,0, 0), block_size = 16, thickness=1, type_=8):
    cv2.rectangle(img, origin, (origin[0]+block_size, origin[1]+block_size), line_color, thickness=thickness, lineType =type_)

def draw_all_blocks(img, thickness):
    color_list = [(255,0,0), (255,255,0), (255,0,255)]
    x = 0
    y = 0
    counter = 0
    while x < img.shape[1]-8:
        # cv2.line(img, (x, 0), (x, img.shape[0]), color=line_color, lineType=type_, thickness=thickness)
        # draw_one_block(img, (x,y))
        
        
        while y < img.shape[0]-8:
            # cv2.line(img, (0, y), (img.shape[1], y), color=line_color, lineType=type_, thickness=thickness)
            lc = color_list[counter%3]
            draw_one_block(img, (x,y),line_color=lc, thickness=thickness)
            counter += 1
            y += 8
        y=0
        x += 8
    return counter

'''
creating a green grid covering the resized image
'''

grid_cells_img = resized_img.copy()
draw_grid(grid_cells_img, type_=8)

'''
Creating the three first blocks
'''

first_block_img = grid_cells_img.copy()
second_block_img = grid_cells_img.copy()
third_block_img = grid_cells_img.copy()

draw_one_block(first_block_img, origin=(0,0), line_color=(255,0,0), thickness=2)
draw_one_block(second_block_img, origin=(8,0), line_color=(255,255,0),thickness=2)
draw_one_block(third_block_img, origin=(16,0), line_color=(255,0,255),thickness=2)

'''
Creating all the blocks on top of the cells
'''

blocks_img = grid_cells_img.copy()
counter = draw_all_blocks(blocks_img, thickness=1)

'''
Vizualization
'''

fig, (ax0, ax1, ax2, ax3) = plt.subplots(1, 4, figsize=(16, 4), sharex=False, sharey=False) 
ax0.imshow(cv2.cvtColor(grid_cells_img, cv2.COLOR_BGR2RGB))
ax0.set_title('Cells') 

ax1.imshow(cv2.cvtColor(first_block_img, cv2.COLOR_BGR2RGB))
ax1.set_title('first block') 

ax2.imshow(cv2.cvtColor(second_block_img, cv2.COLOR_BGR2RGB))
ax2.set_title('second block') 

ax3.imshow(cv2.cvtColor(third_block_img, cv2.COLOR_BGR2RGB))
ax3.set_title('third block') 

fig, ax = plt.subplots(1,1,figsize=(4,4))
ax.imshow(cv2.cvtColor(blocks_img, cv2.COLOR_BGR2RGB))
ax.set_title("All Blocks")

plt.show()

logging.info("In total, there are: " + str(counter) + " blocks possible in the picture")


As shown in the previous example, on the image chosen, there are $49$ blocks possible of size $(16 \times 16)$ pixels.

###HOG How-to, Step5: concatenation

After the normalization, the only step remaining is the concatenation of the computed vectors into a larger one, that represent the input image. This will be the feature representation of the image, based on the *oriented* gradients in that image.

###What size is this feature representation ?
- one cell is represented by $9$ numbers (histogram)
- four histograms are normalized together, leading to a $(36,1)$ vector
- there are $49$ such vectors representing the image.
    If the image as a width of size $(w*8)$ pixels, and a height of $(h*8)$ pixels, the image dimension is $(8*h \times  8*w)$. In such an image, they are :
    *   h cells vertically, and w cells horizontally,
    *  (h-1) blocks vertically and (w-1) blocks horizontally.

The length of the final vector is then $36 \cdot 49$ numbers, or a $(1764,1)$ vector.

Of course, this is still large... But much more compact that our initial $(64,64,3)$ array of $12288$ numbers



In [0]:
fd, hog_image = skimage_feature_hog(resized_img, 
                    orientations=9, 
                    pixels_per_cell=(8,8), 
                    cells_per_block=(2, 2), 
                    block_norm = "L2",
                    visualize=True, 
                    transform_sqrt = True,
                    multichannel=True)

'''
Plotting results
'''

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8), sharex=True, sharey=True) 
ax1.imshow(cv2.cvtColor(resized_img, cv2.COLOR_BGR2RGB))
ax1.set_title('Input image') 

# Rescale histogram for better display 
hog_image_rescaled = exposure.rescale_intensity(hog_image, in_range=(0, 10)) 
ax2.imshow(hog_image_rescaled, cmap=plt.cm.gray) 
ax2.set_title('Histogram of Oriented Gradients - rescaled')
logging.debug("HOG Rescaled: " + str(hog_image_rescaled.min()) + " -> " + str(hog_image_rescaled.max()) )

plt.show()

# logging.debug(hog_image_rescaled.shape)
logging.info("Shape of the HOG feature : " + str(fd.shape))



<!--histogram construction is based on the gradient computed - both magnitude and orientation (as defined)

- the bin is selected according to the orientation (direction) of the gradient; 
- the value that goes in the bin is based on the magnitude

For instance on the toy example:
first pixel has:
* mag = 6; orn = 45°. So, the vote of this pixel goes for 75% in the bin of 40°, and 25% in the bin of 60°, as closer to 40°. As a result, we add 4.5 to bin nb 3, and 1.5 to bin nb 4. -->

###HOG: Exec summary

* one cell (typ 8 x 8) of an image is represented by a histogram 
    * the orientation and magnitude of the gradient are computed on each pixel
    * orientation of the gradient indicate a bin
    * magnitude indicate the amount to place into the bin
* the histogram is a vector of size 9 (as 9 bins)
* one block (16 x 16) histogram is the concatenation of the 4 histograms, each representing one cell of the block, hence represented by a (36 x 1) vector.
* the final HOG feature vector is based on the concatenation of all blocks.


For an image of 64 x 64, it follows:
* 8 cells along the width
* 8 cells along the height
* Number of cells = 64 cells
* Number of blocks = 7 * 7 = 49 blocks

Each block has a representative vector of dimension $(36 \times 1)$, and the resulting vector has dimension $(49*36 \times 1)$, or $(1764,)$


This ends the first part of Histogram of Oriented Gradient, where I showed in detail how to compute such a feature representation, and the meaning of the different parameters.

As many other parameters, the "hyper-parameters" of a HOG method should always be assessed according to a specific problem. 
- Insofar, the parameters have mainly be considered equal to their suggested value by the litterature, in this [HOG paper](http://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf)
- Later, in this tutorial, we will optimize a classifier using a systematic method and a cross-validation set - Stay tuned ! 

Recall that we shall **never** use the test set to fine tune our model parameters.

<!--Still to do : 
 These [hog] parameters were obtained by experimentation and examining the accuracy of the classifier — you should expect to do this as well whenever you use the HOG descriptor. Running experiments and tuning the HOG parameters based on these parameters is a critical component in obtaining an accurate classifier.-->

### Detecting an object of interest in a new image

In this part, the goal is to use the HOG feature descriptor in order to detect if an object is present or not in a new image.

> we are not building a classifier or identificator (yet!) The goal is *just* to detect which region of an image have structures that correspond well to the ones of the feature descriptor.


To do so, the steps are:
1. Select an object
2. Compute its HOG feature description
3. Select a new image
4. pre-process this image in terms of dimensions
5. compute the image HOG at every location
6. Assess the matching between the descriptor and the image


####Select an object and compute its hog

As we have built a training set and a test set, let's just pick randomly one image of the training set. We can do even better and compute the HOG for all images in training and test sets using the parameters already seen. We store the results in the dictionary `hog_training` and `hog_test`. As *usual*, the keys are the persons names.

Later, we can select any of those as the `image_of_interest`


In [0]:
hog_training = {}
hog_test = {}

for person in persons:
    hog_training[person] = []
    for src_img in training_set[person]:
        if not color:
            src_img = cv2.cvtColor(src_img, cv2.COLOR_BGR2GRAY)
        resized_img = cv2.resize(src_img, (sq_size,sq_size))
        fd, hog_image = skimage_feature_hog(resized_img, 
                                            orientations=9, 
                                            pixels_per_cell=(8,8), 
                                            cells_per_block=(2,2), 
                                            block_norm = "L2",
                                            visualize=True, 
                                            transform_sqrt = True,
                                            multichannel=color)
        hog_training[person].append((fd, hog_image, resized_img))

logging.debug(pretty_return_dict_size(hog_training))
logging.info("hog_training dictionary contains the HOG descriptors (resized) for all faces of the training set")

for person in persons:
    hog_test[person] = []
    for src_img in test_set[person]:
        if not color:
            src_img = cv2.cvtColor(src_img, cv2.COLOR_BGR2GRAY)
        resized_img = cv2.resize(src_img, (sq_size,sq_size))
        fd, hog_image = skimage_feature_hog(resized_img, 
                                            orientations=9, 
                                            pixels_per_cell=(8,8), 
                                            cells_per_block=(2,2), 
                                            block_norm = "L2",
                                            visualize=True, 
                                            transform_sqrt = True,
                                            multichannel=color)
        hog_test[person].append((fd, hog_image, resized_img))

logging.debug(pretty_return_dict_size(hog_test))
logging.info("hog_test dictionary contains the HOG descriptors (resized) for all faces of the test set")

In [0]:
'''
Let's pick an image of interest, by its index [0 -> 19]
'''

idx_of_interest = 2
image_of_interest = training_set[personA][idx_of_interest]
hog_of_interest = hog_training[personA][idx_of_interest]

'''
Visualization of the image and its hog selected as image_of_interest
'''
fig, (ax0, ax1) = plt.subplots(1,2,figsize = (8,4), sharex=False, sharey=False)

ax0.imshow(cv2.cvtColor(image_of_interest, cv2.COLOR_BGR2RGB))
ax0.set_title("Image of interest from training set")

ax1.imshow(hog_of_interest[1])
ax1.set_title("Visualization of the HOG of interest")

logging.info("Shape of the descriptor       : " + str(hog_of_interest[0].shape))
logging.info("Shape of the descriptor (visu): " + str(hog_of_interest[1].shape))
logging.info("Shape of the image of interest: " + str(image_of_interest.shape))


####Select a new image
This is the image we want to apply the descriptor matching.
Put in another words, we want to verify, using a distance metric such as the euclidean distance, if the HOG descriptor of my *image_of_interest* presents similarities with a region in a new image.

Let's just pick a certain number of images from our **original** images downloaded, the ones that are not cropped yet, nor resized. 


In [0]:
person_ = personA # can be {personA, personB, personC, personD}
number_of_candidates = 3
random.seed("7/4/2020")
images_candidates = random.sample(images[person_], number_of_candidates)

fig = plt.figure(figsize=(12,4))
i=0

for img in images_candidates:
    ax=fig.add_subplot(1,number_of_candidates, i+1)
    ax.imshow( cv2.cvtColor(img, cv2.COLOR_BGR2RGB) )
    i+=1
plt.show()


The different images selected don't have the same size.

In [0]:
logging.info("Shapes of image on which to look for the image_of_interest: ")
for img in images_candidates:
    logging.info("Image shape: " + str(img.shape))

####Matching
In this larger step, we want to see if there is a match between the descriptor of interest and the image candidate. 

Several problems arise:
- The descriptor of interest has a fixed (designed) size of $(1764,1)$
- The different image candidates have different sizes,
- The image candidates contains more information than just a face

*Why not just cropping the faces on the image and resize it?*

We could of course do that - but that is not really the purpose :-) Rather, we want to assess, in **every location** of the image candidate, if there is a chance to see a pattern such as the descriptor provided of the image of interest.

*Simpler case*
Let's consider first that if the image candidate contains a face (=object), this face has approximately the same size as the original face of interest

One way to proceed is to slide, across the image candidate, a window of the size of the object to detect. 
- at *each* location, crop the part of the image candidate in the sliding window
> sliding at every each location is cumbersome and time consuming. The parameter `step`actually defines the amount of pixels that is skipped in both directions during the sliding.
- compute the HOG of this crop
- compare (using Euclidean distance for instance) this descriptor with the descriptor of the object to find
- go to the next location

At the end, the result is a score attributed to every location, indicating the correspondance between the object to detect, and the area in the image. 

 

*Helper functions*
- `get_local_crop`: realize the cropping before computing the HOG, ensuring the appropriate size to compute HOG
- `match_hog`: homemade matching between an object of interest and an image candidate, implementing this "sliding" accross the image
- `fill_matrix_min_neighbours`: compensate the effect of the `step`parameter in the plot

In [0]:
def get_local_crop(src_img, center_pixel, crop_shape, show = False):
    '''
    src_img:
    center_pixel:
    crop_shape:
    '''
    crop_height =  crop_shape[0]
    crop_width = crop_shape[1]
    top_= center_pixel[0] - crop_height // 2
    bottom_ = center_pixel[0] + crop_height // 2
    left_ = center_pixel[1] - crop_width // 2  
    right_ = center_pixel[1] + crop_width // 2
    if len(src_img.shape)> 2:
        crop = src_img[ top_ : bottom_, left_:right_, :]
    else:
        crop = src_img[ top_ : bottom_, left_:right_]

    if show:
        cv2_imshow(crop)
    return crop

def match_hog(src_img, hog_desc, original_face_shape, win_stride = (8,8), show = False):
    '''
    src_img : image to analyse
    hog_desc: (fd, hog_image) of the corresponding faces_cropped
    original_face_shape: shape of the face used for hog_desc computation
    '''
    height = src_img.shape[0] # height of the image to analyze
    width  = src_img.shape[1] # width of the image to analyze
    
    height_face = original_face_shape[0]
    width_face = original_face_shape[1]

    res_shape = (src_img.shape[0], src_img.shape[1])
    res = np.ones(res_shape)*-1

    if show:
        tmp_image = src_img.copy()
        cv2.rectangle(tmp_image, (width_face//2, height_face//2), (width - width_face//2, height - height_face//2), (0, 255, 0))
        cv2_imshow(tmp_image)

    running_h_idx = range(height_face //2, height - height_face//2+1, win_stride[0])
    running_w_idx = range(width_face //2, width - width_face//2+1, win_stride[1])

    for h_idx in running_h_idx:
        for w_idx in running_w_idx:
            local_crop = get_local_crop(src_img, (h_idx, w_idx), original_face_shape, False)
            local_resized_img = cv2.resize(local_crop, (64,64))
            local_fd = skimage_feature_hog(local_resized_img, 
                                            orientations=9, 
                                            pixels_per_cell=(8,8), 
                                            cells_per_block=(2, 2), 
                                            block_norm = "L2",
                                            visualize=False, 
                                            transform_sqrt = True,
                                            multichannel=True)

            # computing euclidean distance here !!            
            res[h_idx,w_idx]= np.linalg.norm(local_fd-hog_desc[0])
            if show:
                cv2_imshow(local_resized_img)

    return res

def fill_matrix_min_neighbours(matrx, win_stride = (16,16), margin = 0):
    '''
    fill the gaps in the  computation by taking the min values from closest neighbours
    that were computed.
    '''
    l = np.argwhere(matrx != -1)
    res = matrx.copy()

    if len(l)<2:
        idx_to_change = res == -1
        logging.warning("len < 2 --> len(idx_to_change) = " + str(len(idx_to_change)))
        res[idx_to_change] = res.max() + margin
        return res

    top_left_corner = l[0]
    bottom_right_corner = l[-1]

    for i in range(top_left_corner[0], bottom_right_corner[0]+win_stride[0]//2):
        for j in range(top_left_corner[1],bottom_right_corner[1]+win_stride[1]//2):
            if matrx[i,j] == -1:
                local_roi = get_local_crop(matrx, (i,j),win_stride)
                cand = local_roi[ local_roi != -1 ]
                res[i,j] = cand.min()
    idx_to_change = res == -1
    res[idx_to_change] = res.max() + margin
    return res

*Details*

For all the image in images_candidates list, 
- compute the matching on every required location (point spaced by win_stride)
- fill the non computed point with min neighbour
- normalized (L2) to get values between 0 -> 1
- show the results

In [0]:
for img in images_candidates:
    win_stride=(16,16)
    res = match_hog(img, hog_of_interest, image_of_interest.shape, win_stride)
    new_res = fill_matrix_min_neighbours(res, win_stride)

    logging.debug("worst match hog results: " + str(new_res.max()))
    logging.debug("best match hog results: " + str(new_res.min()))

    b = new_res.copy()
    bmax, bmin = b.max(), b.min()
    if bmax == bmin and bmax == 0:
        logging.info("Perfect match!")
    b = (b - bmin)/(bmax - bmin)
    # b = (b - bmin)/(bmax)

    logging.debug("worst match hog results normalized [expexted 1]: " + str(b.max()))
    logging.debug("best match hog results [expected 0]: " + str(b.min()))



    fig, (ax1, ax2) = plt.subplots(1,2 , figsize=(16, 8), sharex=True, sharey=True) 

    ax1.imshow(cv2.cvtColor(img,cv2.COLOR_BGR2RGB))
    ax1.set_title("Image where to find base face")

    # ax2.imshow(new_res, cmap="jet")
    # ax2.set_title("Results gaps filled with min neighbour")

    ax2.imshow(b, cmap="jet")
    ax2.set_title("Normalized results of Matching")

    plt.show()
    logging.debug("=====================================================")




####Matching - part2

As showed in the results above, it works ! Several observations nonetheless:
1. the borders are red (indicating  large distance): this is because of the sliding that considers the full object of interest is required in the image

2. It works even if not the exact same shape!
    * provided that a threshold is chosen, one could use this to recognize an object
    * only *LOCAL* description is given: only local object shape and appearances (to some extend) can be represented. 
    * Because of locality and limit of expressiveness, the face of Bradley Cooper and Emma Stone may look alike, in terms of HOG descriptors.

3. Only *ONE* size of the object is verified. If the object to find is of the same size as the object in the image, the descriptors will be very similar, and the distance small. However, if the object to find is smaller or larger in the image candidate, the HOG descriptor won't match at all. 
To solve this problem, one can perform multiscale detection, where the object is scaled several times to ensure to really detect the object, if it is present.

A nice overview of this multiscaling can be found in [pyimagesearch](https://www.pyimagesearch.com/2015/11/16/hog-detectmultiscale-parameters-explained/)

4. The detection of object with this sliding window takes time, and is even more time / resource consuming if used with multiscale analysis. Several techniques should be set in place (also discussed in [pyimagesearch](https://www.pyimagesearch.com/2015/11/16/hog-detectmultiscale-parameters-explained/)) such as:
    - reducing the size of the image candidate, without losing too much information
    - adjusting the HOG parameters so that the computation time is optimized for a **specific application**




###Wrap up the feature representation

We can now wrap up the results in a matrix of dimension $(n \times p)$, where $n$ is the number of training images, and $p$ is the number of features (the length of the feature representation)

We call this matrix `X_HOG_train`



In [0]:
'''
get dimensions from previously built data structure
'''
local_n = sum([len(hog_training[person]) for person in persons])
local_p = len(hog_training[personA][0][0])

'''
create data structure X_HOG_train
'''

X_HOG_train = np.empty((local_n, local_p))

'''
Fill data structure with feature representation
'''
i=0
for person in hog_training.keys():
    for item in hog_training[person]:
        X_HOG_train[i,:] = item[0]
        i+=1
        
'''
log the shape of the data structure
'''
logging.info("X_HOG_train shape: " + str(X_HOG_train.shape))


and let's do the same with the test set...

In [0]:
'''
get dimensions from previously built data structure
'''
local_t_n = sum([len(hog_test[person]) for person in persons])
local_t_p = len(hog_test[personA][0][0])

'''
create data structure X_HOG_train
'''

X_HOG_test = np.empty((local_t_n, local_t_p))

'''
Fill data structure with feature representation
'''
i=0
for person in hog_test.keys():
    for item in hog_test[person]:
        X_HOG_test[i,:] = item[0]
        i+=1
        
'''
log the shape of the data structure
'''
logging.info("X_HOG_test shape: " + str(X_HOG_test.shape))

###HOG Conclusion

In this first feature representation building, we studied in details how the descriptor is built and computed, and the different parameters that comes in play, and in particular
- the cell size
- the block size
- the normalization method

We have then used the HOG descriptor to try and detect if an object is in an image, by computing on (almost) every position of the image the descriptor and assessing the distance (as similarity metric) with the face of interest descriptor.
Doing so, we have then seen that the techniques works well to find region of a similar shape in the image, hence the locality of the descriptor. We also covered some of the challenges related to the images'size, locality of the descriptors, and complexity of the computations.


##  Principal Component Analysis - **PCA**

We have extensively covered HOG. Similarly, we will go through the PCA technique. First, we will give some intuition; then we will go through the maths, as this technique rely heavily on the computing, then we will apply PCA on the training set and try and observe some results.

When discussing PCA, we will mainly focus on the technique applied on images. There are plenty of blogs out there precisely defining and tailoring the techniques for all kinds of application. This tutorial is just an example of PCA applied to face images.




---

As for the previous HOG technique, homemade code is fully provided. This gives the details on the implementation and insight about how things are calculated. For this part of the tutorial, most of the examples are done with this homemade code. A section is dedicated to a demo of a library tool as well. For the classification and identification parts, however, library optimized code will be used as it's the purpose of those libraries. 
Don't hesitate to try out different parameters in the provided function, and please report any bug to geoffroy.herbin@student.kuleuven.be

---



In [0]:
class MyPCA:
    '''
    homemade class to perform PCA using several methods and compare
    Three methods are implemented
    - method = "svd": singular value decomposition technique
    - method = "eigen": nominal eigenvalue decomposition technique is implemented
    - method = "eigen_fast": eigenvalue decomposition using dimensionality 
                            reduction is implemented
    '''
    def __init__(self, method = "svd"):
        self.method = method
        self.eigenvalues = None
        self.eigenvectors = None
        self.X_mean = None

    def fit(self, data):
        X = data.copy()
        n, m = X.shape
        assert n < m, "n is not smaller than m -> you most likely need " + \
                    "to transpose your input data"
        self.X_mean = np.mean(X, axis = 0)
        X -= self.X_mean

        self.eigenvalues = None
        self.eigenvectors = None

        if self.method == "svd":
            # singular value decomposition
            U, S, Vt = np.linalg.svd(X)
            self.eigenvalues = S**2 / (n - 1)
            self.eigenvectors = Vt.T[:,0:n]

        elif self.method == "eigen_fast":

            # compute small covariance matrix
            D = np.dot(X, X.T) / (n - 1)

            # eigen decomposition
            LD, W = np.linalg.eig(D)

            order_D = np.argsort(LD)[::-1]
            LD_sorted = LD[order_D]
            W_sorted = W[:,order_D]
            
            eigenVectors_sorted_tmp = np.dot(X.T, W_sorted)
            eigenVectors_sorted = np.empty(eigenVectors_sorted_tmp.shape)
            for i in range(n):
                v = eigenVectors_sorted_tmp[:,i]
                eigenVectors_sorted[:,i] = v / np.linalg.norm(v)
            
            self.eigenvalues = LD_sorted
            self.eigenvectors = eigenVectors_sorted
                
        elif self.method =="eigen":
            # compute covariance matrix

            Cov = np.dot(X.T, X) / (n - 1)

            # eigen decomposition
            LC, V =np.linalg.eig(Cov)

            # sort in appropriate order and keep only relevant component
            order = np.argsort(LC)[::-1]
            LC_sorted = LC[order][0:n]
            V_sorted = V[:,order][:,0:n]

            self.eigenvalues = LC_sorted
            self.eigenvectors = V_sorted.real
        else:
            raise RuntimeError("method value unknown")
        
    def projectPC(self, X, k):
        Vk = self.eigenvectors[:, 0:k]
        # logging.debug("Reduce X " + str(X.shape) + "using k = " + str(k) + " components")
        # logging.debug("Vk (4096 x k)= " + str(Vk.shape))
        X_reduced = np.dot(X, Vk)
        # logging.debug("X_reduced (n x k) = " + str(X_reduced.shape))

        return X_reduced
    
    def reconstruct(self, X_reduced, k, show = False):
        if len(X_reduced.shape) > 1:
            X_reduced = X_reduced[:,0:k]
        else:
            X_reduced = X_reduced[0:k]
        
        Vkt = self.eigenvectors[:, 0:k].T
        # logging.debug("Reduce using X_reduced = " + str(X_reduced.shape) )
        # logging.debug("Reconstruct using k = " + str(k) + " components")
        # logging.debug("Vkt (k x 4096)= " + str(Vkt.shape))
        X_hat_centered = np.dot(X_reduced, Vkt)
        # logging.debug("X_hat_centered.shape (20,4096): " + str(X_hat_centered.shape))
        if show:
            self.show_data(X_hat_centered, add_mean = True)
        return X_hat_centered 

    def compute_error(self, X, X_hat):
        return sqrt(mean_squared_error(X, X_hat))

    def show_principal_components(self,k, figsize=(10,4)):
        w = figsize[0]
        h = figsize[1]
        fig = plt.figure(figsize=(w,h)) 
        fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05) 
        logging.debug("self.eigenvectors.shape = " + str(self.eigenvectors.shape) )
        for i in range(k):
            pc = self.eigenvectors[:,i]
            assert pc.shape[0]==(sq_size**2)*1 or pc.shape[0]==(sq_size**2)*3, "Not proper shape (expected (sq_size**2) (*3)) " + str(pc.shape)
            
            ax = fig.add_subplot(h, w, i+1, xticks=[], yticks=[]) 
            if color:
                pc_img = (np.reshape(pc.real, (sq_size, sq_size, 3))*255).astype("uint8") 
                ax.imshow(pc_img, interpolation='nearest') 
            else:
                pc_img = np.reshape(pc.real, (sq_size, sq_size))
                ax.imshow(pc_img  , cmap=plt.cm.gray, interpolation='nearest')

        plt.show()

    def compute_explained_variance(self, show=True):
        sum_all_eigenValues = sum(self.eigenvalues)
        logging.debug("\nSum of all eigenValues: " + str(sum_all_eigenValues))

        explained_variance      = [(value / sum_all_eigenValues)*100 for value in self.eigenvalues]
        cum_explained_variance  = np.cumsum(explained_variance)
        logging.debug("Cum explained variance : \n" + str(cum_explained_variance))
        if show:
            fig = plt.figure(figsize=(12, 6))
            ax1 = fig.add_subplot(121)
            ax1.bar(range(len(self.eigenvalues)), self.eigenvalues)
            ax1.set_xlabel('eigenvalues')
            ax1.set_ylabel('values')

            ax2 = fig.add_subplot(122)
            ax2.bar(range(len(explained_variance)), explained_variance)
            ax2.plot(range(len(cum_explained_variance)), cum_explained_variance, color='green', linestyle='dashed', marker='o', markersize=5)

            ax2.set_xlabel('eigenvalues')
            ax2.set_ylabel('% information ')
            ax2.legend( labels = ["Cumulative Expl. Var.", "Explained Variance"])
            ax2.grid()
            plt.show()
        return explained_variance, cum_explained_variance

    def show_data(self, X, add_mean = False):
        
        # copy so that adding the mean does not modify the original centered
        # data X
        X_ = X.copy()

        if len(X.shape) > 1:
            self._show_data(X_, add_mean)
        else:
            fig = plt.figure(figsize=(3,3))
            if add_mean:
                    X_ += self.X_mean
            # img = np.reshape(X_, (sq_size, sq_size))
            img = my_reshape(X_, sq_size, color)

            ax = fig.add_subplot(1, 1, 1, xticks=[], yticks=[]) 
            ax.imshow(img, cmap = my_color_map, interpolation='nearest') 

            plt.show()


    def _show_data(self, X, add_mean = False):
        fig = plt.figure(figsize=(8,8)) 
        fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05) 
        i=0
        for x in X:
            if add_mean:
                    x += self.X_mean
            # assert x.shape[0]==(sq_size**2)*3, "Not proper shape (expected (sq_size**2)*3) " + str(x.shape)
            ax = fig.add_subplot(8, 5, i+1, xticks=[], yticks=[]) 
            img = my_reshape(x, sq_size, color) 
            ax.imshow(img, cmap = my_color_map, interpolation='nearest') 

            i+=1
        plt.show()

### Basic idea

If you are completely unfamiliar with the principal components analysis, the thread in [PCA intuition](https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues) contains a wonderful layered explanation of what the PCA is, and what can its use be.
Another very nice introduction is given in [medium](https://medium.com/@aptrishu/understanding-principle-component-analysis-e32be0253ef0)

Essentially, the PCA technique extracts the information from the data to get the main directions intrinsically present in the data. To realize that, PCA uses the covariance, which is a *measure of the extent to which corresponding elements from two sets of ordered data move in the same direction* (definition extracted from [medium website](https://medium.com/@aptrishu/understanding-principle-component-analysis-e32be0253ef0))


Based on this covariance information, the PCA  *\"finds a new set of dimensions (or a set of basis of views) such that all the dimensions are orthogonal (and hence linearly independent) and ranked according to the variance of data along them. It means more important principle axis occurs first\"* (source: [medium website](https://medium.com/@aptrishu/understanding-principle-component-analysis-e32be0253ef0))
If you're more of a visual person, the plot [here](https://i.stack.imgur.com/lNHqt.gif) shows the first principal component of a cloud of points, with an animation that shows how the variance is minimized.

It is really an extraction of information from the data, hence an unsupervised technique, that is used to reduce the dimensionality of the problem. Applying PCA on images can therefore be used as a feature representation!






### PCA How-to, Step1: Pre-processing

The PCA is applied on *hyper-points*: points in our high-dimensional space. We reprensent each of those points by a vector. Applied on images, we then need to convert our sets of images into a useable format. 

The representation chosen is a $(n \times p)$ matrix where
- n is the number of images
- p is the dimension of the vector

The matrix containing the training data is then a $(40 \times p)$ matrix, and -- in our specific problem insofar -- the matrix containing the test data is a $(40 \times p)$ matrix.

#### Resizing
All the images (*hyper-points*) shall have the same dimensions to start with. There is therefore a first step of resizing all the images into a commonly appropriate size, keeping in mind that the faces are square images. The parameter indicating the lenght of this square side is defined as `sq_size` in this notebook. 

`sq_size = 64` indicates that the images are resized as a $(64 \times 64)$ matrix, containing 1 or 3 channels depending on their colorscale. 


#### Color or Grayscale ?
In the litterature, we find both implementation, and I decide not to choose at this point. A grayscale image contains only one channel, a colored image contains three. 
> to change from grayscale to color, just change the boolean parameter `color` in the beginning of this notebook.

The grayscale image is then simply converted to a vector by flattening the matrix, concatenating each row one after the others. 

The colored image is converted applying the same technique on the three channels, then concatenating the three resulting vectors into one, longer, vector.

####Resulting Dimensionality
From a training set colored image of dimensions $( h, w )$
- resize to $(64,64)$
- convert to vector:
    * if grayscale, convert to $(1, 4096)$
    * if color, convert to ($1, 12288)$

The resulting matrix representing the training set has a shape $(40,4096)$ or $(40,12288)$ depending if grayscale or color.

####Process the training set as a data matrix

In [0]:
'''
Build useable training set from (hyper-)parameters
'''

# column 0 = first image
# column 1 = second image
# ...
m_src = get_matrix_from_set(training_set, color, sq_size = sq_size, flatten = True)

logging.debug(" m_src: original matrix")
logging.debug(" m_src: shape = " + str(m_src.shape))

plot_matrix(m_src, color, my_color_map, h=4, w=10, transpose = False)

#### Process the test set as a data matrix

In [0]:
'''
Build useable test set from (hyper-)parameters
'''
m_test_src = get_matrix_from_set(test_set, color, sq_size = sq_size, flatten = True)


logging.debug(" m_test_src: original matrix")
logging.debug(" m_test_src: shape = " + str(m_test_src.shape))


plot_matrix(m_test_src, color, my_color_map,h = 4, w=10,transpose = False)

###PCA How-to, Step2: Centering the Data

Centering the data is essential in order to have, eventually, the eigenvectors sorted according to the eigenvalues properly meaning what we desire, aka the directions of max variances in the data.

The following code shows:
1. (left) a plot of the data (training) matrix 
2. (middle) the mean image
3. (right) a plot of the data where the mean image is substracted: the data are now centered


In [0]:
X = m_src.copy()
X_mean = np.mean(X, axis = 0)
Xc = X - X_mean

fig = plt.figure(figsize=(18, 5))

Xs = np.arange(0, X.shape[0])
Ys = np.arange(0, X.shape[1])
Xs, Ys = np.meshgrid(Xs, Ys)
  
ax1 = fig.add_subplot(131,projection='3d')
surf1 = ax1.plot_surface(Xs, Ys, X.T, cmap=plt.cm.jet, antialiased=True)
ax1.set_xlabel('image index')
ax1.set_ylabel('vector index')
ax1.set_zlabel('value')
ax1.set_title('Original data')
ax2 = fig.add_subplot(132)
ax2.imshow(my_reshape(X_mean, sq_size, color), cmap = my_color_map, interpolation='nearest') 
ax2.set_title("Mean image")


ax3 = fig.add_subplot(133,projection='3d')
surf3 = ax3.plot_surface(Xs, Ys, Xc.T, cmap=plt.cm.jet, antialiased=True)
ax3.set_xlabel('image index')
ax3.set_ylabel('vector index')
ax3.set_zlabel('value')
ax3.set_title("Centered data")
plt.show()

logging.info("Visualization of the images - centered:")

plot_matrix(Xc, color, my_color_map, h=4, w=10,transpose = False)
plt.show()

###PCA How-to, Step3: Decomposition


####*Canonical - EigenDecomposition*
####0. <u>Data </u>
Let's take our initial data matrix $X$, a $(n \times p)$ matrix of data where n is the number of images, and p is the number of variables. In our case, the number of variables is the number of pixels of one image (or three times this number, if color image). First, we have to center the data, hence substracting the mean image. This is done in the previous step. In the following text, $X$ is assumed centered. 



####1. <u>Covariance Matrix </u>

We compute the covariance matrix, that indicates how a variable (= pixel intensity) varies with respect to other pixels. The Covariance matrix indicates how the variables evolve with respect to each others. 

$C = \frac{X^T \cdot X}{n-1}$ and has dimension $(p \times p)$. This is a pretty large matrix already.

####2. <u>Eigenvalues and EigenVectors </u>
Having computed the covariance matrix C, we compute its eigenvectors and eigenvalues, indicating the main directions and their strength of how the data evolve. The eigendecomposition is expressed as: $C = V L V^T$ where
- $L$ is a diagonal matrix of eigenvalues
- $V$ is the $(p \times p)$ matrix of eigenvectors


<u>*Mathematical Trick: Exploiting the dimensionality of the matrix*</u>

Computing the eigenvalues and eigenvectors of $C$ is pretty cumbersome, as $C$ is a large matrix $(p \times p)$.

Recalling our algebra skills, given the dimension of $X$, we know that only a limited amount of eigenvalues are non zero: there are only $(n-1)$ non zero eigenvalues. There is no need to compute the $p$ eigenvalues and related $(p \times p)$ eigenvectors matrix as (all) the information is contained in only $(n-1)$ eigenvalues.

As a result, to speed up the computation and take advantage of this property, instead of computing the eigenvalues and eigenvectors of $C = \frac{X^T \cdot X}{n-1}$ of size $(p \times p)$, let's rather compute the $n$ eigenvalues and corresponding eigenvectors of the matrix $D = \frac{X \cdot X^T}{n-1}$ of size $(n \times n)$, such that $D = W  L W^T$
- the eigenvalues computed are the same as $C$'s
- the corresponding eigenvectors of C, in matrix $V$, are related such that $V = X^T \cdot W$

This way, it takes advantage of the dimensions of the problem.

####3. <u>Principal Components</u>

The principal components are defined as the eigenvectors $V$. The eigenvectors can be sorted according to the value of their associated eigenvalue, in decreasing order.

The eigenvector that has the largest eigenvalue associated is called "first principal component", the second largest is called "second principal component", and so on.

In the context of faces analysis, the eigenvectors are often called **eigenfaces**, as they can be reshaped as an image, and displayed (provided the appropriate number conversion so that it's in a range visible)

By projecting the original data $X$ on the new directions, the eigenfaces, we get *new coordinates* that yet *fully describe the original data*. 

Furthermore, the number of eigenvectors on which we project the data is reduced with respect to original problem dimensionality.   


####*Singular Value Decomposition*

The idea behind the SVD is essentially mathematical, and help in computing the eigenvectors and eigenvalues in a different way. From a matrix X, centered, (as in previous section), one can compute its decomposition $X = U \cdot S \cdot V^T$ 
where $U$ is a unitary matrix, $S$ is diagonal, containing what's called the singular values $s_i$. 
One can see that $V$, right singular vectors, are related to eigenvectors of the covariance matrix from previous section.

Indeed, computing this covariance matrix gives:

\begin{align}
Cov &= \frac{X^T \cdot X}{n-1} \\
    &= \frac{V \cdot S \cdot U^T \cdot U \cdot S \cdot V^T}{n-1} \\
    &= V \cdot \frac{S^2}{n-1} V^T
\end{align}

There is therefore a link between the singular values ($s_i$) and the eigenvalues ($\lambda_i$):
$$\lambda_i = \frac{s_i^2}{n-1}$$ and the right singular vectors are the eigenvectors $V$.

Note that in practice, most of the implementation of the PCA algorithm uses singular value decomposition, and starts by centering the data - such as the library function `sklearn.decomposition.PCA` that we will extensively use later on.

####Decomposition and Eigenfaces

The following lines of codes create an object of type `MyPCA`, and compute the decompositions following two methods: "svd" and "eigen_fast". 



In [0]:
nb_training_faces = sum(training_sets_size.values())

'''
Ensuring reset of object if cell is rerun
'''
my_pca = None
my_pca2 = None

'''
Getting the source matrix
'''
X = m_src.copy()

'''
Creating the MyPCA objects
-> solving according to SVD
-> solving according to EigenDecomposition (with Math Trick)
'''
my_pca = MyPCA("svd")
my_pca.fit(X)
print("All principal components, using SVD")
my_pca.show_principal_components(k=nb_training_faces)

my_pca2 = MyPCA("eigen_fast")
my_pca2.fit(X)
print("All principal components, using eigendecomposition")
my_pca2.show_principal_components(k=nb_training_faces)

# my_pca3 = MyPCA("eigen")
# my_pca3.fit(X)
# print("All principal components, using nominal eigendecomposition")
# my_pca3.show_principal_components(k=nb_training_faces)

The two plots above show the same information: the eigenfaces, but computed in two different ways: using SVD and using eigendecomposition. Several things are important to be noticed:
- the eigenfaces are very much alike. Actually, they are exactly the same (despite the last one, see next point) considering the well-known sign ambiguity related to decomposition, see [Standord course, sect. 5.3, Properties of eigenvectors](https://graphics.stanford.edu/courses/cs205a-13-fall/assets/notes/chapter5.pdf). Long story short, white and black may be reversed without any issue in the eigenfaces, on each image independantly. 
    * In the commonly used library implementation, there is often an `sign_flip`function implemented to ensure repeatability of the results of the decomposition. See [documentation](https://kite.com/python/docs/sklearn.utils.extmath.svd_flip)
- the last eigenface seems to differ... Definitely, this isn't an issue. 
    * recall that there are only (n - 1) non-zero eigenvalue. The eigenface associated to this eigenvalue is therefore multiplied by 0, and doesn't play a role. 

If interested, you may uncomment the last lines in order to create a PCA with parameter `method = "eigen"`, and see the result of the true eigendecomposition, without the mathematical trick. Or you can trust me that the result is the same - with the actual noisy 40th component (time to run ~50 seconds)

####Reconstructing on k components

In [0]:
'''
Selecting X as the input image we want to reconstruct
'''
X = m_src.copy()[9,:]

logging.debug("Shape of input X = " + str((X.shape)))

'''
Centering the data
'''
X_mean = np.mean(m_src.copy(), axis=0)
Xc = X - X_mean


'''
Computing X_reduced, projection of Xc on the principal components space
'''
X_reduced = my_pca.projectPC(Xc, k=nb_training_faces)
logging.debug("Shape of X_reduced = " + str(X_reduced.shape))

'''
Reconstructing progressively
Based on k first components only
'''
k=0
fig = plt.figure(figsize=(3,3))
img = my_reshape(X_mean, sq_size, color)
ax = fig.add_subplot(1, 1, 1, xticks=[], yticks=[]) 
ax.imshow(img, cmap = my_color_map, interpolation='nearest') 
plt.show()
logging.info("Principal components used = " + str(k) + ";\nReconstruction error = " + str(np.round(my_pca.compute_error(Xc+X_mean, X_mean),2)))

for k in [1,2,3,5,8,10,12,15,20,25,30,40]:
    X_hat_centered = my_pca.reconstruct(X_reduced, k, show=True)
    logging.info("\nAbove: \nPrincipal components used = " + str(k) + 
                 ";\nReconstruction error = " + 
                 str(np.round(my_pca.compute_error(Xc, X_hat_centered),2)) + 
                 "\n"+"___"*30)


In [0]:
expl_var, cum_expl_var = my_pca.compute_explained_variance(show = True)

*Discussions*

On the above results and images, several things can be observed:
1. First the different reconstructions of one selected face. As expected, the reconstruction error decreases as the number of components used (k) increases. This also matches the intuition behind the explained variance.
2. Second, the ultimate error remaining is 0, indicating no information was lost when considering the 40 components. That confirms the mathematical theory.
3. The last two graphs show on the left, the eigenvalues, and on the right, the cumulative explained variance. 
    * the eigenvalues are sorted from the most important to the least important, confirming the right curve of the cumulative expl. var. However, the descent of the values is not fast (not exponential). Furthermore, there is no big drop off in the values.
    * this indicates that the choice of an **optimal number p** of components such that the dimensionality of the feature space is reduced but still informative is not obvious. 

####Choice of optimal $p$


$p$ is defined as the optimal number of components used in the reduced space so that:
1. the dimension is reduced (lower than n)
2. the reconstructed information is *close* to input data. It means the reconstruction is still informative.
Selecting only the first $p$ components has the effect of removing small variances. This can be important for some application, if there are little variance between different classes.


As said above:
- there is no clear drop-off in the eigenvalues,
- there is no exponential decrease if the eigenvalues.

It makes the choice of an optimal $p$ complicated. 

To choose, we will do:
1. compute the reconstruction loss (error):
    * for all training examples
    * for all possible choice of p
2. plot this RMSE, in absolute value, 
3. plot, in percentage, the ratio between the reconstruction loss using p-components and the RMSE between mean image and input image. This gives the notion of percentage of reconstruction error -- that can actually be related to the cumulative explained variance !
2. define a threshold of 95% of the information kept (5% of error tolerated)

That will lead to a sensible choice of $p$. This is however purely arbitrary.


In [0]:
X_train = m_src.copy()
X_train_mean = np.mean(X_train, axis = 0)
X_train_centered = X_train - X_train_mean

my_pca = None
my_pca = MyPCA("svd")
my_pca.fit(X_train)

n = X_train.shape[0]
'''
np array containing all the rmse computed

if n = 40, max number of components, then rmse has a size 40x40 (0 -> 39)
> a row matches the rmse of one image wrt the dimension reconstructed. Last column should be 0 (or close, ~e-14)
'''
rmse = np.empty((n,n+1))
rmse_pc = np.empty((n,n+1)) # rmse in percentage
rmse_base = np.empty((n,))
for i in range(n):
    rmse_base[i]=my_pca.compute_error(X_train[i], X_train_mean)


index_image = 0
for img_center_vector in X_train_centered:
    # logging.debug("projecting on " + str(n) + " principal components")
    X_centered_reduced = my_pca.projectPC(img_center_vector, n)
    for k in range(n+1):
        # from 1 to n, included
        if k == 0:
            rmse[index_image, k] = rmse_base[index_image]
            rmse_pc[index_image, k] = 100 * rmse[index_image,k] / rmse_base[index_image]
        else:
            # logging.debug("reconstructing using " + str(p) + " principal components")
            X_hat_centered = my_pca.reconstruct(X_centered_reduced, k)
            rmse[index_image,k] = my_pca.compute_error(img_center_vector, X_hat_centered)
            rmse_pc[index_image, k] = 100 * rmse[index_image,k] / rmse_base[index_image]
        
    index_image += 1

fig = plt.figure(figsize = (16,16))
ax1 = fig.add_subplot(2,2,1)

for idx in range(n):
    rmse_ = rmse[idx,:]
    ax1.plot([i for i in range(0,n+1)],rmse_, '-')
ax1.set_title("reconstruction errors (RMSE) for all images")
ax1.set_xlabel("reconstruction dimension(s) \'p\' ")
ax1.set_ylabel("RMSE")

rmse_mean = np.mean(rmse, axis = 0)
ax2 = fig.add_subplot(2,2,2)
ax2.plot([i for i in range(0, n+1)], rmse_mean, "ro-")
ax2.set_title("Mean of RMSE for all images")
ax2.set_xlabel("reconstruction dimension(s) \'p\' ")
ax2.set_ylabel("mean of RMSEs")

ax1.set_ylim((0,80))
ax2.set_ylim((0,80))
ax1.grid()
ax2.grid()

ax3 = fig.add_subplot(2,2,3)

for idx in range(n):
    rmse_ = rmse_pc[idx,:]
    ax3.plot([i for i in range(0,n+1)],rmse_, '-')
ax3.set_title("reconstruction errors (RMSE) for all images, in %")
ax3.set_xlabel("reconstruction dimension(s) \'p\' ")
ax3.set_ylabel("RMSE")

rmse_pc_mean = np.mean(rmse_pc, axis = 0)
ax4 = fig.add_subplot(2, 2,4)
ax4.plot([i for i in range(0, n+1)], rmse_pc_mean, "ro-")
ax4.set_title("Mean of RMSE for all images, in %")
ax4.set_xlabel("reconstruction dimension(s) \'p\' ")
ax4.set_ylabel("mean of RMSEs")

ax3.set_ylim((0,110))
ax4.set_ylim((0,110))
ax3.grid()
ax4.grid()

# print(rmse_pc_mean)

Following the explained process of defining $p$, the threshold is set at $p = 35$. This is definitely not an extraordinary result, but yet constitutes somehow a reduction, and based on a reasonable choice.

We can now visualize the training image reconstructed based on the first 35 components.

In [0]:
'''
Define optimal p
'''
p = 35

'''
Selecting X as the input image we want to reconstruct
'''
X = m_src.copy()
logging.debug("Shape of input X = " + str((X.shape)))

'''
Centering the data
'''
X_mean = np.mean(m_src.copy(), axis=0)
Xc = X - X_mean

'''
Creating data structure for outputs
'''
X_hat=np.empty(X.shape)


'''
Computing X_reduced, projection of Xc on the principal components space
'''
X_reduced = my_pca.projectPC(Xc, k=p)
logging.debug("Shape of X_reduced = " + str(X_reduced.shape))

'''
Reconstructing based on p first components only
'''
X_hat_centered = my_pca.reconstruct(X_reduced, p, show=False)
X_hat = X_hat_centered + X_mean
logging.info("\nReconstructed images:")
plot_matrix(X_hat, color, my_color_map, h=4, w=10, transpose=False)

'''
Visualize original input, for comparison purpose
'''
logging.info("\nOriginal images:")
plot_matrix(X, color, my_color_map,h=4, w=10, transpose=False)


# for axis in ['top','bottom','left','right']:
#     fig2.axes[1].spines[axis].set_linewidth(2)
#     fig2.axes[1].spines[axis].set_color('white')



On the above two series of images, the top one is the reconstructed using 35 components, and the bottom one plots the original data.
- overall, the quality seems good enough to fully recognize all the faces pretty well, confirming that most of the variance is kept, and the remaining construction loss is small.
- several images, however, clearly show this loss (analysis in grayscale):
    * image[0,6] contains visibly some extra riddles on the bottom left of the face 
    * image[2,1] is appears difformed
    * image[3,6] shows reminiscence of hair on Bradley Cooper's forehead.
    * ...


Luckily, it still show some loss in the reconstruction, confirming the previous results established. Nonetheless, the quality is considered good enough to go on with the $p=35$ selected. 




####Plot on first two Principal Components
PCA is often used as dimensionality reduction technique to represent high dimensional data. Let's show the reconstructed faces on the first two principal components base.

*We do that first with the training images, reconstructed using p=35, for information. Later, we will apply the same thing on some test images*

In [0]:
my_pca = None

X = m_src.copy()
my_pca = MyPCA("svd")
my_pca.fit(X)

data = m_src.copy()
data_mean = np.mean(m_src.copy(), axis = 0)
data_centered = data - data_mean

# my_pca_svd.show_data(X)

# reduce the image to the principal components (k=2)
data_projected = my_pca.projectPC(data_centered, k=2)
logging.debug("shape of data_projected = " + str(data_projected.shape))


plt.figure() 
fig, ax = plt.subplots(1, 1, figsize=(14, 14), sharex=True, sharey=True)

eig1 = data_projected[:,0]
eig2 = data_projected[:,1]
ax.plot(eig1, eig2, 'bo')
ax.set_xlabel("First Component")
ax.set_ylabel("Second Component")

for (x_, y_), img_vector_ in zip(data_projected, X_hat):
    img = my_reshape(img_vector_, sq_size, color)
    ab = AnnotationBbox(OffsetImage(img, cmap = my_color_map), (x_, y_), frameon=False)
    ax.add_artist(ab)

ax.grid()
plt.show()

logging.info("\nIn order to better visualize the plot on top, here is the same view\nwere personA is in red, and personB is in green")

plt.figure() 
fig, ax = plt.subplots(1, 1, figsize=(8, 8), sharex=True, sharey=True)
ax.plot(eig1[0:20], eig2[0:20], 'ro')
ax.plot(eig1[20:40], eig2[20:40], 'go')
ax.legend(["personA", "personB"])
labels=[i for i in range(20)]*2
for i, txt in enumerate(labels):
    ax.annotate(txt, (eig1[i], eig2[i]))
plt.show()


###Demo scikit learn

Of course, everything that has been done so far regarding PCA can be achieved using dedicated  - and optimized - libraries. For that purpose, we can use the `sklearn.decomposition` package that, among other things, implement the PCA using the *SVD* decomposition that we've looked at. 

A difference to note is the use, internally, of the `svd_flip(u, vt)`, a function that ensures the vectors to be deterministic, hence solving the sign ambiguity inherent to matrix decomposition, as already discussed above.

The following part first performs the same operation as we've implemented before to get familiarized.



In [0]:
'''
Let's start --again-- from the training set, processed as a matrix
'''
logging.info("shape of input matrix (n x p) = " + str(m_src.shape))
plot_matrix(m_src, color, my_color_map, h=4, w=10, transpose = False)

In [0]:
'''
Create pca sklearn object, and compute decomposition
'''
X = m_src.copy()
n_components = 40
logging.info("Shape input data: "+str(X.shape))
pca_ = sklearn_decomposition_PCA(n_components=n_components) 
pca_.fit(X)

'''
Get the eigenfaces
'''

eigenfaces = pca_.components_
logging.debug("eigenfaces shape = " + str(eigenfaces.shape))

'''
Visualize the eigenfaces, just as we did before
'''
if color:
    eigenfaces_cvt = (eigenfaces*255).astype(np.uint).copy()
else:
    eigenfaces_cvt = eigenfaces.copy()
plot_matrix(eigenfaces_cvt, color, my_color_map, h=4, w=10, transpose=False)


Without any surprise, we find again the same principal components as before.

Continuing with sklearn library, we can reconstruct the original data based on the first $p$ components. Hopefully, the results are the same as the ones obtained with the homemade PCA. 

In [0]:
p=35
pca_ = sklearn_decomposition_PCA(n_components=p) 
pca_.fit(X)
X_reduced_sk = pca_.transform(X)
X_reconstructed_sk = pca_.inverse_transform(X_reduced_sk)

print(X_reconstructed_sk.shape)
plot_matrix(X_reconstructed_sk, color, my_color_map, h=4, w=10)
logging.info("Difference between Homemade PCA and Scikit-learn PCA: " + str(np.linalg.norm(X_hat - X_reconstructed_sk)))
logging.info(" ==> OK!")

We find the exact same images as the one reconstructed using the homemade code `MyPCA`, which is the expected but nonetheless relieving and self-rewarding conclusion!
Let's continue with the PCA then...

### Projection of Test Faces reconstructed

Using the same kind of plot as before, we can reconstruct, **in the same eigenfaces base** the test set images using $p$ first components. 


We need to apply the sames steps that were applied to the training set, to the test set. that is:


1.   centering the data: substracting the mean **of the training set**
2.   projecting the resulting centered data on the p first principal components computed by applying PCA on the training data. We won't fit the PCA to the test images, we *just* project the test data using the already-found principal components
3.   reconstructing the original data based on those p first principal components, and plot the result of the 40 images on the 2-first components space.



In [0]:
scatter_plot_3D = False
scatter_plot_3times = False

'''
Once again, start by locally copying the data structure
> data_test as new data
> data_train_mean for centering
'''

data_test = m_test_src.copy()
data_train_mean = np.mean(m_src.copy(), axis = 0)

data_test_centered = data_test - data_train_mean

'''
Perojecting the data onto the p components
'''

data_test_projected = my_pca.projectPC(data_test_centered, k=p)
logging.debug("shape of data_projected = " + str(data_test_projected.shape))

'''
scatter plot in 3D - 3 first components
'''
if scatter_plot_3D:
    fig = plt.figure(figsize=(8,8))
    ax = fig.add_subplot(111, projection='3d')
    eig1 = data_test_projected[:,0]
    eig2 = data_test_projected[:,1]
    eig3 = data_test_projected[:,2]
    ax.scatter(eig1[0:10], eig2[0:10], eig3[0:10], 'b')
    ax.scatter(eig1[10:20], eig2[10:20], eig3[10:20], 'r')
    ax.scatter(eig1[20:30], eig2[20:30], eig3[20:30], 'g')
    ax.scatter(eig1[30:40], eig2[30:40], eig3[30:40], 'y')
    plt.show()

if scatter_plot_3times:
    fig = plt.figure(figsize=(24,8))

    eig1 = data_test_projected[:,0]
    eig2 = data_test_projected[:,1]
    eig3 = data_test_projected[:,2]


    ax1 = fig.add_subplot(1,3,1)
    eig1 = data_test_projected[:,0]
    eig2 = data_test_projected[:,1]
    eig3 = data_test_projected[:,2]
    ax1.plot(eig1[0:10], eig2[0:10], 'bo')
    ax1.plot(eig1[10:20], eig2[10:20], 'ro')
    ax1.plot(eig1[20:30], eig2[20:30], 'go')
    ax1.plot(eig1[30:40], eig2[30:40], 'yo')
    
    
    ax2 = fig.add_subplot(1,3,2)
    ax2.plot(eig1[0:10], eig3[0:10], 'bo')
    ax2.plot(eig1[10:20], eig3[10:20], 'ro')
    ax2.plot(eig1[20:30], eig3[20:30], 'go')
    ax2.plot(eig1[30:40], eig3[30:40], 'yo')

    ax3 = fig.add_subplot(1,3,3)
    ax3.plot(eig2[0:10], eig3[0:10], 'bo')
    ax3.plot(eig2[10:20], eig3[10:20], 'ro')
    ax3.plot(eig2[20:30], eig3[20:30], 'go')
    ax3.plot(eig2[30:40], eig3[30:40], 'yo')

    labels=[0,1,2,3,4,5,6,7,8,9]*4
    for i, txt in enumerate(labels):
        ax1.annotate(txt, (eig1[i], eig2[i]))
        ax2.annotate(txt, (eig1[i], eig3[i]))
        ax3.annotate(txt, (eig2[i], eig3[i]))
    plt.show()

In [0]:
eig1 = data_test_projected[:,0]
eig2 = data_test_projected[:,1]

'''
Reconstructing the data, using p first principal components
'''
data_test_reconstructed = my_pca.reconstruct(data_test_projected, p, show=False)


'''
Visualization of the reconstructed data
'''

plt.figure() 
fig, ax = plt.subplots(1, 1, figsize=(16, 16), sharex=True, sharey=True)
ax.grid()
ax.plot(eig1, eig2, 'bo')
ax.set_title("Visualization of the reconstruced data using p components onto 2 first PC")

for x_, y_, img_vector_ in zip(eig1, eig2, data_test_reconstructed):
    img = my_reshape(img_vector_ + data_train_mean, sq_size, color)
    ab = AnnotationBbox(OffsetImage(img, cmap = my_color_map), (x_, y_), frameon=False)
    ax.add_artist(ab)



plt.figure() 
fig, ax = plt.subplots(1, 1, figsize=(8, 8), sharex=True, sharey=True)
eig1 = data_test_projected[:,0]
eig2 = data_test_projected[:,1]
ax.plot(eig1[0:10], eig2[0:10], 'bo')
ax.plot(eig1[10:20], eig2[10:20], 'ro')
ax.plot(eig1[20:30], eig2[20:30], 'go')
ax.plot(eig1[30:40], eig2[30:40], 'yo')
ax.legend(["personA", "personB", "personC", "personD"])
ax.set_title("Visualization of the reconstruced data using p components onto 2 first PC")

labels=[0,1,2,3,4,5,6,7,8,9]*4
for i, txt in enumerate(labels):
    ax.annotate(txt, (eig1[i], eig2[i]))
plt.show()



The plots of the test reconstructions confirms the intuition behing the eigenfaces:
- Emma Stone, in blue, is mainly on the top; while Bradley Cooper, in red, in mainly on the bottom. They are well separated according to the 2nd eigenface, which seems related to the shape of the face, with a very dark part on the bottom right.
- Jane Levy, in green, a woman that resembles to Emma Stone for a human, is mainly on the top of the plot,
- Marc Blucas, in yellow, is a white man similar to Bradley Cooper. While half of the points are located in the lower left half, where most of Bradley cooper images also are, the rest of Marc Blucas images is in the middle of blue and green points.

Two two first eigenfaces, or principal components, already gives us some of the important information present in the data, even if their cumulative explained variance - fitted for the training set inputs - was actually not that high!

Although we will certainly not modify any hyper-parameters based on the test set images, it is still interesting to reproduce the metrics that we built for the training set and the choice of an optimal $p$. It is interesting to answer such questions as:
- what is the final relative reconstruction error ?
- How does it evolve with p ?
- Was $p$ a nice choice, regarding the test sets ?



In [0]:
'''
np array containing all the rmse computed

if n = 40, max number of components, then rmse has a size 40x41 (0 -> 40 included)
> a row matches the rmse of one image wrt the dimension reconstructed. 
Last column should NOT be 0 as we compute construction of test images (= with loss)
'''
n = m_test_src.shape[0]

rmse_test = np.empty((n,n+1))
rmse_test_pc = np.empty((n,n+1)) # rmse in percentage
rmse_test_base = np.empty((n,))
for i in range(n):
    rmse_test_base[i]=my_pca.compute_error(data_test[i], X_train_mean)


index_image = 0
for img_center_vector in data_test_centered:
    X_test_centered_reduced = my_pca.projectPC(img_center_vector, n)
    for k in range(n+1):
        # from 1 to n, included
        if k == 0:
            rmse_test[index_image, k] = rmse_test_base[index_image]
            rmse_test_pc[index_image, k] = 100 * rmse_test[index_image,k] / rmse_test_base[index_image]
        else:
            # logging.debug("reconstructing using " + str(p) + " principal components")
            X_test_hat_centered = my_pca.reconstruct(X_test_centered_reduced, k)
            rmse_test[index_image,k] = my_pca.compute_error(img_center_vector, X_test_hat_centered)
            rmse_test_pc[index_image, k] = 100 * rmse_test[index_image,k] / rmse_test_base[index_image]
        
    index_image += 1

'''
Visualization !
'''

fig = plt.figure(figsize = (16,16))
ax1 = fig.add_subplot(2,2,1)

for idx in range(n):
    rmse_ = rmse_test[idx,:]
    ax1.plot([i for i in range(0,n+1)],rmse_, '-')
ax1.set_title("reconstruction errors (RMSE) for all test images")
ax1.set_xlabel("reconstruction dimension(s) \'p\' ")
ax1.set_ylabel("RMSE")

rmse_test_mean = np.mean(rmse_test, axis = 0)
ax2 = fig.add_subplot(2,2,2)
ax2.plot([i for i in range(0, n+1)], rmse_test_mean, "ro-")
ax2.set_title("Mean of RMSE for all test images")
ax2.set_xlabel("reconstruction dimension(s) \'p\' ")
ax2.set_ylabel("mean of RMSEs")

ax1.set_ylim((0,80))
ax2.set_ylim((0,80))
ax1.grid()
ax2.grid()

ax3 = fig.add_subplot(2,2,3)

for idx in range(n):
    rmse_ = rmse_test_pc[idx,:]
    ax3.plot([i for i in range(0,n+1)],rmse_, '-')
ax3.set_title("reconstruction errors (RMSE) for all test images, in %")
ax3.set_xlabel("reconstruction dimension(s) \'p\' ")
ax3.set_ylabel("RMSE")

rmse_test_pc_mean = np.mean(rmse_test_pc, axis = 0)
ax4 = fig.add_subplot(2, 2,4)
ax4.plot([i for i in range(0, n+1)], rmse_test_pc_mean, "ro-")
ax4.set_title("Mean of RMSE for all test images, in %")
ax4.set_xlabel("reconstruction dimension(s) \'p\' ")
ax4.set_ylabel("mean of RMSEs")

ax3.set_ylim((0,110))
ax4.set_ylim((0,110))
ax3.grid()
ax4.grid()


*Observations*
- the reconstruction loss, computed in the same fashion as for the training set images, decreases also as $p$ increases
- the slope seems to become near 0 as $p$ is ~35. It comforts us with the choice of $p=35$. However, assessing this parameter on a validation set, or even better performing cross-validation (or leave-one-out cross-validation) would be preferable. On the test set, one could also argue that not much info is gained for the component after the 20th.
- using all the components, the remaining error in the reconstruction is still 60% of the error of the base error. there is "no way" to do better.
> as a reminder, the "base" error is the RMSE between the input image, and the mean image of the training set. 

###PCA Conclusion

In this section, we performed a lot :-)! We tried to give the basic idea of the technique, and explained the pre-processing steps required in order to obtain genuine results using PCA (resizing, centering). 

Then, we have covered some of the math behind the technique, and discussed about the nominal *eigendecomposition*, the mathematical *trick* associated, and the *singular value decomposition*. Those three methods have been fully implemented in a class `MyPCA`and tested against each other. 

Using `MyPCA`, we have furthermore detailed and visualized what the eigenfaces are, and discussed about the reconstruction to find back our original data. This involves the choice of an *optimal* $p$, number of components used, which is a trade-off between information loss and dimensionality reduction. We discussed abundantly this topic and showed one way to choose $p$. 

Using this number, we finally plotted the train images **and** the test images onto the 2 first components space, and discussed those results.

Finally, we also repeated some steps about eigenfaces generation and reconstruction using a well-known and optimized library `sklearn`, which confirmed all the results obtained using the homemade implementation.

If you've reached this line: Congrat's! I know it's dense, but it's worth it!
More to come...

##Transfer Learning

*Doing this project alone as a working student, this part can be skipped*

##Features 2D Visualizations

t-SNE  is a quite nice technique for dimensionality reduction used in order to visualize high dimensional data into the 2D (or 3D) space. Other dimensionality reduction techniques often make use of the variance only in order to complete this dimensionality reduction, while t_SNE uses probabilities of being similar (or not). t-SNE stands for *t-Distributed Stochastic Neighbor Embedding*

We won't go through the details of the technique, but you can surely find the theory in the [original paper](http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf), and some other complete tutorial in [towardsdatascience](https://towardsdatascience.com/t-sne-python-example-1ded9953f26) for instance. 


Using this dimensionality reduction technique, and its `sklearn` implementation, we can try and visualize our feature representation built.


This technique is not trivial, and some interesting remarks and insights are explained in [How to Use t-SNE Effectively](https://distill.pub/2016/misread-tsne/). In particular, it's important to note the hyperparameters:
- `perplexity`, which intuitively is *a guess of the number of close neighbors each point has*
- `learning_rate`, very common in iterative methods. 

Those two parameters are tailored for our application.



Let's specify what the labels are on the training set:
- personA, Emma Stone: 0
- personB, Bradley Cooper: 1


In [0]:
y_train = np.ones((40,))
y_train[0:20] = 0

logging.info("y_train shape   : " + str(y_train))
logging.info("y_train sum [20]: " + str(sum(y_train)))

Util function to generate a colored and scattered plot, inspired by [datacamp](https://www.datacamp.com/community/tutorials/introduction-t-sne)

In [0]:
# Utility function to visualize the outputs t-SNE
# inspired by https://www.datacamp.com/community/tutorials/introduction-t-sne

def tsne_scatter(x, colors, title):
    # choose a color palette with seaborn.
    num_classes = len(np.unique(colors))
    palette = np.array(sns.color_palette("bright", num_classes))
    # palette = sns.color_palette("bright", num_classes)
    # create a scatter plot.
    f = plt.figure(figsize=(8, 8))
    ax = f.add_subplot(111)
    sc = ax.scatter(x[:,0], x[:,1], lw=0, s=40, c=palette[colors.astype(np.int)])
    ax.set_title(title)
    ax.grid()

    # # add the labels for each digit corresponding to the label
    txts = []

    for i in range(num_classes):
        # Position of each label at median of data points.
        xtext, ytext = np.median(x[colors == i, :], axis=0)
        txt = ax.text(xtext, ytext, str(i), fontsize=24)
        txts.append(txt)

    plt.show()

Let's generate a random number based on a seed, for the sake of reproducibility

In [0]:
random.seed(8042020)
rand_nb = random.randrange(0,1000)

logging.info("Random Number generated is: " + str(rand_nb))

Now, we can call the t-sne `sklearn` implementation, using two tailored parameters:
- `perplexity` is set to the theoretical number of neighbours, that we now at this point, 
- `learning_rate` is set to a small value which seems a good balance. 

In [0]:
tsne_hog = sklearn.manifold.TSNE(random_state=rand_nb, perplexity=20, learning_rate=50.0)
X_HOG_embedded = tsne_hog.fit_transform(X_HOG_train)

tsne_scatter(X_HOG_embedded, y_train, "Projection of HOG features of the training set using t-SNE")

In [0]:
X_train = get_matrix_from_set(training_set, color, sq_size,flatten=True)
p = 35
pca_ = sklearn_decomposition_PCA(n_components=p) 
pca_.fit(X_train)
X_PCA_train = pca_.transform(X_train)
logging.info("X_PCA_train shape: " + str(X_PCA_train.shape))


In [0]:
tsne_pca = sklearn.manifold.TSNE(random_state=rand_nb, perplexity=20, learning_rate=50.0)
X_PCA_embedded = tsne_pca.fit_transform(X_PCA_train)

tsne_scatter(X_PCA_embedded, y_train, "Projection of PCA features of the training set using t-SNE")

###*Comments on the t-SNE plots*

Before going into the comments, let's first remind two of the six key messages taken from the deep analysis in [How to Use t-SNE Effectively](https://distill.pub/2016/misread-tsne/).
1. Hyper-parameters really matters,
2. Cluster sizes in a t-SNE plot mean nothing,
3. Distances between well-separated clusters in a t-SNE plot may mean nothing


> This analysis was performed with `color = False`, meaning the PCA computed in grayscale. The results, specifically t-SNE reproduction, are of course influenced by this change.

---

The two different plots, for the two feature representations, are built with the hyperparameter `perplexity` set to the theoretical number of neighbours. The hyperparameter `learning_rate` was also modified to try and cope with the problem. Changing those values change the results; as is, it seems to give pretty *interesting* results. 


 

###*Interesting* result ?

Those two graphs are based on the feature representations built: HOG and PCA. Those are high dimensional features, and t-SNE technique is applied to project, in a non-linear fashion, the features onto a 2D figure. 

The two plots shows what we *hope* to find: some notion of distances/similarities between points of the same class. 
- the plots indicate that in both cases, the features seem to be (mostly)separable between the classes, even not linearly in 2D after projection. This is a nice and promising result to build upon. 
- Considering the size and distance between the cluster, it should not really matters in the analysis as reminded above
- As announced by the litterature, the `perplexity` hyperparameter matters a lot.


Intuitively, one could say -- based on the plots above -- that several classifiers may work better than others for certain points. 
> for instance, 3-NN may not work well if an image has a HOG feature representation projected onto [2, 4.5] or a PCA feature projected onto [2, 4] by the t-SNE transformation. 

This kind of intuition may be erroneous, due to the high non-linearity inherent to this transformation. Let's be cautious then, and verify those feelings/ideas in the next steps (see Identification part).
While t-SNE is a great technique for visualization purpose, one shall remain careful regarding the conclusion drawn based on the plots.

In [0]:
# tsne of the test set for the HOG feature

# tsne_hog_test = sklearn.manifold.TSNE(random_state=rand_nb, perplexity=10, learning_rate=50.0)
# X_HOG_embedded = tsne_hog.fit_transform(X_HOG_test)
# y_test = np.ones((40,))
# y_test[0:10]=0
# y_test[20:30]=2
# y_test[30:40]=3
# tsne_scatter(X_HOG_embedded, y_test, "Projection of HOG features of the training set using t-SNE")

# Exploit Feature Representations

In this part, we will use the representations learnt before in order to build a classifier and identification system. 
> From an academic stand point, I was exempted of performing the *Classification* topic, being alone on this project as a working student. I appreciate the flexibility. However, as part of the *Impress your TA's*, and as this ought to be a *fun* part, I have decided to cover this topic as well.

In the next two parts I will construct a classification and an identification system for each of the feature representations, HOG then PCA, and I will qualitatively and quantitatively compare the results obtained. 

> In the analysis, unless specified otherwise, we assume `color=False` and `sq_size=64`. Results with other parameters may differ.

In [0]:
def show_missed(X_test, y_test, y_predict):
    missed = np.where( np.array(y_test != y_predict))
    correct = np.where( np.array(y_test == y_predict))

    if len(missed[0]) > 0:
        logging.info("Mis-classified images: ")
        logging.info("Index: " + str(missed[0]))
        plot_matrix(X_test[missed[0],:], color, my_color_map, h=1, w=len(missed[0]))
    
    logging.info("Properly classifier images: ")
    logging.info("Index: " + str(correct[0]))
    plot_matrix(X_test[correct[0],:], color, my_color_map, h=2, w=1+len(correct[0])//2)

## Classification - howto
While the following sections may appear quite dense, the overall idea is the following:
1. pre-process data
    1. Resizing to appropriate size, as already discussed in the first parts of this tutorial. Typically, the size is smaller than all images in the trainging and test set. 
    2. color or grayscale, as this notebook is fully compatible with both.
    3. shuffle the (ordered) training set
    
2. compute feature representation, as described in the previous part
3. train the corresponding classifier using training sets

4. apply the classifier on unseen images (test)
    * apply same preprocessing as for training (except shuffling)
    * predict
    * compute metrics
    * observations and discussions

In case you're lost, don't hesitate to go back a few steps - keeping an eye on the table of content.


### Preprocess the data

- Let's just check the "meta-hyper-parameters": color and size.


In [0]:
logging.info("smallest px shall be less than: " + str(min(min(get_min_size(training_set)), min(get_min_size(test_set)))))
logging.info("hyper-parameter sq_size : " + str(sq_size))
logging.info("hyper-parameter color   : " + str(color))

The smallest face crop that we have in training set and test set is (70,70), so we can without issue rescale all our face crops to (64,64) as part of the data pre-processing. 

> In this notebook, this resizing is handled with the parameter `sq_size = 64`, set at the very beginning

- Get raw training data.

Those are the raw face cropped from personA and personB. 
In order to be sure not to be disturbed by previous execution or previous code snipper, let's just get those data again.
> this is not a performance issue to repeat this step considering the relative small amount of data we actually deal with.

- Resize the raw images to a common image size

The image size is define by `sq_size`, and we need to resize the training raw image accordingly.

Those two actions are performed in a single function. Code is of course provided. 


In [0]:
X_train = get_matrix_from_set(training_set, color, sq_size = sq_size, flatten = False)
X_test = get_matrix_from_set(test_set, color, sq_size = sq_size, flatten = False)
y_train = np.zeros((40,))
y_train[20:40] = 1
 
'''
For now, set up "0" for personC; "1" for personD !
see discussion in a later cell.
'''
y_test = np.zeros((40,))
y_test[10:20] = 1
y_test[30:40] = 1

#### Shuffling

The training mathematical methods are of course numerical methods, for which the ordening of the training input may have its influence. In order to prevent as much as possible this bias, the classifier `fit` method automatically shuffle the training data between each epoch, unless specified otherwise. We leave the default parameter to ensure this shuffling. Nonetheless, as the training data are currently completely ordered, we introduce a pre-shuffling at this stage, before starting the very first run.

For the sake of repeatability, we seed the RNG, as before in this tutorial.



In [0]:
def shuffle_training(X_train, y_train):
    np.random.seed(0)
    np.random.shuffle(X_train)
    np.random.seed(0)
    np.random.shuffle(y_train)
    np.random.seed()

shuffle_training(X_train, y_train)  

In order to make sure to keep those data as we may need them in a future (optimization ;-) ) step, let's create some "backup" variables. It is ok to do that as the amount of data remain small.

In [0]:
X_train_HOG_shuffled_back = X_train.copy()
y_train_HOG_shuffled_back = y_train.copy()
X_test_HOG_back = X_test.copy()
y_test_HOG_back = y_test.copy()


###Visualization of the training images (shuffled)


In [0]:
logging.info("Training set :")
logging.info("Training set shuffled [0 -> 19]:")
plot_matrix(X_train[0:20,:], color, my_color_map, h=1, w=20, transpose = False)
logging.info("Training set shuffled [20 -> 39]:")
plot_matrix(X_train[20:40,:], color, my_color_map, h=1, w=20, transpose = False)


 Obviously, we see that Emma Stone and Bradley Cooper are now interleaved (at least, their images)

###Visualization of the test images

In [0]:
logging.info("Test set :")
logging.info("Test set PersonA [0 -> 9]:")
plot_matrix(X_test[0:10, :], color, my_color_map, h=1, w=10, transpose = False)
logging.info("Test set PersonB [10 -> 19]:")
plot_matrix(X_test[10:20, :], color, my_color_map, h=1, w=10, transpose = False)
logging.info("Test set PersonC [20 -> 29]:")
plot_matrix(X_test[20:30, :], color, my_color_map, h=1, w=10, transpose = False)
logging.info("Test set PersonD [30 -> 39]:")
plot_matrix(X_test[30:40, :], color, my_color_map, h=1, w=10, transpose = False)

###Standard scaler

It is often a question of "should we scale our features or not" ? Scaling as in "have a variance between 0 and 1".

While one could argue it's always better, in the context of this tutorial, we will not. 
The very essence of the inputs are pixel intensities: they already are on the same scale of data, and there isn't order of magnitude differences between them. Of course, it does not lead to having scaled feature representations...Yet, it does not seem to matter very much in our problem.

As it does not strictly participate to the educative goal of this tutorial, I dediced not to include the scaler step in the different systems (or Pipeline, as we will call them). 
Nonetheless, if one wanted to try out, a common scaler is `sklearn.preprocessing.StandardScaler()`.

If the data are not scaled, they however **do need** to be centered for the PCA technique, as discussed previously. This does not change. 

## HOG Classification

In this section, we focus on the classifier based on the HOG feature representation.

Util function to plot side by side an color face, and its HOG descriptor, for educative purpose

In [0]:
def show_one_image_hog(idx_of_interest, person=personA, set="training"):
    if set == "training":
        source_set = training_set
        hog_set = hog_training
    elif set == "test":
        source_set = test_set
        hog_set = hog_test


    image_of_interest = source_set[person][idx_of_interest]
    hog_of_interest = hog_set[person][idx_of_interest]

    '''
    Visualization of the image and its hog selected as image_of_interest
    '''
    fig, (ax0, ax1) = plt.subplots(1,2,figsize = (8,4), sharex=False, sharey=False)

    ax0.imshow(cv2.cvtColor(image_of_interest, cv2.COLOR_BGR2RGB))
    ax0.set_title("Face \'barely\' properly classified")

    ax1.imshow(hog_of_interest[1])
    ax1.set_title("Visualization of the HOG of interest")

    logging.info("Shape of the descriptor       : " + str(hog_of_interest[0].shape))
    logging.info("Shape of the descriptor (visu): " + str(hog_of_interest[1].shape))
    logging.info("Shape of the image of interest: " + str(image_of_interest.shape))
    plt.show()

### Construction of HOG Transformer
Following the method and advices of [Kapernikov](https://kapernikov.com/tutorial-image-classification-with-scikit-learn/)

####[Kézako](https://forum.wordreference.com/threads/k%C3%A9zako.245210/)?
We won't go into the details of the computer science design pattern leading to the Transformer building by `sklearn`, but in very essence, a transformer is a class that takes some input and perform some transformation on that, depending (possibly) on extra parameters. 

We actually already used such a transformer during the PCA demo using `sklearn`, and more specifically `sklearn.decomposition.PCA`. This `PCA` class is a `Transformer` because it inherits from `BaseEstimator` and `TransformerMixin`.


In [0]:
import inspect
tuple_ancestor = inspect.getmro(sklearn_decomposition_PCA)
for ancestor_class in tuple_ancestor:
    print("child of " + str(ancestor_class))

In particular, `sklearn.decomposition.PCA` implements -among others- the methods `fit` and `transform`, which are required to be called a `Transformer`.  A Transformer helps in defining a systematic way of performing some actions on the data. 

We can build our own transformer, called `HogTransformer`, make it inheriting of `BaseEstimator` and `TransformerMixin` and benefit from the same capabilities. 

*Don't worry if it's still fuzzy, it'll become clearer when we will use it.*



In [0]:
class HogTransformer(BaseEstimator, TransformerMixin):
    '''
    Expects an array of 2D arrays (1 channel images)
    Calculates hog features for each image
    '''
    def __init__(self, y=None, 
                 orientations = 9, 
                 pixels_per_cell = (8,8), 
                 cells_per_block = (2,2),
                 block_norm = "L2-Hys", 
                 transform_sqrt = False,
                 multichannel = False):
        self.y = y
        self.orientations = orientations
        self.pixels_per_cell = pixels_per_cell
        self.cells_per_block = cells_per_block
        self.block_norm = block_norm
        self.transform_sqrt = transform_sqrt
        self.multichannel = multichannel # default is grayscale
    
    def fit(self, X, y=None):
        logging.debug("[HOGTransformer.fit] X.Shape " + str(X.shape))
        return self
    
    def transform(self, X, y=None):
        logging.debug("[HOGTransformer.transform] X.Shape " + str(X.shape))

        def local_hog(X):
            if self.multichannel:
                X_ = X.copy() #.T
                # logging.debug("TO CHECK IF TRANSPOSE STILL NEEDED ?")
                # logging.debug("[HOGTransformer.transform] (1) X.Shape " + str(X.shape))
                # logging.debug("[HOGTransformer.transform] (2) X_.Shape " + str(X_.shape))
            else:
                X_ = X.copy()
            # logging.debug("[HOGTransformer.transform.local_HOG]" )
            # cv2_imshow(X_)
            return skimage_feature_hog(X_,
                                       orientations = self.orientations, 
                                       pixels_per_cell = self.pixels_per_cell,
                                       cells_per_block = self.cells_per_block,
                                       block_norm = self.block_norm,
                                       visualize = False, 
                                       transform_sqrt = self.transform_sqrt, 
                                       feature_vector = True, 
                                       multichannel = self.multichannel)

        try: 
            # tmp = [str(image.shape) for image in X]
            # logging.debug( str(tmp) )
            return np.array([local_hog(image) for image in X])
        except ValueError as ve:
            logging.error(str(ve))
        except NameError as ne:
            logging.error(str(ne))


/!\ A careful reader could recommend to also build a Transformer for the color conversion according to `color` attribute, as well as the resizing according to the `sq_size` attribute. 

==> That is completely True !

Nonetheless, this part of the code has been covered much later that the pre-processing steps, and it is not mandatory to cover the full scope of this tutorial. 
That being said, a future version could indeed replace the utility functions handling those `color` and `sq_size` parameters as Transformers.


From the previous section, we have ready the training data `X_train` and `y_train`. Let's apply (= fit, then transform) the `HogTransformer`. 

We log the final shape of the training data hog representations.

In [0]:

# scalify = StandardScaler()
hogify = HogTransformer(    orientations = 9, 
                            pixels_per_cell = (8,8), 
                            cells_per_block = (2,2),
                            block_norm = "L2", 
                            transform_sqrt = True,
                            multichannel = color)

X_train_hog = hogify.fit_transform(X_train)
# X_train_prepared = scalify.fit_transform(X_train_hog)
X_train_hog_prepared = X_train_hog

logging.info("X_train prepared for HOG classification. Shape: " + str(X_train_hog_prepared.shape))

Now that we have the training data ready to train the classifier, let's go!

Many possibilities exist, and let's try out a stochastic gradient descent classifier, from `sklearn`.
In a first step, we can leave most of the parameters as is. Considering the loss function and penalty parameters, it leads to a linear SVM classifier, see [sklearn SGDClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) for more details.

When created, we can call the function `fit` to train the classifier. 


####Training

In [0]:
sgd_clf = SGDClassifier(random_state=42, max_iter=1000, tol=1e-6, verbose = 1, shuffle=True)
sgd_clf.fit(X_train_hog_prepared, y_train)

The classifier is now trained.

We can test the result on the test sets. 

> Important note: there was no validation set dedicated to parameters evaluation. While this is a good practice, it's not crucial for the moment as the goal is to demonstrate the methodology. In a few sections, while we will discuss optimization of the classifier, we will realize cross-validation using dedicated tools from `sklearn`.


We need to apply the same pre-processing steps as we did on the training set. As we already resized and set up the color according to global hyperparameters, we just need to apply the `transform`method of the `Hog_Transformer`

> Why not fit? The `fit` method will never be applied on the test data, as it implies tuning the Transfomer for the data - which we don't want with the test set. For the present HOG case, it does not change anything as there is nothing done in the method; for other cases such as PCA, this is highly important as we don't want the Principal Components to be modified by the test set.


###Test on personA and personB




####Gathering and processing the test data

Those two persons correspond to the persons of the training set. Their indices in the test set are from [0,19] so that we can create `X_test_ab` that contains the input data for the test only of those 2 persons.

Similarly, we extract the labels in `y_test_ab`

In [0]:
X_test_ab = X_test[0:20,:]
y_test_ab = y_test[0:20]

print("X_test    shape -> " + str(X_test.shape))
print("X_test_ab shape -> " + str(X_test_ab.shape))
print("sum y_test_ab [10]      -> " + str(sum(y_test_ab)))

'''
Application of the hog transform
'''
X_test_hog_ab = hogify.transform(X_test_ab)
X_test_hog_ab_prepared = X_test_hog_ab




As before, let's save those as backup variables

In [0]:
X_test_HOG_ab_back = X_test_ab.copy()
y_test_HOG_ab_back = y_test_ab.copy()


####Tests and metrics
we use our classifier to predict the labels of the test inputs. 

In [0]:
y_pred_hog_ab = sgd_clf.predict(X_test_hog_ab_prepared)

Now, we can compute the accuracy for the classifier

In [0]:
accuracy = np.sum(y_pred_hog_ab == y_test_ab)/len(y_test_ab)
logging.info("Accuracy: " + str(accuracy) )

Note that if we are really not sure about how to compute the accuracy, or if we prefer to use library functions, we can!

In [0]:
logging.info("Accuracy [SKlearn]: " + str(metrics.accuracy_score(y_test_ab, y_pred_hog_ab )))

Luckily, both accuracy give the same answers ;-)!
#####Summary
 
- we built a Transformer that modifies the input (= raw images) into HOG features
- we created a stochastic Gradient descent classifier
- we trained this classifier using the HOG feature
- we transformed the test set raw images into HOG features
- we tested the classifier against those test HOG features, for personA (= class 0) and personB (=class 1)
- we computed the accuracy

It is not 100%, and it may be interesting to observe the failures.

####Visualization: Confusion Matrix

Let's create a (very simple) confusion matrix. 
The goal is to see where are the faulty classification. In other words, in this simple classification that we run, what were the failing images ?

In [0]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix

'''
Confusion matrix
'''
def my_plot_confusion_matrix(classifier, X_test, y_test, true_labels, predicted_labels=None):
    if predicted_labels == None:
        predicted_labels = true_labels
    
    titles_options = [("Confusion matrix, without normalization", None),
                    ("Normalized confusion matrix", 'true')]
    for title, normalize in titles_options:
        disp = plot_confusion_matrix(classifier, 
                                    X_test, 
                                    y_test,
                                    display_labels=None,
                                    cmap=plt.cm.viridis,
                                    normalize=normalize,
                                    include_values=True, 
                                    xticks_rotation='horizontal', 
                                    values_format=None)
        disp.ax_.set_title(title)
        disp.ax_.set_xticklabels(predicted_labels)
        disp.ax_.set_yticklabels(true_labels)

        plt.show()



In [0]:
true_labels= ["Emma Stone", "Bradley Cooper"]
my_plot_confusion_matrix(sgd_clf,
                         X_test_hog_ab_prepared, 
                         y_test_ab,
                         true_labels = true_labels,
                         predicted_labels = true_labels) 

The result is pretty clear: all the images that were misclassified (4) are Emma Stone faces that were classified as Bradley Cooper faces.

Let's dig into the analysis to see what precisely are those images.
- let's find which is index of the misclassification, 
- let's show the test images, separated from all the others.

In [0]:
mis_idx = np.where( np.array(y_test_ab - y_pred_hog_ab) != 0)
logging.info("Indices misclassified: " + str(mis_idx[0]))

images_misclassified = X_test_ab[mis_idx[0]].copy()
logging.info("Images misclassified: ")
plot_matrix(images_misclassified, color, my_color_map, h=1, w=mis_idx[0].shape[0])

logging.info("Images correctly classified: ")
correct_idx = np.where(np.array(y_test_ab - y_pred_hog_ab) == 0)
remaining_images = X_test_ab[correct_idx[0]].copy() 
plot_matrix(remaining_images, color, my_color_map, h=2, w=1+correct_idx[0].shape[0]//2)


logging.info("Again - Overview of the training data to ease the understanding...")
plot_matrix(X_train, color, my_color_map, h=4, w=10)


####Observations and Discussions on test results

> those observations are done with `color = False` and `sq_size = 64`

The good thing with working with images is that we may get a better understanding by simply looking at the high dimensional input data: the raw images. 
Images #1, #2 and #7 and #9 are misclassified. They are all personA images, and we will focus more on those images first.



#####personA (mis)classification
- Images #2, #7, #9 appear visually really different from all the others: hair color and haircut are dramatically different than most of Emma Stone faces. 

To be more convinced, let's come back to the HOG feature descriptor, for an image we've already seen.


In [0]:
show_one_image_hog(2, person=personA, set="training")

On this descriptor, it appears clear that the haircut plays a key role in the description of the face, as visible in the top right corner where a clear oblique line is visible. This line cannot be present in the descriptor of the test set images #2 nor #7 nor #9, leading to a harder classification.

- Image #1 is also misclassified. It may be more difficult to understand why there was a mistake there. Some intuition however:
    - it is the only image from the person A training and test sets that has this rotation
    - the haircut is - besides rotated - not as sharp as most of the other images
    - there is a strong dark background on the left, leading to a large gradient magnitude at this side, which is not present for other personA image.


To confirm all those intuitions, let's look at the real descriptor used for those misclassified image. We are used to these image from the first part of this tutorial.

In [0]:
'''
computation of the HOG descriptor for the misclassified images
'''
hog_missed = []
for index in mis_idx[0]:
    hog_image = hog_test[personA][index][1]
    hog_missed.append(hog_image)

fig, ax_ = plt.subplots(1,mis_idx[0].shape[0],figsize = (16,4), sharex=True, sharey=True)

for count in range(len(ax_)):
    ax_[count].imshow(hog_missed[count])



Our assumptions seem to cope with the descriptors:
1. Image #1 is rotated, the haircut doesn't lead to a clear difference, and the background on the left is captured by the descriptor
2. Images #2, #7 and #9 indeed do not show the haircut.

##### Decision Boundaries
Another information given by the classifier confirms the intuition that
- image #1 seems more alike the others, but the rotation makes it harder to be classified, 
- image #2, #7, #9 don't have a key element of the descriptor, hence are more easily classified wrongly. 

This is confirmed by the *decision function* from the classifier, which returns the distance with respect to the boundary line. As stated in the documentation, it predicts the confidence scores for samples which is the signed distance of that sample to the hyperplane, see [sklearn source](https://github.com/scikit-learn/scikit-learn/blob/95d4f0841/sklearn/linear_model/_base.py#L247)

>  $  distance \gt 0 \Rightarrow class = 1$


In [0]:
'''
Printing classifier scores (= distance to decision plane)
'''
scores = sgd_clf.decision_function(X_test_hog_ab_prepared)
logging.info("scores (decision functions): " + str(scores))

'''
Visualization
'''
# plt.figure(figsize=(8,8))
fig, ax = plt.subplots(1, 1, figsize=(8,8))
scores_a = scores[0:10]
scores_b = scores[10:20]
ax.scatter(range(len(scores_a)), scores_a)
ax.scatter(range(len(scores_a), len(scores_a)+len(scores_b)), scores_b)
ax.plot(range(len(scores)), np.zeros((len(scores),)),"r-")
ax.set_title("Visualization of distance to classification boundary")
ax.legend(["Boundary", "personA", "personB"])
ax.set_xlabel("test image index",fontsize=12)
ax.set_ylabel("Distance to boundary decision",fontsize=12)
if not color:
    ax.set_xlim(left=-0.5, right=20.5)
    ax.set_ylim(bottom=-275.0, top=275.0)
    ax.text(2,150,"Classified as \nPersonB",fontsize=12)
    ax.text(12,-150,"Classified as \nPersonA",fontsize=12)
    
plt.show()

On the above plot, personA is in blue, on the left part (indices 0 -> 9 included) and personB is on the right side, in orange, (indices 10 -> 19 included). 

From this graph, everything that is said is confirmed:
* 4 images from personA are in the wrong side of the line, hence misclassified as personB instead of personA,
* misclassified images #1 are nonetheless close to the boundary line. Image #1, in particular, is the rotated image: although visually, the image looks well personA's face, the rotation makes it harder for our classifier. Besides, it is close from the boundary, indicating the classifier *is not so confident* about its choice.
* Images #2 and #7, two of the other three misclassified images -- that don't show the nominal personA haircut -- are further away from the boundary line. This corresponds to the visual hints that those test images actually do not look alike the training image, because of the haircut.
* Image #9, the last test image of personA is not properly classified, but barely! It indeed shows the same characteristics as misclassified #2 and #7 (blond hair, different haircut, looks younger, ...) "Thanks to" the background and the viewpoint, however, the HOG descriptor differences do not lead to such a misclassification as for the two other similar images #2 and #7. The descriptor is reminded here below.
* On the contrary, personB is classified, with a high confidence, properly for all test images. 

In [0]:
show_one_image_hog(index, person=personA, set="test")

#####personB classification

Bradley Cooper, personB, was properly classified 100% of the time already, indicating a nice resemblance between personB test set and training set HOG descriptors.
As visible on the distance to boundary decision plot above, the classifier is really condifent about its choice, specifically in comparison with results from personA tests images.  



###One More Thing... Pipelining!

Until now, the training and test have been quite *manual*
- we define data structure between the transformers
- we specify the transformation manually one after the others

The use of the `Transformer`'s that we have intriduced already gives the opportunity to do (much) better and take advantage of a dedicated *Architecture Style* of software programs called [Pipes and Filters](https://medium.com/@syedhasan010/pipe-and-filter-architecture-bd7babdb908-). 

The very good thing is that it's already provided by `sklearn`, and fully applicable to `Transformer`'s, which are in fact just `Filter`'s.

This makes the classification much less of a manual process:
- no more care about the follow-up of action for each run, 
- no more intermediate data structure creation, 
- **very** easy to modify and add/remove steps, withou messing with the data structures.

Concretely, we need to define a `Pipeline` object that we will call to `fit` and `predict`. This `Pipeline`will make use of our `Transformer` objects and automatically connects the output of one to the input of the next one.


We are now ready to reproduce the results using the `Pipeline` architecture!


In [0]:
'''
Definition of a pipeline
'''

HOG_pipeline = Pipeline([('hogify', HogTransformer(
                             orientations = 9,
                             pixels_per_cell = (8,8),
                             cells_per_block = (2,2),
                             block_norm='L2',
                             transform_sqrt=True, 
                             multichannel = color)
                         ),
                         ('classify', SGDClassifier(
                             random_state = 42, 
                             max_iter = 1000, 
                             tol=1e-3)
                         )

])

'''
Training
'''
# we set the X_train before the hogify of course...
classifier = HOG_pipeline.fit(X_train, y_train)
# logging.info("Percentage correct: " + str( 100* np.sum(y_pred == y_test_ab)/len(y_test_ab) ) + " %")

'''
Predicting
'''
y_pred_= classifier.predict(X_test_ab) 

'''
Computing and Showing accuracy
'''
logging.info("Percentage correct: " + str( 100*np.sum(y_pred_ == y_test_ab)/len(y_test_ab)) + " %")
misclassifier_index = np.where( np.array(y_test_ab != y_pred_))
logging.info("Indices misclassified: " + str(misclassifier_index[0]))


At the end of the pipeline, we have reproduced exactly the same results as before.

***From now on, I will use Pipelining instead of regular manual scripting***

###Tests on personC and personD

We have a *not great* accuracy so far for personA and personB, and we can wonder how the current classifier, based on HOG descriptors which are local, performs when personC and personD come into play.

- we create `X_test_cd` containing the test data for personC and personD
- we create `y_test_cd` containing the 0/1 labels for personC and personD

####Wait - What? Why?
Hum... Indeed, assigning a label between 0 and 1 means we expect personC to be classified as personA, and personD as personB. This is disputable. Let's see this step as just a way to assess how the classification based on our feature works, and how the metric evolves when the inputs are "so" different, keeping in mind the results should not be as high as previously. 

Another way to see it is to understand persons A-C and B-D from the same classes, but only a biased training set is available. We want to assess how the final classifier perform on images never seen and quite different from training set, yet part of the classes (say, *white_young_female*, *white_40s_male*).



In [0]:
X_test_cd = X_test[20:40, :]
y_test_cd = y_test[20:40]

logging.info("X_test_cd shape: " + str(X_test_cd.shape))
logging.info("y_test_cd shape: " + str(y_test_cd.shape))

X_test_HOG_cd_back = X_test_cd.copy()
y_test_HOG_cd_back = y_test_cd.copy()

Based on that, we can simply reuse the classifier already created

####Prediction

In [0]:
'''
Predicting
'''
# note that as we use the pipeline, we can directly set as input the data matrix. 
# the hogify step is included.
y_pred_cd= classifier.predict(X_test_cd) 

####Confusion Matrix

In [0]:
'''
Computing and Showing accuracy
'''
logging.info("Percentage correct: " + str( 100*np.sum(y_pred_cd == y_test_cd)/len(y_test_cd)) + " %")
misclassifier_index = np.where( np.array(y_test_cd != y_pred_cd))
properclassifier_index = np.where( np.array(y_test_cd == y_pred_cd))

logging.info("Indices misclassified: " + str(misclassifier_index[0]))

true_labels=["Jane Levy", "Marc Blucas"]
predicted_labels = ["Emma Stone", "Bradley Cooper"]
my_plot_confusion_matrix(classifier,
                         X_test_cd, 
                         y_test_cd,
                         true_labels = true_labels,
                         predicted_labels = predicted_labels) 

Besides the *accuracy* of 55%, this gives a very interesting results:
- All the images of personD were *correctly* classified
- All the images of personC, except index 0, were *wrongly* classified

> Again, I emphasize the fact that *correctly* and *wrongly* may not be appropriate considering previous remark.

Let's have a closer look at the images.

In [0]:
logging.info("Mis-classified images: ")
plot_matrix(X_test_cd[misclassifier_index[0],:], color, my_color_map, h=1, w=len(misclassifier_index[0]))

logging.info("Properly classifier images: ")
plot_matrix(X_test_cd[properclassifier_index[0],:], color, my_color_map, h=2, w=1+len(properclassifier_index[0])//2)

#### Decisions boundary -- confidence score

In order to visually understand current classification results, we can plot the decision boundary. 
As indicated in the documentation, "the confidence score for a sample is the signed distance of that sample to the hyperplane", see [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier.decision_function)
It gives an idea of how far the sample is from the hyperplane, hence an idea of the confidence the classifier has. 
This is great to observe what samples are easy / tricky to classify.

In [0]:
'''
Printing classifier scores (= distance to decision plane)
'''
scores = classifier.decision_function(X_test_cd)
logging.info("scores (decision functions): " + str(scores))

'''
Visualization
'''
# plt.figure(figsize=(8,8))
fig, ax = plt.subplots(1, 1, figsize=(8,4))
scores_a = scores[0:10]
scores_b = scores[10:20]
ax.scatter(range(len(scores_a)), scores_a)
ax.scatter(range(len(scores_a), len(scores_a)+len(scores_b)), scores_b)
ax.plot(range(len(scores)), np.zeros((len(scores),)),"r-")
ax.set_title("Visualization of distance to classification boundary")
ax.legend(["Boundary", "personA", "personB"])
ax.set_xlabel("test image index",fontsize=12)
ax.set_ylabel("Distance to boundary decision",fontsize=12)
if not color:
    ax.set_xlim(left=-0.5, right=20.5)
    ax.set_ylim(bottom=-275.0, top=275.0)
    ax.text(2,150,"Classified as \nPersonB",fontsize=12)
    ax.text(12,-150,"Classified as \nPersonA",fontsize=12)
plt.show()

Looking at the above plots of the distance to boundary decision line:
- there is no clear difference in classification of personC and personD
- personD was always properly classified as personB (class = "1"). We can however say that the average distance to boundary line (hence confidence of the classifier) is lower than for previous personB test images (see previous section). This is perfectly normal: the classifier is correct, but less confident, for personD (absent from training) than for personB.
- personC is almost always misclassified, except for one image, that we plot after.
    - personC is similar to personA **but** does not have this same haircut, characteristics to personA
    - the rest of the descriptor cannot make it for the haircut, and it leads to a misclassification. In particular, the lateral sides are mostly vertical lines, more characteristics to personB than personA.

####Better understanding


In [0]:
show_one_image_hog(0,personC, "test")

Above: this is the only image of personC classified as personA so far. We recognize, due to the view point and rotation of the picture, the oblique part on the upper right corner.  Other images do not show this, and are then classified as personB.

Example given, Image#1 of personC, and Image#0 of personB are shown here after. 

In [0]:
show_one_image_hog(1,personC, "test")
show_one_image_hog(1,personB, "training")

### HOG Classifier conclusion


In this section, we used our HOG feature to train a simple classifier (stochastic Gradient Descent), a linear model, that performed with an accuracy of 80% on the test set of personA and personB, and 55% on the test set of personC and personD. 
These numbers need to be taken with caution considering:
- the small amount of images (both in training and test sets)
- there was currently no optimization -- using a cross validation technique -- on the model hyperparameters. In particular:
    * `orientations`
    * `pixels_per_cell`
    * `cells_per_block`
    * `block_norm`
    * `transform_sqrt`

The results have been assessed, and in particular the (relative) poor performance on PersonA test images, and personC test images. This emphasize the inherent "quality" of the representation, its locality, and fine orientation *granularity* (changes in the orientation can be perceived quite well), spatial *granularity*,...all that depending of key parametres listed above. Surely, some optimizations are possible and will be treated in a later stage.

## PCA Classification

Applying the same routine as described for the HOG feature representation, we will build a classifier based on the PCA feature representation. 

###Gathering and processing the data

As already discussed in the HOG section, we retrieve the original data, resize, convert them according to `color` attribute, and reshape them in a useable matrix. Those could also be done using a `Transformer`. 

####Training data

Code Subtlety: because of the `get_matrix_from_set` function and current organization of the code, we need to reapply the shuffling as we re-use the import from the beginning according to the different parameters. 

In case of a production code, this would need to be improved.


In [0]:
'''
get_matrix_from_set
'''
X_train_PCA = get_matrix_from_set(training_set, color, sq_size, flatten=True)
y_train_PCA = np.zeros((40,))
y_train_PCA[20:40] = 1


'''
Shuffling
'''
logging.info("Labels before shuffling:\n" + str(", ".join([str(i.astype(np.uint8)) for i in y_train_PCA])))

shuffle_training(X_train_PCA, y_train_PCA)

logging.info("Labels  after shuffling:\n" + str(", ".join([str(i.astype(np.uint8)) for i in y_train_PCA])))

####Test data

As an anticipation, we already pre-process the data we will use for the test. we split the dataset to our convenience between:
- test on personA and personB
- test on personC and personD



In [0]:
'''
get_matrix_from_set
'''
X_test_PCA= get_matrix_from_set(test_set, color, sq_size, flatten = True)
y_test_PCA = y_test.copy()

logging.info("X_test_PCA Shape          = " + str(X_test_PCA.shape))
logging.info("y_test_PCA Shape          = " + str(y_test_PCA.shape))

'''
AB
'''
X_test_PCA_ab = X_test_PCA[0:20, :].copy()
y_test_PCA_ab = y_test_PCA[0:20].copy()
logging.info("X_test_PCA_ab Shape       = " + str(X_test_PCA_ab.shape))
logging.info("y_test_PCA_ab Shape       = " + str(y_test_PCA_ab.shape))
logging.info("Sum y_test_PCA_ab [10]    = " + str(sum(y_test_PCA_ab)))

'''
CD
'''
X_test_PCA_cd = X_test_PCA[20:40, :].copy()
y_test_PCA_cd = y_test_PCA[20:40].copy()
logging.info("X_test_PCA_cd Shape       = " + str(X_test_PCA_cd.shape))
logging.info("y_test_PCA_cd Shape       = " + str(y_test_PCA_cd.shape))
logging.info("Sum y_test_PCA_cd [10]    = " + str(sum(y_test_PCA_cd)))



In [0]:
'''
Define backup variables
'''
X_train_PCA_shuffle_back = X_train_PCA.copy()
y_train_PCA_shuffle_back = y_train_PCA.copy()
X_test_PCA_back = X_test_PCA.copy()
y_test_PCA_back = y_test_PCA.copy()

X_test_PCA_ab_back = X_test_PCA_ab.copy()
y_test_PCA_ab_back = y_test_PCA_ab.copy()

X_test_PCA_cd_back = X_test_PCA_cd.copy()
y_test_PCA_cd_back = y_test_PCA_cd.copy()


###PCA pipeline

Using Scikit-Learn library, that we already introduced in the previous part, we create the PCA Transformer.

As input, we give first the `n_components` equal to the optimal $p$ that we found in previous section. 
This does not represent much of a dimensionality reduction, as already discussed.

In [0]:
'''
Below commented code create and train a classifier \"manually\""
'''
# p = 35
# pcaify = sklearn_decomposition_PCA(n_components = p)
# X_train_PCA = pcaify.fit_transform(X_train)
# X_train_prepared = X_train_PCA

# logging.info("PCA fit_transform result:\nshape:" + str(X_train_prepared.shape))

# sgd_clf = SGDClassifier(random_state=42, max_iter = 1000, tol=1e-4, verbose = 1)
# sgd_clf.fit(X_train_prepared, y_train)

The pipeline is created, and trained using input data directly.

In [0]:
'''
Definition of a PCA pipeline
'''
pcaify = sklearn_decomposition_PCA(n_components = 35)
classify = SGDClassifier( random_state = 42, max_iter = 10000, tol=0.0001)

PCA_pipeline = Pipeline([('pcaify', pcaify),
                         ('classify', classify)])

'''
Training
'''
clf_pca = PCA_pipeline.fit(X_train_PCA, y_train_PCA)


###Test on personA and personB




####Predictions

Using the pipeline, we can predict the results for the tests images of personA and personB.

We compute the accuracy


In [0]:
'''
Predicting A and B
'''
y_pred_PCA_ab = clf_pca.predict(X_test_PCA_ab) 
logging.info("Percentage correct: " + str( 100*np.sum(y_pred_PCA_ab == y_test_PCA_ab)/len(y_test_PCA_ab)) + " %")

For the sake of completion, we can - one extra time - get the eigenfaces used, according to the `n_components = 35` parameter.

In [0]:
efaces = pcaify.components_
logging.debug("eigenfaces shape = " + str(efaces.shape))

'''
Visualize the eigenfaces, just as we did before
'''
if color:
    efaces_cvt = (efaces*255).astype(np.uint).copy()
else:
    efaces_cvt = efaces.copy()
plot_matrix(efaces_cvt, color, my_color_map, h=4, w=10, transpose=False)

####Confusion Matrix
As previously, it's interesting to get the confusion matrix view, even if it is quite simple in our problem, considering only two classes. 



In [0]:
true_labels = ["Emma Stone", "Bradley Cooper"]
my_plot_confusion_matrix(clf_pca,
                         X_test_PCA_ab, 
                         y_test_PCA_ab,
                         true_labels = true_labels) 

The confusion matrix directly shows that on test images for A and B, there is an accuracy of 100% and there is no misclassified images.

####Observations and Discussions on test results


The first thing to note is the absolute result 100% accuracy with the first attempt. This is pretty good. 
As a reminder, we had "only" 80% using HOG feature representation. However, we should be careful to draw any conclusion at this point as none of the classifier have been optimized yet in terms of hyperparameters (specifically not the HOG classifier - see later)

Similarly to what we did for the HOG feature representation, let's analyze deeper the results of the classification on test images of personA and personB using the PCA feature representation, with hyperparameter $p=35$.
This analysis is performed with `sq_scale = 64` and `color = False`.


In [0]:
'''
Printing classifier scores (= distance to decision plane)
'''
scores = clf_pca.decision_function(X_test_PCA_ab)
logging.info("scores (decision functions): " + str(scores))

'''
Visualization
'''
# plt.figure(figsize=(8,8))
fig, ax = plt.subplots(1, 1, figsize=(8,4))
scores_a = scores[0:10]
scores_b = scores[10:20]
ax.scatter(range(len(scores_a)), scores_a)
ax.scatter(range(len(scores_a), len(scores_a)+len(scores_b)), scores_b)
ax.plot(range(len(scores)), np.zeros((len(scores),)),"r-")
ax.set_title("Visualization of distance to classification boundary")
ax.legend(["Boundary", "personA", "personB"])
ax.set_xlabel("test image index",fontsize=12)
ax.set_ylabel("Distance to boundary decision",fontsize=12)
if not color:
    ax.text(2,0.5e8,"Classified as \nPersonB",fontsize=12)
    ax.text(12,-0.5e8,"Classified as \nPersonA",fontsize=12)
plt.show()

Visualizing the decision boundaries score, it seems the classifier is *pretty certain* about the personB classification, and slightly *less certain* for personA. 


###Tests on personC and personD

Similarly for what we did with the HOG feature representation, we can use our classifier on person C and person D test data. 

The remarks we made before regarding the intrinsic meaning of this test stay applicable.

We use `X_test_PCA_cd` and `y_pred_PCA_cd` as data structures.

We first use the classifer to predict, then print the indices and plot the misclassified images, the confusion matrix and the score, as we did before already.

####Predictions

In [0]:
'''
Predictions on C an D
'''
y_pred_PCA_cd = clf_pca.predict(X_test_PCA_cd) 
logging.info("Percentage correct: " + str( 100*np.sum(y_pred_PCA_cd == y_test_PCA_cd)/len(y_test_PCA_cd)) + " %")


'''
get misclassified and correctly classified images index
show related images
'''
show_missed(X_test_PCA_cd, y_test_PCA_cd, y_pred_PCA_cd)


####Confusion Matrix

In [0]:
true_labels=["Jane Levy", "Marc Blucas"]
predicted_labels = ["Emma Stone", "Bradley Cooper"]
my_plot_confusion_matrix(clf_pca,
                         X_test_PCA_cd, 
                         y_test_PCA_cd,
                         true_labels = true_labels,
                         predicted_labels = predicted_labels) 

####Decision Boundaries


In [0]:
scores = clf_pca.decision_function(X_test_PCA_cd)
logging.info("scores (decision functions): " + str(scores))

'''
Visualization
'''
# plt.figure(figsize=(8,8))
fig, ax = plt.subplots(1, 1, figsize=(8,4))
scores_a = scores[0:10]
scores_b = scores[10:20]
ax.scatter(range(len(scores_a)), scores_a)
ax.scatter(range(len(scores_a), len(scores_a)+len(scores_b)), scores_b)
ax.plot(range(len(scores)), np.zeros((len(scores),)),"r-")
ax.set_title("Visualization of distance to classification boundary")
ax.legend(["Boundary", "personA", "personB"])
ax.set_xlabel("test image index",fontsize=12)
ax.set_ylabel("Distance to boundary decision",fontsize=12)
# if not color:
# ax.text(4,4e7,"Classified as \nPersonB",fontsize=12)
# ax.text(12,-2.5e7,"Classified as \nPersonA",fontsize=12)
plt.show()

####Observations and Discussions on the test results

The results of the prediction on personC and D are interesting as they differ a lot from the ones of the HOG feature. 

In the case of the PCA feature, both person C and person D have several correct and wrong predictions. Results are slighlty better (by one image) for person D, but considering the few images on the test sets, this is most likely not statistically representative. 

Some hints to better understand the results: as a reminder, PCA feature representation is a projection of the images into a vector of coefficients - weights - of the eigenfaces (Principal Components) found during the training phase. 

If an image is not classified properly, it means its feature representation *differs too much* from the class feature representation. Else, the classifier most likely would have found the correct class. Furthermore, the principal components are the directions of maximal variance of the training images. It follows that a wrongly classified image is not explained best by a linear combination of the $p$ principal components (direction of max variance of the training phase).

Looking above at the misclassified images, one can wonder how it comes that an image, looking just like another - is misclassified. There are several possible explanations:
- too much influence from the background
- a too different lighting conditions
- different scale w.r.t training images
- different orientation (pose and view points) w.r.t. training images



####Better understanding

> Note: the results of this section run in `color=True` mode may be slightly different

In order to better understand how the classifier works, let's choose a test image and try to improve the results. To do so, we will modify the image input itself, to see what can lead to a good classification with the system we have.

> this may not be a good practice, as usually, one would rather work on the training and validation sets, and **not** modify the test set. However, the goal of this sub section is limited to give a bit more insight about PCA and classification based on PCA information, so that modifying a chosen image lead to a change in classification results. The goal is not to improve the classifier, but rather understand the modifications in input images that makes it delivering the results.

Let's choose image of person C, index #6.


In [0]:
logging.info("Image we work on: image personC, index 6")
img = X_test_PCA_cd[6,:].copy()
fig = plt.figure(figsize=(4,4 ))
ax= fig.subplots(1,1)
ax.imshow(my_reshape(img, sq_size, color), cmap = my_color_map, interpolation='nearest') 
ax.set_axis_off()

This image is misclassified as class "1" instead of "0".

- Mean value of the image, after *training_mean* substraction

In [0]:
img_centered = img - pcaify.mean_

fig = plt.figure(figsize=(4,4))
ax= fig.subplots(1,1)
ax.imshow(my_reshape(img_centered, sq_size, color), cmap = my_color_map, interpolation='nearest') 
ax.set_axis_off()
plt.show()

logging.info("Remaining mean value test ab  : " + str(np.round(np.mean(X_test_PCA_ab-pcaify.mean_),2)))
logging.info("Remaining mean value test cd  : " + str(np.round(np.mean(X_test_PCA_cd-pcaify.mean_),2)))
logging.info("Remaining mean value expexted : " + str(np.round(np.mean(X_train_PCA - pcaify.mean_),2)))
logging.info("Remaining mean value image#6  : " + str(np.round(np.mean(img_centered),2)))
logging.info("Nominal   mean value image#6  : " + str(np.round(np.mean(img),2)))


We see that the test images of personC and personD have a resulting mean lower than the training set (=0) and also lower than personA and personB test images. 

> Intuitively, the average is a bit "darker".

We also see that the image is **rotated** wrt mean image. This is visible to whitish areas around the eyes and mouth, and slightly darker around the supposed chin. An intuitive PCA-related conclusion is that the variance induced by the rotation is not well explained by personA images. 

What if we would modify this test image so that we try to rotate it "back" to a regular front face ? Doing so, we hope to decrease this unexplained variance.
- we use the `scipy.ndimage` library
- the angle to rotate is experimentally $-22 [deg]$
- as rotated, some pixels are missing values to fill the shape. We set those pixels at the average pixel value of the image before modification. 
- Rotation is (experimentally) enough; no need of extra shift to fit better.

In [0]:
'''
Rotation of the image
'''
img_rotated = ndimage.rotate(my_reshape(img, sq_size, color), -22, reshape=False, mode = "constant", cval=np.mean(img)) 
img_rotated_flatten = img_rotated.flatten()
img_rotated_centered = img_rotated_flatten - pcaify.mean_

'''
Visualization
'''
fig = plt.figure(figsize=(12,4))
ax1, ax2, ax3 = fig.subplots(1,3)
ax1.imshow(my_reshape(img, sq_size, color), cmap = my_color_map, interpolation='nearest') 
ax2.imshow(img_rotated, cmap = my_color_map, interpolation='nearest') 
ax3.imshow(my_reshape(img_rotated_centered, sq_size, color), cmap = my_color_map, interpolation='nearest') 

ax1.set_title("Original #6")
ax2.set_title("Rotated #6")
ax3.set_title("Rotated #6 - training_mean ")

ax1.set_axis_off()
ax2.set_axis_off()
ax3.set_axis_off()



logging.info("New mean value image#6 modified : " + str(np.round(np.mean(img_rotated_centered),2)))

Now, we can replace the image #6 that was misclassified by the newly rotated image, for which missing values added are the mean of the original image. Successively, we show the test set for person C and D originally, and the modified one (image #6 changed by its rotated version)

In [0]:
'''
Replacement of the original image #6 by its rotated version
'''
X_test_PCA_cd_new = X_test_PCA_cd.copy()
X_test_PCA_cd_new[6,:] =  img_rotated_flatten 

logging.info("Usual and nominal test set for personC and personD")
plot_matrix(X_test_PCA_cd, color, my_color_map, h=2, w=10)

logging.info("Modified test set for personC and personD, image#6 replaced")
plot_matrix(X_test_PCA_cd_new, color, my_color_map, h=2, w=10)


Let's try to predict again the class for this test "new" test set, which is only different by image#6.

In [0]:
'''
Prediction on new test set
'''
y_pred_PCA_cd_new = clf_pca.predict(X_test_PCA_cd_new) 
logging.info("Percentage correct: " + str( 100*np.sum(y_pred_PCA_cd_new == y_test_PCA_cd)/len(y_test_PCA_cd)) + " %")

misclassifier_index = np.where( np.array(y_test_PCA_cd != y_pred_PCA_cd_new))
properclassifier_index = np.where( np.array(y_test_PCA_cd == y_pred_PCA_cd_new))

logging.info("Indices misclassified: " + str(misclassifier_index[0]))
scores = clf_pca.decision_function(X_test_PCA_cd_new)


# Visualization
true_labels=["Jane Levy", "Marc Blucas"]
predicted_labels = ["Emma Stone", "Bradley Cooper"]
my_plot_confusion_matrix(clf_pca,
                         X_test_PCA_cd_new, 
                         y_test_PCA_cd,
                         true_labels = true_labels,
                         predicted_labels = predicted_labels) 


Now, the image#6_rotated is properly classified, as hoped, thanks to the rotation. Concretely, the rotation had mostly the following impacts:
- decrease of the influence of a dark area because of the hair
    * influence of background
    * lighting conditions
- better match with eigenfaces
    * variance better explained by personA-related eigenfaces

This is not a rigorous method to determine the exact behavior of the classifier. Rather, it's an intuitive reasoning showing how an image can be (mis-)classified by PCA repesentation based classifier. It also shows that rotation of an image can also matter in case of the PCA feature representation, as in the HOG representation.

###HOG vs PCA classification 

Based on the same training samples, and the same classifier technique used, PCA does a better job at classifying the test set images. 

The test set contains some images that are just alike the training set, but also some with different view points, and visual differences in terms of person haircut, color, ... 

PCA-based classification seems less dramatically confused by those aspects, and seems to have successfully captured the representation associated to personA and personB (the two classes). Yet, we have seen and analyzed that rotation may have a large impact on classification results.

HOG-based classification, on the other side, suffers more with respect to these training-test sets differences and, as we saw, is also more disturbed by the local haircut change between personA and personC. 

However, we should remain careful at this point:
- the classifiers have not been optimized in any sort,
- the training set is pretty reduced.


## Identification
In an identification setup the goal is to **compute similarity scores** between pairs of data examples and use them to identify new images. 
In this section, we will:
1. describe the visualizations used along the section
2. compute the feature representations HOG / PCA
3. compute the distances pairwaise between the test set images and the training set images. 
4. discuss those distance results, macroscopically and at image level, 
5. Use k-NN to label an image based on its nearest neighbours. In particular, we will:
    - discuss the choice of $k$ parameters, using different ways
    - discuss the results of the labeling
    - analyze the images having the closest and furthest nearest neighbor

First, we will need to import several pairwise metrics functions from `sklearn` library.

In [0]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import cosine_distances
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.metrics.pairwise import manhattan_distances



```
>>> from numpy.linalg import norm
>>> norm(X, axis=1, ord=1)  # L-1 norm
>>> norm(X, axis=1, ord=2)  # L-2 norm
>>> norm(X, axis=1, ord=np.inf)  # L-∞ norm
```




We create also a util function that allows plotting the similarity matrix (or distance matrix), with several parameters. This will be extensively used, while the code itself isn't that important.
> Note that by default, the colormap "jet" is used. This is a personnal preference and I'm used to work with it. Should that **not** suit you, you can of course change this colormap, either locally as the argument of the plot_similarity_matrix function, or globally as the default value of this parameter.


In [0]:
from mpl_toolkits.axes_grid1 import make_axes_locatable

def plot_similarity_matrix(similarity_matrix, show_numbers = False, vmax1 = None, vmax2 = None, norm_only=False, width=16, height = 8, fontsize = 10, return_normed = False, cmap=plt.cm.jet):
    similarity_matrix_norm = 100*similarity_matrix / np.linalg.norm(similarity_matrix, axis = 1, ord=1, keepdims=True)
    
    # for i in range(similarity_matrix_norm.shape[0]):
    #     print(sum(similarity_matrix_norm[0,:]))

    if norm_only:
        fig = plt.figure()
        ax = fig.add_subplot(111)
        fig.set_size_inches(width, height)

        im = ax.imshow(similarity_matrix_norm, vmax=vmax2, cmap=cmap)
        ax.set_title('%')
        divider = make_axes_locatable(ax)
        cax = divider.append_axes("right", size="5%", pad=0.1)
        fig.colorbar(im, cax=cax)
        if show_numbers:
            for i in range(similarity_matrix.shape[0]):
                for j in range(similarity_matrix.shape[1]):
                    ax.text(j, i, str(round(similarity_matrix_norm[i,j],0)).split(".")[0],fontsize=fontsize,va='center', ha='center')
        
    else:
        fig, ax = plt.subplots(ncols=2)
        fig.set_size_inches(width, height)
    
        im1 = ax[0].imshow(similarity_matrix, vmax=vmax1, cmap=cmap)
        ax[0].set_title('as is')
        im2 = ax[1].imshow(similarity_matrix_norm, vmax=vmax2, cmap=cmap)
        ax[1].set_title('%')
        dividers = [make_axes_locatable(a) for a in ax]
        cax1, cax2 = [divider.append_axes("right", size="5%", pad=0.1) for divider in dividers]
    
        fig.colorbar(im1, cax=cax1)
        fig.colorbar(im2, cax=cax2)

        if show_numbers:
            for i in range(similarity_matrix.shape[0]):
                for j in range(similarity_matrix.shape[1]):
                    ax[0].text(j, i, str(similarity_matrix[i,j]), fontsize=fontsize,va='center', ha='center')
                    ax[1].text(j, i, str(round(similarity_matrix_norm[i,j],0)).split(".")[0],fontsize=fontsize,va='center', ha='center')
    fig.tight_layout()
    plt.show()

    if return_normed:
        return similarity_matrix_norm

#### Illustration of distance measures


In order to better understand the Visualization used, a tool example is given hereunder. The input is a matrix handmade. Later, it will be the matrix of shape (#test_image, #train_image) containing the distances computed pairwise.

Regarding the input matrix:
- on each line, the ratio between number is kept the same
- the five rows contain three different scale of the numbers:    

```
[[ 100,  20,  30,  40,  50], 
 [   2,  10,   3,   4,   5], 
 [  20,  30, 100,  40,  50],
 [   2,   3,   4,  10,   5], 
 [ 200, 300, 400, 500,1000]]
```


The numbers written indicate the value of the cell.

**Left side**: The "As is" matrix colors the cell as the numbers are set. The colorscale is therefore really large, and it can be useful to observe the disparity of measures across all tests. 

If the represented matrix is a distance matrix, for instance, one can observe that the 5th test image is *very far* from all the training images, as the 5th row as larger numbers than any other row. 
The drawback is that if some numbers are much higher than others, we lose in granularity to represent the differences between those numbers. For instance, a distance of "2" and a distance of "20" are very much alike.

**Right side**: The figure shows the same input matrix but normalized by row. That means that the sum of all numbers in a row is equal to 100 %. 
It is helpful to analyze *locally* the distance/similarity values for 1 specific test image with respect to all training images. 

In [0]:
test_dist_mtx = np.array([[100,20,30,40,50], [2,10,3,4,5], [20,30,100,40,50],[2,3,4,10,5], [200,300,400,500,1000]])
plot_similarity_matrix(test_dist_mtx, show_numbers=True, width=8, height=4, fontsize=14)


In the next sections of this tutorial, these Visualizations will be used to detail the similarity/distance measures between our different images. 
The vertical axis correspond to test images, with the following mapping
- 0 -> 9: images of PersonA, Emma Stone
- 10 -> 19: images of PersonB, Bradley Cooper
- 20 -> 29: images of PersonC, Jane Levy
- 30 -> 39: images of PersonD, Marc Blucas

The horizontal axis corresponds to the training images, with the following mapping:
- 0 -> 19: images of PersonA (training set)
- 20 -> 39: images of PersonB (training set)

If the feature descritions are appropriate, the features distance measurements should lead to
- very small distance (= large similarity measure) between training and test images of PersonA; and similarly for personB)
- very high distance (= small similarity measure) between training and test images of PersonA; and similarly for personB)

Put in another way, persons from the same class would share feature descriptions, and not share other class feature description. This is exaclty inline with how we define what is a *good* feature, at the beginning of this notebook.

Intuitively, feature description of :
- test personA $\simeq$ training personA $\&$ test personA $\neq$ training personB; 
- test personB $\neq$ training personA $\&$ test personB $\simeq$ training personB; 
- test personC $\sim$ training personA $\&$ test personC $\neq$ training personB; 
- test personD $\neq$ training personA $\&$ test personD $\sim$ training personB; 

Visually plot as a matrix, it could hence look like the following images.

In [0]:
test_dist_mtx = np.array([[5,20], [20,5], [10,20], [20,10]])
plot_similarity_matrix(test_dist_mtx, show_numbers=True, width=5, height=5, fontsize=14)

To ease later computation, we also write a simple `get_distances` (naming is not great...) that returns the sum of the distances in the eight eights of the global pairwise distances matrix. If it is not clear, it'll become soon enough when using it.

In [0]:
def get_distances(matrix_distances):
    '''
    res:[ 
            [dist(A,A), dist(A,B)]
            [dist(B,A), dist(B,B)]
            [dist(C,A), dist(C,B)]
            [dist(D,A), dist(D,B)]
        ]

    '''
    res = np.empty((4,2))
    res[0,0] = sum(sum(matrix_distances[0:10,0:20]))
    res[0,1] = sum(sum(matrix_distances[0:10,20:40]))
    res[1,0] = sum(sum(matrix_distances[10:20,0:20]))
    res[1,1] = sum(sum(matrix_distances[10:20,20:40]))
    res[2,0] = sum(sum(matrix_distances[20:30,0:20]))
    res[2,1] = sum(sum(matrix_distances[20:30,20:40]))
    res[3,0] = sum(sum(matrix_distances[30:40,0:20]))
    res[3,1] = sum(sum(matrix_distances[30:40,20:40]))
    return res


### HOG feature decriptors





#### Pre-process data

Similarly to classification, the very first step is to pre-process the data. 
We use the `get_matrix_from_set` function that allows to work with different size and color. 

For the identification, it is not required to shuffle the training set, as the distance to all samples is computed.

In [0]:
X_train = get_matrix_from_set(training_set, color, sq_size = sq_size, flatten = False)
X_test = get_matrix_from_set(test_set, color, sq_size = sq_size, flatten = False)

y_train = np.zeros((40,))
y_train[20:40] = 1
 
'''
For now, set up "0" for personC; "1" for personD !
'''
y_test = np.zeros((40,))
y_test[10:20] = 1
y_test[30:40] = 1


#####Visualization


In [0]:
logging.info("Training set (horizontal matrix axis):")
logging.info("Training set PersonA [0 -> 19]:")
plot_matrix(X_train[0:20,:], color, my_color_map, h=1, w=20, transpose = False)
logging.info("Training set PersonB [20 -> 39]:")
plot_matrix(X_train[20:40,:], color, my_color_map, h=1, w=20, transpose = False)

logging.info("Test set (vertical matrix axis):")
logging.info("Test set PersonA [0 -> 9]:")
plot_matrix(X_test[0:10, :], color, my_color_map, h=1, w=10, transpose = False)
logging.info("Test set PersonB [10 -> 19]:")
plot_matrix(X_test[10:20, :], color, my_color_map, h=1, w=10, transpose = False)
logging.info("Test set PersonC [20 -> 29]:")
plot_matrix(X_test[20:30, :], color, my_color_map, h=1, w=10, transpose = False)
logging.info("Test set PersonD [30 -> 39]:")
plot_matrix(X_test[30:40, :], color, my_color_map, h=1, w=10, transpose = False)

#####Creation of Transformer

As usual now, let's create the transformer that computes the HOG for the images.
The parameters may sound new to you: don't panic. Those parameters are actually found to be an optimization - discussed later -, and the effect of the parameters on the identification part is also discussed a bit later. 

In [0]:
hogify = None
hogify = HogTransformer(    orientations = 9, 
                            pixels_per_cell = (16,16), 
                            cells_per_block = (2,2),
                            block_norm = "L2-Hys", 
                            transform_sqrt = False,
                            multichannel = color)


#####Transformation of Inputs
 - **X_train**: application of the fit then transform method from the transformer
 - **X_test**: application of the transform method from the transformer

As discussed already, it doesn't really change a thing for the HOG computation as the fit doesn't do much (recall, it is a homemade Transformer we created in the previous task).

In [0]:
X_train_hog = hogify.fit_transform(X_train)
X_test_hog = hogify.transform(X_test)

logging.debug("X_train_hog Shape: " + str(X_train_hog.shape))
logging.debug("X_test_hog Shape: " + str(X_test_hog.shape))


#### Compute pairwise distances

After the computation of the feature representations for both data sets (training and test), we can compute the distances pairwise -- between each pair. 

Three common distance formula are used:
1. Euclidean distance
2. Manhattan distance
3. Cosine distance

While the Euclidean distance is intuitive up to 3 dimensions, it is known to behave not as good in high dimensions where "everything is far away". Comparing the results of the three methods will give a better insight if there is an issue with the distance measurements or not.


Let's recall the formalism we use:
- horizontal axis: training images [20 personA; 20 personB]
- vertical axis: test images [10 personA; 10 personB, 10 personC, 10 personD]



In [0]:
hog_distances_eucl = euclidean_distances(X_test_hog, X_train_hog)
hog_distances_man = manhattan_distances(X_test_hog, X_train_hog)
hog_distances_cos = cosine_distances(X_test_hog, X_train_hog)

# logging.debug("DISTANCE BASED ON COSINE - LOG10")
# plot_similarity_matrix(np.log10(hog_distances_cos))

logging.info("DISTANCE BASED ON EUCL")
plot_similarity_matrix((hog_distances_eucl))

logging.info("DISTANCE BASED ON MANHATTAN")
plot_similarity_matrix((hog_distances_man))

logging.info("DISTANCE BASED ON COSINE")
plot_similarity_matrix((hog_distances_cos))

#### Analysis on the distance computed

From a macroscopic view, the first thing to note is that all the distances give *similar* results. As macroscopic, I understand the matrix as devided in eight blocks as presented above. This is particularly true when comparing normed distances, on the right side. 
> Remember that on the right side, the sum of all values of a line is equal to 100. It helps figuring relatively what is the closest/furthest training image from a specific test image, which we actually are interested in.

Of course there are differences between the colors represented but they don't change drastically, and euclidean seems to give enough granularity to continue using it.

As it has been a lot of colors/matrix represented, let's just plot below only the euclidean distance, normed per test sample (so, the euclidean distance as a percentage of the sum of the distances, per test sample).



In [0]:
logging.info("Expected Look-alike matrix coloration")
plot_similarity_matrix(test_dist_mtx, show_numbers=True, norm_only=True, width=3, height=3, fontsize=14)


logging.info("Global results of pairwise distances")
hog_eucl_normed = plot_similarity_matrix(hog_distances_eucl, show_numbers = False, norm_only=True, width=10, height = 10, fontsize = 10, return_normed=True)

res=get_distances(hog_distances_eucl)
logging.info("Macroscopic view")
plot_similarity_matrix(res, show_numbers=True, norm_only=True, width=4, height=4)

> The second plot is a really macro view of the large matrix. It is read as *\"56% of the sum of the distances of all personB test images with respect to all training images regards personA training images, while only 44% regards personB training images. It seems reasonable to assume that personB test images are closer to personB training images than personA training images\"*.

Continuing the analysis of the pairwise distance results focusing on this normed matrix:
- from the macroscopic view, it complies with expectation for all eights (test personX - training personY) except of Jane Levy, personC
- Test images of personC is overall closer to personB than personA, which is not what we could expect considering personC is a young white female, with similar hair color than personA.
- PersonB test images are clearly close to training personB images; personA test images are globally further from their corresponding training image.

However, this is fully inline with the classification results we obtained in previous section for the HOG feature, and the digging we made into the HOG representation. Because of the haircut, a large part of the histogram of personC actually is much more similar to the one of Bradley Cooper (personB). 

- At the image level, it is quite obvious that some training images seems (very) far from all others. Visually, it is the case for training set images 0, 18 and 19. This is confirmed when looking at the sum of the (positive) distances over all test examples -- see next plot. 
It seems those images are not very useful for identification purpose, at least. A careful eye confirms it is three images belonging to personA dataset. A conclusion is then: *Based on the pairwise distances between all HOG representations, three images of the personA dataset seems to be little help in identification tasks.*  Note that the threshold on the below graph is chosen purely arbitrarily.

In [0]:
if not color:
    threshold = 65
else:
    threshold = 65
eucl_dist_vert_sum = np.sum(np.abs(hog_distances_eucl), axis = 0)

# Visualization
fig=plt.figure(figsize=(8,4))
ax = fig.add_subplot(111)
ax.scatter([i for i in range(len(eucl_dist_vert_sum))],eucl_dist_vert_sum)

ax.plot([-1,41],[threshold,threshold], '-r')
ax.set_title("Sum of the distances to test images, per training images")
plt.show()

indices = np.where(eucl_dist_vert_sum > threshold)[0]
logging.info("Indices of training images > Threshold: " + str(indices))


plot_matrix(X_train[indices,:], color, my_color_map, h=1, w=len(indices))

- The results obtained for Jane Levy are *completely* inline with the results obtained in previous classification task! The reasons behind this *bad* results, as discussed already - and in details in the previous section, mainly rely on the haircut which makes personC's HOG similar to personB's. 

- Nothing much special about Marc Blucas, personD, despite that the results are as expected: less similar than mean to personA, more similar to personB.

####Hog Transformer parameters

As for the classification, the results on the distance are of course dependant on the HOG transformer parameters, and specifically the `pixels_per_cell` or the `cells_per_block` which somehow define the spatial granularity of the representation. 
We expect that, with a smaller number of pixels per cells as defined, while the real value will change, the overall behavior should remain (macroscopic view). 

We can try that out, with `pixels_per_cell = (4,4)` for instance, hence *16 times finer*


In [0]:
'''
Hog Transformer
'''
hogify = HogTransformer(    orientations = 9, 
                            pixels_per_cell = (4,4), 
                            cells_per_block = (2,2),
                            block_norm = "L2-Hys", 
                            transform_sqrt = False,
                            multichannel = color)

'''
Preparation of the data
'''
X_train_hog = hogify.fit_transform(X_train)
X_test_hog = hogify.transform(X_test)

'''
Pairwise distance computing
'''
hog_distances_eucl = euclidean_distances(X_test_hog, X_train_hog)

'''
Show the Matrix
'''
logging.info("DISTANCE BASED ON EUCL")
plot_similarity_matrix((hog_distances_eucl))

'''
Setup the new threshold value (trial-error)
'''
if not color:
    threshold = 570
else:
    threshold = 570
'''
Compute sum in vertical axis
'''
eucl_dist_vert_sum = np.sum(np.abs(hog_distances_eucl), axis = 0)

'''
extract indices
'''
indices = np.where(eucl_dist_vert_sum > threshold)[0]

'''
Plot results
'''
# print(eucl_dist_vert_sum)
fig=plt.figure(figsize=(8,4))
ax = fig.add_subplot(111)
ax.scatter([i for i in range(len(eucl_dist_vert_sum))],eucl_dist_vert_sum)
ax.plot([-1,41],[threshold,threshold], '-r')
ax.set_title("Sum of the distances to test images, per training images")
plt.show()
logging.info("Indices of training images > Threshold: " + str(indices))

'''
Show images
'''
plot_matrix(X_train[indices,:], color, my_color_map, h=1, w=len(indices))

With such a small number of pixels per cells (details of what happens in the HOG can be found in part 1 of this tutorial), the overal results remain: as expected for all eights but for PersonC's!

* The most distant training images are not all the same as before:
    - 2 Emma Stone images remain the furthest from all test images, 
    - 1 Bradly Cooper image becomes the third furthest. 

This parameter doesn't actually change a lot: overall, the distribution of the distances doesn't change much. However, it, once more, indicates the locality of the HOG representation, and the impact of the feature representation parameters on the subsequent tasks. 

* Another interesting consequence of this finer HOG is that personC and personD seems "more distant", globally, to training samples. This is indicated by the reddish lower part of the left-side matrix, above. The results complies with our intuition that personA and personB test images should be "closer" to training set, as they represent visually the same person.

* Finally, it becomes much clearer, with this finer HOG, that the some test images of personA appears further away from the training examples. For instance, images #1 and #9, Emma Stone (vertical axis) don't seem to have any blue parts. This is interesting to put in perspective with classification results obtained with the HOG representation before, where images 1 and 9 were among the misclassified ones, at least at the first attempt (prior to any optimization).



####Identification of test images using k-NN

In this step, using k-nearest neighbor, the goal is to label the test images according to their nearest neighbours. 
In a very intuitive way, we could say that the images on the vertical axis of the matrix above should get as label the ones from the bluest set of neighbors (either personA (=0) or personB (=1))

As from previous section, we are familiar with the pipeline, we will implement a pipeline using the knn classifier in the HOG dataspace. 





#####Choosing k
**What k-value to take** ?
This is often a critical question.
We could start with k=1, assuming that given the small number of samples in the training set, the closest should be the most appropriate label. 

**Can we do better ?**
Definitely, one of the best way to go is to assess different values for k using a validation set, performing cross-validation. 
Although it is a spoiler to the "Impress you TA's " section, next, we will use a [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) technique, limited to the k parameter, trying out different values.

In [0]:
k=1


'''
Definition of a pipeline
'''
HOG_knn_pipeline = Pipeline([
                         ('hogify', HogTransformer(
                             orientations = 9,
                             pixels_per_cell = (16,16),
                             cells_per_block = (2,2),
                             block_norm='L2-Hys',
                             transform_sqrt=True, 
                             multichannel = color)
                         ),
                         ('classify', 
                          KNeighborsClassifier( n_neighbors = k, metric='euclidean')
                         )])


param_grid_knn = [
    {
        'hogify__pixels_per_cell': [(4,4),(16,16)],
        'classify__n_neighbors':[1,2,3]}
]

grid_search = GridSearchCV(HOG_knn_pipeline,
                           param_grid_knn, 
                           cv = 3, 
                           n_jobs = -1,
                           scoring = "accuracy",
                           verbose = 1, 
                           return_train_score = True)

grid_res = grid_search.fit(X_train, y_train)



`GridSearchCV` helps in performing a systematic choice within different (hyper-)parameter values in automatically running the pipeline with those different values, and output the best result according to a (specified) metric. Here, I choose the accuracy to indicate the best system. 

The accuracy is computed using cross validation, and not! using the test set.



#####Prediction and metrics

In [0]:
logging.info("Best parameters: " + str(grid_res.best_params_))
logging.info("Best scores: " + str(100* grid_res.best_score_ ) + "% ")

best_prediction_ab = grid_res.predict(X_test_ab)
logging.info('Percentage correct persons A, B: ' + str(100*np.sum(best_prediction_ab == y_test_ab)/len(y_test_ab)))
best_prediction_cd = grid_res.predict(X_test_cd)
logging.info('Percentage correct persons C, D: ' + str( 100*np.sum(best_prediction_cd == y_test_cd)/len(y_test_cd)))

In [0]:
show_missed(X_test_ab, y_test_ab, best_prediction_ab)
show_missed(X_test_cd, y_test_cd, best_prediction_cd)

As foreseen by the distances matrix shown, the accuracy is very high for A and B, and personC is the most difficult to identify.

A key message from here is that the results also vary with hyperparameters. 



#####Closest and Furthest nearest neighbors

After this optimization step on the hog transformer and k-NN hyperparameter selection, we can:
- recompute the pairwise distance and plot the resulting matrix, similarly as before
- show the two closest images
- show the neighbors with the largest distance in-between

In [0]:
'''
get hog transformed for X_test and X_train
'''
X_test_hog = grid_res.best_estimator_['hogify'].transform(X_test)
X_train_hog = grid_res.best_estimator_['hogify'].transform(X_train)

'''
Compute pairwise distances
'''
hog_distances_eucl = euclidean_distances(X_test_hog, X_train_hog)

'''
plot matrix
'''
plot_similarity_matrix(hog_distances_eucl)

'''
Using classifier kNN, get closest neighbours list
'''
closest_neighbors = grid_res.best_estimator_['classify'].kneighbors(X_test_hog, 10)
logging.info("\nImage # => Image Training idx @ distance \n")
for i in range(closest_neighbors[1].shape[0]):
    print("Image Test " + str(i) + " => closest neighbor: Image Training  " + str(closest_neighbors[1][i,0]) + "  @ " + str(np.round(closest_neighbors[0][i,0],3)))



From the results printed above, where the closest image is shown for all test images, an interesting point to note is the min and max values of the closest neighbor distances:

- Image Test #17 => very close to #38, 0.776
- Image Test #35 => (not so) close to #19, 1.591

We clearly see that the distance information gives an hint on the certainty of the identification: distance is very low for #17, and high for #19. This indicate that personD test image is actually pretty far from everything (considering the euclidean distance). 

Looking at the images, it actually appears clear for a human why those are considered the closest, and why the images appear much close for personB (Bradley Cooper, on top) than for Marc Blucas, personD (bottom line)


In [0]:
'''
Closest nearest neighbors of all
'''
plot_matrix([X_test[17,:],X_train[38,:]], color, my_color_map, h=1, w=2)
show_one_image_hog(17-10, personB, 'test')
show_one_image_hog(38-20, personB, 'training')



- personB: same pose, same scale, same view point, same face expression... The HOG will be similar!

In [0]:
'''
Furthest nearest neighbors of all
'''
plot_matrix([X_test[35,:],X_train[19,:]], color, my_color_map, h=1, w=2)
show_one_image_hog(35-30, personD, 'test')
show_one_image_hog(19, personA, 'training')

- personD image has a non-nominal face pose, and almost a quarter of the image on the right side is the background, leading to a vertical edge "in the middle". 
Without surprise, this makes personD's image far from everything else in terms of HOG representation, and the closest image is based on this face pose: the right eye and the vertical line are the most important elements of the representation. As a result, while the distance is larger than in previous case (analysis of personB), Emma Stone image becomes the closest to Marc Blucas' image. 

###PCA feature descriptors

We can repeat the steps performed for HOG feature representation, for the PCA feature representation. The same formalism is used.

####Pre-process data

As usual now, and because of our current architecture, it's necessary to re-import the data using `flatten = True` parameter. 

In [0]:
X_train = get_matrix_from_set(training_set, color, sq_size = sq_size, flatten = True)
X_test = get_matrix_from_set(test_set, color, sq_size = sq_size, flatten = True)
y_train = np.zeros((40,))
y_train[20:40] = 1
 
'''
For now, set up "0" for personC; "1" for personD !
'''
y_test = np.zeros((40,))
y_test[10:20] = 1
y_test[30:40] = 1


'''
AB >< CD
'''
X_test_ab = X_test[0:20,:]
y_test_ab = y_test[0:20]

X_test_cd = X_test[20:40,:]
y_test_cd = y_test[20:40]


We can vizualize - once more - the different sets

In [0]:
logging.info("Training set (horizontal axis):")
logging.info("Test set PersonA [0 -> 19]:")
plot_matrix(X_train[0:20,:], color, my_color_map, h=1, w=20, transpose = False)
logging.info("Test set PersonA [20 -> 39]:")
plot_matrix(X_train[20:40,:], color, my_color_map, h=1, w=20, transpose = False)

logging.info("Test set (vertical axis):")
logging.info("Test set PersonA [0 -> 9]:")
plot_matrix(X_test[0:10, :], color, my_color_map, h=1, w=10, transpose = False)
logging.info("Test set PersonB [10 -> 19]:")
plot_matrix(X_test[10:20, :], color, my_color_map, h=1, w=10, transpose = False)
logging.info("Test set PersonC [20 -> 29]:")
plot_matrix(X_test[20:30, :], color, my_color_map, h=1, w=10, transpose = False)
logging.info("Test set PersonD [30 -> 39]:")
plot_matrix(X_test[30:40, :], color, my_color_map, h=1, w=10, transpose = False)


####Compute pairwaise distances

In [0]:
pcaify = None
pcaify = sklearn_decomposition_PCA(n_components = 35)

X_train_PCA = pcaify.fit_transform(X_train)
X_test_PCA = pcaify.transform(X_test)


# pca_distances_cos = cosine_distances(X_test_PCA, X_train_PCA)
pca_distances_eucl = euclidean_distances(X_test_PCA, X_train_PCA)

# plot_similarity_matrix((pca_distances_cos))
plot_similarity_matrix((pca_distances_eucl))


####Analysis on the distances computed



In [0]:
logging.info("Expected look-alike macroscopic coloration")
plot_similarity_matrix(test_dist_mtx, show_numbers=True, norm_only=True, width=3, height=3, fontsize=14)
res=get_distances(pca_distances_eucl)

logging.info("Macroscopic view of the pairwise distances coloration")
plot_similarity_matrix(res, show_numbers=True, norm_only=True, width=4, height=4)

From a macroscopic view, personA and personB seems to confirm the expected behavior, specifically using the normed plot (on the right), or even better, the tiny macro representaiton just above.
In this case of PCA feature representation, it indicates that the variance in the personA and personB test images is explained in a similar fashion as the variance of (one or more) training images.

> The formalism behind the different plot was already explained above and is not repeated here. Please read again previous sections if needed.

Similarly to the analysis performed on the classification task, the results differ notably for personC and personD with respect to HOG feature representation: 
- here, some personC seems properly similar to personA, and some personD seems properly similar to personB, without a clear indication that personC is globally much further from personA than personD is from personB.
Said differently, there may be some personC close to personA. This wasn't really the case in the previous feature representation using HOG.
- personD test images don't look that close to personB training image "anymore", as it was the case for the HOG representation. 


Considering the remarks done already, and the care needed to interpret this "average distance", it does not directly follows that the results of the identification using k-NN will be worse for personD than in the HOG feature. 

---

Another point, more surprising maybe -- but enlighting the differences between the two feature representations -- is the training images that are the most dissimilar to test images.

As we did before, let's sum the distances vertically, and show what image is "globally" the furthest. 
> Of course, this metric has its limitation: the sum of the distances may suffer from one very very high distance (or reversely, benefit from one very very close...) But this is quite precisely what we wish to show, and even if there are some pitfalls, it's a convenient way to illustrate our saying.

In [0]:
if color:
    threshold = 355000
else:
    threshold = 200000
eucl_dist_vert_sum = np.sum(np.abs(pca_distances_eucl), axis = 0)
# print(eucl_dist_vert_sum)
fig=plt.figure(figsize=(8,4))
ax = fig.add_subplot(111)
ax.scatter([i for i in range(len(eucl_dist_vert_sum))],eucl_dist_vert_sum)
ax.plot([-1,41],[threshold,threshold], '-r')
ax.set_title("Sum of the distances to test images, per training images")
plt.show()

indices = np.where(eucl_dist_vert_sum > threshold)[0]
logging.info("Indices of training images > Threshold: " + str(indices))


plot_matrix(X_train[indices,:], color, my_color_map, h=1, w=len(indices))

- The image that is the furthest from all test images is different from the previous feature representation. Note that the threshold is arbitrarily chosen.

It means that the combination of weights of the eigenfaces is the most distant of all the combinations of the test images. The way the variance is explained in this #20 training image is different from the way it is explained in any test image.

####Identification of test images using k-NN

In this step, using k-nearest neighbor, the goal is to label the test images according to their nearest neighbours. In a very intuitive way, we could say that the images on the vertical axis of the matrix above should get as label the ones from the bluest set of neighbors (either personA (=0) or personB (=1))




#####Choosing k
As discussed in HOG-based identification, choosing a right number for k usually implies different test on validation set. That's what we did previously. 

In this PCA-based identification task, we will see another way - more intuitive but yet sensible.

When plotting the matrix of distances, it appears with a lot of colors and it's not easy to determine really what k would be most appropriate. The question is "what number k is a good trade-off so that most of the images seem to be identified appropriately". 
Some thoughts: 
- k should not lead to too noisy identification, 
- k should be kept small,
- k should be such that only distances that **matters** are taken into account.

######What does it mean ? 
Described differently, we should select k such that only what it seems to be really relevant indicate the choice. 

Let's say we consider a sample and its neighbours. The 2 first neighbours are really close and indicate "class 0", and the three next neighbours are actually all much further - yet closest then remaining samples - and indicate "class 1". k should not take into account (too much at least) the neighbours 3,4 and 5, or the sample could be identified as a "class 1" while *obviously*, it should have been "class 0" thanks to the two closest. 

######How to do that ?
Intuitively, we could look at the matrix above and evaluate "by the eye" the number of "most blues", on average. A little subtlety, however: what we are really interested in is the order of magnitude of the distance, hence we take the logarithms of the distances to visually represent better this order of magnitude. 

######On what data set should we do that ?
Well, definitely not on the test set. As we don't have much input samples, and as this mostly constitue an intuition and not a rigorous approach, we decide to follow this approach on the full training set, instead of a validation set. 



In [0]:
dist_training_training = euclidean_distances(X_train_PCA, X_train_PCA)
for i in range(dist_training_training.shape[0]):
    dist_training_training[i,i]= max(dist_training_training[i,:]) 

logging.info("Pairwise distance between training images\n(diagonal = max distances, for visualization)")
plot_similarity_matrix((dist_training_training), False, norm_only=True,cmap=plt.cm.jet)

logging.info("Pairwise distance in log scale between training images\n(diagonal = max distances, for visualization)")
plot_similarity_matrix(np.log(dist_training_training), False, norm_only=True, cmap=plt.cm.jet)

First matrix are the distances in regular scale, and the second are the distances in log scale. It gives a better intuition about the scale of the different distances.

Following this last plot, a parameter $k=1$ or $k=2$ seems an acceptable choice: most of the images seems to be closest (darkest blue) to at least  one other corresponding images.
Of course, it could be argued it doesn't seem mathematically rigorous and reliable. I agree and this is why the cross validation was used for the previous section. Nonetheless, I have presented here another way of selecting this parameter, efficiently.

#####Prediction

First, we play - again using the gridsearch method - to try out some combinations, and we predict the label of the test images

In [0]:
k=2

'''
Definition of a pipeline
'''
PCA_knn_pipeline = Pipeline([
                         ('pcaify', sklearn_decomposition_PCA(n_components=35)
                         ),
                         ('classify', 
                          KNeighborsClassifier( n_neighbors = k, metric='euclidean')
                         )])


param_grid_knn = [
    {
        'pcaify__n_components': range(15,40,1),
        'classify__n_neighbors':[1,2,3]}
]

grid_search = GridSearchCV(PCA_knn_pipeline,
                           param_grid_knn, 
                           cv = 2, 
                           n_jobs = -1,
                           scoring = "accuracy",
                           verbose = 1, 
                           return_train_score = True)

'''
"Training"
'''
grid_res = grid_search.fit(X_train, y_train)
logging.info("Best parameters : " + str(grid_res.best_params_))
logging.info("Best scores (CV): " + str(100* grid_res.best_score_ ) + "% ")

Now, we are ready to use the considered "best" pipeline in order to predict the labels:
 - $p=18$, number of components for the PCA transformer, 
 - $k=1$, number of neighbors to label the test sample


In [0]:
'''
prediction
'''
best_prediction_ab = grid_res.predict(X_test_ab)
logging.info('Percentage correct persons A, B: ' + str(100*np.sum(best_prediction_ab == y_test_ab)/len(y_test_ab)))
best_prediction_cd = grid_res.predict(X_test_cd)
logging.info('Percentage correct persons C, D: ' + str( 100*np.sum(best_prediction_cd == y_test_cd)/len(y_test_cd)))


In [0]:
'''
Visualization of the images
'''
logging.info("Tests - Identification of personA and personB")
show_missed(X_test_ab, y_test_ab, best_prediction_ab)

logging.info("\n"*2)
logging.info("Tests - Identification of personC and personD")
show_missed(X_test_cd, y_test_cd, best_prediction_cd)

We have reused the `GridSearchCV` technique, for the number of neighbours $k$, but also for the number of components $p$ used in the PCA representation.
* For instance, good accuracy results can be achieved already with other parameters: 
    * $p=5$, $k=1$: test AB: 90%, test CD: 60% (cv = 2)
    * $p=20$, $k=1$: test AB: 100%, test CD: 65% (cv = 2)

Those parameters are found with GridSearch on larger batches. The considered best is the couple $(p,k) = (18,1)$, leading to test AB: 100% and test CD: 65%.

As a reminder, "considered best" is **not** regarding the accuracy on the test results, but during the cross validation step.

It is important to note (again:)) than the number of training samples being limited, cross validation may not deliver the full potential.

With parameters $k=1$ and $p=18$, corresponding to the distances showed before, the identification results corresponds to the expectation:
- high accuracy on personA and personB, 
- much lower accuracy on personC and personD which are further from all, in our case. Nonetheless, the results is better than guess - luckily!

Comparing the results to classification ones, on personA and B, there is no difference. For personC and D, the misclassified images were indices [2, 3, 6, 7, 15, 16, 19]. In this identification step, the errors in labeling (based on 1! neighbours and 18 components), are [0, 3, 5, 6, 7, 12, 15]. They don't fully match, but we could yet say that if it's hard to identify, it's hard to classify as well.

#####Closest and Furthest nearest neighbors

As for previous section, we can show with this feature representation the closest and furthest nearest neighbors.



In [0]:
'''
get PCA transformed for X_test and X_train
'''
X_test_PCA = grid_res.best_estimator_['pcaify'].transform(X_test)
X_train_PCA = grid_res.best_estimator_['pcaify'].transform(X_train)

'''
Compute pairwise distances
'''
pca_distances_eucl = euclidean_distances(X_test_PCA, X_train_PCA)

'''
plot matrix
'''
plot_similarity_matrix(pca_distances_eucl)

'''
Using classifier kNN, get n closest neighbours list
'''
n=10
closest_neighbors = grid_res.best_estimator_['classify'].kneighbors(X_test_PCA, n)
logging.info("\nImage # => Image Training idx @ distance \n")
for i in range(closest_neighbors[1].shape[0]):
    print("Image Test " + str(i) + " => closest neighbor: Image Training  " + str(closest_neighbors[1][i,0]) + "  @ " + str(np.round(closest_neighbors[0][i,0],3)))


distance_max = np.max(closest_neighbors[0][:,0])
distance_max_index_test = np.where(closest_neighbors[0][:,0] == distance_max)[0][0]
distance_max_index_train = closest_neighbors[1][distance_max_index_test,0]

distance_min = np.min(closest_neighbors[0][:,0])
distance_min_index_test = np.where(closest_neighbors[0][:,0] == distance_min)[0][0]
distance_min_index_train = closest_neighbors[1][distance_min_index_test,0]
print("Largest  \"closest distance\" is: " + str(np.round(distance_max,0)) + " between (test image,training image) = (" + str(distance_max_index_test) + ", "+str(distance_max_index_train) + ")" )
print("Smallest \"closest distance\" is: " + str(np.round(distance_min,0)) + " between (test image,training image) = (" + str(distance_min_index_test) + ", "+str(distance_min_index_train) + ")" )





In [0]:
'''
Closest nearest neighbors of all
'''
plot_matrix([X_test[distance_min_index_test,:],X_train[distance_min_index_train,:]], color, my_color_map, h=1, w=2)

In [0]:
'''
Furthest nearest neighbors of all
'''
plot_matrix([X_test[distance_max_index_test,:],X_train[distance_max_index_train,:]], color, my_color_map, h=1, w=2)

This constitutes a surprising result:
- the nearest neighbours with the shortest distance, using the PCA input, is definitely not what a human eye would have guessed: two different persons. 
It means that the variance from the mean image is explained in both images in a similar fashion. This result is confirmed by the darkest blue point of the normed distance matrix. 

- the nearest neighbors having the largest distance is -- without surprise now -- between Jane Levy and Emma Stone. "Luckily", it yet gives the correct labeling result, but it also tends to indicate that the test image is quite different, PCA-feature wise, from the others. 



We could go on and look for other funny things out from those numbers, such as the test image which is the furthest from any training image.

###Identification - Conclusion

In this identification task, we computed a similarity score -- the euclidean distance -- pairwise. Using the distance from a test image to a training image, we labeled the test images according to $k$ neighbors.
We confirmed the role of different parameters for both methods, and the results and trends on those results that we could already observed in the previous tasks. 
We also saw that the labeling using k-NN was at least "as easy" as the classification task (in terms of performance reached), specifically for personC and personD.

Of course, this is not the end of the story, and many more things could be achieved:
- analyzing in more depth the influence of the parameters of the feature representations, 
- for each test images, assessing the distribution of the distance to all other images, 
- within a class, assessing the distance between each pair of images
    * this would be linked to the notion of cluster 
- assess if there is a "predominant" image, an image that is close to many others
    * this actually would be the reverse of what we did when we observed which images were the furthest from all
- ...

There is no exhaustive list on what can be done. 


## Impress your TA's

> As said at the beginning of the classification section, I was normally exempted to perform the classification part but yet decided to do it, per personnal interest and eagerness to learn. I hope it will also help in the sense of impressing my TA's.

During the sections on classification and identification, a lot of work has been done to define a clean way to perform the tasks, using the `Transformer` and `Pipeline`. Also, the notion of `GridSearchCV` was introduced and used in the identification task to help choosing the $k$ parameter based on a reliable method.


However, until now, performance of the classification are "behind" the ones of the identification, specifically regarding the HOG representation. In this section, we will try to improve the performances of the HOG-based classification, and the PCA-based classification, using the *gridsearch*. 
We will also try and understand better the classification results, specifically the HOG classifier, using two template images that we create on purpose.

> This part has been mostly done using `color = False` and `sq_size = 64`. There can be differences in the results if those parameters are changed.

####HOG Classification - Optimization
The results obtained above in [this cell](https://colab.research.google.com/drive/1OYq1-SZZURJ5uujmf3PTqdEx3SDGAqZQ#scrollTo=mkYFiPjIKHan&line=2&uniqifier=1) are not so bad for a first attempt, but it's worth it to assess if we can do better!

> better: obtain better **accuracy** results on the test set. Of course, we don't want to tailor the parameters **for** the test set images. 

In order to optimize the results of the classification, we will first implement a **systematic** way of searching for optimal parameters on the training set, using [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html):
- intrinsic cross-validation (the training set is successively split into different subset to allow cross validation)
- automatic parameter testing, according to the given sequence of possibilities to test. 

This is therefore much less of a "manual" process: the system automatically goes through all possible parameter combinations, and establishes the metrics of the tested models using cross validation set.


To perform this *gridsearch*, we reimplement the same steps as before. We speedup a little the process by retrieving the backup variables we set in the previous steps

In [0]:
'''
Make sure of the inputs
'''
X_train = X_train_HOG_shuffled_back.copy()
y_train = y_train_HOG_shuffled_back.copy()

X_test_ab = X_test_HOG_ab_back.copy()
y_test_ab = y_test_HOG_ab_back.copy()

X_test_cd = X_test_HOG_cd_back.copy()
y_test_cd = y_test_HOG_cd_back.copy()



'''
Definition of a pipeline
'''
HOG_pipeline = Pipeline([
                         ('hogify', HogTransformer(
                             orientations = 9,
                             pixels_per_cell = (8,8),
                             cells_per_block = (2,2),
                             block_norm='L2',
                             transform_sqrt=True, 
                             multichannel = color)
                         ),
                         ('classify', SGDClassifier(
                             random_state = 42, 
                             max_iter = 1000, 
                             tol=1e-3)
                         )])


'''
Definition of the parameters grid
'''

param_grid_HOG = [
    {
        'hogify__orientations': [9],
        'hogify__cells_per_block': [(2, 2),(3, 3)],
        'hogify__pixels_per_cell': [(4,4),(8, 8),(16, 16)],
        'hogify__block_norm': ['L2-Hys','L1', 'L2'],
        'hogify__transform_sqrt': [False, True],
        'hogify__multichannel':[color],
        'classify': [
            SGDClassifier(random_state=42, max_iter=1000, tol=1e-5),
            svm.SVC(kernel='linear', C=0.1),
            svm.SVC(kernel='linear', C=1)]}
]

'''
Creation of the object GridSearch
'''
grid_search_HOG = GridSearchCV( HOG_pipeline,
                                param_grid_HOG, 
                                cv = 4, 
                                n_jobs = -1,
                                scoring = "accuracy",
                                verbose = 1, 
                                return_train_score = True)

'''
Train the model, trying out the combination
'''
grid_res_HOG = grid_search_HOG.fit(X_train, y_train)

# print(grid_res_HOG.best_estimator_)
print("\n"*3)
print("=="*40)
print("Best accuracy score (cross validation): " + str(100*grid_res_HOG.best_score_) + " %")
print("Summary of the search best parameters:")
print("orientations = ", grid_res_HOG.best_params_['hogify__orientations'])
print("cells_per_block = ", grid_res_HOG.best_params_['hogify__cells_per_block'])
print("pixels_per_cell = ", grid_res_HOG.best_params_['hogify__pixels_per_cell'])
print("block_norm = ", grid_res_HOG.best_params_['hogify__block_norm'])
print("transform_sqrt = ", grid_res_HOG.best_params_['hogify__transform_sqrt'])
print("classifier = ", grid_res_HOG.best_params_['classify'])
print("=="*40)


The best parameters found according to our *gridsearch* lead to a major change in:
- `pixels_per_cell`: leading to a less finer cell definition. Spacially, it leads to a wider low-pass filtering. This was described in the first part of this notebook,
- `transform_sqrt`: which, according to the authors of the base paper, needs to be tried out experimentally to "decide", 
- `block_norm`: which, similarly, needs to be tested on cross validation set before being adopted.

The best classifier, among the ones tested, remain the SGD used already in previous sections.

We can now simply used the *best* estimator found by the *gridsearch* to perform prediction, and visualize the results (both accuracy and images themselves).

In [0]:
'''
predict AB
'''
best_prediction_ab = grid_res_HOG.predict(X_test_ab)
logging.info("Percentage correct    : " + str(100*np.sum(best_prediction_ab == y_test_ab)/len(y_test_ab)))
show_missed(X_test_ab, y_test_ab, best_prediction_ab)

my_plot_confusion_matrix(grid_res_HOG, X_test_ab, y_test_ab, ["Emma Stone", "Bradley Cooper"])


'''
predict CD
'''
best_prediction_cd = grid_res_HOG.predict(X_test_cd)
logging.info("Percentage correct    : " + str(100*np.sum(best_prediction_cd == y_test_cd)/len(y_test_cd)))
show_missed(X_test_cd, y_test_cd, best_prediction_cd)
my_plot_confusion_matrix(grid_res_HOG, X_test_cd, y_test_cd,["Jane Levy", "Marc Blucas"] ,["Emma Stone", "Bradley Cooper"])


After this optimization pass, we clearly see that the best classification result is 100% for personA and personB. This is much better than what we had originally (80%).
This confirms that, after optimization, we cannot (at least without deeper analysis) state that one feature representation or the other is intrinsically better than the others. Most likely, it depends on other metrics as well.

Because of the new hyper-parameters defined, the decision boundary changes such that more personC and personD images are misclassified.

This is actually not surprising, as the hyperparameters are tailored against cross-validation sets, hence containing only personA and personB images. There is no cross-validation using personC or personD images. Their classification results are therefore not expected to improve. 
In our specific case, we even see that the accuracy in classifying C and D is just as a random classifier, basically. Looking at the confusion matrix, it clearly appears that personC is the one leading (mainly) to such a bad score. 
This behavior is exactly the one already discussed in previous section, related to HOG locality and the *particular* descriptor of personA, with the oblique hair.

These optimization results also open up the possible ways of parameters tuning in regards of the ultimate goal of the system being designed: how do we want the system to perform on a test set containing personC and personD images ?

####PCA Classification - Optimization

While the results obtained earlier using the PCA feature representation are already very good, with 100% accuracy on the test sets of personA and personB, we can use the *gridsearch* technique to test different number of principal components taken into account, or even to try out different classifiers.


In [0]:
'''
Make sure of the inputs
'''
X_train_PCA = X_train_PCA_shuffle_back.copy()
y_train_PCA = y_train_PCA_shuffle_back.copy()
X_test_PCA_ab = X_test_PCA_ab_back.copy()
X_test_PCA_cd = X_test_PCA_cd_back.copy()
y_test_PCA_ab = y_test_PCA_ab_back.copy()
y_test_PCA_cd = y_test_PCA_cd_back.copy()

'''
Set up a grid of parameters to test
'''
param_grid_PCA = [
    {
        'pcaify__n_components': range(10,41,1),
        'classify': [
            SGDClassifier(random_state=42, max_iter=10000, tol=1e-5),
            svm.SVC(kernel='linear', C=1e-5, tol=1e-5 ),
            svm.SVC(kernel='linear', C=0.1, tol=1e-5 )
            ]}
]

'''
create the GridSearchCV object
'''
grid_search_PCA = GridSearchCV(PCA_pipeline,
                           param_grid_PCA, 
                           cv = 5, 
                           n_jobs = -1,
                           scoring = "accuracy",
                           verbose = 1, 
                           return_train_score = True)

grid_res_PCA = grid_search_PCA.fit(X_train_PCA, y_train_PCA)

# logging.info("Best Score            :" + str(grid_res_PCA.best_score_))
# logging.info("Best Parameters found :")
# logging.info(grid_res_PCA.best_params_)
print("\n"*3)
print("=="*40)
print("Best accuracy score (cross validation): " + str(100*grid_res_PCA.best_score_) + " %")
print("Summary of the search best parameters:")
print("n_components = ", grid_res_PCA.best_params_['pcaify__n_components'])
print("classifier = ", grid_res_PCA.best_params_['classify'])
print("=="*40)




Based on the 5-fold-cross-validation, the best accuracy score is 100%, and more interestingly, the number of components is now only 15, when using another classifier, a linear support vector machine with a very small regularization parameter (leading actually to a really strong regularization, see [doc](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).

We can now use the "best found" pipeline to predict the class of our persons.

In [0]:
'''
predict AB
'''
best_prediction_ab = grid_res_PCA.predict(X_test_PCA_ab)
logging.info("Percentage correct    : " + str(100*np.sum(best_prediction_ab == y_test_PCA_ab)/len(y_test_PCA_ab)))
show_missed(X_test_PCA_ab, y_test_PCA_ab, best_prediction_ab)
my_plot_confusion_matrix(grid_res_PCA, X_test_PCA_ab, y_test_PCA_ab,["Emma Stone", "Bradley Cooper"])

'''
predict CD
'''

best_prediction_cd = grid_res_PCA.predict(X_test_PCA_cd)
logging.info("Percentage correct    : " + str(100*np.sum(best_prediction_cd == y_test_PCA_cd)/len(y_test_PCA_cd)))
show_missed(X_test_PCA_cd, y_test_PCA_cd, best_prediction_cd)
my_plot_confusion_matrix(grid_res_PCA, X_test_PCA_cd, y_test_PCA_cd,["Jane Levy", "Marc Blucas"] ,["Emma Stone", "Bradley Cooper"])


After this optimization pass, we clearly see that the best classification result remain 100%, without surprise, for personA and personB. 
Unlike the HOG-based classifier, the accuracy result on personC and D hasn't changed, and what's more, the samples misclassified have not changed. It tends to indicate that the eigenfaces after 15 (excluded) are not that useable for the classifier to differentiate the classes.

Similarly to what was said for the HOG classifier optimization, the fact the accuracy does not improve for personC and personD is expected: the hyper-parameters are selected against crossvalidation, hence containing only personA and personB images. When looking closer, it appears that - once again - it's the performance on personC that penalized overall accuracy on C and D: the classifier acts as a random classifier for personC. Also there, we discussed already largely this behavior in the previous section.


With this optimized pipeline, we reach the same accuracy on personA and personB test set with only 15 components, instead of 35 before. This is a nice result as it means less computing in the end.


####Understanding our classifiers

Without going much deeper in the analysis, it may be useful to illustrate once more the differences between the classifiers (HOG-based vs PCA-based).
- HOG: For an image to be classified as personA, it is much easier if it has, in HOG representation, the oblique line, characteristic of personA haircut. This indicates at least two things:
    * the sensitivity of the classifier to "edge details" of the image. In this case, it is the haircut. It could be also, for instance a skirt collar for a man, a beard, ... Those information are visual characteristic and can be found deeply in the HOG. It is useful sometimes... but also may not be highly reliable, as we will see in the example below: a simple image with an oblique line.
    * the lack of variability of the training set images. That is, recognizing Emma Stone is almost reduced to recognizing this oblique line. Similarly, classifying as Bradley cooper is reduced to recognizing a vertical line on the right side, thanks to its hair shape and forehead, and face's shape. 

To illustrate this, let's just create two images, using the function created in the first part of this notebook. 

In [0]:
'''
Construction of a new set, with two template images
'''

personA_template = create_image(64, 64, special="personA")
personB_template = create_image(64, 64, special="personB")

new_set={}
new_set[personA]=[personA_template]
new_set[personB]=[personB_template]
new_set[personC]=[]
new_set[personD]=[]


''' 
pre-processing
'''
new_X_HOG = get_matrix_from_set(new_set, color, sq_size, flatten=False)

'''
visualization
'''
plot_matrix(new_X_HOG, color, my_color_map, h=2, w=2)

'''
Prediction using best gridsearch output
'''
res_HOG = grid_res_HOG.predict(new_X_HOG)

logging.info("First  image labeled as class: " + str(res_HOG[0]).split(".")[0])
logging.info("Second image labeled as class: " + str(res_HOG[1]).split(".")[0])

We see that -- as expected -- the first image with the oblique line is classified as "0", Emma Stone, and the second is classified as "1", Bradley Cooper.



- PCA: To understand better the PCA, we need to look at the eigenfaces, and the variance that each of the eigenfaces can explain. In the classification task, we deeply cover the rotation of an image of personC. Its variance introduced by the rotation could not be well explained, leading to a classifier actually assigning the wrong class. We saw that manually reducing this variance (= rotating the image) lead to a correct (= expected) result from the classifier. As already discussed, rotation isn't the only thing affecting the PCA-based classifier results (background, lighting, ...)

For the sake of completeness, it is interesting to show the results on PCA-based classification of the two *template* images created above, for the HOG feature. 


In [0]:
'''
pre-processing
'''
new_X_PCA = get_matrix_from_set(new_set, color, sq_size, flatten=True)

'''
Visualization
'''
plot_matrix(new_X_PCA, color, my_color_map, h=2, w=2)

'''
Prediction
'''
res_PCA = grid_res_PCA.predict(new_X_PCA)

logging.info("First  image labeled as class: " + str(res_PCA[0]).split(".")[0])
logging.info("Second image labeled as class: " + str(res_PCA[1]).split(".")[0])

Clearly, this is a proof both classifiers don't work the same way. Classifying from a linear combination of the eigenfaces, the **same** class is given to both "template" images, while they are visually really different.

While PCA-based and HOG-based classifiers don't work the same way, the interpretation of misclassification may be easier (at least, in my opinion based mostly on this project), on the HOG feature representation, which is essentially based on the edges on an image.



####Other optimizations

The results in classification and identification are not that bad, with 100% on the test sets for all personA and personB run once optimized. However, we may want to robustify our predictions, and include much more data. 

It is possible to "artificially" generate new data without too much effort. Indeed, we saw an example of a rotation (when understanding better Classification based on PCA). What we may be interested in is populating our current data set simply by generating new images:
- rotation of every images several times for different angular values.
    * using different rotation center
    * completing the rotated images with pixels (background) having different colors (color level)
- translation of every images several amount of pixels, in different directions
- add some energy in the image, particularly for PCA which is not robust to lighting
- hide some parts of the images

It will quickly increase the amount of data there is, and robustify the classification results (even if the result is already very good). One shall however make sure to not fall into overfitting when training the classifier.

Last point, we explained previously why we would not scale the data - that is, we would not make the variance between 0 and 1. In the context of getting the very best out of our data, we should repeat the experience with the scaling implemented, and confirm the resulting behavior.

##Discussion

CONGRAT's !! You are at the end of this tutorial and if you've reached this step, you most likely understand everything related to HOG, PCA, Classification and Identification.

A lot has already been said overall in this tutorial - we'll try to wrap this up.

###Summary of the activities
This tutorial was quite long, and it's even possible we forgot what we did... So let's refresh our memory!

1. we retrieved the data, carefully and really randomly. This leads to a training set of faces from personA (20) and personB (20), and a test set from personA (10) and personB(10), but also personC (10) and personD (10).

2. we spent quite some time on feature representation constructions:
    * HOG, where we detailed precisely how to compute the gradients, the histogram (we even built several toy images to understand the very essence of the Histogram). We learnt what a cell is, what a block is, how to compute normalization, ... and the influence of all these (hyper-)parameters. A homemade class was coded to reproduce library results
    * PCA, where we also went through the maths and saw three different ways to compute the principal components. We also compared results on examples with library results, and we detailed the effect of the number of components chosen in terms of error. In particular, we spent some time on analyzing the **explained variance**, **cumulative energy**, and **optimal number of principal components**. 

To give a bit of insight, we represented those quite high-dimensional features in the 2D space using t-SNE tool. 

All of that was done first, with the ultimate goal of building systems for classification, and for identification.

3. we built a classification system, first based on the HOG feature representation, then on the PCA feature representation. 
    - we learnt progressively what is:
        * the required pre-processing steps, including the building of the features,
        * the pipeline architecture, using Transformers
        * how to deal with python libraries, and specifically `sklearn` functions
    - the results were deeply analyzed:
        * HOG: we digged into the HOG representation of misclassified images
        * PCA: we succeeded in modifying (according to a plan well established :-) ) an input image  such that it could pass the classification tests.
    
In particular, we saw that it appears easier to interpret the results from HOG classification than PCA classification. Also, without any optimization per se, the PCA-based classifier eached higher accuracy scores than HOG-based classifier on the test set.

4. we built an identification system, first based on the HOG feature representation, then on the PCA feature representation. 
    - before this, we detailed the formalism to study the metrics, the euclidean distance between the representations
        - we saw that comparing different distances (manhattan, cosine, ...) euclidean distance was giving appropriate results
    - we showed, using a colorful matrix representation, that indeed, from a macroscopic perspective, personA test images were closer to personA training images, and resp. for personB. However, we saw funny things for personC and personD. 
        - in regards of the classification results obtained before, the distances computed confirmed what we had: personC is very far from personA in terms of HOG feature, while it's more "fuzzy" for the PCA feature. This is a very nice take-home message: all the representation don't lead to the same results, so that there really is "engineering" behind such systems.
    - using k-NN classifer, we assigned a label to personA, personB, personC and personD test images, and discussed the results. 
        * To assign this label, we needed to define the k parameter: the number of neighbors that would participate to the labeling decision. We decided not to go for a fancy weighted model, but rather to use 
            * a gridsearch technique, that is already an optimization of the kind using cross validation
            * an intuitive technique, where the distances are shown in colors, and - using a logarithmic scale -, only a few neighbors seem of interest to label the image
        * We confirmed the classification results in the sense that it's often "easier" for personA and personB than personC and personD.
    - for both representation, we showed the closest and furthest nearest neighbor of our dataset (between test image and training image)
        * This system/model can of course show the three closest neighbors, three furthest, ... many improvements and extra work are suggested.
    
5. Beyond the optimization and improvement techniques used along the way, we continued in presenting a little deeper the Grid Search, and improved the original results obtained, in particular:
    * in terms of accuracy for the classification using HOG feature, 
    * in terms of number of useful components using PCA feature

We continued showing the difference in essence between the two classifiers showing the results on two template images created specifically. Finally, we suggested different ways of populating our training set in order to robustify the classification and identification results, without requiring other faces downloads.




###Main differences between PCA and HOG

In this section, we come back to a few key differences between both features we used in our tutorial

- HOG is based on gradients, their magnitude and orientation. Intuitively, it corresponds to the edges (cfr Obama pictures). It's therefore heavily impacted by rotation, translation, ... which makes the interpretation easy when looking at the visual representation of the HOG (Without this image of "bars", it's not always easy: see for instance personC classification results). Many parameters allow to tune the robustness to noise, to lighting condition, or the "granularity" of the representation. 

- PCA is based on finding the intrinsic directions of maximal variance in the images. It is a very famous and useful technique for dimensionality reduction by selecting only the $p$ most important component. This can also be used for denoising. It is however an undesirable property when little variance is needed to differentiate between classes: reducing the dimensionality may decrease the performance. 
PCA technique is sensitive to lighting conditions, rotation, scale, background, ...
While the eigenfaces are interpretable in qualitative terms 'this eigenface tends to emphasize the contrast between this and this', it's (as far as I'm concerned) less intuitive to fully interpret the results without analyzing deeper the numbers.

In terms of accuracy, (all results given after optimization), with `sq_size = 64` and `color = False`
- A,B:     
    * HOG: 
        - classification: 100%
        - identification: 100%
    * PCA:
        - classification: 100%
        - identification: 100% 
- C,D: 
    * HOG: 
        - classification: 50%
        - identification: 70%
    * PCA:
        - classification: 65%
        - identification: 65% 


During all our classification/identification tasks, we have observed and explained why the results in terms of accuracy were worse for personC and D than A and B. This of course matches the intuition as, even if the faces are visually similar for a human, with respect to some physical aspects, those aspects may not be the ones captured by the features and classifiers/identification systems. In particular, we saw that the HOG transform would consider Marc Blucas really similar to Bradley Cooper...as well as Jane Levy! At least, in comparison with Emma Stone, which has a remarkable oblique haircut.

To go on even further, we could definitely go along the road of the "other optimizations" suggested (see [other optimizations](https://colab.research.google.com/drive/1OYq1-SZZURJ5uujmf3PTqdEx3SDGAqZQ#scrollTo=k-fU8twAA2aM&line=11&uniqifier=1) ). 

A message learnt seems to be that HOG is particularly well suited to recognized specific pattern in an image -- just like an obvious oblique haircut. Parameters of course allow to deviate from that pattern, but in essence, that's what it is. PCA, on the other hands, may grasp better overall information from the data themselves. Depending on the goal of the application, both features could lead to different results - or different ease to reach the desired performances.

This leads to the question: on a real system, what is the message learnt of such results?


###Classification or Identification

As is, there is no clear answer :-)
In the context of this tutorial, we reached pretty good results with both systems. 

Conceptually, an identification does not really learn anything - it "simply" computes a **relevant** metric between the feature, and gives the most appropriate label based on that. On the contrary, the classifier try to find a boundary, with a "clear" separation between the classes. Both systems can perform many things, but the challenges are different:
- Identification is hard if the number of pairwaise computation is immense, 
- the metric to use as similarity measure may not be easy to find, specifically in high dimension (see the curse of dimensionality)
- For the classification, there is moreover the challenge of finding the appropriate classifier, and its adapted hyper-parameters.

What we saw in particular along this tutorial is that results on all systems differ because of intrinsic characteristic (bias) of the methods. As a perfect example, the identification with $k=1$: there needs to be only one very close training image to be properly labeled. However, this may be subject to "noise", as the background/viewpoint (see [HOG](https://colab.research.google.com/drive/1OYq1-SZZURJ5uujmf3PTqdEx3SDGAqZQ#scrollTo=1esG7zSkqgX4&line=1&uniqifier=1) or [PCA](https://colab.research.google.com/drive/1OYq1-SZZURJ5uujmf3PTqdEx3SDGAqZQ#scrollTo=zMfCvQRhRx6g&line=3&uniqifier=1))

### Authentication system

Let's imagine a company wants to purchase our face detection system to perform authentication on its employees... What would be our advices ?

The questions raised by an authentication system is much more complicated that it could seem. This answer will be centered on computer vision topic.

First, obviously, it is currently very easy to fool the different systems, see examples of the template images. Linked to the same idea, we shall avoid that someone just print out a 1:1 scale picture of one of the employees in order to get access somewhere. A solution for that is to have an extra system verifying that there is movement, or some distance measurement so that it's not a flat picture,... This is outside of the scope of current discussion. 

Regarding the authentification system, the first thing to understand is the need, in particular in terms of penalty and scores.
> what is worse: false positives (authenticating X as Y, giving to X Y's access rights) , or false negatives (not authenticating X as X) ?

We saw both systems (classification and identification) can reach an accuracy of 100% quite "easily", while the results on "unknown person" (personC and D)differ. My succinct advices would then be:
- make sure to have a training set large enough for each of the employees
- use a threshold quite high in terms of system confidence on the output
- use a combinations of both feature representations
    * the combination was not used in this notebook


#####Classification or Identification ?
We can imagine a k-NN identification, as implemented currently, with a (low) thresholding on the distance computed, so that only a face that is really close to another gets labelled. The "drawback" of such system is that the training set needs to be large enough to decrease the amount of false positives: there needs to be an image "really close" to the test image.

A priori, good results are also obtained in this prototype with the classifiers. This is most likely what we wish for:
- high accuracy on personA and personB, 
- "low accuracy" (= close to random) on personC and personD.
    * this is actually not entirely true if are looking at the results on C and D separately. 

Definitely, for both classifiers and identification systems, those cases of personC and personD need to be well-thought, and it seems a threshold on the system confidence score needs to be established. Again, this is to put in regards of the False Positive / False Negative penalty scores. 

Other possibilities to try out would be to organize a vote between several classifiers/identification systems. According to the metrics to reach, this could lead to a very **robust** and reliable system!

#####Dataset sizing
Currently, the size of the different sets is of course too low to guarantee any production-grade performances: globally, it tends to lower down the confidence results. In all cases, increasing the training sets (and test sets) would allow:
- performing extensive and reliable cross-validation of the model parameters, 
- cover more poses, viewpoints, scale, light conditions, ... in order to avoid misclassification/misidentification, 
- potentially make the threshold even higher in terms of confidence level

Provided that some regularization mechanism are implemented as well (or accuracy may drop in real life)


#####What feature to be based on ?
*Simpler* feature representations as HOG give already very good results provided fine tuning. Besides, taking into account the environment (light conditions, pose, ...), some features may be more robust.
- Can there be a led indicating where the employee gaze should point at ?
- Can they keep their glasses or not ? 
    * is it supposed to work with / without glasses ? 
- Is the lighting purely artificial (and controled), or is it to be implemented in a hall where natural light is abundant, leading to changing light conditions ?

All those questions may lead to the use of different features: HOG is more robust to lighting if proper normalization is done, as already discussed.


---

Thanks!

Geoffroy Herbin, R0426473
