# Exploring The support vector machine classifier

### importing dependencies and reading data

In [1]:
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.neighbors import KernelDensity
import numpy as np
import random
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.svm import SVC 
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_curve

df = pd.read_csv('./dataset/data.csv')
data = df.drop(['person_ID', 'frame', 'stream', 'sequance'], axis=1)
target = df["person_ID"]



### preparing the data

For conducting all the experiments in this section the ChokePoint Dataset was used by first having the data processed by the facenet neural network to produce a data frame of the embedings of all the faces alongside some identifiable information like the person_ID, frame, stream and sequance

We have decided to use the protocol suggested by the authors of the dataset for the verification, in which the data is devided into two groups G1 and G2, each group will play the turn of training set and evaluation set in turn. for more information please refer to http://arma.sourceforge.net/chokepoint/

and we have decided to use case study 1 which is concerned with:
1. indoor scenes only
2. short time intervals

In [2]:
G1_streams = ["P1E_S1_C1", "P1E_S2_C2", "P1L_S1_C1", "P1L_S2_C2"]
G1_sequence = ["P1E_S1", "P1E_S2","P1L_S1", "P1L_S2"]
G2_streams = ["P1E_S3_C3", "P1L_S3_C3", "P1E_S4_C1", "P1L_S4_C1"]
G2_sequence = ["P1E_S3", "P1L_S3","P1E_S4", "P1L_S4"]

In [3]:
G1_data = df[df["stream"].isin(G1_streams)]
G1_indices = G1_data.index[G1_data["stream"].isin(G1_streams)]
G1_sequance_ind = [G1_data.index[G1_data["sequance"] == sequence] 
                   for sequence in G1_sequence]
G2_data = df[df["stream"].isin(G2_streams)]
G2_indices = G2_data.index[G2_data["stream"].isin(G2_streams)]
G2_sequance_ind = [G2_data.index[G2_data["sequance"] == sequence] 
                   for sequence in G2_sequence]

A list of indices is prepared for cross validation, it contains 32 pairs of sets of indices where each sequance of G1 will be used as training and tested against each sequance of G2 and vice versa

In [4]:
cv_G1 = [(G1_S, G2_S) for G1_S in G1_sequance_ind for G2_S in G2_sequance_ind]
cv_G2 = [(G2_S, G1_S) for G1_S in G1_sequance_ind for G2_S in G2_sequance_ind]
cv = cv_G1 + cv_G2

### Test 1: exploring the accuracy of the classifier when trained on all classes

In this test the classifier is tuned according to the protocol discussed above, to find the best accuracy it can reach

To better suite the data to the application that we are working on it was decided that the classifier should be trained on data only from one sequence at a time rather than the entire group 

The C parameter is sampled from a linear function, while the gamma parameter is sampled from an exponential function and they are tested using three different kernels

In [5]:
param_grid = {'C': np.linspace(0.05, 5, 10), 
              'gamma': 10 ** np.linspace(0.01, 0, 10),
              'kernel': ['rbf', 'poly', 'sigmoid']}
grid_g = GridSearchCV(SVC(),param_grid,refit=False, n_jobs=16, cv=cv)
grid_g.fit(data, target)

scores_g = grid_g.cv_results_.get('mean_test_score').tolist()

print(grid_g.best_params_)
print('accuracy =', grid_g.best_score_)

{'C': 5.0, 'gamma': 1.0102862550356189, 'kernel': 'rbf'}
accuracy = 0.9879507552795811
