# Task B: SVM with HOG features

## 0. Load the preprocessed data
**We will use the preprocessed dataset after image processing and PCA. It characterizes 4 different classes (tumor or not) based on 200 features.** 

In [294]:
# Import necessary libraries
import pickle
import numpy as np
import pandas as pd

In [295]:
# Load preprocessed data with help of pickle.
with open('DataAfterProcess/images_AfterProcess.pickle', 'rb') as handle:
    X = pickle.load(handle)
    
with open('DataAfterProcess/label_AfterProcess.pickle', 'rb') as handle:
    y = pickle.load(handle)
    
# Check result.
print(X.shape, y.shape) 

(3000, 300) (3000,)


### Train Test Split

**Split data into a training set (90%) and a test set (10%). Note that the test set here comes from the dataset.zip file, not the test.zip file.**

In [296]:
# Import necessary libraries
from sklearn.model_selection import train_test_split

In [297]:
# Split data into a training set and a test set (90% training and 10% testing data).
# Notice that all random state is chosen as fixed in this assignment to ensure reproducibility.
X_train, X_test, y_train, y_test = train_test_split(X,y, 
                                                    test_size=0.1, random_state=3)

# Split training set into a new training set and a validation set (90% training and 10% validation data).
X_train_new, X_val, y_train_new, y_val = train_test_split(X_train,y_train, 
                                                          test_size=0.1, random_state=3)
# Check result.
print(X_train_new.shape, y_train_new.shape, X_val.shape, y_val.shape, X_test.shape, y_test.shape) 
print('train set: {} | val set: {} | test set: {}'.format(round(len(y_train_new)/len(X),3), 
                                                          round(len(y_val)/len(X),3),
                                                          round(len(y_test)/len(X),3)))

(2430, 300) (2430,) (270, 300) (270,) (300, 300) (300,)
train set: 0.81 | val set: 0.09 | test set: 0.1


## 1. Train a SVM model
**Firstly, we train a SVM model without hyperparameter tuning in order to compare with model trained from GridsearchCV in next step.**

In [298]:
# Import necessary libraries
from sklearn.svm import SVC

In [299]:
# Call the SVC() model from sklearn and fit the model to the training data.
# Notice that all random state is chosen as fixed in this assignment to ensure reproducibility.
svm = SVC(random_state=0, decision_function_shape='ovo')
svm.fit(X_train_new, y_train_new)

SVC(decision_function_shape='ovo', random_state=0)

### Model validation
**Evaluate the model per validation set by checking confusion matrix, classification report, accuracy score and cohen kappa score.**

In [300]:
# Import necessary libraries
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, cohen_kappa_score

In [301]:
# # Get predictions from the model.
# y_pred = svm.predict(X_val)

# # Print model performance: Confusion matrix, accuracy score, classification report and cohen kappa score.
# print('Confusion matrix: \n', confusion_matrix(y_val, y_pred), '\n')
# print('Accuracy on validation set: ', accuracy_score(y_val, y_pred), '\n')
# print('Classification report: \n', classification_report(y_val, y_pred), '\n') 
# print('cohen_kappa score：',cohen_kappa_score(y_val, y_pred))


# Get predictions from the model.
y_pred = svm.predict(X_test)

# Print model performance: Confusion matrix, accuracy score, classification report and cohen kappa score.
print('Confusion matrix: \n', confusion_matrix(y_test, y_pred), '\n')
print('Accuracy on validation set: ', accuracy_score(y_test, y_pred), '\n')
print('Classification report: \n', classification_report(y_test, y_pred), '\n') 
print('cohen_kappa score：',cohen_kappa_score(y_test, y_pred))

Confusion matrix: 
 [[42  0  1  1]
 [ 5 73 16  0]
 [ 1  1 72  5]
 [ 0  0  1 82]] 

Accuracy on validation set:  0.8966666666666666 

Classification report: 
               precision    recall  f1-score   support

           0       0.88      0.95      0.91        44
           1       0.99      0.78      0.87        94
           2       0.80      0.91      0.85        79
           3       0.93      0.99      0.96        83

    accuracy                           0.90       300
   macro avg       0.90      0.91      0.90       300
weighted avg       0.91      0.90      0.90       300
 

cohen_kappa score： 0.8601882197299979


## 2. Hyperparameter tuning: GridsearchCV 
**Let's tune the hyper-parameters of SVM with help of GridsearchCV(). We tune the kernel with different regularizarion parameter C here.**

### Implement GrisearchCV

In [302]:
# Import necessary libraries
from sklearn.model_selection import GridSearchCV

In [303]:
# Create a dictionary called param_grid and fill out some parameters for C and kernel.
param_grid = {'C': [0.1,0.3,0.5,1,3,5,10,30,50],
              'kernel': ['poly', 'rbf', 'sigmoid']}

In [304]:
# Create a GridSearchCV() object and fit it to the training data.
grid_SVMTaskB = GridSearchCV(SVC(decision_function_shape='ovo'), param_grid=param_grid, refit=True, verbose=2)
grid_SVMTaskB.fit(X_train, y_train)

Fitting 5 folds for each of 27 candidates, totalling 135 fits
[CV] END .................................C=0.1, kernel=poly; total time=   0.6s
[CV] END .................................C=0.1, kernel=poly; total time=   0.6s
[CV] END .................................C=0.1, kernel=poly; total time=   0.5s
[CV] END .................................C=0.1, kernel=poly; total time=   0.5s
[CV] END .................................C=0.1, kernel=poly; total time=   0.7s
[CV] END ..................................C=0.1, kernel=rbf; total time=   1.0s
[CV] END ..................................C=0.1, kernel=rbf; total time=   0.9s
[CV] END ..................................C=0.1, kernel=rbf; total time=   0.8s
[CV] END ..................................C=0.1, kernel=rbf; total time=   0.7s
[CV] END ..................................C=0.1, kernel=rbf; total time=   0.7s
[CV] END ..............................C=0.1, kernel=sigmoid; total time=   0.5s
[CV] END ..............................C=0.1, k

[CV] END ...............................C=10, kernel=sigmoid; total time=   0.2s
[CV] END ...............................C=10, kernel=sigmoid; total time=   0.3s
[CV] END ...............................C=10, kernel=sigmoid; total time=   0.2s
[CV] END ...............................C=10, kernel=sigmoid; total time=   0.2s
[CV] END ..................................C=30, kernel=poly; total time=   0.6s
[CV] END ..................................C=30, kernel=poly; total time=   0.6s
[CV] END ..................................C=30, kernel=poly; total time=   0.6s
[CV] END ..................................C=30, kernel=poly; total time=   0.5s
[CV] END ..................................C=30, kernel=poly; total time=   0.6s
[CV] END ...................................C=30, kernel=rbf; total time=   0.6s
[CV] END ...................................C=30, kernel=rbf; total time=   0.6s
[CV] END ...................................C=30, kernel=rbf; total time=   0.6s
[CV] END ...................

GridSearchCV(estimator=SVC(decision_function_shape='ovo'),
             param_grid={'C': [0.1, 0.3, 0.5, 1, 3, 5, 10, 30, 50],
                         'kernel': ['poly', 'rbf', 'sigmoid']},
             verbose=2)

### Check best hyperparameters

In [305]:
# Print best hyperparameters and score of grid model
print('Best hyperparameter by GridsearchCV():', grid_SVMTaskB.best_params_, '\n')
print('Best score by using this hyperparameter: %.4f' %(grid_SVMTaskB.best_score_), '\n\n')

# Print mean, std of grid model
means_grid = grid_SVMTaskB.cv_results_['mean_test_score']
stds_grid = grid_SVMTaskB.cv_results_['std_test_score']
params_grid = grid_SVMTaskB.cv_results_['params']
    
for mean, stdev, param in zip(means_grid, stds_grid, params_grid):
    print('For parameter %r: Mean score %.4f (with std %.4f)\n'% (param, mean, stdev))

Best hyperparameter by GridsearchCV(): {'C': 30, 'kernel': 'rbf'} 

Best score by using this hyperparameter: 0.9337 


For parameter {'C': 0.1, 'kernel': 'poly'}: Mean score 0.6867 (with std 0.0323)

For parameter {'C': 0.1, 'kernel': 'rbf'}: Mean score 0.7411 (with std 0.0168)

For parameter {'C': 0.1, 'kernel': 'sigmoid'}: Mean score 0.7263 (with std 0.0133)

For parameter {'C': 0.3, 'kernel': 'poly'}: Mean score 0.8393 (with std 0.0117)

For parameter {'C': 0.3, 'kernel': 'rbf'}: Mean score 0.8256 (with std 0.0175)

For parameter {'C': 0.3, 'kernel': 'sigmoid'}: Mean score 0.7700 (with std 0.0079)

For parameter {'C': 0.5, 'kernel': 'poly'}: Mean score 0.8681 (with std 0.0154)

For parameter {'C': 0.5, 'kernel': 'rbf'}: Mean score 0.8507 (with std 0.0218)

For parameter {'C': 0.5, 'kernel': 'sigmoid'}: Mean score 0.7800 (with std 0.0093)

For parameter {'C': 1, 'kernel': 'poly'}: Mean score 0.9022 (with std 0.0146)

For parameter {'C': 1, 'kernel': 'rbf'}: Mean score 0.8904 (with st

## 3. Model Evaluation
**Evaluate the model per test set by checking confusion matrix, classification report, accuracy score and cohen kappa score.**

In [306]:
# Get predictions from the grid model.
pred_grid_SVMTaskB = grid_SVMTaskB.predict(X_test)

# Print grid model performance: Confusion matrix, accuracy score and classification report.
print('Confusion matrix: \n', confusion_matrix(y_test, pred_grid_SVMTaskB), '\n')
print('Accuracy on test set: ', accuracy_score(y_test, pred_grid_SVMTaskB), '\n')
print('Classification report: \n', classification_report(y_test, pred_grid_SVMTaskB), '\n')
print('cohen_kappa score：',cohen_kappa_score(y_test, pred_grid_SVMTaskB))

Confusion matrix: 
 [[44  0  0  0]
 [ 1 86  7  0]
 [ 2  0 74  3]
 [ 0  0  0 83]] 

Accuracy on test set:  0.9566666666666667 

Classification report: 
               precision    recall  f1-score   support

           0       0.94      1.00      0.97        44
           1       1.00      0.91      0.96        94
           2       0.91      0.94      0.92        79
           3       0.97      1.00      0.98        83

    accuracy                           0.96       300
   macro avg       0.95      0.96      0.96       300
weighted avg       0.96      0.96      0.96       300
 

cohen_kappa score： 0.9411862285292033


## 3. Save model

In [307]:
# Save trained model with help of pickle.
with open('Model/SVMTaskB.pickle', 'wb') as handle:
    pickle.dump(grid_SVMTaskB, handle)