# Task B: SVM

## 0. Load the preprocessed data
**We will use the preprocessed dataset after PCA. It characterizes 2 different classes (tumor or not) based on 200 features.** 

In [66]:
# Import necessary libraries
import pickle
import numpy as np
import pandas as pd

In [67]:
# Load preprocessed data with help of pickle.
with open('DataAfterProcess/images_AfterProcess.pickle', 'rb') as handle:
    X = pickle.load(handle)
    
with open('DataAfterProcess/label_AfterProcess.pickle', 'rb') as handle:
    y = pickle.load(handle)
    
# Check result.
print(X.shape, y.shape) 

(3000, 4096) (3000,)


### Train Test Split

**Split data into a training set (90%) and a test set (10%). Note that the test set here comes from the dataset.zip file, not the test.zip file.**

In [68]:
# Import necessary libraries
from sklearn.model_selection import train_test_split

In [69]:
# Split data into a training set and a test set (90% training and 10% testing data).
# Notice that all random state is chosen as 0 in this assignment to ensure reproducibility.
X_train, X_test, y_train, y_test = train_test_split(X,y, 
                                                    test_size=0.1, random_state=0)

# Check result.
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape) 
print('train set: {}  | test set: {}'.format(round(len(y_train)/len(X),3), 
                                             round(len(y_test)/len(X),3)))

(2700, 4096) (2700,) (300, 4096) (300,)
train set: 0.9  | test set: 0.1


## 1. Train a SVM model
**Firstly, we train a SVM model without hyperparameter tuning in order to compare with model trained from GridsearchCV in next step.**

In [70]:
# Import necessary libraries
from sklearn.svm import SVC

In [78]:
# Call the SVC() model from sklearn and fit the model to the training data.
# Notice that all random state is chosen as 0 in this assignment to ensure reproducibility.
svm = SVC(C=1, random_state=0)
svm.fit(X_train, y_train)

SVC(C=1, random_state=0)

### Model Evaluation
**Get predictions from the model, then create the confusion matrix, the classification report and check the accuracy score, cohen kappa score .**

In [79]:
# Import necessary libraries
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, cohen_kappa_score

In [80]:
# Get predictions from the model.
y_pred = svm.predict(X_test)

# Print model performance: Confusion matrix, accuracy score and classification report.
print('Confusion matrix: \n', confusion_matrix(y_test, y_pred), '\n')
print('Accuracy on test set: ', accuracy_score(y_test, y_pred), '\n')
print('Classification report: \n', classification_report(y_test, y_pred), '\n') 
print('cohen_kappa score：',cohen_kappa_score(y_test, y_pred))

Confusion matrix: 
 [[27  6 11  2]
 [ 3 67 17  4]
 [ 4 19 53  7]
 [ 0  1  0 79]] 

Accuracy on test set:  0.7533333333333333 

Classification report: 
               precision    recall  f1-score   support

           0       0.79      0.59      0.68        46
           1       0.72      0.74      0.73        91
           2       0.65      0.64      0.65        83
           3       0.86      0.99      0.92        80

    accuracy                           0.75       300
   macro avg       0.76      0.74      0.74       300
weighted avg       0.75      0.75      0.75       300
 

cohen_kappa score： 0.6630748216724844


## 2. Hyperparameter tuning: GridsearchCV 
**Let's tune the hyper-parameters with help of GridsearchCV(). We tune kernel with different regularizarion parameter C.**

In [39]:
# Import necessary libraries
from sklearn.model_selection import GridSearchCV

In [44]:
# Create a dictionary called param_grid and fill out some parameters for C and kernel.
param_grid = {'C': [0.5,5,50],
              'kernel': ['poly']}


In [45]:
# Create a GridSearchCV() object and fit it to the training data.
grid_SVMTaskA = GridSearchCV(SVC(), param_grid=param_grid, refit=True, verbose=2)
grid_SVMTaskA.fit(X_train, y_train)

Fitting 5 folds for each of 3 candidates, totalling 15 fits
[CV] END .................................C=0.5, kernel=poly; total time=   0.5s
[CV] END .................................C=0.5, kernel=poly; total time=   0.5s
[CV] END .................................C=0.5, kernel=poly; total time=   0.5s
[CV] END .................................C=0.5, kernel=poly; total time=   0.5s
[CV] END .................................C=0.5, kernel=poly; total time=   0.4s
[CV] END ...................................C=5, kernel=poly; total time=   0.5s
[CV] END ...................................C=5, kernel=poly; total time=   0.5s
[CV] END ...................................C=5, kernel=poly; total time=   0.5s
[CV] END ...................................C=5, kernel=poly; total time=   0.5s
[CV] END ...................................C=5, kernel=poly; total time=   0.6s
[CV] END ..................................C=50, kernel=poly; total time=   0.6s
[CV] END ..................................C=50, 

GridSearchCV(estimator=SVC(),
             param_grid={'C': [0.5, 5, 50], 'kernel': ['poly']}, verbose=2)

### Check best hyperparameters

In [46]:
# Print best hyperparameters of grid model
print(grid_SVMTaskA.best_params_)

{'C': 50, 'kernel': 'poly'}


### Model Evaluation
**Get predictions from the grid model, then create the confusion matrix, the classification report and check the accuracy score, cohen kappa score.**

In [47]:
# Get predictions from the grid model.
pred_grid_SVMTaskA = grid_SVMTaskA.predict(X_test)

# Print grid model performance: Confusion matrix, accuracy score and classification report.
print('Confusion matrix: \n', confusion_matrix(y_test, pred_grid_SVMTaskA), '\n')
print('Accuracy on test set: ', accuracy_score(y_test, pred_grid_SVMTaskA), '\n')
print('Classification report: \n', classification_report(y_test, pred_grid_SVMTaskA), '\n')
print('cohen_kappa score：',cohen_kappa_score(y_test, pred_grid_SVMTaskA))

Confusion matrix: 
 [[32  4  8  2]
 [ 2 82  7  0]
 [ 3  7 70  3]
 [ 0  1  0 79]] 

Accuracy on test set:  0.8766666666666667 

Classification report: 
               precision    recall  f1-score   support

           0       0.86      0.70      0.77        46
           1       0.87      0.90      0.89        91
           2       0.82      0.84      0.83        83
           3       0.94      0.99      0.96        80

    accuracy                           0.88       300
   macro avg       0.88      0.86      0.86       300
weighted avg       0.88      0.88      0.87       300
 

cohen_kappa score： 0.8317391502069154


## 3. Save model

In [14]:
# Save trained model with help of pickle.
with open('Model/SVMTaskA.pickle', 'wb') as handle:
    pickle.dump(grid_SVMTaskA, handle)