<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Support-Vector-Machines" data-toc-modified-id="Support-Vector-Machines-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Support Vector Machines</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Breast-Cancer-Support-Vector-Classifier" data-toc-modified-id="Breast-Cancer-Support-Vector-Classifier-1.0.1"><span class="toc-item-num">1.0.1&nbsp;&nbsp;</span>Breast Cancer Support Vector Classifier</a></span></li></ul></li><li><span><a href="#In-Class-Work" data-toc-modified-id="In-Class-Work-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>In Class Work</a></span></li></ul></li><li><span><a href="#Summary" data-toc-modified-id="Summary-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Summary</a></span></li></ul></div>

# Support Vector Machines

Slides: [link](https://docs.google.com/presentation/d/10SpukVqPx8zWaPgPH0gyyv_ZYRRnel1GkH93BRUZfTo/edit?usp=sharing)

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import StandardScaler
from IPython.display import Image  
from sklearn import metrics
from sklearn import preprocessing
from sklearn.metrics import f1_score, confusion_matrix
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split

### Breast Cancer Support Vector Classifier

In [2]:
bc=pd.read_csv('breast_cancer_scikit_onehot_dataset.csv')

In [3]:
target=bc['class']
target = bc['class'].map(lambda x: 1 if x == 4 else 0).values 
target = pd.Series(target)

In [4]:
predictor=bc.drop(columns=['class'])
predictor.head()

Unnamed: 0,clump_thickness,uniformity_of_cell_size,uniformity_of_cell_shape,marginal_adhesion,single_epithelial_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitosis
0,5.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0
1,5.0,4.0,4.0,5.0,7.0,10.0,3.0,2.0,1.0
2,3.0,1.0,1.0,1.0,2.0,2.0,3.0,1.0,1.0
3,6.0,8.0,8.0,1.0,3.0,4.0,3.0,7.0,1.0
4,4.0,1.0,1.0,3.0,2.0,1.0,3.0,1.0,1.0


In [5]:
scaler = StandardScaler()
predictor= scaler.fit_transform(predictor)

In [6]:
# Import train_test_split function
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(predictor, target, test_size=0.3,random_state=9) 

In [7]:
# fit our model
from sklearn.svm import SVC  
svclassifier = SVC(C=1.0,kernel='rbf', class_weight='balanced')  
svclassifier.fit(X_train, y_train) 



SVC(C=1.0, cache_size=200, class_weight='balanced', coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [8]:
y_pred = svclassifier.predict(X_test)

In [9]:
svclassifier.support_vectors_.shape

(71, 9)

In [10]:
from sklearn.metrics import accuracy_score

print(f"The accuracy score is {accuracy_score(y_test, y_pred)}")

The accuracy score is 0.9532163742690059


In [11]:
print("f1 Score:",f1_score(y_test, y_pred))

f1 Score: 0.9444444444444445


## In Class Work

- Use grid search to improve on model. 

In [20]:
param_grid = [{'C': [1, 5, 20], 'kernel': ['linear', 'poly', 'rbf'],  'gamma':['scale', 2,4,5], 'decision_function_shape': ['ovo', 'ovr'], 'class_weight' :['balanced']}]

grid_search = GridSearchCV(svclassifier, param_grid, cv=5,scoring='f1',return_train_score=True)

grid_search.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=SVC(C=1.0, cache_size=200, class_weight='balanced',
                           coef0=0.0, decision_function_shape='ovr', degree=3,
                           gamma='auto_deprecated', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='warn', n_jobs=None,
             param_grid=[{'C': [1, 5, 20], 'class_weight': ['balanced'],
                          'decision_function_shape': ['ovo', 'ovr'],
                          'gamma': ['scale', 2, 4, 5],
                          'kernel': ['linear', 'poly', 'rbf']}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring='f1', verbose=0)

In [21]:
grid_search.best_params_

{'C': 1,
 'class_weight': 'balanced',
 'decision_function_shape': 'ovo',
 'gamma': 'scale',
 'kernel': 'linear'}

In [22]:
y_pred = grid_search.predict(X_test)

In [23]:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.9649122807017544


In [24]:
f1_score(y_test, y_pred)

0.9577464788732395

# Summary
- SVM is very effective
- However, it is not highly interpretable
- Optimization is hard to understand (convex quadratic program)
- Performs really well on high dimensional dataset with low amount of observations

