<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Evaluating SVM on Multiple Datasets


---

In this lab you can explore several datasets with SVM classifiers compared to logistic regression and kNN classifiers. 

Your datasets folder has these four datasets to choose from for the lab:

**Breast cancer**

    from sklearn.datasets import load_breast_cancer
    
**Spambase**

    resource-datasets/spam

**Car evaluation**

    resource-datasets/car_evaluation

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')

from ipywidgets import *
from IPython.display import display

from sklearn.svm import SVC

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [3]:
from sklearn.svm import SVC, LinearSVC
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn import metrics
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix

## A: Breast cancer data

### 1. Load and prepare the data

- Are there any missing values? Impute or clean if so.
- Select a classification target and predictors.
- Determine the baseline for accuracy.
- Rescale the data.

In [4]:
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()

X = pd.DataFrame(data.data,columns=data.feature_names)
y = data.target

In [5]:
X.shape

(569, 30)

### 2. Build an SVM classifier on the data

For details on the SVM classifier, see [SVM-classifier](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).

- Initialize and train a linear SVM with the default settings. What is the average accuracy score with 5-fold cross validation?
- Repeat using a radial basis function (rbf) classifier. Compare the scores. Which one is better?
- Print the confusion matrix and classification report for your models.

- [Classification report](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)

- Confusion matrix:

 ```python
df_confusion = pd.crosstab(y_true, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
```

In [6]:
linsvm = LinearSVC()
print(cross_val_score(linsvm, X, y).mean())
rbfsvm = SVC()
print(cross_val_score(rbfsvm, X, y).mean())

linsvm.fit(X, y)
pred_lin = linsvm.predict(X)
rbfsvm.fit(X, y)
pred_rbf = rbfsvm.predict(X)



0.8928737773637634
0.9121720229777983




In [7]:
print(classification_report(y, pred_lin))
print('Confusion Matrix:')
pd.crosstab(y, pred_lin, rownames=['Actual'], colnames=['Predicted'], margins=True)

              precision    recall  f1-score   support

           0       0.92      0.90      0.91       212
           1       0.94      0.95      0.95       357

    accuracy                           0.93       569
   macro avg       0.93      0.93      0.93       569
weighted avg       0.93      0.93      0.93       569

Confusion Matrix:


Predicted,0,1,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,191,21,212
1,17,340,357
All,208,361,569


In [8]:
print(classification_report(y, pred_rbf))
print('Confusion Matrix:')
pd.crosstab(y, pred_rbf, rownames=['Actual'], colnames=['Predicted'], margins=True)

              precision    recall  f1-score   support

           0       0.97      0.82      0.89       212
           1       0.90      0.98      0.94       357

    accuracy                           0.92       569
   macro avg       0.93      0.90      0.91       569
weighted avg       0.93      0.92      0.92       569

Confusion Matrix:


Predicted,0,1,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,174,38,212
1,6,351,357
All,180,389,569


### 3. Tune the SVM classifiers with gridsearch

- Check in the documentation which parameters can be tuned in combination with different kernels.
- Create a further train-test split to obtain a hold-out validation set.
- Cross-validate scores.
- Examine confusion matrices and classification reports.

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

svm = [LinearSVC(loss='hinge'),
       SVC(kernel='rbf')]
svm_params = [{'C':np.logspace(-4,4,10)},
              {'C':np.logspace(-4,4,10),
               'gamma':np.logspace(-4,4,10)}]

gs = {}
for i in range(len(svm)):
    gs['gs_{}'.format(i)] = GridSearchCV(svm[i], svm_params[i], verbose=1, n_jobs=2, cv=3)
    gs['gs_{}'.format(i)].fit(X_train,y_train)
    print(gs['gs_{}'.format(i)].best_params_)
    print(gs['gs_{}'.format(i)].best_score_)    
    print(classification_report(y_test, gs['gs_{}'.format(i)].predict(X_test)))
    print('Confusion Matrix:')
    print(pd.crosstab(y_test, gs['gs_{}'.format(i)].predict(X_test), rownames=['Actual'], colnames=['Predicted'], margins=True))

Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  30 out of  30 | elapsed:    0.5s finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.


{'C': 0.0001}
0.9061032863849765
              precision    recall  f1-score   support

           0       0.98      0.83      0.90        53
           1       0.91      0.99      0.95        90

    accuracy                           0.93       143
   macro avg       0.94      0.91      0.92       143
weighted avg       0.93      0.93      0.93       143

Confusion Matrix:
Predicted   0   1  All
Actual                
0          44   9   53
1           1  89   90
All        45  98  143
Fitting 3 folds for each of 100 candidates, totalling 300 fits
{'C': 21.54434690031882, 'gamma': 0.0001}
0.9366197183098591
              precision    recall  f1-score   support

           0       0.94      0.92      0.93        53
           1       0.96      0.97      0.96        90

    accuracy                           0.95       143
   macro avg       0.95      0.95      0.95       143
weighted avg       0.95      0.95      0.95       143

Confusion Matrix:
Predicted   0   1  All
Actual         

[Parallel(n_jobs=2)]: Done 300 out of 300 | elapsed:    1.9s finished


### 4. Compare kNN and logistic regression on the dataset.


- Gridsearch optimal parameters 
- Cross-validate scores.
- Examine confusion matrices and classification reports.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

clf = [KNeighborsClassifier(),
       LogisticRegression()]
clf_params = [{'n_neighbors':[3,5,7,17,25,51],
               'weights':['uniform','distance']},
              {'C':np.logspace(-4,4,10),
               'penalty':['l1','l2']}]

gs = {}
for i in range(len(svm)):
    gs['gs_{}'.format(i)] = GridSearchCV(clf[i], clf_params[i], verbose=1, n_jobs=2, cv=3)
    gs['gs_{}'.format(i)].fit(X_train,y_train)
    print(gs['gs_{}'.format(i)].best_params_)
    print(gs['gs_{}'.format(i)].best_score_)    
    print(classification_report(y_test, gs['gs_{}'.format(i)].predict(X_test)))
    print('Confusion Matrix:')
    print(pd.crosstab(y_test, gs['gs_{}'.format(i)].predict(X_test), rownames=['Actual'], colnames=['Predicted'], margins=True))

Fitting 3 folds for each of 12 candidates, totalling 36 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.


### 5. Bonus: Consider different scores in the gridsearch

## B: Car data

- Repeat the same steps

### 1. Load and prepare the data

In [None]:
car = pd.read_csv('../../../../resource-datasets/car_evaluation/car.csv')

### 2. Build an SVM classifier

### 3. Grid search SVM

### 4. Compare with kNN and logistic regression

## C: Spam data

- Repeat the same steps

### 1. Load and prepare the data

In [None]:
spam = pd.read_csv('../../../../resource-datasets/spam/spambase.csv')
spam.head()

### 2. Build an SVM classifier

### 3. Grid search SVM

### 4. Compare to kNN and logistic regression