<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Evaluating SVM on Multiple Data Sets

_Authors: Kiefer Katovich (SF)_

---

In this lab, you'll explore several data sets with SVM classifiers and compare them to logistic regression and KNN classifiers. 

Your "datasets" folder contains the following three data sets to choose from for the lab:

**Breast Cancer**

    ./datasets/breast_cancer_wisconsin

**Spambase**

    ./datasets/spam

**Car Evaluation**

    ./datasets/car_evaluation


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')

from ipywidgets import *
from IPython.display import display

from sklearn.svm import SVC

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

### 1) Load the breast cancer data.

- Are there any missing values? If so, impute or clean the data set.
- Select a classification target and predictors.

In [2]:
df = pd.read_csv('./datasets/breast_cancer_wisconsin/breast_cancer.csv', na_values='?')

In [3]:
df.head()

Unnamed: 0,Sample_code_number,Clump_Thickness,Uniformity_of_Cell_Size,Uniformity_of_Cell_Shape,Marginal_Adhesion,Single_Epithelial_Cell_Size,Bare_Nuclei,Bland_Chromatin,Normal_Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1.0,3,1,1,2
1,1002945,5,4,4,5,7,10.0,3,2,1,2
2,1015425,3,1,1,1,2,2.0,3,1,1,2
3,1016277,6,8,8,1,3,4.0,3,7,1,2
4,1017023,4,1,1,3,2,1.0,3,1,1,2


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
Sample_code_number             699 non-null int64
Clump_Thickness                699 non-null int64
Uniformity_of_Cell_Size        699 non-null int64
Uniformity_of_Cell_Shape       699 non-null int64
Marginal_Adhesion              699 non-null int64
Single_Epithelial_Cell_Size    699 non-null int64
Bare_Nuclei                    683 non-null float64
Bland_Chromatin                699 non-null int64
Normal_Nucleoli                699 non-null int64
Mitoses                        699 non-null int64
Class                          699 non-null int64
dtypes: float64(1), int64(10)
memory usage: 60.1 KB


In [5]:
df['Class'].value_counts()

2    458
4    241
Name: Class, dtype: int64

In [10]:
df.isnull().sum()

Sample_code_number             0
Clump_Thickness                0
Uniformity_of_Cell_Size        0
Uniformity_of_Cell_Shape       0
Marginal_Adhesion              0
Single_Epithelial_Cell_Size    0
Bare_Nuclei                    0
Bland_Chromatin                0
Normal_Nucleoli                0
Mitoses                        0
Class                          0
dtype: int64

In [9]:
df = df.dropna()

In [20]:
y = df['Class'].map(lambda x: 1 if x==4 else 0)
X = df.iloc[:,:-1]

In [21]:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
Xn = ss.fit_transform(X)

### 2) Build an SVM classifier on the data.

For details on the SVM classifier, [click here](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).

- What's the baseline for accuracy?
- Initialize and train a linear SVM. What's the average accuracy score with a five-fold cross-validation?
- Repeat using a radial basis function (RBF) classifier. Compare the scores. Which one is better?
- Print a confusion matrix and classification report for your best model using training and testing data.

**Classification report**:

```python
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
```

**Confusion matrix**:

```python
df_confusion = pd.crosstab(y_true, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
```

In [22]:
y.value_counts()/len(y)

0    0.650073
1    0.349927
Name: Class, dtype: float64

In [29]:
from sklearn.model_selection import GridSearchCV
from sklearn import svm, grid_search

Cs = [0.001, 0.01, 0.1, 1, 10] # higher C gives more penalty to error, narrows vector margin reduces bias
gammas = [0.001, 0.01, 0.1, 1] # small gamma reduces variance but increases bias
degrees = [1,2,3,4,5]
param_grid = {'C': Cs, 'gamma' : gammas, 'degree': degrees}
grid_search = GridSearchCV(svm.SVC(kernel='linear'), param_grid, cv=5)
grid_search.fit(Xn, y)
grid_search.best_params_

{'C': 0.1, 'degree': 1, 'gamma': 0.001}

In [34]:
from sklearn.model_selection import cross_val_score
SVM = svm.SVC(kernel='rbf', C=1, gamma=0.001, degree=1)
accs = cross_val_score(SVM, Xn, y, cv=5)
print accs
print np.mean(accs)

[0.91970803 0.95620438 0.97810219 0.97810219 0.99259259]
0.9649418761827521


#### 2.A) Are there more false positives or false negatives? Is this good or bad?

In [4]:
# A:

### 3) Perform the steps above with a different data set.

Repeat each step.

In [5]:
# A:

### 4) Compare SVM, KNN, and logistic regression using a data set.

You should:

- Use grid-search optimal parameters for each (for SVM, just grid search C and kernel).
- Cross-validate scores.
- Examine confusion matrices and classification reports.

**Bonus:**

Plot "learning curves" for the best models of each. This is a great way see how training/testing size affects the scores. Check out [the documentation](http://scikit-learn.org/stable/modules/learning_curve.html#learning-curves) for how to use this function in scikit-learn.

In [6]:
# A: