# Support Vector Machines Lab

In this lab we will explore several datasets with SVMs. The assets folder contains several datasets (in order of complexity):

1. Breast cancer
- Spambase
- Car evaluation
- Mushroom

For each of these a `.names` file is provided with details on the origin of data.

In [26]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
import patsy
from sklearn.cross_validation import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import train_test_split, StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.svm import SVC

sns.set_style('white')

%matplotlib inline

# Exercise 1: Breast Cancer



## 1.a: Load the Data
Use `pandas.read_csv` to load the data and assess the following:
- Are there any missing values? (how are they encoded? do we impute them?)
- Are the features categorical or numerical?
- Are the values normalized?
- How many classes are there in the target?

Perform what's necessary to get to a point where you have a feature matrix `X` and a target vector `y`, both with only numerical entries.

In [6]:
bc = pd.read_csv('../../assets/datasets/breast_cancer.csv')

In [13]:
bc.head()

Unnamed: 0,Sample_code_number,Clump_Thickness,Uniformity_of_Cell_Size,Uniformity_of_Cell_Shape,Marginal_Adhesion,Single_Epithelial_Cell_Size,Bare_Nuclei,Bland_Chromatin,Normal_Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


In [23]:
bc.Bare_Nuclei.replace('?', np.nan, inplace = True)
bc = bc.dropna()
bc.Mitoses.value_counts()

1     563
2      35
3      33
10     14
4      12
7       9
8       8
5       6
6       3
Name: Mitoses, dtype: int64

In [25]:
Y, X = patsy.dmatrices("""
Class ~ Clump_Thickness + Uniformity_of_Cell_Size + Uniformity_of_Cell_Shape + Marginal_Adhesion + Single_Epithelial_Cell_Size +
Bare_Nuclei + Bland_Chromatin + Normal_Nucleoli + Mitoses - 1
""", data = bc)

Y = np.ravel(Y)

In [27]:
# no kernel trick transformation:
# C, remember, is the "regularization" on penalty on msclassification errors.
# Higher C means you care about errors more, lower C means you care about
# maximizing the margin more.
svm_linear = SVC(kernel = 'linear', C = 10)

# kernel is the "radial basis function"
svm_rbf = SVC(kernel = 'rbf', C = 0.00001)

# kernel is some sigmoid transformation (maybe like logistic regression, not sure)
svm_sigmoid = SVC(kernel = 'sigmoid')

# kernel is a polynoial transformation.
# This is probably like (x1, x2) --> (x1, x2, x1**2, x2**2)
# degree is how many exponent terms per predictor you' want (here just 2)
svm_poly = SVC(kernel = 'poly', degree = 2)

In [29]:
svm = SVC()

param_grid = {
    'kernel': ['linear', 'rbf', 'sigmoid', 'poly'],
    'degree': [2, 3, 4, 5],
    'C': np.logspace(-5, 2, 50)
}

gs = GridSearchCV(svm, param_grid, cv = 5, verbose = 1)

gs.fit(X, Y)

Fitting 5 folds for each of 800 candidates, totalling 4000 fits


[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:    0.5s
[Parallel(n_jobs=1)]: Done 199 tasks       | elapsed:    2.2s
[Parallel(n_jobs=1)]: Done 449 tasks       | elapsed:    5.0s
[Parallel(n_jobs=1)]: Done 799 tasks       | elapsed:    8.8s
[Parallel(n_jobs=1)]: Done 1249 tasks       | elapsed:   13.3s
[Parallel(n_jobs=1)]: Done 1799 tasks       | elapsed:   18.9s
[Parallel(n_jobs=1)]: Done 2449 tasks       | elapsed:   25.4s
[Parallel(n_jobs=1)]: Done 3199 tasks       | elapsed:   34.5s
[Parallel(n_jobs=1)]: Done 4000 out of 4000 | elapsed:  1.0min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'kernel': ['linear', 'rbf', 'sigmoid', 'poly'], 'C': array([  1.00000e-05,   1.38950e-05,   1.93070e-05,   2.68270e-05,
         3.72759e-05,   5.17947e-05,   7.19686e-05,   1.00000e-04,
         1.38950e-04,   1.93070e-04,   2.68270e-04,   3.72759e-04,
         5.17947e-04,   7.19686e-0...70e+01,   3.72759e+01,   5.17947e+01,
         7.19686e+01,   1.00000e+02]), 'degree': [2, 3, 4, 5]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=1)

In [31]:
gs.best_params_

{'C': 0.037275937203149381, 'degree': 2, 'kernel': 'linear'}

Trying to: 
Maximize margin AND Minimize misclassification error
$$C\sum$$
Big C -> Care more about errors
Small C -> Care more about wider margin

## 1.b: Model Building

- What's the baseline for the accuracy?
- Initialize and train a linear svm. What's the average accuracy score with a 3-fold cross validation?
- Repeat using an rbf classifier. Compare the scores. Which one is better?
- Are your features normalized? if not, try normalizing and repeat the test. Does the score improve?
- What's the best model?
- Print a confusion matrix and classification report for your best model using:
        train_test_split(X, y, stratify=y, test_size=0.33, random_state=42)

**Check** to decide which model is best, look at the average cross validation score. Are the scores significantly different from one another?

**Check:** Are there more false positives or false negatives? Is this good or bad?

## 1.c: Feature Selection

Use any of the strategies offered by `sklearn` to select the most important features.

Repeat the cross validation with only those 5 features. Does the score change?

## 1.d: Learning Curves

Learning curves are useful to study the behavior of training and test errors as a function of the number of datapoints available.

- Plot learning curves for train sizes between 10% and 100% (use StratifiedKFold with 5 folds as cross validation)
- What can you say about the dataset? do you need more data or do you need a better model?

##  1.e: Grid Ssearch

Use the grid_search function to explore different kernels and values for the C parameter.

- Can you improve on your best previous score?
- Print the best parameters and the best score

# Exercise 2
Now that you've completed steps 1.a through 1.e it's time to tackle some harder datasets. But before we do that, let's encapsulate a few things into functions so that it's easier to repeat the analysis.

## 2.a: Cross Validation
Implement a function `do_cv(model, X, y, cv)` that does the following:
- Calculates the cross validation scores
- Prints the model
- Prints and returns the mean and the standard deviation of the cross validation scores

> Answer: see above

## 2.b: Confusion Matrix and Classification report
Implement a function `do_cm_cr(model, X, y, names)` that automates the following:
- Split the data using `train_test_split(X, y, stratify=y, test_size=0.33, random_state=42)`
- Fit the model
- Prints confusion matrix and classification report in a nice format

**Hint:** names is the list of target classes

> Answer: see above

## 2.c: Learning Curves
Implement a function `do_learning_curve(model, X, y, sizes)` that automates drawing the learning curves:
- Allow for sizes input
- Use 5-fold StratifiedKFold cross validation

> Answer: see above

## 2.d: Grid Search
Implement a function `do_grid_search(model, parameters)` that automates the grid search by doing:
- Calculate grid search
- Print best parameters
- Print best score
- Return best estimator


> Answer: see above

# Exercise 3
Using the functions above, analyze the Spambase dataset.

Notice that now you have many more features. Focus your attention on step C => feature selection

- Load the data and get to X, y
- Select the 15 best features
- Perform grid search to determine best model
- Display learning curves

# Exercise 4
Repeat steps 1.a - 1.e for the car dataset. Notice that now features are categorical, not numerical.
- Find a suitable way to encode them
- How does this change our modeling strategy?

Also notice that the target variable `acceptability` has 4 classes. How do we encode them?


# Bonus
Repeat steps 1.a - 1.e for the mushroom dataset. Notice that now features are categorical, not numerical. This dataset is quite large.
- How does this change our modeling strategy?
- Can we use feature selection to improve this?
