# Support Vector Machines Lab

In this lab we will explore several datasets with SVMs. The assets folder contains several datasets (in order of complexity):

1. Breast cancer
- Spambase
- Car evaluation
- Mushroom

For each of these a `.names` file is provided with details on the origin of data.

In [2]:
import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.cross_validation import cross_val_score, train_test_split, StratifiedKFold
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn import preprocessing
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.learning_curve import learning_curve

# Exercise 1: Breast Cancer



## 1.a: Load the Data
Use `pandas.read_csv` to load the data and assess the following:
- Are there any missing values? (how are they encoded? do we impute them?)
- Are the features categorical or numerical?
- Are the values normalized?
- How many classes are there in the target?

Perform what's necessary to get to a point where you have a feature matrix `X` and a target vector `y`, both with only numerical entries.

In [4]:
# Read in the dataset
data = pd.read_csv('breast_cancer.csv')

In [3]:
# Look at the head and dtypes
data.head()
data.dtypes
# There are 16 missing values
# Features are numeric
# The values are not normalized
# 2 classes in the target

Sample_code_number              int64
Clump_Thickness                 int64
Uniformity_of_Cell_Size         int64
Uniformity_of_Cell_Shape        int64
Marginal_Adhesion               int64
Single_Epithelial_Cell_Size     int64
Bare_Nuclei                    object
Bland_Chromatin                 int64
Normal_Nucleoli                 int64
Mitoses                         int64
Class                           int64
dtype: object

In [4]:
# Convert Bare_Nuclei to numeric and change missing values to NaN
data.Bare_Nuclei = pd.to_numeric(data.Bare_Nuclei, errors = 'coerce')

In [5]:
# Drop the rows that contain a missing value in Bare_Nuclei
data.dropna(axis=0,how='any',inplace=True)

In [6]:
# Map the Class to be 0 for benign and 1 for Malignant
data.Class = data.Class.map({2:'0', 4:'1'})
data.Class = pd.to_numeric(data.Class, errors = 'coerce')

## 1.b: Model Building

- What's the baseline for the accuracy?
- Initialize and train a linear svm. What's the average accuracy score with a 3-fold cross validation?
- Repeat using an rbf classifier. Compare the scores. Which one is better?
- Are your features normalized? if not, try normalizing and repeat the test. Does the score improve?
- What's the best model?
- Print a confusion matrix and classification report for your best model using:
        train_test_split(X, y, stratify=y, test_size=0.33, random_state=42)

**Check** to decide which model is best, look at the average cross validation score. Are the scores significantly different from one another?

In [7]:
# Set our feature and target variables
X = data[data.columns.values[1:10]]
y = data.Class

In [8]:
# Initialize a linear SVC and fit the data
model_linear = SVC(kernel='linear')
model_linear.fit(X,y)
# Check the model's score
cvscores = cross_val_score(model_linear,X,y,cv=3,n_jobs=1)
print "CV score: {:.4} +/- {:.4}".format(cvscores.mean(), cvscores.std())

CV score: 0.9649 +/- 0.01788


In [9]:
# Initialize a RBF SVC and fit the data
model_rbf = SVC(kernel='rbf')
model_rbf.fit(X,y)
# Check the model's score:
cvscores = cross_val_score(model_rbf,X,y,cv=3,n_jobs=1)
print "CV score: {:.4} +/- {:.4}".format(cvscores.mean(), cvscores.std())

CV score: 0.9576 +/- 0.02513


In [10]:
# Now we normalize our data and repeat our models
# Checked both l1 and l2 and saw that l2 performed better, so use l2 for our models below
X_norm = preprocessing.normalize(X,norm='l2')

In [11]:
# Initialize a linear SVC and fit the data
model_norm_linear = SVC(kernel='linear')
model_norm_linear.fit(X_norm,y)
# Check the model's score
cvscores = cross_val_score(model_norm_linear,X_norm,y,cv=3,n_jobs=1)
print "CV score: {:.4} +/- {:.4}".format(cvscores.mean(), cvscores.std())

CV score: 0.8741 +/- 0.02312


In [12]:
# Initialize a RBF SVC and fit the data
model_norm_rbf = SVC(kernel='rbf')
model_norm_rbf.fit(X_norm,y)
# Check the model's score:
cvscores = cross_val_score(model_norm_rbf,X_norm,y,cv=3,n_jobs=1)
print "CV score: {:.4} +/- {:.4}".format(cvscores.mean(), cvscores.std())

CV score: 0.8653 +/- 0.03071


In [13]:
# We see that after normalizing our features the score drops for both linear and RBF

In [14]:
# We see that the linear model performs slightly better than the RBF model for this data set (both non-normalized and normalized)
# Non-normalized feature columns perform better

In [15]:
# Print a confusion matrix and classification report for our best performing model:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.33, random_state=42)
model_linear.fit(X_train,y_train)
predictions = model_linear.predict(X_test)
print confusion_matrix(y_test,predictions)
print classification_report(y_test,predictions,target_names=['Benign','Malignant'])

[[142   5]
 [  3  76]]
             precision    recall  f1-score   support

     Benign       0.98      0.97      0.97       147
  Malignant       0.94      0.96      0.95        79

avg / total       0.96      0.96      0.96       226



**Check:** Are there more false positives or false negatives? Is this good or bad?

In [16]:
# There are more FP (5) than FN (3)
# This is a good thing in this situation because it is worse to have a false negative and think the patient doesn't have cancer.
# With cancer it is important to catch it as early as possible to maximize survival chances.

## 1.c: Feature Selection

Use any of the strategies offered by `sklearn` to select the most important features.

Repeat the cross validation with only those 5 features. Does the score change?

In [17]:
# Define a function to select the k best features
def k_best(X,y,k):
    select = SelectKBest(chi2, k=k)
    selected_data = select.fit_transform(X,y)
    selected_cols = X.columns[select.get_support()]
    X_selected = pd.DataFrame(selected_data, columns=selected_cols)
    return X_selected

In [18]:
# Select the 5 best features from X
X_best = k_best(X,y,5)

In [19]:
# Repeat the cross validation with only those 5 features:
# Linear:
model_linear.fit(X_best,y)
cvscores = cross_val_score(model_linear,X_best,y,cv=3,n_jobs=1)
print "Linear CV score: {:.4} +/- {:.4}".format(cvscores.mean(), cvscores.std())
# RBF
model_rbf.fit(X_best,y)
cvscores = cross_val_score(model_rbf,X_best,y,cv=3,n_jobs=1)
print "RBF CV score: {:.4} +/- {:.4}".format(cvscores.mean(), cvscores.std())

Linear CV score: 0.959 +/- 0.02156
RBF CV score: 0.9576 +/- 0.01611


In [20]:
# There is not much change when we only use the 5 best features instead of all of them.
# Mean score for Linear drops a negligible amount.
# Mean score for RBF stays the same but std dev is smaller.

##  1.d: Grid Search

Use the grid_search function to explore different kernels and values for the C parameter.

- Can you improve on your best previous score?
- Print the best parameters and the best score

In [22]:
params = {'C':[.01,.1,1,10,100],'kernel':['linear','rbf']}
grid = GridSearchCV(SVC(),params,cv=3)
grid.fit(X,y)
print "The best parameters are: {}".format(grid.best_params_)
print "These parameters give a score of: {}".format(grid.best_score_)

The best parameters are: {'kernel': 'linear', 'C': 0.01}
These parameters give a score of: 0.966325036603


# Exercise 2
Now that you've completed steps 1.a through 1.d it's time to tackle some harder datasets. But before we do that, let's encapsulate a few things into functions so that it's easier to repeat the analysis.

## 2.a: Cross Validation
Implement a function `do_cv(model, X, y, cv)` that does the following:
- Calculates the cross validation scores
- Prints the model
- Prints and returns the mean and the standard deviation of the cross validation scores

In [23]:
def do_cv(model,X,y,cv):
    model.fit(X,y)
    cvscores = cross_val_score(model,X,y,cv=cv,n_jobs=-1)
    print model
    score_stats = (cvscores.mean(),cvscores.std())
    print "CV score: {:.4} +/- {:.4}".format(*score_stats)
    return score_stats

## 2.b: Confusion Matrix and Classification report
Implement a function `do_cm_cr(model, X, y, names)` that automates the following:
- Split the data using `train_test_split(X, y, stratify=y, test_size=0.33, random_state=42)`
- Fit the model
- Prints confusion matrix and classification report in a nice format

**Hint:** names is the list of target classes

In [24]:
def do_cm_cr(model,X,y,names):
    X_train, X_test, y_train, y_test = train_test_split(X,y,stratify=y,test_size=.33,random_state=42)
    model.fit(X_train,y_train)
    predictions = model.predict(X_test)
    print confusion_matrix(y_test,predictions)
    print classification_report(y_test,predictions,target_names=names)

## 2.c: Grid Search
Implement a function `do_grid_search(model, parameters)` that automates the grid search by doing:
- Calculate grid search
- Print best parameters
- Print best score
- Return best estimator

In [25]:
def do_grid_search(model, X, y, parameters):
    grid = GridSearchCV(model,parameters,cv=3)
    grid.fit(X,y)
    print "The best parameters are: {}".format(grid.best_params_)
    print "These parameters give a score of: {}".format(grid.best_score_)
    return grid.best_estimator_

# Exercise 3
Using the functions above, analyze the Spambase dataset.

Notice that now you have many more features. Focus your attention on step C => feature selection

- Load the data and get to X, y
- Select the 15 best features
- Perform grid search to determine best model

In [5]:
# Load the dataset
spam = pd.read_csv('spambase.csv')

In [29]:
# Set the target and feature
X = spam[spam.columns.values[:-1]]
y = spam['class']

In [30]:
# Select the 15 best features
spam_best = k_best(X,y,15)

In [84]:
# Perform grid search to determine best model
params = {'C':[.01,.1,1,10,100],'kernel':['linear','rbf']}
do_grid_search(SVC(),spam_best,y,params)

The best parameters are: {'kernel': 'linear', 'C': 100}
These parameters give a score of: 0.896978917627


SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [None]:
# This takes a very long time to run
# Best params: linear, C=100
# Best score: 0.8969

In [None]:
# Perform a CV on this model
model = SVC(C=100,kernel='linear')
do_cv(model,spam_best,y,cv=5)

# Exercise 4
Repeat steps 1.a - 1.d for the car dataset. Notice that now features are categorical, not numerical.
- Find a suitable way to encode them
- How does this change our modeling strategy?

Also notice that the target variable `acceptability` has 4 classes. How do we encode them?


In [6]:
# Read in the dataset
cars = pd.read_csv('car.csv')

In [54]:
# Exploratory analysis
cars.head()
cars.dtypes
cars.buying.unique()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,acceptability
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [55]:
# Set the target and features
X = cars[['buying','maint','doors','persons','lug_boot','safety']]
y = cars['acceptability']

In [59]:
# Make dummy variables and map the target
X_dum = pd.get_dummies(X,columns=['buying','maint','lug_boot','safety'])
y = y.map({'unacc':0,'acc':1,'good':2,'vgood':3})
# Fix the doors and persons columns
X_dum.doors = X_dum.doors.map({'2':2,'3':3,'4':4,'5more':5})
X_dum.persons = X_dum.persons.map({'2':2,'4':4,'more':5})

In [70]:
# Do a grid search to determine best modeling parameters
params = {'C':[.01,.1,1,10,100],'kernel':['linear','rbf']}
do_grid_search(SVC(),X_dum,y,params)

The best parameters are: {'kernel': 'rbf', 'C': 0.1}
These parameters give a score of: 0.740740740741


SVC(C=0.1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [75]:
# Select 10 best features and perform another grid search
X_best = k_best(X_dum,y,10)
do_grid_search(SVC(),X_best,y,params)

The best parameters are: {'kernel': 'rbf', 'C': 0.1}
These parameters give a score of: 0.732638888889


SVC(C=0.1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [76]:
# Select 5 best features and perform another grid search
X_best = k_best(X_dum,y,5)
do_grid_search(SVC(),X_best,y,params)

The best parameters are: {'kernel': 'rbf', 'C': 0.1}
These parameters give a score of: 0.738425925926


SVC(C=0.1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [78]:
# We see that we can get almost the same score using only 5 features

In [2]:
# Perform a CV on this model
model = SVC(C=.1,kernel='rbf')
do_cv(model,X_dum,y,cv=5)

NameError: name 'SVC' is not defined

# Bonus
Repeat steps 1.a - 1.d for the mushroom dataset. Notice that now features are categorical, not numerical. This dataset is quite large.
- How does this change our modeling strategy?
- Can we use feature selection to improve this?


In [7]:
# Read in the mushroom data
data = pd.read_csv('mushroom.csv')

In [55]:
# Map the class column (e = edible, p = poisonous)
data['class'] = data['class'].map({'e':0,'p':1})

In [56]:
# Look at the data
data.head()
# We see that every column has letters as values. 
# We are going to need to fix this numerical values to build a model.

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises?,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,1,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,0,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,0,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,1,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,0,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [65]:
# Create a map and apply it for each column to convert from letter values to numbers
for column in data.columns.values:
    column_dict = {}
    for cat,num in zip(data[column].unique(),range(len(data[column].unique()))):
        column_dict[cat] = num
    data[column] = data[column].map(column_dict)

In [70]:
# Set the target and features
X = data[data.columns.values[1:]]
y = data['class']

In [76]:
# Do a grid search to determine best modeling parameters
params = {'C':[.01,.1,1,10,100],'kernel':['linear','rbf']}
do_grid_search(SVC(),X,y,params)

The best parameters are: {'kernel': 'linear', 'C': 100}
These parameters give a score of: 0.935007385524


SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [None]:
# Create a model using these optimal parameters
grid_model = SVC(C=100,kernel='linear')

In [74]:
# Perform a cross validation on this model
do_cv(grid_model,X,y,5)

SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
CV score: 0.9508 +/- 0.06409


(0.95077769496906794, 0.064087491690281537)

In [75]:
# Print the confusion matrix and classification report for this model
do_cm_cr(grid_model,X,y,['edible','poisonous'])

[[1292    0]
 [   0 1389]]
             precision    recall  f1-score   support

     edible       1.00      1.00      1.00      1292
  poisonous       1.00      1.00      1.00      1389

avg / total       1.00      1.00      1.00      2681



In [80]:
# See what happens when we select only the 10 best features
X_best = k_best(X,y,10)

In [81]:
# Do a grid search to determine best modeling parameters
do_grid_search(SVC(),X_best,y,params)

The best parameters are: {'kernel': 'linear', 'C': 10}
These parameters give a score of: 0.926267848351


SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [82]:
# Create a model using these optimal parameters
grid_model = SVC(C=10,kernel='linear')

In [83]:
# Perform a cross validation on this model
do_cv(grid_model,X,y,5)

SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
CV score: 0.9312 +/- 0.06976


(0.93120126450298224, 0.069758062881640062)

In [84]:
# Print the confusion matrix and classification report for this model
do_cm_cr(grid_model,X,y,['edible','poisonous'])

[[1292    0]
 [   0 1389]]
             precision    recall  f1-score   support

     edible       1.00      1.00      1.00      1292
  poisonous       1.00      1.00      1.00      1389

avg / total       1.00      1.00      1.00      2681

