# Support Vector Machines Lab

In this lab we will explore several datasets with SVMs. The assets folder contains several datasets (in order of complexity):

1. Breast cancer
- Spambase
- Car evaluation
- Mushroom

For each of these a `.names` file is provided with details on the origin of data.

In [3]:
import pandas as pd

# Exercise 1: Breast Cancer



## 1.a: Load the Data
Use `pandas.read_csv` to load the data and assess the following:
- Are there any missing values? (how are they encoded? do we impute them?)
- Are the features categorical or numerical?
- Are the values normalized?
- How many classes are there in the target?

Perform what's necessary to get to a point where you have a feature matrix `X` and a target vector `y`, both with only numerical entries.

In [4]:
df_bc = pd.read_csv('./../../assets/datasets/breast_cancer.csv')

# df_bc.head()
# df_bc.info()
print "There is question marks in Bare_Nuclei per the below value_count. I am going to remove those rows since it's only 16 of 699 entires."
print df_bc.Bare_Nuclei.value_counts()

df_bc = df_bc[df_bc.Bare_Nuclei != '?']

df_bc.Bare_Nuclei.value_counts()

There is question marks in Bare_Nuclei per the below value_count. I am going to remove those rows since it's only 16 of 699 entires.
1     402
10    132
5      30
2      30
3      28
8      21
4      19
?      16
9       9
7       8
6       4
Name: Bare_Nuclei, dtype: int64


1     402
10    132
5      30
2      30
3      28
8      21
4      19
9       9
7       8
6       4
Name: Bare_Nuclei, dtype: int64

## 1.b: Model Building

- What's the baseline for the accuracy?
- Initialize and train a linear svm. What's the average accuracy score with a 3-fold cross validation?
- Repeat using an rbf classifier. Compare the scores. Which one is better?
- Are your features normalized? if not, try normalizing and repeat the test. Does the score improve?
- What's the best model?
- Print a confusion matrix and classification report for your best model using:
        train_test_split(X, y, stratify=y, test_size=0.33, random_state=42)

**Check** to decide which model is best, look at the average cross validation score. Are the scores significantly different from one another?

In [5]:
df_bc.head()

Unnamed: 0,Sample_code_number,Clump_Thickness,Uniformity_of_Cell_Size,Uniformity_of_Cell_Shape,Marginal_Adhesion,Single_Epithelial_Cell_Size,Bare_Nuclei,Bland_Chromatin,Normal_Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


In [6]:
print "What's the baseline for the accuracy? ", 
df_bc.Class.value_counts()[2] / float(sum(df_bc.Class.value_counts()))

What's the baseline for the accuracy? 

0.65007320644216693




In [None]:
# Initialize and train a linear svm. What's the average accuracy score with a 3-fold cross validation?

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

# Create X,y Variables
X = df_bc.drop('Class', axis = 1)
y = df_bc['Class'].map({2: 0, 4: 1})

# Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Create and Fit Model
model = SVC(kernel='linear')
model.fit(X_train, y_train)

# Determine accuracy of model
avg_accuracy = np.mean(cross_val_score(model, X, y, cv=3))

print "Initialize and train a linear svm. What's the average accuracy score with a 3-fold cross validation?",
avg_accuracy


In [None]:
# Are your features normalized? if not, try normalizing and repeat the test. Does the score improve?

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.33, random_state=42)

# Create and fit model
model = SVC(kernel='linear')
model.fit(X_train, y_train)

print 'Avg accuracy score for cv=3 and normalized features:', 
np.mean(cross_val_score(model, X_scaled, y, cv=3))

#Create and fit model
model = SVC(kernel='rbf')
model.fit(X_train, y_train)
    
# Confusion Matrix
from sklearn.metrics import confusion_matrix, classification_report
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.33, random_state=42)

model = SVC(kernel='linear')
model.fit(X_train, y_train)
predictions = model.predict(X_test)

confusion_matrix = pd.DataFrame(confusion_matrix(y_test, predictions))
confusion_matrix.columns = ['Predicted Benign', 'Predicted Cancer']
confusion_matrix.index = ['benign', 'cancer']
confusion_matrix

**Check:** Are there more false positives or false negatives? Is this good or bad?

## 1.c: Feature Selection

Use any of the strategies offered by `sklearn` to select the most important features.

Repeat the cross validation with only those 5 features. Does the score change?

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2


X2 = SelectKBest(chi2, k=5).fit_transform(X, y)
scaler = StandardScaler()
sclaled.fit(X2)
X_scaled = sclr.transform(X2)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.33, random_state=42)

model = SVC(kernel='linear')
model.fit(X_train, y_train)

print 'The average accuracy score with a 3-fold cross validation when features are normalized and the classifier is linear is', 
np.mean(cross_val_score(model, X_scaled, y, cv=3))
print 'The score went down a bit.'

## 1.d: Learning Curves

Learning curves are useful to study the behavior of training and test errors as a function of the number of datapoints available.

- Plot learning curves for train sizes between 10% and 100% (use StratifiedKFold with 5 folds as cross validation)
- What can you say about the dataset? do you need more data or do you need a better model?

##  1.e: Grid Ssearch

Use the grid_search function to explore different kernels and values for the C parameter.

- Can you improve on your best previous score?
- Print the best parameters and the best score

In [23]:
from sklearn.model_selection import GridSearchCV
gs = GridSearchCV(clf, param_grid=[{'C':[0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000], 'kernel':['linear','rbf']}])
gs.fit(X_scld, y)

In [None]:
# Print the best parameters and the best score
print ""
print gs.best_params_

# Exercise 2
Now that you've completed steps 1.a through 1.e it's time to tackle some harder datasets. But before we do that, let's encapsulate a few things into functions so that it's easier to repeat the analysis.

## 2.a: Cross Validation
Implement a function `do_cv(model, X, y, cv)` that does the following:
- Calculates the cross validation scores
- Prints the model
- Prints and returns the mean and the standard deviation of the cross validation scores

> Answer: see above

## 2.b: Confusion Matrix and Classification report
Implement a function `do_cm_cr(model, X, y, names)` that automates the following:
- Split the data using `train_test_split(X, y, stratify=y, test_size=0.33, random_state=42)`
- Fit the model
- Prints confusion matrix and classification report in a nice format

**Hint:** names is the list of target classes

> Answer: see above

## 2.c: Learning Curves
Implement a function `do_learning_curve(model, X, y, sizes)` that automates drawing the learning curves:
- Allow for sizes input
- Use 5-fold StratifiedKFold cross validation

> Answer: see above

## 2.d: Grid Search
Implement a function `do_grid_search(model, parameters)` that automates the grid search by doing:
- Calculate grid search
- Print best parameters
- Print best score
- Return best estimator


> Answer: see above

# Exercise 3
Using the functions above, analyze the Spambase dataset.

Notice that now you have many more features. Focus your attention on step C => feature selection

- Load the data and get to X, y
- Select the 15 best features
- Perform grid search to determine best model
- Display learning curves

In [None]:
# Load the data and get to X, y
df_spam = pd.read_csv('./../../assets/datasets/spambase.csv')
df_spam.head()
X = spam.drop('class', axis=1)
y = spam['class']
X.head()

In [None]:
# Select the 15 best features
X_Kbest = SelectKBest(chi2, k=15).fit_transform(X, y)
scaler = StandardScaler()
scaler.fit(X_Kbest)

X_scld = sclr.transform(X_Kbest)
X_scld.shape

In [None]:
# Perform grid search to determine best model
model = SVC()
do_cm_cr(model, X_scaled, y, ['benign', 'cancer'])
parameters = [{'C':[0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000], 'kernel':['linear','rbf']}]
do_grid_search(clf, X_scld, y, parameters)

In [None]:

# Display learning curves
model = SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
plot_learning_curve(model, "Learning Curves (SVM, rbf)", X_scld, y, train_sizes=[.4, .5, .6, .7, .8, .9, 1], cv=10)
plt.show()

# Exercise 4
Repeat steps 1.a - 1.e for the car dataset. Notice that now features are categorical, not numerical.
- Find a suitable way to encode them
- How does this change our modeling strategy?

Also notice that the target variable `acceptability` has 4 classes. How do we encode them?


In [4]:
# Load the Data
car = pd.read_csv('./../../assets/datasets/car.csv')
car.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,acceptability
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [5]:
from sklearn.preprocessing import LabelEncoder
# Get dummies for X and make y categorical
X = pd.get_dummies(car.drop('acceptability', axis=1))
le = LabelEncoder()
y = le.fit_transform(car['acceptability'])

print y[:5]
print X.shape

[2 2 2 2 2]
(1728, 21)


# Bonus
Repeat steps 1.a - 1.e for the mushroom dataset. Notice that now features are categorical, not numerical. This dataset is quite large.
- How does this change our modeling strategy?
- Can we use feature selection to improve this?
