Support Vector Machines and Cross-Validation
=====

Some of the code contained within this notebook is from Ch. 16 of *Data Science from Scratch* by J. Grus.



Example of feature wise normalization.  Notice that the normalization terms are determined only using the training data.

In [5]:
from __future__ import division
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
import numpy as np

X, y = load_boston(return_X_y=True)

#print type(X), X.shape, type(y), y.shape
N = int(0.8 * X.shape[0])
train_X = X[:N,:]
train_y = y[:N]
val_X = X[N:,:]
val_y = y[N:]


# Normalize each feature
mins_X = np.min(train_X, axis = 0)
train_X = train_X - mins_X
val_X = val_X - mins_X

maxs_X = np.max(train_X, axis = 0)
train_X = train_X / maxs_X
val_X = val_X / maxs_X

print train_X[0,:]

print np.min(train_X), np.max(train_X), np.min(val_X), np.max(val_X)

print train_X.shape, val_X.shape

reg = LinearRegression().fit(train_X, train_y)

print mean_absolute_error(reg.predict(train_X), train_y)
print mean_absolute_error(reg.predict(val_X), val_y)

[0.         0.18       0.07344184 0.         0.31481481 0.57750527
 0.64160659 0.26920314 0.         0.22755741 0.28723404 1.
 0.08967991]
0.0 1.0 -0.21613002146580806 1.0939457202505218
(404, 13) (102, 13)
3.310518415536489
4.730017250961031


Train a SVM on two classes from the iris dataset.  Try to split off 20% as a validiation dataset.

In [7]:
from sklearn import datasets
from sklearn import svm
import random
from sklearn.metrics import accuracy_score

# import some data to play with
iris = datasets.load_iris()
X = iris.data[:100,:]  # we only take the first two classes.
y = iris.target[:100]

# randomize the data
mapping = range(X.shape[0])
random.shuffle(mapping)

# Remap data
X = X[mapping,:]
y = y[mapping]

clf = svm.SVC()
clf.fit(X, y)

print "Accuracy on training data", accuracy_score(clf.predict(X), y)

# Create a 80-20 split and evaluate on the validation data
N = int(0.8 * X.shape[0])
train_X = X[:N,:]
train_y = y[:N]
val_X = X[N:,:]
val_y = y[N:]

clf = svm.SVC()
clf.fit(train_X, train_y)

print "Accuracy on training data", accuracy_score(clf.predict(train_X), train_y)
print "Accuracy on validation data", accuracy_score(clf.predict(val_X), val_y)





Accuracy on training data 1.0
Accuracy on training data 1.0
Accuracy on validation data 1.0


Train a SVM on two digits form the digit class.  Evaluate using 10 fold cross validation.

In [32]:
from sklearn import datasets
from sklearn import svm
import random
from sklearn.metrics import accuracy_score

# import some data to play with
X, y = datasets.load_digits(n_class = 2, return_X_y=True)

X = X.reshape((-1, 16*16))

print np.sum(y)

# randomize the data
mapping = range(X.shape[0])
random.shuffle(mapping)

# Remap data
X = X[mapping,:]
y = y[mapping]

clf = svm.SVC()
clf.fit(X, y)

print "Accuracy on training data [full data] ", accuracy_score(clf.predict(X), y)

# Create a 80-20 split and evaluate on the validation data
N = int(0.8 * X.shape[0])
train_X = X[:N,:]
train_y = y[:N]
val_X = X[N:,:]
val_y = y[N:]

clf = svm.SVC()
clf.fit(train_X, train_y)

print "Accuracy on training data [80 split]", accuracy_score(clf.predict(train_X), train_y)
print "Accuracy on validation data [20 split]", accuracy_score(clf.predict(val_X), val_y)

# Do 10-fold cross validation
k = 10

from sklearn.model_selection import KFold 
# https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6
kf = KFold(n_splits=k) 
kf.get_n_splits(X)

fold_accuracies = []

for train_index, val_index in kf.split(X):

    train_X = X[train_index,:]
    train_y = y[train_index]
    val_X = X[val_index,:]
    val_y = y[val_index]
    
    clf = svm.SVC(C=1, gamma=0.0001)
    #clf = svm.SVC()
    clf.fit(train_X, train_y)
    
    fold_accuracies.append(accuracy_score(clf.predict(val_X), val_y))

print "Accuracy on validation folds"
print fold_accuracies

print "Avg. over all validation folds", sum(fold_accuracies) / k
    
    

182
Accuracy on training data [full data]  1.0
Accuracy on training data [80 split] 1.0
Accuracy on validation data [20 split] 0.5
Accuracy on validation folds
[0.0, 0.5555555555555556, 0.6666666666666666, 0.5555555555555556, 0.5555555555555556, 0.4444444444444444, 0.4444444444444444, 0.3333333333333333, 0.6666666666666666, 0.7777777777777778]
Avg. over all validation folds 0.5


Perform a grid search over gamma and C for the digit classification task.

In [87]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC

tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-2, 1e-3, 1e-4],
                     'C': [0.1, 0.5, 1, 10, 50, 100, 1000]}]

# import some data to play with
X, y = datasets.load_digits(n_class = 2, return_X_y=True)

X = X.reshape((-1, 16*16))

# randomize the data
mapping = range(X.shape[0])
random.shuffle(mapping)

# Remap data
X = X[mapping,:]
y = y[mapping]

# Create a 80-20 split, then do 10-fold cv with grid search and evaluate on the validation data
N = int(0.8 * X.shape[0])
train_X = X[:N,:]
train_y = y[:N]
val_X = X[N:,:]
val_y = y[N:]

clf = GridSearchCV(SVC(), tuned_parameters, cv=10)
clf.fit(train_X, train_y)

print "Best parameters set found on development set:", clf.best_params_

means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))

preds = clf.predict(val_X)
zip(preds, val_y)


Best parameters set found on development set: {'kernel': 'rbf', 'C': 10, 'gamma': 0.0001}
0.528 (+/-0.115) for {'kernel': 'rbf', 'C': 0.1, 'gamma': 0.01}
0.569 (+/-0.219) for {'kernel': 'rbf', 'C': 0.1, 'gamma': 0.001}
0.583 (+/-0.411) for {'kernel': 'rbf', 'C': 0.1, 'gamma': 0.0001}
0.528 (+/-0.115) for {'kernel': 'rbf', 'C': 0.5, 'gamma': 0.01}
0.569 (+/-0.219) for {'kernel': 'rbf', 'C': 0.5, 'gamma': 0.001}
0.583 (+/-0.388) for {'kernel': 'rbf', 'C': 0.5, 'gamma': 0.0001}
0.556 (+/-0.142) for {'kernel': 'rbf', 'C': 1, 'gamma': 0.01}
0.569 (+/-0.275) for {'kernel': 'rbf', 'C': 1, 'gamma': 0.001}
0.583 (+/-0.484) for {'kernel': 'rbf', 'C': 1, 'gamma': 0.0001}
0.556 (+/-0.142) for {'kernel': 'rbf', 'C': 10, 'gamma': 0.01}
0.583 (+/-0.276) for {'kernel': 'rbf', 'C': 10, 'gamma': 0.001}
0.653 (+/-0.473) for {'kernel': 'rbf', 'C': 10, 'gamma': 0.0001}
0.556 (+/-0.142) for {'kernel': 'rbf', 'C': 50, 'gamma': 0.01}
0.583 (+/-0.276) for {'kernel': 'rbf', 'C': 50, 'gamma': 0.001}
0.653 (+/-0.



[(0, 1),
 (0, 1),
 (1, 0),
 (0, 1),
 (0, 0),
 (0, 1),
 (0, 1),
 (0, 0),
 (1, 0),
 (1, 0),
 (0, 1),
 (0, 0),
 (0, 0),
 (0, 1),
 (0, 1),
 (0, 0),
 (0, 0),
 (0, 0)]