# Support Vector Machines

The scikit-learn documentation has a nice write-up of support vector machines [here](http://scikit-learn.org/stable/modules/svm.html).  There are three different implementations: [SVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC), which is the main one that we'll use, [NuSVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVC.html#sklearn.svm.NuSVC), which is a slightly different formulation, and [LinearSVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC), which only supports linear kernels but is faster on large datasets.  

In [None]:
%matplotlib inline

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# this is a new import
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report, roc_curve, roc_auc_score
from sklearn.cross_validation import train_test_split
from sklearn import datasets
import seaborn as sns
from sklearn.grid_search import GridSearchCV

We're going to work with the by-now-nauseatingly-familiar iris dataset.  We'll build a binary model to predict species 2 vs species 3, using the first two features only, so that we can visualize it.

In [None]:
iris = datasets.load_iris()

X = iris.data
y = iris.target


X = X[y != 0, :2]
y = y[y != 0]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.9)

Let's define a function that will train an SVM with a given kernel and a given value of the `C` and `gamma` parameters.  The function will return the trained model, the test predictions, and the test distances from the hyperplane.

In [None]:
def train_svm(kernel="linear", C=1.0, gamma=0.0):

    svm = SVC(kernel=kernel, C=C, gamma=gamma)
    svm.fit(X_train, y_train)

    # predict on the test set
    y_preds = svm.predict(X_test)
    # get the distances from the hyperplane, the sign of which
    # is the prediction above
    y_dists = svm.decision_function(X_test)
    
    return (svm, y_preds, y_dists)

Let's define a function that will take a trained SVM and plot the 2-dimensional decision function, along with the training and test points.

In [None]:
def plot_svm(svm, X, X_test):

    # plot all of the data points
    plt.scatter(X[:, 0], X[:, 1], c=y, zorder=10, cmap=plt.cm.Paired)
    # put an extra circle on top of the test points
    plt.scatter(X_test[:, 0], X_test[:, 1], s=80, facecolors='none', zorder=10)

    # step size of the mesh
    h = 0.01
    # range of the mesh
    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

    z = svm.decision_function(np.column_stack((xx.ravel(), yy.ravel())))

    z = z.reshape(xx.shape)
    plt.pcolormesh(xx, yy, z > 0, cmap=plt.cm.Paired)
    # this will plot the contour lines of the decision function
    plt.contour(xx, yy, z, colors=['k', 'k', 'k'], linestyles=['--', '-', '--'],
    levels=[-.5, 0, .5])

    plt.show()

First we'll train a linear SVM.

In [None]:
svm_linear, y_preds_linear, y_dists_linear = train_svm("linear", C=1.0)

In [None]:
y_preds_linear

In [None]:
y_dists_linear

We can see how many support vectors there are.  That is, how many points are on or inside of the margin.  This will return the number of support vectors of each class:

In [None]:
svm_linear.n_support_

In [None]:
# indices of the support vectors
svm_linear.support_

In [None]:
# the vectors themselves
svm_linear.support_vectors_[0:5]

In [None]:
svm_linear, y_preds_linear, y_dists_linear = train_svm("linear", C=1.0)
plot_svm(svm_linear, X, X_test)

Let's try increasing the C parameter, which is the opposite of how ISLR defines it.  In ISLR, C is the budget for how many points are allowed to violate the margin.  So larger C means more violations are allowed.  In scikit-learn, it's the opposite.

In [None]:
# C is the inverse of how ISLR defines it
svm_linear, y_preds_linear, y_dists_linear = train_svm("linear", C=0.01)
plot_svm(svm_linear, X, X_test)

In [None]:
param_grid = {"C":[1, 10, 100, 1000]}

svm = SVC(kernel="linear")
cv = GridSearchCV(svm, param_grid, cv=5, n_jobs=4, refit=True)
cv.fit(X_train, y_train)

In [None]:
cv.grid_scores_

Let's fit an SVM with a polynomial kernel of degree 2:

In [None]:
svm_poly, y_preds_poly, y_dists_poly = train_svm("poly", C=1.0, gamma=2)
plot_svm(svm_poly, X, X_test)

Let's fit an SVM with a polynomial kernel of degree 3:

In [None]:
svm_poly, y_preds_poly, y_dists_poly = train_svm("poly", C=1.0, gamma=3)
plot_svm(svm_poly, X, X_test)

Let's fit an SVM with an RBF kernel:

In [None]:
svm_rbf, y_preds_rbf, y_dists_rbf = train_svm("rbf", C=1.0, gamma=3)
plot_svm(svm_rbf, X, X_test)

Increasing the parameter of the RBF kernel makes us look at more and more local points, increasing the variance:

In [None]:
svm_rbf, y_preds_rbf, y_dists_rbf = train_svm("rbf", C=1.0, gamma=30)
plot_svm(svm_rbf, X, X_test)

## Text Classification

SVM's are quite ofen used in text classification problems.  Here, we're going to work through an example (a modified version of [this](http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html)) using the 20 newsgroups dataset.

In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer

First we'll load the data:

In [None]:
data_train = fetch_20newsgroups(subset='train', categories=None,
                                shuffle=True, random_state=42,
                                remove=('headers', 'footers', 'quotes'))

data_test = fetch_20newsgroups(subset='test', categories=None,
                               shuffle=True, random_state=42,
                               remove=('headers', 'footers', 'quotes'))

In [None]:
data_train.keys()

The data has postings from one of 20 "newsgroup" types.

In [None]:
len(data_train.data)

In [None]:
data_train.target_names

The first trainging post comes from the "rec.autos" newsgroup:

In [None]:
data_train.target_names[data_train.target[0]]

And has the following contents:

In [None]:
data_train.data[0]

Let's make a binary target variable to predict whether a given post is in the "sci.space" newsgroup:

In [None]:
space_target = (data_train.target==14).astype("int")
space_target_test = (data_test.target==14).astype("int")

One way to turn a blob of text into features or predictors (to "vectorize" it) is to simply count up the number of times each word appears.

In [None]:
vectorizer = CountVectorizer(stop_words='english')
X_train = vectorizer.fit_transform(data_train.data)
X_test = vectorizer.transform(data_test.data)

The features are the number of times each word in the "vocabulary" appears in a given post.  Note that, by far, most words won't show up in most posts, so the matrix of predcitors is very "sparse".

In [None]:
len(vectorizer.vocabulary_)

In [None]:
vectorizer.vocabulary_

In [None]:
X_train[0, :]

In [None]:
X_train[0, :].todense()

In [None]:
non_zeroes = np.array(np.where(X_train[0, :].todense())[1])[0]
non_zeroes

Let's define a function that will take a list of indices and print the words from the vocabulary dictionary that correspond to those indices:

In [None]:
def print_words(index_list):
    for word, index in vectorizer.vocabulary_.iteritems():
        if index in index_list:
            print word

In [None]:
print_words(non_zeroes)

Let's train a linear, penalized SVM:

In [None]:
svm = LinearSVC(penalty='l2', C=1.0)
svm.fit(X_train, space_target)

One way to see what the SVM is doing is to look at the coefficients of each word, and to sort them from largest to smallest.  This will tell us which words are most associated with "space" in the posts:

In [None]:
top25 = np.argsort(svm.coef_.ravel())[-25:]
top25

In [None]:
svm.coef_.ravel()[top25]

In [None]:
print_words(top25)

[http://en.wikipedia.org/wiki/Wally_Schirra](http://en.wikipedia.org/wiki/Wally_Schirra)

Let's see how we do on a test set:

In [None]:
preds = svm.predict(X_test)

In [None]:
pd.crosstab(index=space_target_test, columns=preds, rownames=['True'], colnames=['Predicted'])