# Support vector machines and non-linearly separable data

In this notebook, we show how SVMs can be used to find non-linear decision boundaries.

In [None]:
%matplotlib inline
import warnings; warnings.simplefilter('ignore')
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

import numpy as np
import matplotlib.pyplot as plt

import mglearn
from sklearn.datasets import make_blobs
from sklearn.datasets import load_breast_cancer

from sklearn.preprocessing import MinMaxScaler

## Non-linearly separable data 

Let's create a 2D data set that cannot be separated with a straight line (that is, the dataset is not linearly separable):

In [None]:
X, y = make_blobs(n_samples=50, centers=2,random_state=0, cluster_std=0.60)
mglearn.discrete_scatter(X[:,0],X[:,1], y)

We'll first try to train a linear model - logistic regression - on this dataset:

In [None]:
logreg = LogisticRegression().fit(X,y)
print("Accuracy obtained with logistic regression: {:.2f}".format(logreg.score(X,y)))

# Plotting the linear svm decision boundary:
mglearn.discrete_scatter(X[:,0],X[:,1], y)
mglearn.plots.plot_2d_separator(logreg, X, eps=.5, fill=True, alpha=0.3)
plt.title("Decision boundary for logistic regression")

The logistic regression does a terrible job!
We need an algorithm that can create a decision boundary which is not simply a straight line!

## Kernelized support vector machines

By using a so-called "Kernelized support vector machine" classifier, we allow the algorithm to use a so-called *kernel* to consider non-linear combinations of the features in its search for a decision boundary. And that is exactly what is needed in this case.

In [None]:
svm = SVC()
svm.fit(X,y)
print("Accuracy obtained with a kernelized SVM: {:.2f}".format(svm.score(X,y)))

In [None]:
mglearn.discrete_scatter(X[:,0],X[:,1], y)
mglearn.plots.plot_2d_separator(svm, X, eps=.5, fill=True, alpha=0.15)

### Hyperparameters of the kernelized support vector machines

Three of the most important hyperparameters of the SVM are:

- Kernel: Often "linear" (only allows linear decision boundaries), "poly" (considers all polynomial interactions upto a default degree of 3) and "rbf". The last one is the "Radial Basis Function", which is extremely powerful and the default kernel.
- C: Regularization parameter, which determines how much you are penalized for each misclassified datapoint (default: 1)
- gamma (only for 'rbf'-kernel): controls the complexity of the decision boundary; the smaller gamma is, the larger "area of influence" of each support vector. Therefore, a small value gives a simpler decision boundary (default: 1/n_features)

Below, we illustrate the effect of the C and the gamma parameter on the dataset above.

In [None]:
fig, axes = plt.subplots(3, 3, figsize=(15,10))

for ax, C in zip(axes, [0.1, 1, 1000]):
    for a, gamma in zip(ax,[0.1, 1, 10]):
        svm = SVC(C=C, gamma=gamma).fit(X,y)
        mglearn.discrete_scatter(X[:,0],X[:,1], y, ax=a)
        mglearn.plots.plot_2d_separator(svm, X, eps=.5, fill=True, alpha=0.3, ax=a)
        a.set_title("C={:.3f}, gamma={:.3f}".format(C,gamma))

## SVM on cancer dataset

Let's try to apply a kernelized SVM to the cancer data set:

In [None]:
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, 
                                                   stratify=cancer.target, 
                                                   random_state=42)

In [None]:
svc = SVC()
svc.fit(X_train, y_train)

print("Accuracy on training set: {:.2f}".format(svc.score(X_train, y_train)))
print("Accuracy on test set: {:.2f}".format(svc.score(X_test, y_test)))

Scaling is extremely important in this case! We can rescale the data so all the features vary from 0 to 1 using the MinMaxScaler-function from sklearn:

In [None]:
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

svc.fit(X_train_scaled, y_train)
print("Accuracy on training set: {:.2f}".format(svc.score(X_train_scaled, y_train)))
print("Accuracy on test set: {:.2f}".format(svc.score(X_test_scaled, y_test)))