# Setup

In [1]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "svm"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

# Support Vector Machines

"A Support Vector Machine (SVM) is a very powerful and versatile Machine Learning
model, capable of performing linear or nonlinear classification, regression, and even
outlier detection. It is one of the most popular models in Machine Learning, and anyone
interested in Machine Learning should have it in their toolbox. SVMs are particularly
well suited for classification of complex but small- or medium-sized datasets."

# Linear SVM Classification

"The two classes can clearly be separated easily with a straight line (they are linearly separable).  
The left plot shows the decision boundaries of three possible linear classifiers. The  
model whose decision boundary is represented by the dashed line is so bad that it  
does not even separate the classes properly. The other two models work perfectly on  
this training set, but their decision boundaries come so close to the instances that  
these models will probably not perform as well on new instances. In contrast, the  
solid line in the plot on the right represents the decision boundary of an SVM classifier;  
this line not only separates the two classes but also stays as far away from the  
closest training instances as possible. You can think of an SVM classifier as fitting the"  
widest possible street (represented by the parallel dashed lines) between the classes.  
This is called large margin classification.

![title](images/svm_1.png)

"SVMs are sensitive to the feature scales, as you can see in  
Figure 5-2: on the left plot, the vertical scale is much larger than the  
horizontal scale, so the widest possible street is close to horizontal.  
After feature scaling (e.g., using Scikit-Learn’s StandardScaler),  
the decision boundary looks much better (on the right plot)."

![title](images/svm_2.png)

## Soft Margin Classification

"If we strictly impose that all instances be off the street and on the right side, this is  
called hard margin classification. There are two main issues with hard margin classification.  
First, it only works if the data is linearly separable, and second it is quite sensitive  
to outliers"

![title](images/svm_3.png)

"To avoid these issues it is preferable to use a more flexible model. The objective is to  
find a good balance between keeping the street as large as possible and limiting the  
margin violations (i.e., instances that end up in the middle of the street or even on the  
wrong side). This is called soft margin classification."

"In Scikit-Learn’s SVM classes, you can control this balance using the C hyperparameter:  
a smaller C value leads to a wider street but more margin violations. Figure 5-4  
shows the decision boundaries and margins of two soft margin SVM classifiers on a  
nonlinearly separable dataset. On the left, using a low C value the margin is quite   
large, but many instances end up on the street. On the right, using a high C value the  
classifier makes fewer margin violations but ends up with a smaller margin. However,  
it seems likely that the first classifier will generalize better: in fact even on this training  
set it makes fewer prediction errors, since most of the margin violations are  
actually on the correct side of the decision boundary."

![title](images/svm_4.png)

TIP: "If your SVM model is overfitting, you can try regularizing it by  
reducing C."


"The following Scikit-Learn code loads the iris dataset, scales the features, and then  
trains a linear SVM model (using the LinearSVC class with C = 1 and the hinge loss  
function, described shortly) to detect Iris-Virginica flowers. The resulting model is  
represented on the left of Figure 5-4."

In [1]:
import numpy as np
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

iris = datasets.load_iris()
X = iris["data"][:, (2, 3)]  # petal length, petal width
y = (iris["target"] == 2).astype(np.float64)  # Iris virginica

svm_clf = Pipeline([
        ("scaler", StandardScaler()),
        ("linear_svc", LinearSVC(C=1, loss="hinge", random_state=42)),
    ])

svm_clf.fit(X, y)

Pipeline(steps=[('scaler', StandardScaler()),
                ('linear_svc', LinearSVC(C=1, loss='hinge', random_state=42))])

In [2]:
svm_clf.predict([[5.5, 1.7]])

array([1.])

NOTE: "Unlike Logistic Regression classifiers, SVM classifiers do not output  
probabilities for each class."

"Alternatively, you could use the SVC class, using SVC(kernel="linear", C=1), but it  
is much slower, especially with large training sets, so it is not recommended. Another  
option is to use the SGDClassifier class, with SGDClassifier(loss="hinge",  
alpha=1/(m*C)). This applies regular Stochastic Gradient Descent (see Chapter 4) to  
train a linear SVM classifier. It does not converge as fast as the LinearSVC class, but it  
can be useful to handle huge datasets that do not fit in memory (out-of-core training),  
or to handle online classification tasks."

TIP: "The LinearSVC class regularizes the bias term, so you should center  
the training set first by subtracting its mean. This is automatic if  
you scale the data using the StandardScaler. Moreover, make sure  
you set the loss hyperparameter to "hinge", as it is not the default  
value. Finally, for better performance you should set the dual  
hyperparameter to False, unless there are more features than  
training instances (we will discuss duality later in the chapter)."