<a href="https://colab.research.google.com/github/easypanda/Handson-ML2/blob/master/Support_Vector_Machines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TO BE FINISHED

#Support Vector Machines

A Support Vector Machine (SVM) is a powerful and versatile Machine Learning model, capable of performing linear or nonlinear classification, regression , and even outlier detection. SVMs are particulary well suited for classification of complex small- or medium-sized datasets.

## Linear SVM Classification

You can think of an SVM classifier as fitting the widest possible street( represented by the parallel dashed lines) between the classes. This is called **large margin classification**. Notice that adding more training instances "off the street" will not affect the decision boundary at all: it is fully determined by the instances located on the edge of the street. These instances are called the **Support Vectors**.

/!\ : SVMs are sensible to the feature scales so we should do feature scaling each time.

## Soft Margin Classification

If we strictly impose that all instances must be off the street and on the right side, this is called **hard margin classification**.
This has two main issues:
* It only works if the data are linearly seperable.
* It is sensitive to outliers.

We choose use a more flexible model that ensure the good balance between keeping the street as large as possible and limiting the *margin violations*.
This is called **Soft Margin Classification**.

When creating a model with Scikit-Learn, we can specify the number of hyperparameters including C. if C is set to a low value, we enable a large margin and therefore more margin violations whereas with a high C, a fewer margin violations. Sometimes it's more sutable to have more margin violations to ensure a better generalization of the model.

/!\ If the model is overfitting, we can try regularizing by reducing C.

In [0]:
import numpy as np
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

In [0]:
iris = datasets.load_iris()
X = iris["data"][:,(2,3)] #Petal length and petal width columns
y = (iris["target"] == 2).astype(np.float64) #Just Iris virginica

In [0]:
svm_clf = Pipeline([
                    ("scaler",StandardScaler()),
                    ("linear_svc",LinearSVC(C=1,loss="hinge")),
])

In [4]:
svm_clf.fit(X,y)

Pipeline(memory=None,
         steps=[('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('linear_svc',
                 LinearSVC(C=1, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
                           loss='hinge', max_iter=1000, multi_class='ovr',
                           penalty='l2', random_state=None, tol=0.0001,
                           verbose=0))],
         verbose=False)

In [5]:
svm_clf.predict([[5.5,1.7]])

array([1.])

**/!\** Unline Logistic Regression, SVM does not output probabilities for each class.

We could also use the SVC class with a liner kernel (SVC(kernel="linear",C=1)) or with the SGDClassifier (SGDClassifier(loss="hinge",alpha=1/(m*c))). It does not converge as fast as the LinearSVC class but it can be useful to handle online classification tasks or huge datasets that do not fit in memory (out-of-core training).

Loss hyperparameter should be on "hinge" and for better performance, dual hyperparameter should be set on False.

## Nonlinear SVM Classification

One approche to handle nonlinear datasets is to add more features, such as polynomial features.

In [0]:
from sklearn.datasets import make_moons
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

In [0]:
X,y = make_moons(n_samples=100,noise=0.15)
polynomial_svm_clf = Pipeline([
                               ('poly_features',PolynomialFeatures(degree=3)),
                               ('scaler',StandardScaler()),
                               ('svm_clf',LinearSVC(C=10,loss="hinge"))
])

In [8]:
polynomial_svm_clf.fit(X,y)

Pipeline(memory=None,
         steps=[('poly_features',
                 PolynomialFeatures(degree=3, include_bias=True,
                                    interaction_only=False, order='C')),
                ('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('svm_clf',
                 LinearSVC(C=10, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
                           loss='hinge', max_iter=1000, multi_class='ovr',
                           penalty='l2', random_state=None, tol=0.0001,
                           verbose=0))],
         verbose=False)

## Polynomial Kernel

Adding a polynomial features is simple to implement and can work great with all sorts of Machine Learning algorithms (not just SVM). That said, at a low polynomial degree, this method cannot deal with very complex datasets, and with a high polynomial degree it creates a huge number of features, making the model too slow.

But with SVM, you can apply a technique called **kernel trick** to make possible to have the same result as if you had added many polynomial features, even with very high-degree polynomials without actually adding them.

In [0]:
from sklearn.svm import SVC

In [0]:
poly_kernel_svm_clf = Pipeline([
                                ("scaler",StandardScaler()),
                                ("svm_clf",SVC(kernel="poly",degree=3,coef0=1,C=5))
])

In [11]:
poly_kernel_svm_clf.fit(X,y)

Pipeline(memory=None,
         steps=[('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('svm_clf',
                 SVC(C=5, break_ties=False, cache_size=200, class_weight=None,
                     coef0=1, decision_function_shape='ovr', degree=3,
                     gamma='scale', kernel='poly', max_iter=-1,
                     probability=False, random_state=None, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

This SVM classifier uses a third-degree polynomial kernel.
If the model is overfitting, we can decrease the polynomial degree and if it's underfitting we can increase it.
The hyperparemeter coef0 controls how much the model is unfluenced by high-degree polynomials versus low-degree polynomials.

## Similarity Features

Another technique to tackle nonlinear problems is to add features computed using a **similarity function** which measures how much each instance resembles a particular **landmark**.

One similarity function is the Gaussian Radial Basis Function (RBF).

This is a bell-shaped function varying from 0 (very far from the landmark) to 1 (the landmark).

To select the landmarks, the simplest approach is to createa landmark at the location of each and every instance in the dataset. Doing that creates many dimensions and thus increases the chances that the transformed training set will be linearly seperable. The downside is that a training set with m instances and n features will be transformed into a training set with m instances and m features (assuming that the original features have been dropped). 

## Guassian RBF Kernel

The kernel trick does its SVM magic, making it possible to obtain a similar result as if you added many similarity features.

In [0]:
rbf_kernel_svm_clf = Pipeline([
                               ('scaler',StandardScaler()),
                               ("svm_clf",SVC(kernel="rbf",gamma=5,C=0.001))
])

In [13]:
rbf_kernel_svm_clf.fit(X,y)

Pipeline(memory=None,
         steps=[('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('svm_clf',
                 SVC(C=0.001, break_ties=False, cache_size=200,
                     class_weight=None, coef0=0.0,
                     decision_function_shape='ovr', degree=3, gamma=5,
                     kernel='rbf', max_iter=-1, probability=False,
                     random_state=None, shrinking=True, tol=0.001,
                     verbose=False))],
         verbose=False)

The hyperparameter gamma:
* Makes the bell-shaped curve narrower if increased.
* The result is each instance's range of influence is smaller.

Conversely, a small gamma value makes the bell-shaped curve wider: Instances have a larger range of influence, and the decision boundary ends up smoother.

It acts like a regularization hyperparameter: if the model is overfitting, we should reduce it; if it's underfitting, we should increase it (likewise the hyperparameter C).

Other kernels exist but they are used more rarely (like the ones specialized for specific data structures, like string kernels for documents or DNA sequences).

As a rule of thumb, it should be always tried first the LinearSVC (as it's faster than the SVC(kernel="linear"), especially if the training set is very large or plenty of features. If it's not too large, the Gaussian RBF kernel can be also tried.

# Computational Complexity