# Chapter 5: Support Vector Machines

### Load the Iris Dataset, Scale the Features, Train a Linear SVM Model

In [3]:
import numpy as np
from sklearn import datasets
from sklearn.pipeline import Pipeline

from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

iris = datasets.load_iris()
X = iris['data'][:, (2,3)] # petal length, petel width
y = (iris['target'] == 2).astype(np.float64) # Iris virginica

svm_clf = Pipeline([
    ('scaler', StandardScaler()),
    ('linear_svc', LinearSVC(C=1, loss='hinge')),
])

svm_clf.fit(X, y)

svm_clf.predict([[5.5, 1.7]])

# Unlike Logistics Regression classifiers, SBM lassifiers do not output probabilities for each class


array([1.])

### Nonlinear SVM Classification
- moons dataset: a toy dataset for binary classification in which the data points are shaped as two interleaving half circles

In [4]:
from sklearn.datasets import make_moons
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

X, y = make_moons(n_samples=100, noise=0.15)
polynomial_svm_clf = Pipeline([
    ('poly_features', PolynomialFeatures(degree=3)),
    ('scaler', StandardScaler()),
    ('svm_clf', LinearSVC(C=10, loss='hinge'))
])

polynomial_svm_clf.fit(X, y)

### Polynomial Kernel
- At a low polynomial degree, this method cannot deal with very complex datasets
- With a high polynomial dgree it creates a huge number of features, making the model too slow
- If your model is overfitting you can lower the polynomial degree
- The kernel trick makes it possible to get the same result as if you had added many polynomial features, even with very high-degree polynomials, without actually having to add them
- The kernel trick is implemented by the SVC class

In [7]:
from sklearn.svm import SVC
poly_kernel_svm_clf = Pipeline([
    ('scaler', StandardScaler()),
    ('svm_clf', SVC(kernel='poly', degree=3, coef0=1, C=5)) 
    # The hyperparameter coef0 controls how much the model is influenced by high-degree polynomials versus low-degree polynomials
    # A common approach to finding the right hyperparameter values is to use grid search - it is faster to do a finer grid search around the best values foundb
])
poly_kernel_svm_clf.fit(X, y)

### Similarity Features
- Another technique to tackle nonlinear problems is to add features computed using a similarity function
- Similarity function measures how much each instance resembles a particular landmark
- How to choose a landmark: the simplest approach is to create a landmark at the location of each and every instance in the dataset
- If your training set is very large, you end up with an equally large number of features

### Gaussian RBF Kernel
- The kernel trick makes it possible to obtain similar results as if you had added many similarity features
- Other kernels exist: string kernels are sometimes used when classifying text documents or DNA sequencing
- Always try the linear kernel first, especially if the training set is large or has many features
- If the training set is not too large you should try the Gaussian RBF kernel

In [8]:
rbf_kernel_svm_clf = Pipeline([
    ('scaler', StandardScaler()),
    ('svm_clf', SVC(kernel='rbf', gamma=5, C=0.001)),
    # Increasing gamma makes the bell-shaped curve narrower - each instance's range of influence is smaller
    # and the decision boundary ends up being more irregular, wiggling around individual instances.
    # A sall gamma value makes the bell_shaped curve wider - instances have a larger range of influence
    # and the decision boundary ends up smoother.
])

rbf_kernel_svm_clf.fit(X, y)


    

### Computational Complexity

#### Linear SVC
- LinearSVC does not support kernel trick
- LinearSVC training time complexity is O(m x n)
- The algorithm takes longer if you require very high precisions
- This is controlled by the tolerance hyperparamer (called tol in Scikit-Learn)
- In most classification tasks, the default tolerance is fine

#### SVC Class
- Based on the libsvm library, and does support kernel trick
- Training time complexity is usually between O(m^2 x n) and O(m^3 x n)
- It gets slow when the number of training instances get large

### SVM Regression
- SVMs support linear and nonlinear classification, and also linear and nonlinear regression
- SVM regressioin tries to fit as many instances as possible on the street while limiting margin violations
- The width of the street is controlled by a hyperparameter
- Adding more training instances within the margin does not affect the model's predictions

In [10]:
from sklearn.svm import LinearSVR

svm_reg = LinearSVR(epsilon=1.5)
svm_reg.fit(X, y)

In [13]:
# Scikit-Learn's SVR class supports the kernel trick
# The SVR class is the gression equivalent of the SVC class
# SVR class gets much too slow when the training set grows large
# SVMs can also be used for outlier detection

from sklearn.svm import SVR

svm_poly_reg = SVR(kernel='poly', degree=2, C=100, epsilon=0.1)
svm_poly_reg.fit(X, y)

### Under the Hood