# Support Vector Machines
Support Vector Machines (SVM) offer a powerful and flexible classifier. They're known for the kernel trick - used to handle nonlinear input spaces. They're a *discriminative classifier*, i.e. they simply find a line or curve (in two dimensions) or manifold (in multiple dimensions) that divide classes from one another.

Generally, SVMs are used for classification problems but they can also be used in regression problems - they can easily handle multiple continuous and categorical variables. The core idea of an SVM is to find a maximum marginal hyperplane(MMH) that best divides the dataset into classes.
![svm](http://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1526288453/index3_souoaz.png)

**Support Vectors**
Support vectors are the data points which are closest to the hyperplane. These points define the separating line by calculating the margin.

**Hyperplane**
A hyperplane is a decision plane which separates a set of objects having different class memberships

**Margin**
Margin is the gap between the two lines on the closest class points - this is calculated as the perpendicular distance from the line to the support vectors. The larger the margin, the better.

## How it works
The main objective is to segregate the given dataset via a hyperplane with the maximum possible margin between the plane and its support vectors

**Steps**
1. Generate hyperplanes which segregates the classes.
2. Select the hyperplane with the maximum margin between support vectors
![how svm works](http://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1526288454/index2_ub1uzd.png)

## Kernel Trick
Some problems can't be solved using a linear hyperplane. In such a situation, SVM uses a kernel trick to transform the input space to a higher dimensional space. Post-kernel trick, the data points are plotted on the $x$ and $z$ axis. $z$ is the sum squared of both $x$ and $y$: $z=x^2+y^2$. The result looks like this:
![kernel trick](http://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1526288453/index_bnr4rx.png)

## SVM Kernels
Using the kernel trick converts a nonseparable problem into a separable problem by adding more dimension to it.

* **Linear Kernel** A Linear kernel can be used by taking the dot product of any two given observations. The product between two vectors is the sum of the multiplication of each pair of input values
    * `K(x, xi) = sum(x * xi)`
* **Polynomial Kernel** A polynomial kernel is a more generalized form of the linear kernel - it can distinguish curved or nonlinear input spaces
    * `K(x, xi) = 1 + sum(x * xi)^d`
        * Where `d` is the degree of the polynomial. `d=1` is similar to the linear transformation
* **Radial Basis Function Kernel** The RBF kernel can map an input space in an infinitely dimensional space, making it pretty popular.
    * `K(x, xi) = exp(-gamme * sum((x - xi^2))`
        * Where `gamma` is a parameter that ranges from 0 to 1
        * A higher value of gamma will perfectly fit the dataset and overfit it.
        * `gamma=0.1` is a good default value

## Implementation
We will look at scikit's cancer dataset - a popular multi-class classification problem. It's comprised of 30 features and a target (type of cancer), the two classes being malignant and benign.

##### Load the data

In [1]:
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()

##### Explore the data

In [3]:
# print the features
print(f"Features: {cancer.feature_names}")

# print the target values
print(f"\nLabels: {cancer.target_names}")

Features: ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']

Labels: ['malignant' 'benign']


In [6]:
# see the shape of the data
print(f"Shape: {cancer.data.shape}\n")

# check first 5 records
print(cancer.data[0:5])

Shape: (569, 30)

[[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
  1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
  6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
  1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
  4.601e-01 1.189e-01]
 [2.057e+01 1.777e+01 1.329e+02 1.326e+03 8.474e-02 7.864e-02 8.690e-02
  7.017e-02 1.812e-01 5.667e-02 5.435e-01 7.339e-01 3.398e+00 7.408e+01
  5.225e-03 1.308e-02 1.860e-02 1.340e-02 1.389e-02 3.532e-03 2.499e+01
  2.341e+01 1.588e+02 1.956e+03 1.238e-01 1.866e-01 2.416e-01 1.860e-01
  2.750e-01 8.902e-02]
 [1.969e+01 2.125e+01 1.300e+02 1.203e+03 1.096e-01 1.599e-01 1.974e-01
  1.279e-01 2.069e-01 5.999e-02 7.456e-01 7.869e-01 4.585e+00 9.403e+01
  6.150e-03 4.006e-02 3.832e-02 2.058e-02 2.250e-02 4.571e-03 2.357e+01
  2.553e+01 1.525e+02 1.709e+03 1.444e-01 4.245e-01 4.504e-01 2.430e-01
  3.613e-01 8.758e-02]
 [1.142e+01 2.038e+01 7.758e+01 3.861e+02 1.425e-

##### Split the data

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(cancer.data, 
                                                    cancer.target,
                                                    test_size=0.3,
                                                    random_state=42)

##### Generate the model

In [8]:
from sklearn import svm

# Create the classifier
clf = svm.SVC(kernel='linear')

# Train the model
clf.fit(X_train, y_train)

# Predict the response for the test set
y_pred = clf.predict(X_test)

##### Evaluate the model

In [13]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

print(f"Accuracy:  {accuracy_score(y_test, y_pred)}")
print(f"Precision: {precision_score(y_test, y_pred)}")
print(f"Recall:    {recall_score(y_test, y_pred)}")

Accuracy:  0.9649122807017544
Precision: 0.9636363636363636
Recall:    0.9814814814814815


## Hyperparameter Tuning
* **Kernel:** The main function of the kernel is to transform the given dataset to the required form - this involves changing the kernel function.
    * Polynomial and RBF are useful for non-linear hyperplanes.
    * In some applications, a more complex kernel is necessary.
* **Regularization:** The regularization parameter `C` is the penalty parameter which represents the misclassification or error term. This parameter tells the SVM optimizer how much error is bearable - it's how you control the tradeoff between decision boundary and missclassification term.
    * A smaller value of `C` creates a small-margin hyperplane
    * A larger value of `C` creates a larger-margin hyperplane
* **Gamma:** A lower value of Gamma will loosely fit the training dataset, whereas a higher value of gamma will exactly fit the training dataset, causing overfitting.
    * A low value of gamma considers only nearby points when calculating the separation line
    * A high value of gamma considers all of the data points when calculating the separation line

## Advantages
* Low memory use (they use a subset of training points in the decision phase)
* Fast predictions. 
* Works well with a clear margin of separation and with high dimensional space
* Versatile, due to the kernel trick

## Disadvantages
* Not suitable for large datasets because of its high training time
* Works poorly with overlapping classes - tuning the softening parameter $c$ is very important
* Sensitive to the type of kernel used
* The results do not have a probabalistic interpretation