# Support Vector Machines

- Support Vector Machine (SVM) is non parametric supervised machine learning algorithm that fits best for classification problem. But it also works well with regression problem.

- In N-dimensional input space, our goal using SVM is to learn a `hyperplane` that best separates the input space into classes.
- Hyperplane is a line in 2D space and just a point in 1D space.
- Multiple hyperplane can exits in given input space that separates classes. And, the optimal hyperplane is the one among all plausible hyperplanes that separates classes with largest margin.

![image.png](attachment:image-2.png)

- Here, the distance between hyperplane and the closest data point is called `margin`.
- The margin is computed only with the closest points called `support vectors` as the perpendicular distance from these support vectors with the hyperplane.
- They are called `support vectors` because they support the construction of hyperplane determining their orientation.
- Thus in SVM, the optimial hyperplane is obtained from the learning process using training data through the maximization of margin.

- SVM is not just limited to linear classification, SVM's can efficiently perform a non-linear classification, implicitly mapping their inputs into high dimensional feature space.

## Hard Margin vs. Soft Margin

- We have already defined margin as the perpendicular distance between support vectors and the hyperplanes.
- Also, we choose the hyperplane with maximal margin as the optimal one.
- But, in SVM, we have notion of two different margin viz. Hard Margin and Soft Margin.

- In SVM, the hyperplane is expected to separate the classes clearly such that the points belonging to one class is on one side of the hyperplane only and the points belonging to the another class is on otherside of the hyperplane only. This is called `hard margin` which is default state of SVM.
- On the other hand, if we allow some data points to cross the boundary set by hyperplane i.e. we allow some misclassification of points by relaxing the hard constraints set by hyperplane. This is called `Soft Margin`.

![image.png](attachment:image-2.png)

fig. Soft Margin

- We incorporate soft margin by introducing slack variables C. This is a regularization parameter that defines how much misclassification are we allowing for i.e. C determines the no. of observationas allowed to cross the boundary set by hyperplane. 

- The smaller the value of C the more violations of boundary or misclassification is allowed and is more sensitive to the training data. This is the case of higher variance and lower bias.

- Conversely, the larger the value of the algorithm is more sensitive towards training data causing lower variance and higher bias.

- Conclusion is, hard margin implementation has larger value of C and Soft margin implementation has lesser value of C.

## Kernel in SVM


- We have already mentioned above that the SVM not only works with linear data but also with non linear data. It does so by transforming lower dimension data into higher dimension such that classes are linearly separable. And this is achived through the use of so called `Kernel`.

- The learning of hyperplane explained above is the result of linear kernel.

- So, let's explore differnt kernel used in SVM.

### *1. Linear Kernel*

The linear kernel is the dot product between new data point and the support vectors.

For each data point $x_j$ in X, compute
$$ K(x_i, x_j) = x_i.x_j $$

where, $x_i$ is the support vector.

### *2. Polynomial Kernel*
The polynomial kernel is given as:

foreach data point $x_j$ in X, compute
$$ K(x_i, x_j) = (x_i.x_j + 1)^d $$ 

where d is the degree of polynomial and is a hyperparameter which can't be learned through training.

### *3. Gaussian Kernel*
It is general purpose Kernel and used when there is no prior knowledge about data. It is given as:

$$K(x_i, x_j) = exp(-(||x_i - x_j||^2)/2\sigma^2) $$

### *4. Gaussian Radial basis function (RBF) Kernel*
The gaussian rbf kernel is given as:

$$ K(x_i, x_j) = \exp(\gamma * \sum(xj-x_i^2))$$

Where, gamma is the hyperparameter which defines how far the single training example influence is. Thus, higher value of gamma consider points closer to the plausible hyperparameter and conversely, lower value of gamma consider farther points.

### *5. Laplace RBF Kernel*
It is also a general purpose kernel that is used when there is no prior knowledge about data.

$$ K(x_i, x_j) = exp(-||x_i-x_j||/\sigma)

### 6. Hyperbolic tangent kernel
It is given as:

$$ K(x_i, x_j) = tanh(\kappa x_i.x_j + c) $$

## Pros and Cons

### Pros
- SVM works well when margin is clearly distinguishable.

- SVM is more effective in high dimensional spaces.

- SVM works in case where the number of dimensions is larger than the number of samples.

- SVM is also memory efficient algorithm since this algorithm uses only the subject of observations called support vectors to determine hyperplane and is suitable for both classification and regression.

### Cons
- SVM is not preferrred with very large data sets.

- SVM does not work well with overlapping classes.

- SVM doesn't work well when number of observations are too much lesser than the number of features.