## The hyperplane


![image.png](attachment:image.png)


A hyperplane is a boundary that divides a p-dimensional space into two parts. The name "planes" comes from the fact that if you increase the p-dimension, what divides the p-dimensions will be a plane. 

For one dimensional space, will need just one point to divide the space.

If it is two-dimensional space, we will use a line to divide the space.


So, why do we split these spaces with a hyperplane?


![image-2.png](attachment:image-2.png)


We want to find the hyperplane (line) in the graph above that separates this space in two parts, in a way that points above the line are classified as 0 and points below the line will be classified as 1, for example.





## Maximal margin classifier

So we have a two-dimensional space of predictors and want to find the hyper plane which can separate this space so that each class is in a different part.

In a scenario where the data is perfectly separable, you can have infinite hyperplanes:

![image.png](attachment:image.png)


But which hyperplane should be choose, and why?

The hyperplane that is choosen is usually that which is farthest from the training observations.

The steps to find this hyperplane is:

    1) Calcualte the perpendicular distance of observations from Hyperplane
    
    2) Minimum value of distance is called margin
    
    3) Choose the hyperplane with maximum value of margin

So whichever hyperplane has **maximum value of the margin** will be selected


![image-2.png](attachment:image-2.png)



After choosing the hyperplane based on the criteria above, se evaluate if there are points that fall within the margin defined. 


These points that fall in the margin are known as **support vectors**, because these points are **supporting these margin boundaries**. Now the other points are not important anymore because our classfier is completely dependnet on thse support vectors, and classification now is based on these points that fall within the margins. 


This is why this technique is different from conventional ML techniques.


#### Limitations of maximal margin classifier


* Limitation 1: Maxmimal margin classifier cannot be used if two classes are not separable by a linear hyperplane.


* Limitation 2: Maximal margin classifier is very sensitive to support vectors, an additional observation can lead to a dramatic shift in the hyperplane. Such sensitivity can make maximal margin classifier very prone to overfitting. 


![image-3.png](attachment:image-3.png)



## Support vector classifier

This approach is to handle the limitations from maximal margin classifiers, that is: hyperplane sensitive to new data and data too sparse to find a hyperplane. 


Suppor vector classifier is a **soft margin classifier**: we will have a hyperplane and its margins, but all the observations need not to be on the correct sides of the margin.

For this, we will consider not only the hyperplane, but also its margins and how points are distributed in between or off these margins.

The soft margin classifier will allow two types of errors to occur: 

    1) A point that is correctly classified but is on the wrong side of the margin
    2) A point that is misclassified because it is on the wrong side of the hyperplane


![image.png](attachment:image.png)


**REMEMBER**: all the points that fall within the margins are called as the support vectors. Any point outside the margin will not be much relevant for classifying. Our support vector margins will change when any of the support vector points change.


So, how a support vector classifier is created?

    1) We define a misclasification budget (B)
    
    2) We limit the sum of distances of misclassified points (on wrong side of margins) from their correct margins to be (x1 + x2 + x3 + x4) < B, where x are the distances. 
    
    3) We then try to maximize the margin while trying to stay within the budget.
    
    4) Usually in implementation we use C (Cost - multiplier of the error term) which is inversely related to B. The C is a hyperparameter for cross-validation so we get minimum test-set error.
    
    
**So what are the impacts of the C?**

1. When C is **small**, margins will be **wide** and there will be **many** support vectors and many misclassified observations

2. When C is **large**, margins will be **narrow** and there will be **fewer** support vectors and fewer misclassified values

3. Low C value **prevents overfitting** and may give best test performance

4. We can try to find optimal C value with tuning. 



#### Limitations of support vector classifiers


* Limitation 1: Support vector classifier is a **linear** classfier, it cannot classfy non linear separable data. When we have non-linearity in the data we use what we call as **kernel** methods which will lead to support vector machines.



## Support Vector Machines

Support Vector machine (SVM) is an extension of the support vector classifier which uses **Kernels** to create **non-linear** boundaries.

Kernels are windows that captures the relationship between two observations. Some popular kernels include: Linear, Polynomial an Radial


* Linear kernel:

The linear kernel will be that used by the support vector classifiers in which you have a linear hyperplane and linear margins.

![image.png](attachment:image.png)



* Polynomial kernel:

The polynomial kernel uses the power function to make boundaries non-linear. The term d in the fomula below (exponential) will determine the flexibility of the hyperplane and the degree of the polynomial.

![image-2.png](attachment:image-2.png)



* Radial kernel

The radial kernel is one of the most used kernels since it is the more efficient to capture non-linearity of the data points. Below Y is gamma, **but can also be called sigma** as well. The gamma determines the margins of the kernel, as larger gamma will give very thight margins and only very close data points will be affected. 

The gamma is also a **hyperparameter** to find the optimal with cross validation.

![image-3.png](attachment:image-3.png)





**It is important then to find the optimal parameters balanced for both Cost value (regularization/penalization of the cost function) of support vectors and the sigma/gamma (controls radiality) for radial kernels**