[Original Blog](https://blog.csdn.net/qq_31347869/article/details/88071930?spm=1001.2014.3001.5506)

Common Architecture of Machine Learning Strategies: <br>
**Training Set -> Extract Feature Vector -> Fitting to Algorithms (Classifier: Decision Tree, kNN, ...) -> Outputs**

<h2>SVM</h2>

Support Vector Machines (SVMs) is a set of **supervised learning** methods used for classification, regression and outliers detection.
It is a **Binary Classification** model, it reflects the instances into dots in the space, and aims to draw a line to distinguish these dots in a best way,
so that when there comes new dots, the line can classify them.
<span style="color:orange">SVMs are suitable for mid-little level data samples, non-linear, high dimension problems.</span>

The feature vectors of the instances are mapped to some points in the space (take 2d as example).
The purpose of SVM is to draw a line to distinguish the two types of points in the best way,
so that if there are new points in the future, this line can also make a good classification.

![svm_01](images/svm_01.png)


Question: How many lines can be drawn to distinguish the sample points?
> There are countless lines that can be drawn. The difference is whether the effect is good or not.
> Each line can be called a dividing hyperplane.
> For example, the green line above is not good, the blue line is OK, and the red line looks better. The best line we hope to find is the "division hyperplane with the largest interval".

Question: Why do we call it Hyper-plane?
> Because the sample features might be very possibly high dimension, therefore, it won't be a single line to distinguish them.

Question: What is the criteria of this line?
> The SVM will look for the partitioning hyperplane that can distinguish between the two categories and **maximize the margin**.
> It is better to divide the hyperplane, the impact on it is the smallest when the sample is locally disturbed,
> the classification result is the most robust, and the generalization ability to unseen examples is the strongest.

Question: What is Margin
> For any hyperplane, the data points on both sides of it have a **minimum distance (vertical distance)** from it,
> and the sum of these two minimum distances is the interval.
> For example, the band-shaped area formed by the two dotted lines in the figure below is margin,
> and the dotted line is determined by the two points closest to the central solid line (that is, determined by the support vector).
> But at this time, the margin is relatively small. If we use the second method to draw, the margin will be significantly larger and closer to our goal.

![svm_02](images/svm_02.png)

Question: Why make the margin as large as possible?
> Because it is less bias when the margin is large, which is robust.

Question: What is Support Vector?
> As can be seen from the figure above, the distance between the points on the dotted line and the dividing hyperplane is the same.
> In fact, only these points jointly determine the position of the hyperplane, so they are called "support vectors".
> "Support vector machine" also came from this.



<h2>Hard-Margin SVM</h2>

![svm_03](images/svm_03.png)

Partition Hyperplane can be defined as a linear equation: $w^T X + b = 0$, while:

- $w = \{ w_1, w_2,...w_d \}$ is a Normal Vector, which determine the direction of the hyperplane, d is the number of the Eigenvalues
- $X$ is the training sample
- $b$ is the displacement, which determine the distance between hyperplane and original point.

We can determine one unique Partition Hyperplane if and only if normal vector $w$ and displacement $X$ is determined.
The distance between any 2 points on Partition Hyperplane and its Margin Hyperplanes at both side is calculated as $\frac{1}{||w||}$

By using some math derivation, $y_i * (w_0 + w_1x_1 + w_2x_2) \ge 1 , \forall i$ becomes restricted Convex Optimization problem

By using Karush-Kuhn-Tucker (KKT) condition Lagrangian Equation, it can be concluded that MMH can be expressed as the following **"decision boundary"**

$$d(X^T) = \sum_{i=1}^{l} y_i \alpha_i X_i X^T + b_0$$

This equation represents the **Partitioned Hyperplane with the Maximum Margin**

- $l$ is the number for Support Vector Points, most of the points are not support vector points, only those who are on the Margin Hyperplanes are Support Vector Points
  Thus we only calculate the sum of those points
- $X_i$ is the Eigenvalues of the Support Vector Points
- $y_i$ is the Class Label of $X_i$, such as +1 or -1
- $X^T$ is the instance to be tested, want to know which category it should belong to, put it into the equation
- $\alpha_i$ and $b_0$ are single numerical parameters, obtained by the above-mentioned optimal algorithm, $\alpha_i$ is Lagrange Constant

We make classification according to the equation value (positive or negative) by put new test sample $X$ into the equation

<h2>SVM Application Examples</h2>

If we have already had two support vector points (1,1) and (2,3), weight is set as $w = (a,2a)$, if we put this two support vector points coordinates
in to formula $w^T x + b = \pm 1$, and we will have

$$a + 2a + w_0 = -1, using \ point (1,1)$$
$$2a + 6a + w_0 = 1, using \ point (2,3)$$

by solving the equation we can get: <br>
- $a=\frac{2}{5}, w_0=-\frac{11}{5}$
- $w=(a,2a)=(\frac{2}{5}, \frac{4}{5})$
- Partition Hyperplane is $x_1 + 2x_2 - 5.5 = 0$
- Using point (2,0) can verify the classification effect of this partition hyperplane

In [None]:
# Implement SVM by sklearn

from sklearn import svm

# define some points and labels
X = [[2,0], [1,1], [2,3]]
y = [0, 0, 1]

# define a classifier
clf = svm.SVC(kernel='linear')  # .SVC() is the function of SVM, kernel function is set to linear

# train the classifier
clf.fit(X,y)  # call the fit function to build the model (which is computing the partition hyperplane and store information in the classifier)

print(clf)  # print all tha classifier parameters
print(clf.support_)  # print support vectors index
print(clf.support_vectors_ )  # print support vectors
print(clf.n_support_)  # print how many points are support vectors within each class
print(clf.predict([[2,0]]))  # print a new prediction

In [None]:

print(__doc__)

import numpy as np
import pylab as pl

