useful in solving both regression and classification tasks

# Hyperplane
1. 2 planes(only for binary classification) exist on the either side of this hyperplane
2. each plane passes through the point belonging in that class and is also nearest to the hyperplane
3. both of these planes are parallel to the hyperplane
4. <img src="svm-1.png" float="left"/>
5. these **points**, as seen above, are called support vectors for each class
    1. as seen above, there can exist *multiple support vectors for the same class*


# Linearly separable
it is advisable that SVM be used with linearly separable
1. in the above image, the points are arranged in a linearly separable manner
2. linearly separable, as it sounds, means that there exists atleast 1 line(or hyperplane, in case of >2D feature-space) that separates a given collection of points, such that points lying at one side belong to one class, and to the other belong to the other one
3. <img src="linearly_vs_non-linearly_separable.png" float="left"/> 


* greater the margin distance value, the more generalisable the SVM model is, i.e. less prone to making an errors while prediction


<font size="5">Norm of a vector refers to the L2-norm, i.e. the length of the vector, i.e. square-root of sum of squares of each component value.</font> 



# Maths involving svm

1. y = w$^{\textrm{T}}$x + b : equation of the hyperplane
3. <img src="svm-2.png" />
4. distance between support-vectors belonging to different classes, from the hyperplane, is the margin value for this margin plane, which separates points belonging to one-class, from points that don't belong to this class.
5. in the above binary-classification example, let x$_{\textrm{m}}$ = support vector for the margin plane y = w$^{\textrm{T}}$x + b = 1, and x$_{\textrm{n}}$ be a support vector for the margin plane y = w$^{\textrm{T}}$x + b = -1.(x$_{\textrm{1}}$ and x$_{\textrm{2}}$ are features as represented in the above image)
    1. w$^{\textrm{T}}$x$_{\textrm{1}}$ + b = 1 and w$^{\textrm{T}}$x$_{\textrm{2}}$ + b = -1 \
    <font color="red">don't get confused with the dimensionality of x and w here</font>, x can be 2D, i.e. $2\times1$, and w$^{\textrm{T}}$ then corresponds to being a $1\times2$ vector, so that bias b is a scalar.
    2. consider a simple line equation: ax+by+c = 0, consider 2 points P1(x1, y1), and P2(x2, y2), such that each of these points represents the support vectors for 2 different classes
    3. this line represents the separating hyperplane
    4. d1(distance of hyperplane from P1) = **abs(**$\frac{ax1+by1+c}{\sqrt{a^2+b^2}}$**)** = abs(D1), d2(distance of hyperplane from P2) = **abs(**$\frac{ax2+by2+c}{\sqrt{a^2+b^2}}$**)** = abs(D2)
    5. we know that either (D1 > 0 and D2 < 0) or (D1 < 0 and D2 > 0), since the points lie on opposite sides of the line, lets assume the former case, then d1+d2 = D1-D2(since d1 = D1, d2 = -D2)
    6. hence D1-D2 = $\frac{a(x1-x2)+b(y1-y2)}{\sqrt{a^2+b^2}}$ = $\frac{w^T(X1-X2)}{||w||}$, where X1 = $\begin{bmatrix} x1 \\ y1 \end{bmatrix}$ and X2 = $\begin{bmatrix} x2 \\ y2 \end{bmatrix}$, and w = $\begin{bmatrix} a \\ b \end{bmatrix}$, which essentially makes w$^T$ = $\begin{bmatrix} a & b \end{bmatrix}$
    7. hence, on subtracting these 2 equations and dividing by the *norm of w*, we get: $\frac{w^T(xm-xn)}{||w||}$ = $\frac{2}{||w||}$
    8. since the L.H.S. represents the distance betweeen the support vectors, <font color="green">along the direction of the unit vector of w</font>(which is identical to the direction perpendicular to this hyperplane), **and this obviously changes with as the vector w itself changes**, a quantity which we want to **maximise**, we need to maximise the R.H.S. as well, i.e. <font color="red">minimise ||w||</font>.
6. for a point satisfying the hyperplane equation:
   1. case-1: bias b = 0 
       1. in this case, w$^{\textrm{T}}$x = 0, which basically means that the vectors w and x are orthogonal(by definition)
   2. case-2: bias b $\ne$ 0
       1. consider in case-1, an origin shift over the vector a occurs
       2. hence x becomes x-a
       3. still the point would satisfy the hyperplane equation, i.e. w$^{\textrm{T}}$(x-a) = 0, i.e. w$^{\textrm{T}}$x - w$^{\textrm{T}}$a = 0
       4. we already know that, before the shifting of origin had happened, w$^{\textrm{T}}$x = 0
       5. hence w$^{\textrm{T}}$a = 0, i.e. b = -w$^{\textrm{T}}$a (starting from the optimum hyperplane, there exists infinitely many hyperplanes that are able to separate these linearly separable points, until these cut the support vectors(b can have infinitely many values, since b represents the distance between these an arbitrary hyperplane and the most optimum hyperplane)
7. Hence, our optimisation function becomes:\
    max.$_{w, b}$ $\frac{2}{||w||}$, s.t. y$_i$ = $\begin{cases}+1 & w^Tx+b \ge 1 \\ -1 & w^Tx+b \le -1 \end{cases}$,  to simplify, $y_i.(w^Tx+b) \ge 1$(for the 2nd case, multiply both sides by -1, i.e. y$_i$)
    1. this can be written as $\Rightarrow$ min. $\left(\frac{||w||}{2}\right)$
    2. it was found that this term resulted in a non-convex function, and thus required greater time for optimisation
    3. instead, the function $\frac{||w||^2}{2}$ was used as an SVM objective function instead, since this turns out to a convex function, on which gradient descent can be easily applied, to arrive at the optimal point.
    
# Incorporating deviation for misclassified samples
1. it may so happen that $\textrm{y}^{(\textrm{i})}$ = 1, but w$^{\textrm{T}}$x$^{(\textrm{i})}$+b is not $\ge$ 1 , i.e. our SVM model has missclassified the sample.
2. hence, we need to define error $\xi^{(\textrm{i})}$ = 1 - $\textrm{y}^{(\textrm{i})}$.(w$^{\textrm{T}}$x$^{(\textrm{i})}$+b) for each sample i.
    1. its assumed that $\xi^{(\textrm{i})} \ge 0$, hence error-term for points correctly classified is not taken into consideration.
3. this captures the errors from the assumption that **data is linearly separable**
4. hence, now the optimisation function becomes : min.($\frac{||w||^2}{2}$ + C$\sum\limits_{i=1}^{n}\xi^{(\textrm{i})}$), for a total of n training samples, subject to the constraints: 
$\xi^{(\textrm{i})} \ge 0$ and $\xi^{(\textrm{i})}$ = 1 - $\textrm{y}^{(\textrm{i})}$.(w$^{\textrm{T}}$x$^{(\textrm{i})}$+b).

## Lagrangian method of multipliers
1. The idea used in Lagrange multiplier is that the gradient of the objective function f, lines up either in parallel or anti-parallel direction to the gradient of the constraint g, at an optimal point. 
2. In such case, one the gradients should be some multiple of another.
3. 

# Importance of support vectors
1. Deleting the support vectors will change the position of the hyperplane. 
2. These are the points that help us build our SVM.


# sklearn.svm.SVC
1. The fit time scales at least quadratically with the number of samples and may be <font color="red">impractical beyond tens of thousands of samples</font>. 
2. For **large datasets** consider using [sklearn.svm.LinearSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC) or [sklearn.linear_model.SGDClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier) instead, possibly after a [sklearn.kernel_approximation.Nystroem](https://scikit-learn.org/stable/modules/generated/sklearn.kernel_approximation.Nystroem.html#sklearn.kernel_approximation.Nystroem) transformer.


    
# SVM kernels
1. used to convert low dimensional feature-space to a higher one
2. primarily used to convert non-linearly separable data to linearly-separable one