### Support Vector Machines explained

Support Vector Machines (SVMs) are powerful supervised learning models used for classification and
  regression tasks. They are particularly effective in high-dimensional spaces and cases where the number of
   dimensions is greater than the number of samples.

### The goal : Finding the Optimal Hyperplane

The primary goal of an SVM is to find an optimal hyperplane that best separates the data points of
  different classes in a high-dimensional space. For a binary classification problem, if the data is
  linearly separable, there can be infinitely many hyperplanes that separate the classes. SVM aims to find
  the one that maximizes the margin between the closest data points of different classes.

#### Hyperplane Equation

In an n-dimensional space, a hyperplane can be defined by the equation:

  $$w \cdot x - b = 0$$


  Where:
   - $w$ is the weight vector (normal to the hyperplane).
   - $x$ is the input data point vector.
   - $b$ is the bias (or intercept) term.

#### The Margin

The margin is the distance between the hyperplane and the closest data points from each class. These
  closest data points are called support vectors. SVM seeks to maximize this margin.

  For a data point $(x_i, y_i)$, where $y_i \in \{-1, 1\}$ is the class label:


   - If $w \cdot x_i - b \ge 1$ for $y_i = 1$
   - If $w \cdot x_i - b \le -1$ for $y_i = -1$

  These can be combined into a single inequality:

  $$y_i (w \cdot x_i - b) \ge 1$$


  The distance from a point $x_i$ to the hyperplane is given by:

  $$\text{distance} = \frac{|w \cdot x_i - b|}{||w||}$$


  The support vectors lie on the hyperplanes $w \cdot x - b = 1$ and $w \cdot x - b = -1$. The distance
  between these two hyperplanes (the margin) is:

  $$\text{Margin} = \frac{2}{||w||}$$

  To maximize the margin, we need to minimize $||w||$, which is equivalent to minimizing
  $\frac{1}{2}||w||^2$.

#### Optimization Problem (Hard Margin SVM)

 For linearly separable data, the optimization problem is to minimize $\frac{1}{2}||w||^2$ subject to the
  constraint $y_i (w \cdot x_i - b) \ge 1$ for all $i$.

  This is a convex optimization problem that can be solved using Lagrange multipliers.

####  Soft Margin SVM (Handling Non-linearly Separable Data)

In most real-world scenarios, data is not perfectly linearly separable. To handle this, Soft Margin SVM
  introduces slack variables ($\xi_i \ge 0$) and a regularization parameter (C).

  The constraints become:

  $$y_i (w \cdot x_i - b) \ge 1 - \xi_i$$

  The objective function is modified to:

  $$\text{minimize} \quad \frac{1}{2}||w||^2 + C \sum_{i=1}^{n} \xi_i$$


  Where:
   - $C$ is a hyperparameter that controls the trade-off between maximizing the margin and minimizing the
     classification error. A small $C$ creates a larger margin but allows more misclassifications, while a
     large $C$ creates a smaller margin but fewer misclassifications.
   - $\xi_i$ (xi) are the slack variables, representing the degree of misclassification of data point $x_i$.
     If $\xi_i = 0$, the point is correctly classified and outside the margin. If $0 < \xi_i < 1$, the point
     is correctly classified but within the margin. If $\xi_i \ge 1$, the point is misclassified.

#### The Kernel Trick

SVMs can effectively perform non-linear classification using the kernel trick. This involves mapping the
  input data into a higher-dimensional feature space where it might become linearly separable. Instead of
  explicitly transforming the data, kernel functions calculate the dot product of the transformed vectors in
   the higher-dimensional space.

Common kernel functions include:
   - Linear Kernel: $K(x_i, x_j) = x_i \cdot x_j$
   - Polynomial Kernel: $K(x_i, x_j) = (\gamma x_i \cdot x_j + r)^d$
   - Radial Basis Function (RBF) / Gaussian Kernel: $K(x_i, x_j) = \exp(-\gamma ||x_i - x_j||^2)$

### Functioning (Training and Prediction)

#### Training:

1. Data Preparation: Prepare your labeled training data $(X, y)$.

2. Feature Scaling: It's often beneficial to scale your features, especially for RBF kernels, as SVM is
      sensitive to the magnitude of features.

3. Kernel Selection: Choose an appropriate kernel function (e.g., linear, RBF, polynomial) based on the
      nature of your data.

4. Optimization: The SVM algorithm solves the optimization problem (either hard or soft margin) to find the
      optimal weight vector $w$ and bias $b$. This involves finding the Lagrange multipliers $\alpha_i$.

5. Support Vectors Identification: During optimization, only the data points that lie on or within the
      margin (i.e., the support vectors) will have non-zero $\alpha_i$ values. These are the critical points
      that define the hyperplane.

#### Prediction

To classify a new data point $x_{new}$:


1. Calculate the decision function:
    $$f(x_{new}) = \text{sgn}(w \cdot x_{new} - b)$$
    or, in terms of support vectors and kernel function:
    $$f(x_{new}) = \text{sgn}\left(\sum_{i \in SV} \alpha_i y_i K(x_i, x_{new}) + b\right)$$
    Where $SV$ denotes the set of support vectors.
2. The sign of $f(x_{new})$ determines the predicted class label (e.g., +1 or -1).

### Summary

SVMs are powerful due to their ability to handle high-dimensional data, their strong theoretical
  foundation (maximizing margin), and their flexibility with the kernel trick to model non-linear
  relationships. The choice of kernel and the regularization parameter $C$ are crucial for optimal
  performance.