# Support Vector Machine (SVM) - Training Explained

This document provides a comprehensive overview of how Support Vector Machines (SVMs) work, focusing on their mathematical foundation, training objectives, and implementation details, including gradient-based optimization techniques.

## 1. Decision Function

For binary classification, the SVM decision function is:

$$f(x) = w^\top x + b$$

Where:

- $w$: Weight vector
- $b$: Bias (intercept)
- $x$: Input feature vector

The prediction is given by:

$$\hat{y} = \text{sign}(f(x)) = \text{sign}(w^\top x + b)$$

## 2. Objective Function (Hard Margin)

If the data is linearly separable, SVM tries to find the hyperplane with the maximum margin:

$$\min_{w, b} \quad \frac{1}{2} \|w\|^2$$

### Subject to constraints:

$$y_i(w^\top x_i + b) \geq 1 \quad \forall i$$

## 3. Objective Function (Soft Margin)

When data is not perfectly separable, we introduce slack variables $\xi_i$ to allow margin violations:

$$\min_{w, b, \xi} \quad \frac{1}{2} \|w\|^2 + C \sum_{i=1}^{n} \xi_i$$

Subject to:

$$y_i(w^\top x_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0$$

## 4. Gradient Descent Update Rule (Primal Form)

We approximate the hinge loss with:

$$L(w, b) = \frac{1}{2} \|w\|^2 + C \sum_{i=1}^{n} \max(0, 1 - y_i(w^\top x_i + b))$$

Gradient updates:
- If $y_i(w^\top x_i + b) \geq 1$: $\nabla_w = w$, $\nabla_b = 0$
- If $y_i(w^\top x_i + b) < 1$: $\nabla_w = w - C y_i x_i$, $\nabla_b = -C y_i$

## 5. Stochastic Gradient Descent (SGD)

Instead of computing the gradient over the whole dataset, we update $w$ and $b$ using one sample at a time.

when point is misclassfied , The line below combines L2 regularization and hinge loss gradient:

```python
w -= lr * (2 * lambda_param * w - y_i * x_i)  # ← L2 regularization + hinge loss gradient
```

In [None]:
otherwisee the gradient is 


In [None]:
w -= lr * (2 * lambda_param * w )  # ← L2 regularization + hinge loss gradient

## 6. Regularization Gradient

To penalize large weights, SVM uses **L2 regularization**:

$$\frac{1}{2} \|w\|^2$$

Its gradient w.r.t. $w$ is:

$$\nabla_w = w$$

Implemented as:

```python
w -= lr * (2 * lambda_param * w)
```

## Summary

| Component               | Explanation                                    |
| ------------------------ | ---------------------------------------------- |
| Decision Function        | Linear combination of weights and input        |
| Objective Function       | Maximize margin, minimize classification error |
| Constraints              | Ensure proper classification with margin       |
| Gradient Descent         | Optimize weights and bias using loss gradients |
| SGD                      | Efficient training via per-sample updates      |
| Regularization Gradient  | Shrinks weights to improve generalization      |

In [None]:
for x_i, y_i in dataset:
    if y_i * (w @ x_i + b) >= 1:
        w -= lr * (2 * lambda_param * w)
    else:
        w -= lr * (2 * lambda_param * w - y_i * x_i)
        b -= lr * y_i