# Non-linear classification

<hr>

**Feature transformation for non-linearly seperable spaces**<br>

We can adapt linear classifiers by mapping all examples in $x \in \mathbb{R}^d$ to a higher dimensional space with $\phi(x) \in \mathbb{R}^p$, where $p > d$, such that the training examples will now be linearly seperable in the higher dimensional space.

This mapping, $\phi(x)$, can be done in several ways. One way is to add polynomial terms of the original $d$ dimensions, another is to use interactions between the dimensions.

For example, given $X \in \mathbb{R}^2$

$X = \begin{bmatrix} X_1 \\ X_2 \end{bmatrix} \in \mathbb{R}^2$

$\phi(X) = \begin{bmatrix} X_1 \\ X_2 \\ X_1^2 \\ X_2^2 \\ X_1 X_2 \end{bmatrix} \in \mathbb{R}^5$

<img alt="Feature Transformation" src="assets/non_linear_classification.png" width="800">

<hr>

**Non-linear Regression**

This feature transformation could easily be applied to transform linear regression to non-linear regression as well. 

To prevent overfitting by adding more polynomial terms, which will fit the training examples perfectly but fails to generalize well, use cross-validation (*leave-one-out, or k-fold*) to find $\phi(x)$ such that it minimizes the validation error.

<img alt="Adding Polynomial Terms to Linear Regression" src="assets/non_linear_regression.png" width="400">

<hr>

**Kernels for computational efficiency**

Applying polynomial features in classifiers/regressions can result in a very high dimensional space which may be computationally expensive. For example, for $X \in \mathbb{R}^d$ then a polynomial order of 3 will result in $d + \binom{d+2-1}{2} + \binom{d+3-1}{3}$ dimensions.

*Perceptron algorithm*:

- Set $\theta = 0$
- Run through $i = 1, \dots, n$
- if $y^{(i)} \theta \cdot \phi(x^{(i)}) \leq 0$, then update $\theta \leftarrow \theta + y^{(i)} \cdot \phi(x^{(i)})$

When expressed differently, we can see that the update to $\theta$ is the sum of mistaken classification multiplied by $y^{(i)} \cdot \phi(x^{(i)})$, since $\theta$ starts at 0

$\theta = \sum_{j = 1}^{n} \alpha_j \cdot y^{(j)} \cdot \phi(x^{(j)})$, where $\alpha_j$ is an indicator of mistaken classification

If we multiply both sides by $\phi(x^{(i)})$, then

$\theta \cdot \phi(x^{(i)}) = \sum_{j = 1}^{n} \alpha_j \cdot y^{(j)} \cdot \phi(x^{(j)}) \cdot \phi(x^{(i)})$

where $\phi(x^{(j)}) \cdot \phi(x^{(i)})$ is the kernel function, $k(x^{(j)}, x^{(i)})$

$\therefore$ Instead of running the 3rd step using $y^{(i)} \theta \cdot \phi(x^{(i)}) \leq 0$, we can express this using the kernel function, $y^{(i)} \cdot \sum_{j = 1}^{n} \alpha_j \cdot y^{(j)} \cdot k(x^{(j)}, x^{(i)})$ and update $\alpha_i \leftarrow \alpha_i + 1$

*Kernel Perceptron algorithm*:

- Initialize $a_1, a_2, \dots, a_n = 0$
- For $t = 1, \dots, T$
- For $i = 1, \dots, n$
- if $\left(\text {Mistake Condition Expressed in }\,  \alpha _ j\right)$, update $a_j$ appropriately

****

**Feature engineering with kernels**

Composition rules of kernels:

1. $K(x, x') = 1$ is a kernel function
2. Let $f: \mathbb{R}^d \rightarrow \mathbb{R}$ and $K(x, x')$ is a kernel. Then so is $\tilde K(x, x') = f(x) K(x, x') f(x')$ or $\tilde \phi(x) = f(x) \cdot \phi(x)$
3. $K(x, x') = K_1(x, x') + K_2(x, x')$ is also a kernel, given $K_1, K_2$ are kernels
4. $K(x, x') = K_1(x, x') \cdot K_2(x, x')$ is also a kernel, given $K_1, K_2$ are kernels

****

**Kernels and classifiers**

*Radial Basis Function (RBF) Kernel*

$K(x, x') = \exp (-\frac{1}{2} \Vert x - x' \Vert^2)$

*Random forest classifier*

Procedure:
- Bootstrap sample with replacement
- Build a (randomized) decision tree
- Repeat N iterations
- Average N predictions (ensemble)

# Basic code
A `minimal, reproducible example`