# Linear Classifiers & Perceptrons

<hr>

**A soft introduction to Machine Learning (ML): *Supervised Learning***<br>

*Example: Movie Recommender Problem*

Given a series of movies with ratings ($-1$ or $+1$, *dislike/like* for simplicity), build a feature vector ($X \in \mathbb {R}^d$) for each movie:

$x^{(1)} = \begin{bmatrix} 1 & 0 & 1 & 0 & \dots \end{bmatrix}^T$ contains information about the characteristics of the movie, where each binary score is a representation of some characteristics about the movie, e.g. *action, romance, spielberg-directed*

$y^{(1)} = \begin{bmatrix} 1 \end{bmatrix}^T$ where $y \in [-1, 1]$, is the associated rating to the given movie

Given these feature vectors and its associated ratings, the ML model estimates the parameters that would map the feature vectors to its associated ratings. With a new movie comes a new feature vector, where the model uses the estimated parameters to predict the rating for the given movie.

<hr>

**Linear Classifiers**

A linear classifier is a function, $h(\mathbb{X})$, that maps the feature vector, $\mathbb{X}$, to its associated labels, $\mathbb{Y}$. This classifier linearly divides space into two with a hyperplane where training data lies. Given a point $x$ in the space, the classifier $h(x)$ outputs a label, depending on where the point $x$ exists in a multidimensional space among the two linearly divided spaces.

To evaluate how well a linear classifier performs, we compute a training error, $E_i (h) = h(x^{(i)}) \neq y^{(i)}$ and produces $1$ if there is an error, or $0$ otherwise. The fraction of training errors is $\therefore E_n(h) = \frac{1}{n} \sum_{i=1}^{n} E_i (h)$

A set of classifiers, $H \in \mathbb{H}$, lives in a hypothesis space which contains all possible classifiers for the given task.

A perceptron algorithm, $\hat h = \mathbb{A} (S_n, \mathbb{H})$, where $S_n$ is the training data, returns a classifier ($\hat h$) from the hypothesis space ($\mathbb{H}$) that best fits the training data.

The mathematical understanding of linear classifiers is as follows:

A linear hyperplane (*decision boundary*) can be as simple as follows, $X: \theta_1 X_1 + \theta_2 X_2 = 0$ or it can also be expressed in matrix form, $\theta^T X = 0$, where $\theta = \begin{bmatrix} \theta_1 \\ \theta_2 \end{bmatrix}$ and $X = \begin{bmatrix} X_1 \\ X_2 \end{bmatrix}$. From here, we can observe that $\theta$ is orthogonal to $X$ for this hyperplane that cuts through the origin.

For a linear classifier that cuts through the *origin* would simply be expressed this way, $h(X; \theta) = \mathbb{1} (\theta^T X > 0)$, where $\theta \in \mathbb{R}^d$

To generalize beyond a classifier beyond the origin, this is simply expressed with an additional term, $\theta_0$, that shifts the hyperplane from the origin, where $\theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \end{bmatrix}$ and $X = \begin{bmatrix} 1 \\ X_1 \\ X_2 \end{bmatrix}$, such that $\theta^T X = \theta_0 + \theta_1 X_1 + \theta_2 X_2 = 0$ and $\theta$ continues to be orthogonal to $X$

<img alt="Linear Classifier" src="assets/linear_classifier.png" width="400">

A full definition of linear classifiers is based on a clean seperation of the observations and it follows:

Training examples $S_n = \{(x^{(i)}, y^{(i)}), i = 1, \dots, n\}$ are *linearly seperable* if there exists a parameter vector $\hat \theta$ and offset parameter $\hat \theta_0$ such that $y^{(i)} \cdot (\hat \theta \cdot x^{(i)} + \hat \theta_0) > 0$ for all $i = 1, \dots, n$

Given that $y^{(i)}$ is part of the equation, then $0$ is not a possible value and should not be used as a label.

****

**The Perceptron Algorithm**

$E_n (\theta) = \frac{1}{n} \sum_{i=1}^{n} \mathbb{1} (y^{(i)} \cdot (\theta x^{(i)} + \theta_0) \leq 0)$

Procedure:
1. Start with a zero parameter, $\theta = 0$ vector
2. for $i = 1, \dots, n$, do:
    - If label and prediction is wrong, $y^{(i)}(\theta \cdot x^{(i)}) \leq 0$
    - Then update the parameter, $\theta = \theta + y^{(i)}x^{(i)}$
    - Repeat the procedure for $T$ times across the training set again until no further parameter update is necessary
    - Every update always minimizes the training error
        - $y^{(i)} ((\theta + y^{(i)}x^{(i)}) \cdot x^{(i)} + \theta_0 + y^{(i)}) - y^{(i)} (\theta \cdot x^{(i)} + \theta_0) > 0$
        - The formula above simplifies to $(y^{(i)})^2 \cdot (\lVert x^{(i)} \rVert^2 + 1) > 0$
    
    
3. For a sufficiently large $T$ then it converges to a parameter

<img alt="Perceptron" src="assets/perceptron.png" width="400">

<hr>

# Basic code
A `minimal, reproducible example`