# Linear classification

## Linear discriminant analysis (LDA)

Linear discriminant analysis (LDA) is a generalization of Fisher's linear discriminant, that find a linear combination of features that discriminate samples from two or more classes. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.

### Fisher's linear discriminant with equal class covariance

This geometric method does not make probabilistic assumption, it only relies on distances. It look for the linear projection (rotation) $\mathbf{w}$ that maximizes the between / within variance ratio: noted $F(w)$. It should be considered as a pedagogical experimental method. However with few assumptions it will provide the same results than LDA.

Suppose two classes ($C_0, C_1$) of observations have means $\mu_0$, $\mu_1$ and the same the 
total within-class scatter ``covariance'' matrix $S_W$ given by:
\begin{align}
S_W &= \sum_{i\in C_0} (x_i - \mu_0)(x_i - \mu_0)^T + \sum_{j\in C_1} (x_j - \mu_1)(x_j -\mu_1)^T\\
    &= X_c^T X_c
\end{align}

Where $X_c$ is the $(N \times P)$ matrix of data centered on their respective means:

$$
X_c = 
\begin{bmatrix}
X_0 -  \mu_0 \\ X_1 -  \mu_1 
\end{bmatrix}
$$

Where $X_0$ and $X_1$ are the $(N_0 \times P)$ and $(N_1 \times P)$ matrices of samples of classes $C_0$ and $C_1$.

Let $S_B$ being the scatter ``between-class'' covariance matrix and given by

$$
S_B = (\mu_1 - \mu_0 )(\mu_1 - \mu_0 )^T
$$


The linear combination of features $w^T x$ have means $w^T \mu_i$ for i=0,1 and variance $w^T 
X^T_c X_c w$. Fisher defined the separation between these two distributions to be the ratio of the 
variance between the classes to the variance within the classes:

\begin{align}
F_{\text{Fisher}}(w) &= \frac{\sigma_{\text{between}}^2}{\sigma_{\text{within}}^2}\\
                     &= \frac{(w^T \mu_1 - w^T \mu_0)^2}{w^T  X^T_c X_c w}\\
                     &= \frac{(w^T (\mu_1 - \mu_0))^2}{w^T  X^T_c X_c w}\\ 
                     &= \frac{w^T (\mu_1 - \mu_0) (\mu_1 - \mu_0)^T w}{w^T X^T_c X_c w}\\
                     &= \frac{w^T S_B w}{w^T S_W w}
\end{align}

#### Theorem

In the two classes case, the maximum separation occurs by a projection on the $(\mu_1 - \mu_0)$ using the Mahalanobis 
metric $S_B^{-1}$:

$$
    w \propto S_B^{-1}(\mu_1 - \mu_0)
$$

#### Demonstration

Differentiating $F_{Fisher}(w)$ with respect to $w$

\begin{align*}
\nabla_{w}F_{Fisher}(w) &= 0\\
\nabla_{w}(\frac{w^T S_B w}{w^T S_W w}) &= 0\\
(w^T S_W w)(2 S_B w) - (w^T S_B w)(2 S_W w) &= 0\\
(w^T S_W w)(S_B w) &= (w^T S_B w)(S_W w)\\
S_B w &= \frac{w^T S_B w}{w^T S_W w}(S_W w)\\
S_B w &= \lambda (S_W w)\\
S_W^{-1}{S_B} w &= \lambda  w\\
\end{align*}

Since we do not care about the magnitude of $w$, only its direction, we replaced the scalar factor $(w^T S_B w) / (w^T S_W w)$ by $\lambda$. 

Note that this almost looks like an eigen-value equation, if the matrix $S_W^{-1} S_B$ would have 
been symmetric (in fact, it is called a generalized eigen-problem).

However, in the two classes case (where $S_B = (\mu_1 - \mu_0 )(\mu_1 - \mu_0 )^T$), it is easy to 
show that $w = S_W^{-1}(\mu_1 - \mu_0)$ is the eigen-vector of this equation:

\begin{align*}
S_W^{-1}(\mu_1 - \mu_0 )(\mu_1 - \mu_0 )^T w &= \lambda  w\\
S_W^{-1}(\mu_1 - \mu_0 )(\mu_1 - \mu_0 )^T S_W^{-1}(\mu_1 - \mu_0) &= \lambda  S_W^{-1}(\mu_1 
- \mu_0)\\
\end{align*}

Where here $\lambda = (\mu_1 - \mu_0 )^T S_W^{-1}(\mu_1 - \mu_0)$. Which leads to the result:
$$
    w \propto S_B^{-1}(\mu_1 - \mu_0)
$$

The threshold can be chosen as the hyperplane between projections of the two means:
$$
T = w \cdot \frac{1}{2}(\mu_1 - \mu_0)
$$

## Linear classification with Statsmodels

## Linear classification with scikit-learn