Reference: Pattern recognition and Machine learning (Bishop)

# Fisher's linear discriminant

A __discriminant__ is a function that takes an instance and assigns it to a class label.

For simplicity, consider a binary classification.

* Dataset: $\{ \mathbf{x}_1,\ldots, \mathbf{x}_N \}$
* Classes: $C_1, C_2$,
* Object:

    1. Project the dataset to $\mathbb{R}$ by $y = \mathbf{w}^T\mathbf{x}$ for some vector $\mathbf{w}$.
    1. Classify $y \geq -w_0$ as class $C_1$ and otherwise class $C_2$ for some threshold $w_0$.
    
    
Idea:

* Try to maximize the difference of the projected class means $m_i = \mathbf{w}^T\mathbf{m}_i, i=1,2$, where $\mathbf{m}_i = \frac{1}{|C_i|}\sum_{n\in C_i} \mathbf{x}_n$.
    
* Try to minimize the within-class variance for the whole date defined by $s_1^2 + s_2^2$, where $s_i^2 = \sum_{n\in C_i} (\mathbf{w}^T\mathbf{x}_n - m_i)^2$.


Find $\mathbf{w}$ maximizing the Fisher criterion $J(\mathbf{w}) = \displaystyle{\frac{(m_1 - m_2)^2}{s_1^2 + s_2^2}}$.


After finding $\mathbf{w}$, how do we decide a threshold $y_0$? We may assume that the projected instances $y_n = \mathbf{w}^T\mathbf{x}_n$ follow a Gaussian distribution in each class. Then find the parameters of the Gaussian distributions and set $y_0$ as intersection of the Gaussian approximations.

# Probalistic generative models & Quadratic discriminant

From the following multiclass generalization of the logistic sigmoid
$$ p(C_k|\mathbf{x}) = \displaystyle{\frac{p(\mathbf{x}|C_k)p(C_k)}{\sum_j p(\mathbf{x}|C_j)p(C_j)}} = \frac{e^{a_k}}{\sum_j e^{a_j}},$$

we have $$a_k = \ln\left[p(\mathbf{x}|C_k)p(C_k)\right]$$. 


Assume that we have continuous data and $p(\mathbf{x}|C_k) = N(\boldsymbol{\mu}_k, \Sigma)$, Gaussian distributions sharing the same covariance matrix.

Letting $a_k = \mathbf{w}_k^T\mathbf{x} + w_{k0}$, we can show 

* $\mathbf{w}_k = \Sigma^{-1} \boldsymbol{\mu}_k$

* $w_{k0} = -\frac{1}{2}\boldsymbol{\mu}_k^T \Sigma^{-1} \boldsymbol{\mu}_k + \ln p(C_k)$

Note that $a_k$ is of the form $\boldsymbol{\mu}_k^T \Sigma^{-1} \mathbf{x} + c_k$, a linear function of $\mathbf{x}$.


If we allow $p(\mathbf{x}|C_k)$ to have its own covariance matrix $\Sigma_k$, then $a_k$ are not linear functions but quadratic functions of $\mathbf{x}$, giving rise to a __quadratic discriminant__.


Now to estimate the prior probabilities $p(C_k)$ and the parameters $\boldsymbol{\mu}_k, \Sigma$ of the Gaussian distributions, we will find maximum likelihood solutions.

For simplicity, assume the case of two classes. 

Let $\pi = p(C_1)$ (so $p(C_2) = 1-\pi$) and suppose that we have a data set $(\mathbf{x}_n, t_n)$ for $n=1,\ldots,N$, where $t_n = 1$ if $\mathbf{x}_n$ is in class $C_1$ and $t_n = 0$ if $\mathbf{x}_n$ is in class $C_2$.

Then the likelihood function is 
$$\prod_{n=1}^N \left[ \pi N(\mathbf{x}_n | \boldsymbol{\mu}_1,\Sigma)\right]^{t_n} \left[ (1-\pi) N(\mathbf{x}_n | \boldsymbol{\mu}_2,\Sigma)\right]^{1-t_n}$$, where $\mathbf{t} = (t_1,\ldots,t_N)^T$.

* Maximizing the log of the likelihood function _w.r.t._ $\pi$, we get $\pi = |C_1|/N$.

* Maximizing the log of the likelihood function _w.r.t._ $\boldsymbol{\mu}_k$, we find that $\boldsymbol{\mu}_k$ is the mean of all instances belong to class $C_k$. 

* Maximizing the log of the likelihood function _w.r.t._ $\Sigma$, we get $\Sigma = \sum_{k=1}^2 \frac{|C_k|}{N} \mathbf{S}_k$, where $\mathbf{S}_k = \frac{1}{|C_k|} \sum_{n\in C_k} (\mathbf{x}_n - \boldsymbol{\mu}_k) (\mathbf{x}_n - \boldsymbol{\mu}_k)^T$.

## scikit-learn


```python
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
```

__Linear Discriminant Analysis__: A classifier with a linear decision boundary, generated by fitting class conditional densities to the data and using Bayes’ rule. The model fits a Gaussian density to each class, assuming that all classes share the same covariance matrix. The fitted model can also be used to reduce the dimensionality of the input by projecting it to the most discriminative directions, using the transform method.


__Quadratic Discriminant Analysis__: A classifier with a quadratic decision boundary, generated by fitting class conditional densities to the data and using Bayes’ rule. The model fits a Gaussian density to each class.