#### Linear Discriminant Analysis

Logistic regression is a classification algorithm traditionally limited to only two-class classification
problems. If you have more than two classes then the Linear Discriminant Analysis is the
preferred linear classification technique.

Logistic regression is a simple and powerful linear classication algorithm. It also has limitations
that suggest at the need for alternate linear classication algorithms.

- Two-Class Problems. Logistic regression is intended for two-class or binary classication
problems. It can be extended for multiclass classication, but is rarely used for this purpose.
- Unstable With Well Separated Classes. Logistic regression can become unstable
when the classes are well separated.
- Unstable With Few Examples. Logistic regression can become unstable when there
are few examples from which to estimate the parameters.

Linear discriminant analysis does address each of these points and is the go-to linear method
for multiclass classication problems. Even with binary-classication problems, it is a good idea
to try both logistic regression and linear discriminant analysis.

The representation of LDA is pretty straight forward. It consists of statistical properties of your
data, calculated for each class. 

For a single input variable (x) this is the mean and the variance
of the variable for each class.

For multiple variables, this is same properties calculated over the multivariate Gaussian,
namely the means and the covariance matrix (this is a multi-dimensional generalization of
variance). These statistical properties are estimated from your data and plug into the LDA
equation to make predictions.

LDA makes some simplifying assumptions about your data
- That your data is Gaussian, that each variable is is shaped like a bell curve when plotted.
- That each attribute has the same variance, that values of each variable vary around the
mean by the same amount on average.

LDA makes predictions by estimating the probability that a new set of inputs belongs to each
class. The class that gets the highest probability is the output class and a prediction is made.
The model uses Bayes Theorem to estimate the probabilities.

As a general statement, we can state Baye's theorem as follows\
\begin{equation}
\label{eq:bayes}
P(Y=k|\textbf{X=x}) = P(k) \frac{P(\textbf{x}|k)}{P(\textbf{x})} ~~~~~|| I,
\end{equation}

Plugging the Gaussian
into the above equation and simplifying we end up with the equation below. It is no longer a
probability as we discard some terms. Instead it is called a discriminate function for class k. It
is calculated for each class k and the class that has the largest discriminant value will make the
output classication (Y = k):


Dk(x) = x * mean_k/sigma2 - mean_k^2/(2 * sigma) + ln(P(k))

Dk(x) is the discriminate function for class k given input x, the meank, sigma2 and P(k)
are all estimated from your data. The ln() function is the natural logarithm.

Preparing Data For LDA
This section lists some suggestions you may consider when preparing your data for use with
LDA.

- Classication Problems. This might go without saying, but LDA is intended for
classication problems where the output variable is categorical. LDA supports both binary
and multiclass classication.
- Gaussian Distribution. The standard implementation of the model assumes a Gaussian
distribution of the input variables. Consider reviewing the univariate distributions of each
attribute and using transforms to make them more Gaussian-looking (e.g. log and root
for exponential distributions and Box-Cox for skewed distributions).
- Remove Outliers. Consider removing outliers from your data. These can skew the basic
statistics used to separate classes in LDA such the mean and the standard deviation.
- Same Variance. LDA assumes that each input variable has the same variance. It almost
always a good idea to standardize your data before using LDA so that it has a mean of 0
and a standard deviation of 1.

Tutorial Overview

We are going to step through how to calculate an LDA model for simple dataset with one input
and one output variable. This is the simplest case for LDA. This tutorial will to cover:
1. Dataset: Introduce the dataset that we are going to model. We will use the same dataset
as the training and the test dataset.
2. Learning the Model: How to learn the LDA model from the dataset including all of
the statistics needed to make predictions.
3. Making Predictions: How to use the learned model to make predictions for each instance
in the training dataset.

Below is a contrived simple two-dimensional dataset containing the input variable X and the
output class variable Y . All values for X were drawn from a Gaussian distribution and the class
variable Y has the value 0 or 1. The instances in the two classes were separated to make the
prediction problem simpler. All instances in class 0 were drawn from a Gaussian distribution
with a mean of 5 and a standard deviation of 1. All instances in class 1 were drawn from a
Gaussian distribution with a mean of 20 and a standard deviation of 1.
The classes do not interact and should be separable with a linear model like LDA. It is also
handy to know the actual statistical properties of the data because we can generate more test
instances later to see how well LDA has learned the model. Below is the complete dataset.

In [34]:
from io import StringIO
import pandas as pd
import numpy as np

In [17]:
dataset = StringIO("""X Y
4.667797637 0
5.509198779 0
4.702791608 0
5.956706641 0
5.738622413 0
5.027283325 0
4.805434058 0
4.425689143 0
5.009368635 0
5.116718815 0
6.370917709 0
2.895041947 0
4.666842365 0
5.602154638 0
4.902797978 0
5.032652964 0
4.083972925 0
4.875524106 0
4.732801047 0
5.385993407 0
20.74393514 1
21.41752855 1
20.57924186 1
20.7386947 1
19.44605384 1
18.36360265 1
19.90363232 1
19.10870851 1
18.18787593 1
19.71767611 1
19.09629027 1
20.52741312 1
20.63205608 1
19.86218119 1
21.34670569 1
20.333906 1
21.02714855 1
18.27536089 1
21.77371156 1
20.65953546 1
""")

In [18]:
def clean_cols(cols): return cols.lower().strip() 
lda = pd.read_csv(dataset, sep=' ').rename(columns = clean_cols)

The LDA model requires the estimation of statistics from the training data:
1. Mean of each input value for each class
2. Probability of an instance belong to each class.
3. Covariance for the input data for each class.

In [32]:
# mean of each input for each class
mean_0 = lda.loc[lda.y.eq(0), 'x'].mean()
mean_1 = lda.loc[lda.y.eq(1), 'x'].mean()

mean_0, mean_1

(4.975415506999999, 20.087062921)

In [33]:
# probability of an instance belongs to each class
p_y0 = lda.loc[lda.y.eq(0), 'y'].count()/ lda.shape[0]
p_y1 = lda.loc[lda.y.eq(1), 'y'].count()/ lda.shape[0]

p_y0, p_y1

(0.5, 0.5)

You can understand
the variance as the dierence of each instance from the mean. The dierence is squared so
the variance is often written to include these units. It does not mean you need to square the
variance value when using it. We can calculate the variance for our dataset in two steps:
1. Calculate the squared dierence for each input variable from the group mean.
2. Calculate the mean of the squared dierence.

In [35]:
def sqr_difference(df, class_value: int):
    """
    Calculates the sum of the squared difference for each class
    """
    mean_class = df.loc[df.y.eq(class_value), 'x'].mean()
    df_class = df.loc[df.y.eq(class_value)]
    return np.sum(
        np.square(df_class['x'] - mean_class)
    )

In [38]:
sqr_difference(lda, 1)

21.493167084411787

Next we can calculate the variance as the average squared dierence from the mean as:

variance = (1/count(x) -count(classes)) X sum(SquaredDifference(xi))

In [46]:
variance = 1/(lda.shape[0] -lda['y'].nunique()) * sum(sqr_difference(lda, i) for i in range(2))
variance

0.8329315056876604

#### Making Predictions

Predictions are made by calculating the discriminant
function for each class and predicting the class with the largest value. The discriminant function
for a class given an input (x) is calculated using:


Dk(x) = x * mean_k/sigma2 - mean_k^2/(2 * sigma) + ln(P(k))

Where x is the input value, mean, variance and probability are calculated above for the class
we are discriminating. After calculating the discriminant value for each class, the class with
the largest discriminant value is taken as the prediction. 

Let's step through the calculation of
the discriminate value of each class for the rst instance. The rst instance in the dataset is:
**X = 4:667797637 and Y = 0**


In [48]:
lda.loc[0,'x']

4.667797637

In [50]:
def discriminant(x: float, class_value: int, var: float, mean: float, pb: float)-> float:
    """
    Calculates the discriminant function dependent on value
    """
    return x * (mean/ var) - (mean**2/ (2 * var)) + (1/np.e**0.5)

In [69]:
lda = lda.assign(disc_y0 = [discriminant(x, 0, variance, mean_0, p_y0) for x in lda['x']],
           disc_y1 = [discriminant(x, 1, variance, mean_1, p_y1) for x in lda['x']],
           prediction = lambda df: df.loc[:, ['disc_y0', 'disc_y1']].idxmax(axis='columns').eq('disc_y1').astype(int)
                )
                                          

In [70]:
lda.sample(10)

Unnamed: 0,x,y,disc_y0,disc_y1,prediction
25,18.363603,1,95.439267,201.254234,1
31,20.527413,1,108.364527,253.436912,1
23,20.738695,1,109.626592,258.532201,1
29,19.717676,1,103.527661,233.90921,1
11,2.895042,0,3.039692,-171.787187,0
16,4.083973,0,10.141627,-143.114804,0
2,4.702792,0,13.838066,-128.191308,0
35,20.333906,1,107.208635,248.770275,1
22,20.579242,1,108.674119,254.68682,1
3,5.956707,0,21.328176,-97.951762,0


If you compare the predictions to the dataset, you can see that LDA has achieved an accuracy
of 100% (no errors). This is not surprising given that the dataset was contrived so that the
groups for Y = 0 and Y = 1 were clearly separable.