LDA in Python Walkthrough:

https://hands-on.cloud/implementation-of-linear-discriminant-analysis-lda-using-python/

# Discriminant Analysis Notes

### Summary:

Linear Discriminant Analysis is used for both classification and dimensionality reduction. In classification, LDA finds linear combinations of features that best separate multiple classes or groups and maximizes between-class varaince, while minimizing within-class variance. 

You can compare this to PCA, but instead of focusing on data points that contain the most variation, we want to maximize the separability among the known categories. So both LDA and PCA are dimensionality reduction techniques, but for different purposes.

### Why Discriminant Analysis:

When classes are well-separated, logistic regressions are unstable (seeing as how they force themselves to 0 or 1). Linear discriminant analysis does not suffer from this problem.

Linear Discriminant analysis is useful when you have more than two response classes, because it also provides a low-dimensional view of the data

### How Discriminant Analysis:

Imagine having an x variable and a y variable that both impact the success of a certain drug. LDA attempts to draw a new axis cutting through the x and y observations that maximizes the distance between success observations and failure observations.

It does this by two main criteria:

1. The new axis maximizies the distance between the means of fail and success classes. By maximizing the distance, it makes it easier to classify whether an observation was success or failure since they're so far apart. Let's say we have $\mu_1$ for success and $\mu_2$ for failure that represent the means of each class.

2. The new axis minimizes the variation between observations within a class. You can think of this has having a tight grouping on a shooting target. Let's say we have $s^2_1$ and $s^2_2$ that represent the variation of success observations and failure observations respectively.

3. You can now create a ratio $$\frac{(\mu_1 - \mu_2)^2}{s^2_1 + s^2_2}$$
    - we square the numerator to ensure the value stays positive
    - ideally we want a large numerator and a small denominator
    - by creating a ratio between the two criteria, we can accomodate scenarios where values along a variable are not that different between classes, but are very different for another variable.

This can then be simplified to below where d represents "distance" between the means.

$$\frac{(d)^2}{s^2_1 + s^2_2}$$


Summary from another lecture:
The approach in Discriminant Analysis is to model the distribution of X in each of its classes separately (supposing that X had several classes). From there, one uses Bayes theorem to flip things around and obtain Pr(Y|X).

This is different than normal distributions (Gaussians) where distributions for each class leads to linear or quadratic discriminant analysis. This Guassian approach is quite general, many others can be used.

### Handling 3+ Variables

It's almost the exact same process. One creates an axis that maximizes the difference between the two means of the classes. Remember that this axis is effectively just a linear combination of variables.

### Handling 3+ Categories

Two things change, but just barely.
![My Image](LDA for 3 Categories.png)


1. We change how distance is measured amongst the means. We identify a point that is central to all data. Then distance is maximized between each category of points, while minimizing for scatter.
2. We create two axes to separate the data. This is done by creating 3 points, one for each category that is central to each class and use those points to optimize separation by maximizing distance between two axes created to seperate the categories.

This two axes method is powerful since it can handle any number of variables since we are only maximizing distance beteween points and the two axes.


### Compare/Contrast between PCA and LDA

- Both rank the new axes in order of importance
    - PC1 from PCA, accounts for most variation in data and so on with PC2, PC3..
    - LD1 accounts for most variation between categories, and so on with LD2   
- Both methods allow you to see which genes are driving the new axes
- Both try to reduce dimensionality
    - PCA looks at genes with most variation
    - LDA maximizes distance between categories



## Bayes Theorem

Bayes Theorem:
$Pr(Y = k| X = x) = 
\frac{Pr(X = x| Y = k) * Pr(Y = k)}{ Pr(X = x)}$

In writing, one would describe the above as swapping the probability around. Instead of having the probability of Y = k given X = x, Bayes theorem allows us to flip it and multiply our flipped probability by the probability of the original variable of interest. AKA The probability of Y = k given X = x, is equal to the product of the probability of X = x, given Y = k multiplied by the probability that Y = k divided by the probability of X = x.

## Bayes Theorem in Discriminant Analysis

When writing Bayes Theorem in Discriminant Analysis, the writing changes to below. Note that $pi$ is used in lieu of a greek variable below and does not represent its numerical value.

$Pr(Y=k | X=x) = \frac{\pi_k f_k(x)}{\sum_{l=1}^{k}\pi_l f_l(x)}$

* $f_k(x) = Pr(X=x|Y=k)$ is the **density** for X in class *k*. 
    - Remember that "density" of X is effectively the area under the curve for a function(x).
    
* $pi_k = Pr(Y=k)$ is the marginal or **prior** probability for class *k* 