## Linear Discriminant Analysis(LDA)

Linear Discriminant Analysis or LDA is a technique to classify an object into one of two or more groups depending on a set of features that describes that object. It does so by assigning the object to the group with the highest conditional probability. Remember Bayes Rule? This is exactly that.

LDA is an extension to logistic regression. Here the distribution of the predictors X is modeled separately in each of the
response classes (i.e. given Y), and then Bayes’ theorem is used to flip these observations around into estimates for $Pr(Y = k\ |\ X = x)$.


**Using Bayes’ Theorem for Classification**

Suppose there are k(k>=2) classesk>=2 such that we want an observation to be classified into. The response variable Y can take on K possible distinct and unordered values. Let $π_k$ represent the prior probability that a randomly chosen observation comes from the kth class. This is the probability that a given observation is associated with the $k_{th}$ category of the response variable Y . Let $f_k(X) ≡ Pr(X = x|Y = k)$ denote the density function of $X$ for an observation that comes from the $k_{th}$ class. In other words, $f_k(x)$ is relatively large if there is a high probability that an observation in the $k_{th}$ class has X ≈ x, and fk(x) is small if it is very unlikely that an observation in the kth class has X ≈ x. Then Bayes’ theorem states that

$$ Pr(Y = k\ |\ X = x) = \frac{π_k f_k(x)}{\sum_{l=1}^{K} π_l f_l(x)}$$

If the above equation is represented using the notation used for logistic regression we will get something like $p_k(X) = Pr(Y = k\ |\ X)$. This means instead of directly computing $p_k(X)$, we can simply plug in estimates of $π_k$ and $f_k(X)$ into above complicated looking equation. Estimating $π_k$ is easy if we have a random sample of Ys from the population. We compute the fraction of the training observations that belong to the kth class. 

Estimating $f_k(X)$ is a bit more challenging, $p_k(x)$ is the posterior probability that an observation $X\ =\ x$ belongs to the $k_{th}$ class. That is, it is the probability that the observation belongs to the $k_{th}$ class, given the predictor value for that observation. Bayes classifier, classifies an observation to the class for which $p_k(X)$ is largest, has the lowest possible error rate out of all classifiers.


----
We will evaluate the performance of LDA over logistic regression. Use Smarket dataset again to predict Direction. Fit a LDA model using the lda() function, which is part of the MASS library. Notice that the lda() syntax for the lda() function is identical to that of lm(), and to that of glm() except for the absence of the family option. 

We are going to train the model on those observations that occurred before the year 2005, and predict for those observations during 2005.


In [None]:
library(ISLR)
attach(Smarket)

# The object train is a vector of 1,250 elements, corresponding to the observations in our data set. The elements of the 
# vector that correspond to observations that occurred before 2005 are set to TRUE as they satisfy the condition. 
# whereas those that correspond to observations in 2005 are set to FALSE. 
train = Year<2005

# train is a Boolean vector, since its elements are TRUE and FALSE. So the TRUE and FALSE values corresponding to each row
# will let you subset rows or columns of a matrix. For instance, the command Smarket[!train,] would pick out a submatrix of the
# stock market dataset, corresponding to dates in 2005, since those are the ones for which the elements of train are FALSE and 
# `!` operator will reverse the elements of train vector.
Smarket.2005= Smarket[!train,]

# Check the dimensions of Smarket.2005
dim(Smarket.2005)

# Save the Direction values corresponding to 2005 dates.
Direction.2005 = Direction[!train]

In [None]:
library(MASS)
lda.fit=lda(Direction∼Lag1+Lag2 ,data=Smarket ,subset=train)
lda.fit

The model output indicates that $\hat{π}_1$ = 0.492 and $\hat{π}_2$ = 0.508 i'e 49.2% of the observations correspond to days during which the market went down. These are the average probabilities of each predictor within each class, and are used by LDA as estimates of μk. 

The probabilities suggest there is a tendency for the last 2 days’ returns to be negative on days when the market increases and a tendency for the last 2 days’ returns to be positive on days when the market declines. The coefficients of linear discriminants output are used to form the LDA decision rule. If −0.642×Lag1−0.514×Lag2 is large, then the LDA classifier will predict a market increase, and if it is small, then the LDA classifier will predict a market decline. The plot() function produces plots of the linear discriminants, obtained by computing −0.642 × Lag1 − 0.514 × Lag2 for each of the training observations.

In [None]:
plot(lda.fit)

The predict() function returns a list with three elements. The first element, class, contains LDA’s predictions about the movement of the market. The second element, posterior, is a matrix whose kth column contains the posterior probability that the corresponding observation belongs to the kth class. Finally, x contains the linear discriminants.

In [None]:
lda.pred=predict(lda.fit, Smarket.2005)
names(lda.pred)

In [None]:
lda.class =lda.pred$class
table(lda.class ,Direction.2005)

mean(lda.class == Direction.2005)

Applying a 50% threshold to the posterior probabilities allows us to recreate the predictions contained in lda.pred$class

In [None]:
sum(lda.pred$posterior[ ,1]>=0.5)

sum(lda.pred$posterior[,1]<0.5)

In [None]:
lda.pred$posterior[1:20 ,1]
lda.class[1:20]

If we wanted to use a posterior probability threshold other than 50% in order to make predictions, then we could easily do so. For instance, suppose that we wish to predict a market decrease only if we are very certain that the market will indeed decrease on that day—say, if the posterior probability is at least 90%.

In [None]:
sum(lda.pred$posterior[,1]>.9)

No days in 2005 meet that threshold! In fact, the greatest posterior probability of decrease in all of 2005 was 52.02%.