# Classification

Qualitative variables take values in an unordered set C, such as: eye color ∈ {black, brown, blue, green}
    email∈ {spam, not-spam}
    
• Given a feature vector X and a qualitative response Y taking values in the set C, the classification task is to <br>build a function C(X) that takes as input the feature vector X and predicts its value for Y ; i.e. C(X) ∈ C. <br>
• Often we are more interested in estimating the probabilities that X belongs to each category in C.

## Use case - Credit Card Default

![Fig](imgs/lin_reg_024.png)

Suppose for the Default classification task that we code <br>
Y = 0 if No 1 if Yes. <br>

Can we simply perform a linear regression of Y on X and classify as Yes if Y-hat > 0.5? <br>

• In this case of a binary outcome, linear regression does a good job as a classifier, and is equivalent to linear<br>
discriminant analysis which we discuss later. <br>

• Since in the population E(Y |X = x) = Pr(Y = 1|X = x), we might think that regression is perfect for this task.

However, linear regression might produce probabilities less than zero or bigger than one. Logistic regression is more appropriate.

Now suppose we have a response variable with three possible values. A patient presents at the emergency room, and we must classify them according to their symptoms.<br>

Y = 1 if stroke; 2 if drug overdose; 3 if epileptic seizure.<br>

This coding suggests an ordering, and in fact implies that the difference between stroke and drug overdose is the same as between drug overdose and epileptic seizure.<br>

Linear regression is not appropriate here. Multiclass Logistic Regression or Discriminant Analysis are more appropriate

## Logistic Regression

Let’s write p(X) = Pr(Y = 1|X) for short and consider using balance to predict default. Logistic regression uses the form


![Fig](imgs/lin_reg_025.png)

(e ≈ 2.71828 is a mathematical constant [Euler’s number.])
It is easy to see that no matter what values β0, β1 or X take, p(X) will have values between 0 and 1.
A bit of rearrangement gives

![Fig](imgs/lin_reg_026.png)

This monotone transformation is called the log odds or logit transformation of p(X). (by log we mean natural log: ln.)

## Maximum likelihood

We use maximum likelihood to estimate the parameters.

![Fig](imgs/lin_reg_027.png)

This likelihood gives the probability of the observed zeros and ones in the data. We pick β0 and β1 to maximize the likelihood of the observed data

For example, upon running the logisitic regression on the credit card default problem we get the following -

![Fig](imgs/lin_reg_028.png)

Estimated probabilty of default with a balance of $1000

![Fig](imgs/lin_reg_029.png)

Estimated probabilty of default with a balance of $2000

![Fig](imgs/lin_reg_030.png)

### Logistic Regression with several variables

![Fig](imgs/lin_reg_031.png)

## Discriminant Analysis

Here the approach is to model the distribution of X in each of
the classes separately, and then use Bayes theorem to flip things
around and obtain Pr(Y |X).<br>
When we use normal (Gaussian) distributions for each class,
this leads to linear or quadratic discriminant analysis.<br>
However, this approach is quite general, and other distributions
can be used as well. We will focus on normal distributions.

Bayes' Theorem

![Fig](imgs/lin_reg_032.png)

One can write this slightly differently -

![Fig](imgs/lin_reg_033.png)

• fk(x) = Pr(X = x|Y = k) is the density for X in class k.
Here we will use normal densities for these, separately in
each class.<br>
• πk = Pr(Y = k) is the marginal or prior probability for
class k.<br>

### Why Discriminant Analysis

When the classes are well-separated, the parameter
estimates for the logistic regression model are surprisingly
unstable. Linear discriminant analysis does not suffer from
this problem.<br>
• If n is small and the distribution of the predictors X is
approximately normal in each of the classes, the linear
discriminant model is again more stable than the logistic
regression model.<br>
• Linear discriminant analysis is popular when we have more
than two response classes, because it also provides
low-dimensional views of the data.<br>

The Gaussian Density has the form -

![Fig](imgs/lin_reg_034.png)

Here µk is the mean, and σ2k the variance (in class k). 
We will assume that all the σk = σ are the same.<br>
Plugging this into Bayes formula, we get a rather complex
expression for pk(x) = Pr(Y = k|X = x):<br>

![Fig](imgs/lin_reg_035.png)

To classify at the value X = x, we need to see which of the
pk(x) is largest. Taking logs, and discarding terms that do not
depend on k, we see that this is equivalent to assigning x to the
class with the largest discriminant score:

![Fig](imgs/lin_reg_036.png)

Note that δk(x) is a linear function of x.
If there are K = 2 classes and π1 = π2 = 0.5, then one can see that the decision boundary is at<br>
x = (µ1 + µ2)/2

### Example

![Fig](imgs/lin_reg_037.png)

Once we have estimates ˆδk(x), we can turn these into estimates for class probabilities:
    
So classifying to the largest ˆδk(x) amounts to classifying to the class for which Pr( c Y = k|X = x) is largest.

![Fig](imgs/lin_reg_038.png)

For a two-class problem, one can show that for LDA

![Fig](imgs/lin_reg_039.png)

So it has the same form as logistic regression.

The difference is in how the parameters are estimated.
• Logistic regression uses the conditional likelihood based on
Pr(Y |X) (known as discriminative learning). <br>
• LDA uses the full likelihood based on Pr(X, Y ) (known as
generative learning). <br>
• Despite these differences, in practice the results are often
very similar.

Logistic regression is very popular for classification,
especially when K = 2.<br>
• LDA is useful when n is small, or the classes are well
separated, and Gaussian assumptions are reasonable. Also
when K > 2.