# Classification

linear Regression models assume a quantitative $Y$

But: often $Y$ is qualitative/ categorical e.g. eye color "blue, brown, green"

$\Rightarrow$ predicting qualitative response ($Y$) is called classification

classifier widely used:
 - logistic regression
 - linear discriminant analysis (LDA)
 - K-nearest neighbors (KNN)
 - quadratic discriminant analysis (QDA)

## Why not linear Regression?

again linear regression models assume a quantitative response variable $Y$.

But: qualitative response variables do not assume an ordering or the distance between classes in not equal

Hence, quantitative $\neq$ qualitative!

Note: For a binary (two level) qualitative response we can implement a Dummy Variable approach!

$$Y=
\begin{cases}
0 \quad False \quad (A)\\
1 \quad True \quad (B)
\end{cases}$$

predict $\hat{y}>0.5$ to class/ group B

$$Pr(group=B|X)=X\hat{\beta}$$

But: no guarantee that $X\hat{\beta}\in [0,1]$

## Logistic Regression

$Pr(group = B | X) = X\hat{\beta}$

linear model: 
 $$p(X)=\beta_0 + \beta_1 X$$
 
logistic function:
 $$p(X)=\frac{e^{\beta_0 + \beta_1 X}}{1+e^{\beta_0 + \beta_1 X}}$$
 
Hence, the odds are:
 $$ \frac{p(X)}{1-p(X)} = e^{\beta_0 + \beta_1 X} \in \{0,\infty \} \\
     log \left(\frac{p(X)}{1-p(X)}\right) =  \beta_0 + \beta_1 X $$
is an S-shaped function
 - estimation using ML
 - logistic regression for $>2$ classes: 
 $$ 1- Pr(group = A | X)- Pr(group = B | X)$$
 
**Logistic Regression: conditional distribution of Y on X**


## Linear Discriminant Analysis (LDA)

 * model the distribution of the predictors $X$ spearately in each of the classes
 * use Bayes theorem to flip these around into estimes for $Pr(Y=k|X=x)$
 
### Why not using logistic Regression?
 - if classes well-separated, logistic Regression estimates are unstable
 - if n is small distribution of the predictors X is approx. normal in each class
 - LDA is popular if classes $>2$
 
### Bayes Theorem for Classification Problems

 - $\pi_k\equiv Pr(Y=k)$  prior probability that random obs. belongs to class k
 - $f_k(X)\equiv Pr(X=x|Y=k)$ is density 
 
 $$
 p_k(X)=Pr(Y=k|X=x)\\
 =\frac{Pr(X=x\mid Y=k)Pr(Y=k)}{Pr(X=x)}\\
 = \frac{\pi_k f_k(X)}{\sum_{l=1}^K \pi_l f_l(X)}
 $$
 
Thus, $p_k(X)$ is the posterior probability.

Hence, for estimation we need to compute $\pi_k$ using the sample and make assumptions about the functional form of $f_k$

Deriving the LDA formula is done by setting up the Bayes theorem with the assumed functionalform, then taking the logarithm and rearrange these term!

### Sensitivity, Specificity and total error

* Sensitivity = e.g. percentage of true defaulters correctly predicted
* Specificity = e.g. perentage of non-defaulters correctly identified 

**True-positive rate = sensitivity**

**False-positive rate = 1-specificity**

## Quadratic Discriminant Analysis (QDA)

in principle the same as LDA, but assumes that **each class** has it's **own covariate matrix**

$$ X \sim N(\mu_k, \Sigma_k) $$

### Bias-variance Trade-off

* $p$ predictors estimating $\Sigma$ entails estimating 
  $$ \frac{p(p+1)}{2} $$
  
* estimating $\Sigma_k$ leads to estimating
  $$ K*\frac{p(p+1)}{2} $$
  
$\Rightarrow$ LDA is much less flexible than QDA (assuming $\Sigma$ instead of $\Sigma_k$, thus LDA is linear in $X$)

Note: $\Sigma \neq \Sigma_k \quad, \forall k$ leads to high bias for LDA! 

## Comparison of Classifiers

### Logistic Regression vs LDA

 * both linear decision boundaries
 * different estimation and assumptions
 
if observations are Gausian with common covariance matrix $\Sigma$:
**LDA > Logistic Regression**

### K-nearest Neighbors

$$ Pr(Y=j|X=x_0)=\frac{1}{K} \sum_{i\in N_0} I(y_i=j) $$

is a non-parametric approach: no assumptions about the shape of decision boundaries

is useful if decision boundary is nonlinear

**KNN > LDA & Logistic Regression**

But: no coefficient estimates!

### QDA as compromise

is a compromise between the non-parametric KNN & LDA/ Logistic Regression 
 - not as flexbile as KNN
 - performs better with limited sample size (number of observations)

## Exercises for LDA and QDA

Consider the following data generating process in which n observations belong to one of two classes. There are two covariates, drawn from normal distribution $x_1\sim N$ and $x_2 \sim N$ with class specific means. The class means are $\mu_1 = (âˆ’3 \quad 3)$ for class 1, and $\mu_2 = (5 \quad 5)$ for class 2 and $\Sigma_1 = \Sigma_2$. Initially, you may set $\Sigma=(16 \quad -2)(-2 \quad 9)$ and $n_1 =300$ and $n_2 =500$.
The goal of this exercise is to compare the performance of linear discriminant analysis and quadratic discriminant analysis when classifying observations.

In [12]:
library(MASS)       # to fit LDA and QDA analysis
library(mvtnorm)
# Note: packages above require the data to be saved as a data frame
# class 1
n1 = 300
mu1= c(-3,3)
# class 2
n2 = 500
mu2= c(5,5)
# Variance-Covariance Matrix
covmat=matrix(c(16,-2,-2,9), nrow=2, ncol=2)
#total number of observations
N=n1+n2

a) Generate the covariates from a multivariate normal distribution using the $\mu_k$ and $\Sigma$ as described above and an indicator variable indicating class dependence for n observations and combine these in a data frame.

In [13]:
## a)
# DGP - for both classes
set.seed(123)
X1=mvrnorm(n = n1, mu = mu1, Sigma = covmat)
X2=mvrnorm(n = n2, mu = mu2, Sigma = covmat)
df1 <- data.frame(X1)
df1['class'] = 1
df2 <- data.frame(X2)
df2['class'] = 2
df  <- merge(df1,df2, all=TRUE)
rm(df1, df2, X1, X2)

b) Calculate the linear discriminant analysis and quadratic discriminant analysis, estimating all relevant quantities.

In [14]:
## b)
mod_lda  =  lda(class ~ X1 + X2, data=df)
summary(mod_lda)
class_lfit  <- as.numeric(predict(mod_lda)$class)
mod_qda  =  qda(class ~ X1 + X2 , data=df)
summary(mod_qda)
class_qfit  <- as.numeric(predict(mod_qda)$class)

        Length Class  Mode     
prior   2      -none- numeric  
counts  2      -none- numeric  
means   4      -none- numeric  
scaling 2      -none- numeric  
lev     2      -none- character
svd     1      -none- numeric  
N       1      -none- numeric  
call    3      -none- call     
terms   3      terms  call     
xlevels 0      -none- list     

        Length Class  Mode     
prior   2      -none- numeric  
counts  2      -none- numeric  
means   4      -none- numeric  
scaling 8      -none- numeric  
ldet    2      -none- numeric  
lev     2      -none- character
N       1      -none- numeric  
call    3      -none- call     
terms   3      terms  call     
xlevels 0      -none- list     

c) Calculate the mean training error for both methods and compare.

In [15]:
## c)
# Note: since classes are ordinal scale we can not use MSE, due to the fact
#       that the distance between class 1 and 3 are the same as between 1 and 2
#       furthermore it is not appropriarte to use OLS
MTE_LDA=sum(class_lfit!=df$class)/N
MTE_QDA=sum(class_qfit!=df$class)/N
print(MTE_LDA)
print(MTE_QDA)
print(MTE_LDA-MTE_QDA)

[1] 0.11875
[1] 0.12
[1] -0.00125


**Simulation Study**

a) Evaluate the difference between the two methods through calculating classification training error in a simulation study for 100 different samples.

In [16]:
# a)
rm(list=ls())
cat("\014")

MCN=100
MSE=matrix(NaN,MCN,2)


n1 = 300
mu1= c(-3,3)
n2 = 500
mu2= c(5,5)
covmat=matrix(c(16,-2,-2,9), nrow=2, ncol=2)
N=n1+n2

set.seed(123)

for (i in 1:MCN){
  X1=mvrnorm(n = n1, mu = mu1, Sigma = covmat)
  X2=mvrnorm(n = n2, mu = mu2, Sigma = covmat)
  df1 <- data.frame(X1)
  df1['class'] = 1
  df2 <- data.frame(X2)
  df2['class'] = 2
  df  <- merge(df1,df2, all=TRUE)

  mod_lda  =  lda(class ~ X1 + X2, data=df)
  class_lfit  <- as.numeric(predict(mod_lda)$class)
  mod_qda  =  qda(class ~ X1 + X2 , data=df)
  class_qfit  <- as.numeric(predict(mod_qda)$class)

  MSE[i,1]=sum(class_lfit!=df$class)/N
  MSE[i,2]=sum(class_qfit!=df$class)/N
}

avg_MSE_LDA=mean(MSE[,1])
avg_MSE_QDA=mean(MSE[,2])

#par(mfrow=c(1,2))
#plot(MSE[,1], ylab="LDA")
#abline(h=avg_MSE_LDA, col="red")
#plot(MSE[,2], ylab="QDA")
#abline(h=avg_MSE_QDA, col="red")

summary(avg_MSE_LDA-avg_MSE_QDA)



   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -1e-04  -1e-04  -1e-04  -1e-04  -1e-04  -1e-04 

b) Consider the theoretical properties of lda and qda that we discussed in the lecture: Which properties of the initial simulation set up could we manipulate in order to increase the difference between the classification error of lda and qda? Test your intuition by performing a suitable simulation study.

In [17]:
# b)
# Note: since we have different covariate matrixes sigma 1 and sigma 2
#       QDA is more precise than LDA
#       From a theoretical perspective:
#       * if LDAs assumption that the K classes share a common covariance matrix
#          is badly off, then LDA --> high bias
#       * LDA is a much less flexible classifier than QDA --> lower variance
# Hence: Try with different covariance matrices

rm(list=ls())
cat("\014")

MCN=100
MSE=matrix(NaN,MCN,2)


n1 = 300
mu1= c(-3,3)
n2 = 500
mu2= c(5,5)
covmat_1=matrix(c(16,-2,-2,9), nrow=2, ncol=2)
covmat_2=matrix(c(10,-2,-2,5), nrow=2, ncol=2)
N=n1+n2

set.seed(123)

for (i in 1:MCN){
  X1=mvrnorm(n = n1, mu = mu1, Sigma = covmat_1)
  X2=mvrnorm(n = n2, mu = mu2, Sigma = covmat_2)
  df1 <- data.frame(X1)
  df1['class'] = 1
  df2 <- data.frame(X2)
  df2['class'] = 2
  df  <- merge(df1,df2, all=TRUE)

  mod_lda  =  lda(class ~ X1 + X2, data=df)
  class_lfit  <- as.numeric(predict(mod_lda)$class)
  mod_qda  =  qda(class ~ X1 + X2 , data=df)
  class_qfit  <- as.numeric(predict(mod_qda)$class)

  MSE[i,1]=sum(class_lfit!=df$class)/N
  MSE[i,2]=sum(class_qfit!=df$class)/N
}

avg_MSE_LDA=mean(MSE[,1])
avg_MSE_QDA=mean(MSE[,2])

#par(mfrow=c(1,2))
#plot(MSE[,1], ylab="LDA")
#abline(h=avg_MSE_LDA, col="red")
#plot(MSE[,2], ylab="QDA")
#abline(h=avg_MSE_QDA, col="red")

summary(avg_MSE_LDA-avg_MSE_QDA)



    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.002275 0.002275 0.002275 0.002275 0.002275 0.002275 