# Chapter 4: Classification

## Logistic Regression (for binary outcomes)

$$
p(X)=Pr(Y=1|X)
$$
logistic function:
$$
p(X)=\frac{e^{\beta_0+\beta_1X}}{1+e^{\beta_0+\beta_1X}}
$$

*confounding problem*

## Multinomial Logistic Regression

$$
\Pr(Y = k \mid X = x) = \frac{e^{\beta_{k0} + \beta_{k1} x_1 + \cdots + \beta_{kp} x_p}}{1 + \sum_{l=1}^{K-1} e^{\beta_{l0} + \beta_{l1} x_1 + \cdots + \beta_{lp} x_p}}, \quad \text{for } k = 1, \dots, K - 1
$$

And specifically for the Kth class (baseline)：

$$
\Pr(Y = K \mid X = x) = \frac{1}{1 + \sum_{l=1}^{K-1} e^{\beta_{l0} + \beta_{l1} x_1 + \cdots + \beta_{lp} x_p}}
$$

Then,

$$
log(\frac{Pr(Y=k|X=x}{Pr(Y=K|X=x}) = \beta_{l0} + \beta_{l1} x_1 + \cdots + \beta_{lp} x_p
$$

It shows that the **log odds** between any pair of classes is linear in the features.

**Softmax coding** treats every variable symmetrically: $Pr(Y=k|X=x)=\frac{\beta_{l0} + \beta_{l1} x_1 + \cdots + \beta_{lp} x_p}{\sum_{l=1}^{K} e^{\beta_{l0} + \beta_{l1} x_1 + \cdots + \beta_{lp} x_p}}$, and $log(\frac{Pr(Y=k|X=x}{Pr(Y=k'|X=x}) = (\beta_{k0}-\beta_{k'0}) + (\beta_{k1}-\beta_{k'1}) x_1 + \cdots + (\beta_{kp}-\beta_{k'p} )x_p$

## Generative Models for Classification
Suppose that we wish to classify an observation into one of K classes.
Denote $\pi_k$ as the prior probability that a randomly chosen observation comes from the kth class, and let $f_k(X) = Pr(X|Y=k)$ be the density function of X for an observation that comes from the kth class. 

**Bayes’ theorem**:
$$
Pr(Y=k|X=x)=\frac{\pi_kf_k(x)}{\sum_{l=1}^{K}\pi_lf_l(x)}
$$

$\pi_k$ can be easily estimated (see the fraction) if the sample is random from population. $f_k(x)$ can be estimated in muptiple ways.

### 1. Linear Discriminant Analysis for p=1
***Assumption 1*** $f_k(x)$ is **normal or Gaussian**, $f_k(x) = \frac{1}{\sqrt{2\pi} \, \sigma_k} \exp\left( -\frac{1}{2} \left( \frac{x - \mu_k}{\sigma_k} \right)^2 \right)$.

***Assumption 2*** a **shared variance term** $\sigma^2$ across classes.

Then by *Bayes classfier*, assign the observation to the class for which $\delta_k(x)=x\frac{\mu_k}{\sigma^2}-\frac{\mu_k^2}{2\sigma^2}+log(\pi_k)$ is largest.
**Linear Discriminant Analysis** uses the following estimates:
$$
\hat{\mu}_k = \frac{1}{n_k} \sum_{i: y_i = k} x_i 
$$
$$
\hat{\sigma}^2 = \frac{1}{n_k} \sum_{i: y_i = k} (x_i - \hat{\mu}_k)^2
$$
$$
\hat{pi}_k=n_k/n
$$

Then the estimate is $\hat{\delta}_k(x) = x \cdot \hat{\mu}_k - \frac{1}{2} \hat{\mu}_k^2 + \log(\hat{\pi}_k)$, which shows the discriminant function $\hat{\delta}_k(x)$ are linear functions of $x$.

### 2. Linear Discriminant Analysis for p>1 (multiple predictors)
***Assumption*** $X = (X1,X2,...,Xp)$ is drawn from a **multi-variate Gaussian(or multivariate normal)distribution**,with a class-specific mean vector and a common covariance matrix. $f(x) = \frac{1}{(2\pi)^{p/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\left( -\frac{1}{2} (x - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (x - \boldsymbol{\mu}) \right)
$

$$
\delta_k(x) = x^T \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_k - \frac{1}{2} \boldsymbol{\mu}_k^T \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_k + \log \pi_k
$$

**sensitivity** (predicted Yes/truly Yes) and **specificity** (predicted No/truly No)

*ROC Curve*

P.S. the tradeoff between two types of errors


### 3. Quadratic Discriminant Analysis
***Assumption***  Each class has its own covariance matrix. $X$ ~ $N (\mu_k, \boldsymbol{\Sigma})$.

$$
\begin{aligned}
\delta_k(x) 
&= -\frac{1}{2} (x - \mu_k)^T \Sigma_k^{-1} (x - \mu_k) - \frac{1}{2} \log |\Sigma_k| + \log \pi_k \\
&= -\frac{1}{2} x^T \Sigma_k^{-1} x + x^T \Sigma_k^{-1} \mu_k 
   - \frac{1}{2} \mu_k^T \Sigma_k^{-1} \mu_k 
   - \frac{1}{2} \log |\Sigma_k| + \log \pi_k
\end{aligned}
$$

LDA or QDA? - think about *bias-vatiance tradeoff*

### 4. Naive Bayes
***Assumption*** Within the kth class, the p predictors are independent. $f_k(x) = f_{k1}(x_1) \times f_{k2}(x_2) \times \cdot \cdot \cdot \times f_{kp}(x_p)$, where $f_{kj}$ is the density function of the jth predictor among observations in the kth class.

$$
Pr (Y=k|X=x) = \frac{\pi_k\times f_{k1}(x_1)\times f_{k2}(x_2)\times \cdot \cdot \cdot \times f_{kp}(x_p)}{\sum_{l=1}^K \pi_l\times f_{l1}(x_1)\times f_{l2}(x_2)\times \cdot \cdot \cdot \times f_{lp}(x_p)}
$$

How to estimate the one-dimensional density function $f_{kj}$? Several options:

- If $X_j$ is quantitative, assume that $X_j|Y = k$ ~ $N(\mu_{jk}, \sigma_{jk}^2)$. With in each class, the jth predictor is drawn from a (univariate) normal distribution.
- If $X_j$ is quantitative, use non-parametric methods to estimate $f_{kj}$.
- If $X_j$ is qualitative, then we can simply count the proportion of training observations for the jth predictor corresponding to each class.

## Generalizaed linear model
$$
\eta(E(Y|X_1,..., X_p)) = \beta_0 + \beta_1X_1 + \cdot \cdot \cdot + \beta_pX_p
$$


