# Methods of Classification on High-Dimensional Data

<hr>

**Discriminant Analysis** (Linear / Quadratic)<br>

1. Quadratic Discriminant Analysis
    
    Maximize $P(C = c | X = x)$ by Mahalanobis Distance, assuming a Gaussian for each class.
    
    Find a quadratic decision boundary where $P(C = 0 | X = x) = P(C = 1 | X = x)$

    
2. Linear Discriminant Analysis

    Assumes same covariance matrix for all classes and therefore decision boundaries are linear.
    
    If covariance matrix is assumed to be, $\Sigma = \sigma^2 \cdot I$, then it assumes a circular covariance and therefore a particular observation's probability of belonging to a class depends on its Euclidean distance to the class mean, i.e. _K Means_
    

3. Reduced-rank LDA, aka Fisher's LDA

    Idea: $\mu_1, \dots, \mu_K \in R^p$ lie in a linear subspace of dim $K - 1$ (usually $p >> k$). Combine LDA with PCA, i.e. perform PCA on class means.
    
    The maximum number of linear discriminants: $min(p, K-1)$

<hr>

**Logistic Regression**

$log(\frac{p}{1-p}) = \beta_0 + \beta^T X$ (Solve for $\beta_0$, $\beta$ with MLE)

$\therefore p = \frac{exp(\beta_0 + \beta^T X)}{1 + exp(\beta_0 + \beta^T X)}$

Choose $C = 1$ if $p > 0.5$

Assumes shape of probability distribution of categories is known (Gaussian).

<hr>

**Support Vector Machines (SVM)**<br>

Find the hyperplane that maximizes the margin between binary classes. 

Given training data: $(x_1, y_1), \dots, (x_n, y_n)$ with $x_i \in R^p$, $y_i \in$ {-1, 1}, determine a hyperplane $wx - b = 0$ that maximizes the distance to the nearest point $x_i$ from each group.

If **perfect classification** is possible:

Maximize the margin $\frac{2}{\lVert w \rVert_2}$, or equivalently, minimize $\lVert w \rVert_2$ such that $y_i (w x_i - b) \geq 1$ for all $i$

If **imperfect classification** (usually the case):

Maximize margin AND minimize sum of classification errors, minimize $\lVert w \rVert_2 + \lambda \sum_{i=1}^{n} \epsilon$ where $\epsilon$ is the classification error, such that $y_i (w x_i - b) \geq 1 - \epsilon_i$ for all $i$

SVM is a non-parametric method, i.e. no probability distribution, and therefore does not assume the shape of the distribution. Needs labeled binary data.

# Basic code
A `minimal, reproducible example`