# Basic Regression

<hr>

## Simple Linear Regression

Looks for a linear relationship between predictor and response.

$y = \beta_0 + \beta_1 x_1 + \dots + \beta_m x_m$

$= \beta_0 + \sum_{j=1}^{m} \beta_j x_j$

The total error can be presented as a sum of squared errors (SSE):

$\sum_{i=1}^{n} (y_i - \hat y_i)^2 = \sum_{i=1}^{n} (y_i - (\beta_0 + \beta \cdot x_i))^2$

where $x_i$ is a vector of one observation

The best fit regression line minimizes the sum of squared errors as defined by $\beta$. The sum of squared errors is a concave up (convex) quadratic function and therefore we can find its minimum by setting its derivative to zero. 

To find the value of each $\beta_j$ where SSE is at its minimum, then we set partial derivatives with respect to $\beta_j$ to zero and solve the simultaneous equations.

****

## Measuring the qualify of a model's fit

**Maximum Likelihood Estimation (MLE)**

The most basic measure of a model's quality is **likelihood**, i.e. 
given an estimated set of model parameters, $\theta$, what is the likelihood of witnessing these observations under this set of parameters?

$P(X | \theta)$ is the likelihood function where $X$ is the observed data and $\theta$ is the model's parameters. Given a search through all parameter sets, we select the set that returns the maximum likelihood.

*Example*

Suppose we have a model which have estimates $y_1, \dots, y_n$, and observations $z_1, \dots, z_n$ then the errors should be normally distributed with an expectation of zero such that $\epsilon \sim N(0,\sigma^2)$, i.i.d.

Then, the probability density or likelihood function for a single observation will be:

$P(X_i | \theta) = \frac{1}{\sigma \sqrt{2\pi}} \cdot e^{-\frac{(z_i-y_i)^2}{2 \sigma^2}}$

The joint density of all observations is then a product of the above:

$P(X | \theta) = \prod_{i=1}^{n} \frac{1}{\sigma \sqrt{2\pi}} \cdot e^{-\frac{(z_i-y_i)^2}{2 \sigma^2}}$

$= (\frac{1}{\sigma \sqrt{2\pi}})^n \cdot e^{-\frac{1}{2 \sigma^2}\sum_{i=1}^{n} (z_i-y_i)^2}$

$\theta$ is picked such that it maximizes the likelihood and is equivalent to minimizing the sum of squared errors, such that

$\max (\frac{1}{\sigma \sqrt{2\pi}})^n \cdot e^{-\frac{1}{2 \sigma^2}\sum_{i=1}^{n} (z_i-y_i)^2} \to \min \sum_{i=1}^{n} (z_i - y_i)^2$

$\therefore$ Minimizing the sum of squared errors ensures that maximum likelihood is achieved.

Likelihood can also be used to compare two different models by using the likelihood ratio test and applying a hypothesis test.

****

**Akaike Information Criterion (AIC)**

Helps to balance the model's likelihood ($L^*$) and its simplicity ($k$, # of parameters).

$\text{AIC} = 2k - 2\ln(L^*)$

where the penalty term, $2k$ helps to avoid overfitting by balancing likelihood and simplicity.

Models with smaller AIC are preferred and therefore encourages fewer parameters $k$ while maximizing likelihood.

The classical AIC works well when there are infinitely many data points, but a corrected version of AIC can be applied to smaller data sets:

$AIC_c = AIC + \frac{2k(k+1)}{n-k-1} = 2k - 2\ln(L^*) + \frac{2k(k+1)}{n-k-1}$

We can also compare the AIC scores of two different models and compute its relative likelihood, with an example as follows:

- Model 1 with $AIC = 75$ vs Model 2 with $AIC = 80$
- The relative likelihood is characterized as:

    $e^{\frac{(AIC_1 - AIC_2)}{2}} = e^{\frac{(75 - 80)}{2}}  \approx 8.2\%$
    
    $\therefore$ Model 2 is only 8.2% likely to be a better model than Model 1 and thus the first model is probably better
    
    
A similar criterion, BIC, Bayesian Information Criterion, is a harsher version of AIC and ha a stronger penalty term. Typically BIC is only used when there are more data points than parameters.

$AIC = 2k - 2\ln(L^*)$

$BIC = k \ln (n) - 2\ln(L^*)$

where $k \ln (n) > 2k$ generally and therefore BIC encourages simpler models than AIC does

****

## Generalized Linear Models

Relationships between two variables may not always be linear but we can transform our data such that the fit becomes linear.

<img alt="Non-linear Relationships" src="assets/non_linear_relationship.png" width="500" >

Here are some possible ways of transformations:

- Quadratic regression: $y = a_0 + a_1 x_1 + a_2 x_1^2$
- Response transform: $\log(y) = a_0 + a_1 x_1 + \dots + a_m x_m$
- Box-Cox transformations
- Variable interactions: $y = a_0 + a_1 x_1 + a_2 x_2 + a_3 x_1 x_2$

****

**Box-Cox Transformation**

****

**De-trending data**

****

## Principal Components Analysis (PCA)

****

**Interpreting regression coefficients in PCA**

****

**Eigenvalues and eigenvectors**

<hr>

# Basic code
A `minimal, reproducible example`