# Regression

Regression models are used to make predictions of unknown values based on existing ones. It does this by building a statistical relationship between unknown (dependent) variables and known (independent) features.

## Evaluating Regression

\begin{equation*}
\begin{split}
\text{Mean Squared Error (MSE)} &= \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y_i})^2 \\
\text{Root Mean Squared Error (RMSE)} &= \sqrt{\text{MSE}} \\         
\text{Mean Absolute Error (MAE)} &= \sum_{i=1}^{n} \frac{|y_i - x_i|}{n} \\
\text{Residual Sum of Squares (RSS)} &= \sum (y_i - \hat{y_i})^2 \\ 
\text{Total Sum of Squares (SST)} &= \sum (y_i - \bar{y})^2 \\        
\text{Regression Sum of Squares (SSR)} &= \sum (\hat{y_i} - \bar{y})^2 \\
\text{Sum of Squares Error (SEE)} &= \sum (\hat{y_i} - y)^2 \\                         
\text{R-Squared (}R^2) &= \frac{\text{SSR}}{\text{SST}} = 1 - \frac{\text{SSE}}{\text{SST}}
\end{split}
\label{eq:1} \tag{1}
\end{equation*}                           

## Simple Linear Regression

Also known as ordinary least squares (OLS) regression. When two variables $X$ (dependent) and $Y$ (independent) are linearly related, their relationship can be written in the form of a regression line:

\begin{equation*}
\hat{Y} = a + BX
\label{eq:2} \tag{2}
\end{equation*}

where $a$ and $B$ are constants. Constant $B$ is defined as:

\begin{equation*}
B = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2} = \rho \frac{\sigma_y}{\sigma_x} 
\label{eq:3} \tag{3}
\end{equation*}

where $\rho$, the Pearson correlation coefficient, is equal to: 

\begin{equation*}
\rho = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum(y_i - \bar{y})^2}}
\label{eq:4} \tag{4}
\end{equation*}

The second constant, $a$, is defined as:

\begin{equation*}
a = \frac{\sum y_i}{n} - B \frac{\sum x_i}{n} = \bar{y} - B \bar{x}
\label{eq:5} \tag{5}
\end{equation*}

### Derivation

Let's first revisit what we know. We're given $n$ inputs and outputs $(x_i, y_i)$, so we can define our line of best fit as:

\begin{equation*}
\hat{y_i} = a + B x_i
\label{eq:6} \tag{6}
\end{equation*}

such that it minimizes the following cost function (RSS):

\begin{equation*}
\begin{split}
S &= \sum_{i=1}^{n} (y_i - \hat{y_i})^2 \\
&= \sum_{i=1}^{n} (y_i - a - B x_i)^2
\end{split}
\label{eq:7} \tag{7}
\end{equation*}

Now, we want to solve for constants $a$ and $B$. To do this, we should minimize $S$ by setting its partial derivative with respect to both $a$ and $B$ to zero. Let's start with $a$:

\begin{equation*}
\begin{split}
&\frac{\delta S}{\delta a} \sum_{i=1}^{n} (y_i - a - B x_i)^2 = 0 \\
&\Rightarrow -2 \sum_{i=1}^{n} (y_i - a - B x_i) = 0
\end{split} 
\label{eq:8} \tag{8}
\end{equation*}

Knowing that $\sum_{i=1}^{n} a = na$ and after expanding the above, we get:

\begin{equation*}
\begin{split}
&\Rightarrow \sum_{i=1}^{n} y_i - na - B \sum_{i=1}^{n} x_i = 0 \\
&\Rightarrow a = \frac{\sum_{i=1}^{n} y_i - B \sum_{i=1}^{n} x_i}{n} \\
&\Rightarrow \boxed{a = \bar{y} - B \bar{x}}
\end{split} 
\label{eq:9} \tag{9}
\end{equation*}

Now solving for $B$, we have:

\begin{equation*}
\begin{split}
&\frac{\delta S}{\delta B} \sum_{i=1}^{n} (y_i - a - B x_i)^2 = 0 \\
&\Rightarrow 2 \sum_{i=1}^{n} -x_i (y_i - a - B x_i) = 0 \\ 
\end{split}
\label{eq:10} \tag{10}
\end{equation*}

which, when combined with our definition of $a$ above, gives:

\begin{equation*}
\begin{split}
&\sum_{i=1}^{n} (\bar{y} x_i - x_i y_i - B \bar{x} x_i + B x_i^2) = 0 \\
\Rightarrow B &= \frac{\sum_{i=1}^{n} x_i(y_i - \bar{y})}{\sum_{i=1}^{n} x_i (x_i - \bar{x})} \\
&= \boxed{\frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}}
\end{split}
\label{eq:11} \tag{11}
\end{equation*}

## Multiple Linear Regression

Multiple linear regression is similar to simple linear regression except instead of one independent variable, there are multiple. When $Y$ has a linear dependency on $m$ variables $x_1, x_2,..., x_m$:

\begin{equation*}
\hat{Y} = a + b_1 x_1 + b_2 x_2 + ... + b_m x_m
\label{eq:12} \tag{12}
\end{equation*}

## LASSO Regression

Linear regression models can be subject to overfitting. LASSO regression, also known as **L1 regularization** (a type of feature selection in machine learning), aims to fix this by eliminating features from the model. It does this by setting feature coefficients to zero using an L1 penalty equal to the absolute value of the magnitude of the coefficients. It particularly works well in cases where there are a small number of significant parameters and the others have coefficients close to zero. 

## Ridge Regression

Also known as **L2 regularization**. It's similar to LASSO regression except, instead of eliminating features entirely, their coefficients are minimized to some number close to zero by adding an L2 penalty equal to the square of the magnitude of coefficients. Since its not getting rid of any features--only minimizing their effect on the model--ridge regression isn't considered part of feature selection. It works particularly well when there are many significant parameters with close to the same value. 

## Stepwise Regression

Starts with the simplest model, evaluates its performance, then adds another variable, evaluates its performance, then compares it to the previous model, etc. until the best performing model is found.