# Regression

Regression models are used to make predictions of unknown values based on existing ones. It does this by building a statistical relationship between unknown (dependent) variables and known (independent) features.

## Multiple Linear Regression

Let $Y$ be the variate, $X$ be a vector of covariates, $\beta$ be a vector of parameters, and $\epsilon$ be the residual error. For each observed $Y$, we're making the following assumption:

\begin{equation*}
Y = \beta X + \epsilon
\label{eq:1} \tag{1}
\end{equation*}

### Assumptions

* **Linearity:** Assume the relationship between $X$ and the mean of $Y$ is linear. If we're not able to assume this, we can build a regression model using non-linear transformations of $X$, such as:

\begin{equation*}
Y = \beta_1 X_1 + \beta_2 X_1^2 + \beta_3\text{log}(X_1)
\end{equation*}

* **Homoscedasticity:** Assume the variance of the residual is the same for any value of $X$. In other words, assume all observations of $Y$ are assumed to have the same, constant variance. If this assumption doesn't hold, we can assume a constant coefficient of variation instead of a constant mean. We can also use a variance-stabilizing transformation on the data.

* **Independence:** Assume observations of $Y$ are independent of each other (or uncorrelated). In finance, this is a dangerous assumption to make, so understand the risks in doing so. If necessary, try using multivariate regression. 

* **Normality**: For any fixed value of $X$, $Y$ is normally distributed. However, based on the model used, the error distribution can be ignored. What if the variance is much larger than that of a normal distribution? For instance, $Y$ can have a heavier tail distribution (e.g. a $t$-distribution). Robust regression, an alternative to least squares regression that detects influential observations, can be used to account for this. 

* **No multicollinearity:** Assume the covariates aren't perfectly correlated. If the covariates are perfectly correlated, i.e. $X_1$ can determine $X_2$ exactly, you risk overfitting your model as parameter values can have large uncertainties around them. 

### Parameter Derivation

We can estimate the regression parameters by minimizing the loss function (in this case, the least squares error). This is also known as the closed-form solution of linear regression. The total sum of squared errors is:

\begin{equation*}
S(\beta) = \epsilon^2 = (Y - \beta X)^2
\label{eq:2} \tag{2}
\end{equation*}

where $X$ is an $m \times n$ matrix, $Y$ is an $m \times 1$ vector, and $\beta$ is an $n \times 1$ vector. Knowing that the square of a vector is a scalar, i.e. $X^2 = X^T X$, and that $(AB)^T = B^TA^T$ for some $A$ and $B$, we have:

\begin{equation*}
\begin{split}
S(\beta) &= (Y - \beta X)^2 \\
&= (Y - \beta X)^T(Y - \beta X) \\
&= (Y^T - \beta^T X^T)(Y - \beta X) \\
&= Y^T Y - Y^T \beta X - \beta^T X^T Y - \beta^T X^T \beta X \\
&= Y^T Y - 2\beta^T X^T Y - \beta^T X^T \beta X
\end{split}
\label{eq:3} \tag{3}
\end{equation*}

To elaborate on that last line, $Y^T \beta X$ is a scalar since $(1 \times m) \times (m \times n) \times (n \times 1) = (1 \times 1)$. The transpose of a scalar is still a scalar (i.e. $A^T = A$, where $A \in \mathbb{R}$), so we can rewrite $Y^T \beta X = (Y^T \beta X)^T = \beta^T X^T Y$.

$\newcommand{\R}{\mathbb{R}}$Now, to minimize the loss function, we set its derivative equal to 0. First, let's find the derivative of $S(\beta)$ (using [common vector derivatives](http://www.gatsby.ucl.ac.uk/teaching/courses/sntn/sntn-2017/resources/Matrix_derivatives_cribsheet.pdf)):

\begin{equation*}
\frac{dS(\beta)}{d\beta} = 2 X^T Y - 2 X^T X \beta
\label{eq:4} \tag{4}
\end{equation*}

Now, by setting $\frac{dS(\beta)}{d\beta} = 0$, we have:

\begin{equation*}
\begin{split}
&X^T X \beta = X^T Y \\
&\Rightarrow (X^T X)^{-1} X^T X \beta = (X^T X)^{-1} X^T Y \\
&\Rightarrow \boxed{\beta = (X^T X)^{-1} X^T Y}
\end{split}
\label{eq:5} \tag{5}
\end{equation*}

## Simple Linear Regression

Also known as ordinary least squares (OLS) regression. When two variables $x$ (dependent) and $y$ (independent) are linearly related, their relationship can be written in the form of a regression line:

\begin{equation*}
y = \alpha + \beta x
\label{eq:6} \tag{6}
\end{equation*}

where $\alpha$ and $\beta$ are constants. 

### Parameter Derivation

To estimate parameters $\alpha$ and $\beta$, let's first revisit what we know. We're given $n$ inputs and outputs $(x_i, y_i)$, so we can define our line of best fit as:

\begin{equation*}
y_i = \alpha + \beta x_i + \epsilon_i
\label{eq:7} \tag{7}
\end{equation*}

such that it minimizes the sum of squared errors:

\begin{equation*}
S = \sum_{i=1}^{n} \epsilon_i^2 = \sum_{i=1}^{n} (y_i - \alpha - \beta x_i)^2
\label{eq:8} \tag{8}
\end{equation*}

We can minimize $S$ by setting its partial derivative with respect to both $\alpha$ and $\beta$ to 0. Let's start with $\alpha$:

\begin{equation*}
\begin{split}
&\frac{\delta S}{\delta \alpha} \sum_{i=1}^{n} (y_i - \alpha - \beta x_i)^2 = 0 \\
&\Rightarrow -2 \sum_{i=1}^{n} (y_i - \alpha - \beta x_i) = 0
\end{split} 
\label{eq:9} \tag{9}
\end{equation*}

Knowing that $\sum_{i=1}^{n} \alpha = n \alpha$ and after expanding the above, we get:

\begin{equation*}
\begin{split}
&\Rightarrow \sum_{i=1}^{n} y_i - n \alpha - \beta \sum_{i=1}^{n} x_i = 0 \\
&\Rightarrow \alpha = \frac{\sum_{i=1}^{n} y_i - \beta \sum_{i=1}^{n} x_i}{n} \\
&\Rightarrow \boxed{\alpha = \bar{y} - \beta \bar{x}}
\end{split} 
\label{eq:10} \tag{10}
\end{equation*}

Now, solving for $\beta$, we have:

\begin{equation*}
\begin{split}
&\frac{\delta S}{\delta \beta} \sum_{i=1}^{n} (y_i - \alpha - \beta x_i)^2 = 0 \\
&\Rightarrow 2 \sum_{i=1}^{n} -x_i (y_i - \alpha - \beta x_i) = 0 \\ 
\end{split}
\label{eq:11} \tag{11}
\end{equation*}

which, when combined with our definition of $\alpha$ above, gives:

\begin{equation*}
\begin{split}
&\sum_{i=1}^{n} (\bar{y} x_i - x_i y_i - \beta \bar{x} x_i + \beta x_i^2) = 0 \\
\Rightarrow \beta &= \frac{\sum_{i=1}^{n} x_i(y_i - \bar{y})}{\sum_{i=1}^{n} x_i (x_i - \bar{x})} \\
&= \boxed{\frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}}
\end{split}
\label{eq:12} \tag{12}
\end{equation*}

which is also equal to:

\begin{equation*}
\beta = \frac{\text{cov}(x,y)}{\text{var}(x)} = \rho_{xy} \frac{\sigma_y}{\sigma_x}
\label{eq:13} \tag{13}
\end{equation*}

### Significance tests

The $t$-statistic is used to test the significance of each parameter:

\begin{equation*}
t_{\hat{\beta}} = \frac{\hat{\beta} - \beta_0}{\sigma_{\tilde{\beta}}}
\label{eq:14} \tag{14}
\end{equation*}

where $\hat{\beta}$ is the predicted value, $\beta_0$ is the hypothesized value used to test against $\hat{\beta}$ (normally set to 0), and $\sigma_{\tilde{\beta}}$ is the standard deviation of the estimator $\tilde{\beta}$.

## LASSO Regression

Linear regression models can be subject to overfitting. LASSO regression, also known as **L1 regularization** (a type of feature selection in machine learning), aims to fix this by eliminating features from the model. It does this by setting feature coefficients to zero using an L1 penalty equal to the absolute value of the magnitude of the coefficients. It particularly works well in cases where there are a small number of significant parameters and the others have coefficients close to zero. 

## Ridge Regression

Also known as **L2 regularization**. It's similar to LASSO regression except, instead of eliminating features entirely, their coefficients are minimized to some number close to zero by adding an L2 penalty equal to the square of the magnitude of coefficients. Since its not getting rid of any features--only minimizing their effect on the model--ridge regression isn't considered part of feature selection. It works particularly well when there are many significant parameters with close to the same value. 

## Stepwise Regression

Starts with the simplest model, evaluates its performance, then adds another variable, evaluates its performance, then compares it to the previous model, etc. until the best performing model is found.

## Evaluating Regression

The following are different estimations used to evaluate regression:

\begin{equation*}
\begin{split}
\text{Mean Squared Error (MSE)} &= \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y_i})^2 \\
\text{Root Mean Squared Error (RMSE)} &= \sqrt{\text{MSE}} \\         
\text{Mean Absolute Error (MAE)} &= \sum_{i=1}^{n} \frac{|y_i - x_i|}{n} \\
\text{Residual Sum of Squares (RSS)} &= \sum (y_i - \hat{y_i})^2 \\ 
\text{Total Sum of Squares (SST)} &= \sum (y_i - \bar{y})^2 \\        
\text{Regression Sum of Squares (SSR)} &= \sum (\hat{y_i} - \bar{y})^2 \\
\text{Sum of Squares Error (SEE)} &= \sum (\hat{y_i} - y)^2 \\                         
\text{R-Squared (}R^2) &= \frac{\text{SSR}}{\text{SST}} = 1 - \frac{\text{SSE}}{\text{SST}}
\end{split}
\label{eq:15} \tag{15}
\end{equation*} 