# Linear Regression

Linear regression models are used to predict unknown values based on existing ones. It does this by building a statistical relationship between unknown (dependent) variables and known (independent) features.

## Multiple Linear Regression

Let $Y$ be the variate, $X$ the covariates, $\beta$ a vector of parameters (or features), and $\epsilon$ the residual error. For each observed $Y$, we make the following assumption:

\begin{equation*}
Y = \beta X + \epsilon
\label{eq:1} \tag{1}
\end{equation*}

where $X$ is an $m \times n$ matrix, $Y$ is an $m \times 1$ vector, and $\beta$ is an $n \times 1$ vector:

$$
\left[\begin{array}{cc} 
Y_1 \\
Y_2 \\
\vdots \\
Y_m
\end{array}\right]
=
\left[\begin{array}{cc} 
\beta_1 \\
\beta_2 \\
\vdots \\
\beta_n
\end{array}\right]
\left[\begin{array}{cc} 
X_{1,1} & X_{1,2} & \cdots & X_{1,n} \\
X_{2,1} & X_{2,2} & \cdots & X_{2,n} \\
\vdots & \vdots & \ddots & \vdots \\
X_{m,1} & X_{m,2} & \cdots & X_{m,n} 
\end{array}\right]
+
\left[\begin{array}{cc} 
\epsilon_1 \\
\epsilon_2 \\
\vdots \\
\epsilon_m
\end{array}\right]
\label{eq:2} \tag{2}
$$ 

### Assumptions

* **Linearity:** The relationship between $Y$ and $X$ is linear. If we tried to fit a linear model to a non-linear data set, we'd have an inefficient model that fails to capture the trend. Instead, we'd have to build a regression model using non-linear transformations of $X$, such as:

\begin{equation*}
Y = \beta_1 X_1 + \beta_2 X_1^2 + \beta_3\text{log}(X_1)
\end{equation*}

* **Independence:** The independent variables are uncorrelated (i.e. $E[\epsilon_i \epsilon_j]=0$, for $i \in [1,n]$ where $i \neq j$, and $\epsilon$ and $x_i$ are independent). In finance, this is a dangerous assumption to make, so understand the risks in doing so. If the error terms are correlated, the estimated standard errors tend to underestimate the true standard error, which causes confidence intervals and prediction intervals to shrink--reducing the model's accuracy. If necessary, try using multivariate multiple linear regression. 

* **Normality**: The residual is normally distributed, which means that, for any fixed value of $X$, $Y$ is normally distributed (i.e. $E[\epsilon_i] = 0$). However, based on the model used, the error distribution can be ignored. If the residual isn't normally distributed, the confidence intervals may become too wide or narrow. What if the variance is much larger than that of a normal distribution? For instance, $Y$ can have a heavier tail distribution (e.g. a $t$-distribution). Robust regression, an alternative to least squares regression that detects influential observations, can be used to account for this.

* **Homoscedasticity:** The variance of the residual is constant, or the same for any value of $X$ (i.e. $\text{var}(\epsilon_i) = \sigma^2$). Having non-constant variance of the residual will result in disproportionate weighting (often caused by outliers), which affects the model’s performance by unrealistically widening or shrinking the confidence interval. If the assumption of homoscedasticity doesn't hold, we can assume a constant coefficient of variation instead of a constant mean. We can also use a variance-stabilizing transformation on the data. 

* **No perfect multicollinearity:** The covariates aren't perfectly correlated (i.e. $\rho(x_i, x_j) \neq \pm 1$, where $i \neq j$). If the covariates are perfectly correlated, i.e. $X_1$ can determine $X_2$ exactly, you risk overfitting your model as parameter values can have large uncertainties around them. The regression coefficients become very sensitive to small changes in your model. Some potential solutions are:

 * Removing highly correlated variables.
 * Linearly combining highly correlated variables.
 * Implementing PCA as well as partial least squares, LASSO, and Ridge regression. 

### Parameter Derivation

We can estimate the regression parameter $\beta$ by minimizing the loss function. This is also known as the closed-form solution of linear regression. The loss function, or the sum of squared errors, is:

\begin{equation*}
S(\beta) = \epsilon^2 = (Y - \beta X)^2
\label{eq:3} \tag{3}
\end{equation*}

Knowing that the square of a vector is a scalar, i.e. $X^2 = X^T X$, and that $(AB)^T = B^TA^T$ for some $A$ and $B$, we have:

\begin{equation*}
\begin{split}
S(\beta) &= (Y - \beta X)^2 \\
&= (Y - \beta X)^T(Y - \beta X) \\
&= (Y^T - \beta^T X^T)(Y - \beta X) \\
&= Y^T Y - Y^T \beta X - \beta^T X^T Y + \beta^T X^T \beta X \\
&= Y^T Y - 2\beta^T X^T Y + \beta^T X^T \beta X
\end{split}
\label{eq:4} \tag{4}
\end{equation*}

To expand on that last line, $Y^T \beta X$ is a scalar since $(1 \times m) \times (n \times 1) \times (m \times n) = 1 \times 1$. The transpose of a scalar is still a scalar (i.e. $A^T = A$, where $A \in \mathbb{R}$), so we can rewrite $Y^T \beta X$ as $(Y^T \beta X)^T = \beta^T X^T Y$.

$\newcommand{\R}{\mathbb{R}}$Now, to minimize the loss function, we set its derivative with respect to $\beta$ equal to 0. The derivative of $S(\beta)$ using [common vector derivatives](http://www.gatsby.ucl.ac.uk/teaching/courses/sntn/sntn-2017/resources/Matrix_derivatives_cribsheet.pdf) is:

\begin{equation*}
\frac{dS(\beta)}{d\beta} = - 2 X^T Y + 2 X^T X \beta
\label{eq:5} \tag{5}
\end{equation*}

Finally, by setting $\frac{dS(\beta)}{d\beta} = 0$, we have:

\begin{equation*}
\begin{split}
&X^T X \beta = X^T Y \\
&\Rightarrow (X^T X)^{-1} X^T X \beta = (X^T X)^{-1} X^T Y \\
&\Rightarrow \boxed{\beta = (X^T X)^{-1} X^T Y}
\end{split}
\label{eq:6} \tag{6}
\end{equation*}

## Simple Linear Regression

Also known as ordinary least squares (OLS) regression. When two variables $x$ (dependent) and $y$ (independent) are linearly related, their relationship can be written in the form of a regression line:

\begin{equation*}
y = \alpha + \beta x
\label{eq:7} \tag{7}
\end{equation*}

where $\alpha$ and $\beta$ are constants. 

### Parameter Derivation

To estimate parameters $\alpha$ and $\beta$, let's first revisit what we know. We're given $n$ inputs and outputs $(x_i, y_i)$, so we can define our line of best fit as:

\begin{equation*}
y_i = \alpha + \beta x_i + \epsilon_i
\label{eq:8} \tag{8}
\end{equation*}

such that it minimizes the sum of squared errors:

\begin{equation*}
S = \sum_{i=1}^{n} \epsilon_i^2 = \sum_{i=1}^{n} (y_i - \alpha - \beta x_i)^2
\label{eq:9} \tag{9}
\end{equation*}

We can minimize $S$ by setting its partial derivative with respect to both $\alpha$ and $\beta$ to 0. Let's start with $\alpha$:

\begin{equation*}
\begin{split}
&\frac{\delta S}{\delta \alpha} \sum_{i=1}^{n} (y_i - \alpha - \beta x_i)^2 = 0 \\
&\Rightarrow -2 \sum_{i=1}^{n} (y_i - \alpha - \beta x_i) = 0
\end{split} 
\label{eq:10} \tag{10}
\end{equation*}

Knowing that $\sum_{i=1}^{n} \alpha = n \alpha$ and after expanding the above, we get:

\begin{equation*}
\begin{split}
&\Rightarrow \sum_{i=1}^{n} y_i - n \alpha - \beta \sum_{i=1}^{n} x_i = 0 \\
&\Rightarrow \alpha = \frac{\sum_{i=1}^{n} y_i - \beta \sum_{i=1}^{n} x_i}{n} \\
&\Rightarrow \boxed{\alpha = \bar{y} - \beta \bar{x}}
\end{split} 
\label{eq:11} \tag{11}
\end{equation*}

Now, solving for $\beta$, we have:

\begin{equation*}
\begin{split}
&\frac{\delta S}{\delta \beta} \sum_{i=1}^{n} (y_i - \alpha - \beta x_i)^2 = 0 \\
&\Rightarrow 2 \sum_{i=1}^{n} -x_i (y_i - \alpha - \beta x_i) = 0 \\ 
\end{split}
\label{eq:12} \tag{12}
\end{equation*}

which, when combined with our definition of $\alpha$ above, gives:

\begin{equation*}
\begin{split}
&\sum_{i=1}^{n} (\bar{y} x_i - x_i y_i - \beta \bar{x} x_i + \beta x_i^2) = 0 \\
\Rightarrow \beta &= \frac{\sum_{i=1}^{n} x_i(y_i - \bar{y})}{\sum_{i=1}^{n} x_i (x_i - \bar{x})} \\
&= \boxed{\frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}}
\end{split}
\label{eq:13} \tag{13}
\end{equation*}

which is also equal to:

\begin{equation*}
\beta = \frac{\text{cov}(x,y)}{\text{var}(x)} = \rho_{xy} \frac{\sigma_y}{\sigma_x}
\label{eq:14} \tag{14}
\end{equation*}