# Ordinary Least Squares (OLS) Regression

Ordinary Least Squares (OLS) Regression is a foundational statistical method used to model the linear relationship between a dependent variable and one or more independent variables. The goal of OLS is to find the "best-fitting" line (or hyperplane in higher dimensions) through a set of data points by minimizing the sum of the squared differences between the observed values and the values predicted by the model.

## The Concept of Linear Regression

Imagine you have a set of data points, and you suspect there's a linear relationship between two variables, say, X and Y. For example, you might want to see if the number of hours studied (X) affects exam scores (Y).

The general idea of a linear relationship can be expressed as:

$$Y = \beta_0 + \beta_1 X + \epsilon$$

Where:
* $Y$: The dependent variable (the one you're trying to predict, e.g., exam score).
* $X$: The independent variable (the one you're using to predict Y, e.g., hours studied).
* $\beta_0$: The y-intercept (the expected value of Y when X is 0).
* $\beta_1$: The slope (the change in Y for a one-unit change in X).
* $\epsilon$: The error term (or residual), representing the difference between the actual observed value of Y and the value predicted by the linear model. It accounts for all other factors influencing Y that are not captured by X, and also for random noise.

Our goal in OLS is to find the best estimates for $\beta_0$ and $\beta_1$, which we'll denote as $\hat{\beta_0}$ (beta-naught-hat) and $\hat{\beta_1}$ (beta-one-hat). These "hats" indicate that they are *estimates* derived from our sample data, not the true (and usually unknown) population parameters.

Once we have these estimates, our estimated regression line will be:

$$\hat{Y_i} = \hat{\beta_0} + \hat{\beta_1} X_i$$

Where $\hat{Y_i}$ is the predicted value of the dependent variable for a given $X_i$.

## The "Least Squares" Principle

For each observed data point $(X_i, Y_i)$, there will be a difference between the actual observed value $Y_i$ and the predicted value $\hat{Y_i}$. This difference is called the **residual**, denoted as $e_i$:

$$e_i = Y_i - \hat{Y_i}$$

$$e_i = Y_i - (\hat{\beta_0} + \hat{\beta_1} X_i)$$

The core idea of Ordinary Least Squares is to find the values of $\hat{\beta_0}$ and $\hat{\beta_1}$ that **minimize the sum of the squared residuals (errors)**. Why squared errors?
* **To avoid cancellation:** If we just summed the errors, positive and negative errors could cancel each other out, leading to a sum close to zero even if individual errors are large. Squaring ensures all errors contribute positively to the total.
* **To penalize larger errors more:** Squaring gives more weight to larger errors, meaning the model tries harder to fit points that are far away from the line.

So, the objective function we want to minimize is the Sum of Squared Errors (SSE), often denoted as $S$:

$$S = \sum_{i=1}^{n} e_i^2 = \sum_{i=1}^{n} (Y_i - \hat{Y_i})^2 = \sum_{i=1}^{n} (Y_i - (\hat{\beta_0} + \hat{\beta_1} X_i))^2$$

Our task is to find $\hat{\beta_0}$ and $\hat{\beta_1}$ that minimize $S$.

## Deriving the Formulas for Simple Linear Regression (One Independent Variable)

To find the values of $\hat{\beta_0}$ and $\hat{\beta_1}$ that minimize $S$, we use calculus. We take the partial derivatives of $S$ with respect to $\hat{\beta_0}$ and $\hat{\beta_1}$, set them equal to zero, and solve the resulting system of equations. These equations are known as the **Normal Equations**.

Let's start with $S$:

$$S = \sum_{i=1}^{n} (Y_i - \hat{\beta_0} - \hat{\beta_1} X_i)^2$$

### 1. Partial Derivative with respect to $\hat{\beta_0}$

$$\frac{\partial S}{\partial \hat{\beta_0}} = \frac{\partial}{\partial \hat{\beta_0}} \sum_{i=1}^{n} (Y_i - \hat{\beta_0} - \hat{\beta_1} X_i)^2$$

Using the chain rule, $\frac{\partial}{\partial x} (f(x))^2 = 2f(x) \cdot f'(x)$:

$$\frac{\partial S}{\partial \hat{\beta_0}} = \sum_{i=1}^{n} 2(Y_i - \hat{\beta_0} - \hat{\beta_1} X_i) \cdot (-1)$$

Set the derivative to zero (to minimize):

$$0 = -2 \sum_{i=1}^{n} (Y_i - \hat{\beta_0} - \hat{\beta_1} X_i)$$

Divide by -2:

$$0 = \sum_{i=1}^{n} (Y_i - \hat{\beta_0} - \hat{\beta_1} X_i)$$

Distribute the summation:

$$0 = \sum_{i=1}^{n} Y_i - \sum_{i=1}^{n} \hat{\beta_0} - \sum_{i=1}^{n} \hat{\beta_1} X_i$$

Since $\hat{\beta_0}$ and $\hat{\beta_1}$ are constants with respect to the summation:

$$0 = \sum_{i=1}^{n} Y_i - n\hat{\beta_0} - \hat{\beta_1} \sum_{i=1}^{n} X_i$$

Rearrange to solve for $\hat{\beta_0}$:

$$n\hat{\beta_0} = \sum_{i=1}^{n} Y_i - \hat{\beta_1} \sum_{i=1}^{n} X_i$$

Divide by $n$:

$$\hat{\beta_0} = \frac{\sum_{i=1}^{n} Y_i}{n} - \hat{\beta_1} \frac{\sum_{i=1}^{n} X_i}{n}$$

We know that $\frac{\sum Y_i}{n} = \bar{Y}$ (mean of Y) and $\frac{\sum X_i}{n} = \bar{X}$ (mean of X).
So, the formula for $\hat{\beta_0}$ is:

```{math}
:label: equation-1
\hat{\beta_0} = \bar{Y} - \hat{\beta_1} \bar{X}
```

This equation tells us that the regression line passes through the point $(\bar{X}, \bar{Y})$. Let's prove this statement.

The equation we derived for the intercept $\hat{\beta_0}$ in simple linear regression is:

$$\hat{\beta_0} = \bar{Y} - \hat{\beta_1} \bar{X}$$

And the estimated regression line equation is:

$$\hat{Y_i} = \hat{\beta_0} + \hat{\beta_1} X_i$$

Let's take the estimated regression line equation and substitute the formula for $\hat{\beta_0}$ into it:

$$\hat{Y_i} = (\bar{Y} - \hat{\beta_1} \bar{X}) + \hat{\beta_1} X_i$$

Now, let's consider what happens if we plug in the mean of $X$ (which is $\bar{X}$) into this equation for $X_i$. What would the predicted value of $Y$ ($\hat{Y}$) be at that point?

Let $X_i = \bar{X}$:

$$\hat{Y}_{at\, \bar{X}} = \bar{Y} - \hat{\beta_1} \bar{X} + \hat{\beta_1} \bar{X}$$

Notice that the terms $-\hat{\beta_1} \bar{X}$ and $+\hat{\beta_1} \bar{X}$ cancel each other out:

$$\hat{Y}_{at\, \bar{X}} = \bar{Y}$$

This result means that when you input the average value of the independent variable ($\bar{X}$) into your OLS regression equation, the predicted value of the dependent variable ($\hat{Y}$) will be exactly the average value of the dependent variable ($\bar{Y}$).

In other words, the point $(\bar{X}, \bar{Y})$ *always* lies on the OLS regression line.

What Does This Mean?

1.  **The "Center of Gravity" of the Data:** You can think of $(\bar{X}, \bar{Y})$ as the "center of gravity" or the average point of your entire dataset. The OLS regression line is forced to pivot around this central point. No matter what the slope ($\hat{\beta_1}$) is, the line will always go through $(\bar{X}, \bar{Y})$.

2.  **Intuition for the Intercept:** The formula for the intercept, $\hat{\beta_0} = \bar{Y} - \hat{\beta_1} \bar{X}$, makes intuitive sense in this light. It effectively calculates what the Y-intercept needs to be so that, when combined with the calculated slope ($\hat{\beta_1}$), the line *must* pass through the point $(\bar{X}, \bar{Y})$.

3.  **No Extrapolation Needed for the Mean:** If you want to predict the value of Y for an average X, you don't even need the slope and intercept explicitly; you just know it will be the average Y. While this is a simplification, it highlights the line's central tendency.

### 2. Partial Derivative with respect to $\hat{\beta_1}$

$$\frac{\partial S}{\partial \hat{\beta_1}} = \frac{\partial}{\partial \hat{\beta_1}} \sum_{i=1}^{n} (Y_i - \hat{\beta_0} - \hat{\beta_1} X_i)^2$$

Using the chain rule:

$$\frac{\partial S}{\partial \hat{\beta_1}} = \sum_{i=1}^{n} 2(Y_i - \hat{\beta_0} - \hat{\beta_1} X_i) \cdot (-X_i)$$

Set the derivative to zero (to minimize):

$$0 = -2 \sum_{i=1}^{n} X_i (Y_i - \hat{\beta_0} - \hat{\beta_1} X_i)$$

Divide by -2:

$$0 = \sum_{i=1}^{n} X_i (Y_i - \hat{\beta_0} - \hat{\beta_1} X_i)$$

Distribute $X_i$:

$$0 = \sum_{i=1}^{n} (X_i Y_i - \hat{\beta_0} X_i - \hat{\beta_1} X_i^2)$$

Distribute the summation:

```{math}
:label: equation-2
0 = \sum_{i=1}^{n} X_i Y_i - \hat{\beta_0} \sum_{i=1}^{n} X_i - \hat{\beta_1} \sum_{i=1}^{n} X_i^2
```

Now we have a system of two linear equations {eq}`equation-1` and {eq}`equation-2` with two unknowns ($\hat{\beta_0}$ and $\hat{\beta_1}$). We can substitute {eq}`equation-1` into {eq}`equation-2`.

Substitute $\hat{\beta_0} = \bar{Y} - \hat{\beta_1} \bar{X}$ into {eq}`equation-2`:

$$0 = \sum_{i=1}^{n} X_i Y_i - (\bar{Y} - \hat{\beta_1} \bar{X}) \sum_{i=1}^{n} X_i - \hat{\beta_1} \sum_{i=1}^{n} X_i^2$$

$$0 = \sum_{i=1}^{n} X_i Y_i - \bar{Y} \sum_{i=1}^{n} X_i + \hat{\beta_1} \bar{X} \sum_{i=1}^{n} X_i - \hat{\beta_1} \sum_{i=1}^{n} X_i^2$$

Rearrange to isolate terms with $\hat{\beta_1}$:

$$\hat{\beta_1} \sum_{i=1}^{n} X_i^2 - \hat{\beta_1} \bar{X} \sum_{i=1}^{n} X_i = \sum_{i=1}^{n} X_i Y_i - \bar{Y} \sum_{i=1}^{n} X_i$$

Factor out $\hat{\beta_1}$ on the left side:

$$\hat{\beta_1} \left( \sum_{i=1}^{n} X_i^2 - \bar{X} \sum_{i=1}^{n} X_i \right) = \sum_{i=1}^{n} X_i Y_i - \bar{Y} \sum_{i=1}^{n} X_i$$

We know that $\sum_{i=1}^{n} X_i = n\bar{X}$. Substitute this into the equation:

$$\hat{\beta_1} \left( \sum_{i=1}^{n} X_i^2 - \bar{X} (n\bar{X}) \right) = \sum_{i=1}^{n} X_i Y_i - \bar{Y} (n\bar{X})$$

$$\hat{\beta_1} \left( \sum_{i=1}^{n} X_i^2 - n\bar{X}^2 \right) = \sum_{i=1}^{n} X_i Y_i - n\bar{X}\bar{Y}$$

Finally, solve for $\hat{\beta_1}$:

$$\hat{\beta_1} = \frac{\sum_{i=1}^{n} X_i Y_i - n\bar{X}\bar{Y}}{\sum_{i=1}^{n} X_i^2 - n\bar{X}^2}$$

This is one common form of the formula for $\hat{\beta_1}$. It can also be expressed in terms of covariance and variance, which often provides more intuition:

Recall the definitions:
* Sample Covariance: $Cov(X, Y) = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})$
* Sample Variance: $Var(X) = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})^2$

Let's expand the numerator and denominator of the $\hat{\beta_1}$ formula:

**Numerator:**

$$\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y}) = \sum_{i=1}^{n} (X_i Y_i - X_i \bar{Y} - \bar{X} Y_i + \bar{X}\bar{Y})$$
$$= \sum_{i=1}^{n} X_i Y_i - \bar{Y} \sum_{i=1}^{n} X_i - \bar{X} \sum_{i=1}^{n} Y_i + \sum_{i=1}^{n} \bar{X}\bar{Y}$$
$$= \sum_{i=1}^{n} X_i Y_i - \bar{Y} (n\bar{X}) - \bar{X} (n\bar{Y}) + n\bar{X}\bar{Y}$$
$$= \sum_{i=1}^{n} X_i Y_i - n\bar{X}\bar{Y} - n\bar{X}\bar{Y} + n\bar{X}\bar{Y}$$
$$= \sum_{i=1}^{n} X_i Y_i - n\bar{X}\bar{Y}$$

This shows that the numerator of our $\hat{\beta_1}$ formula is indeed $\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})$.

**Denominator:**

$$\sum_{i=1}^{n} (X_i - \bar{X})^2 = \sum_{i=1}^{n} (X_i^2 - 2X_i\bar{X} + \bar{X}^2)$$
$$= \sum_{i=1}^{n} X_i^2 - 2\bar{X} \sum_{i=1}^{n} X_i + \sum_{i=1}^{n} \bar{X}^2$$
$$= \sum_{i=1}^{n} X_i^2 - 2\bar{X} (n\bar{X}) + n\bar{X}^2$$
$$= \sum_{i=1}^{n} X_i^2 - 2n\bar{X}^2 + n\bar{X}^2$$
$$= \sum_{i=1}^{n} X_i^2 - n\bar{X}^2$$

This shows that the denominator of our $\hat{\beta_1}$ formula is indeed $\sum_{i=1}^{n} (X_i - \bar{X})^2$.

Therefore, the formula for $\hat{\beta_1}$ can be elegantly written as:

$$\hat{\beta_1} = \frac{\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n} (X_i - \bar{X})^2}$$ 

Or, in terms of covariance and variance:

$$\hat{\beta_1} = \frac{(n-1)Cov(X, Y)}{(n-1)Var(X)} = \frac{Cov(X, Y)}{Var(X)}$$

So, for simple linear regression, the OLS estimators are:

$$\hat{\beta_1} = \frac{\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n} (X_i - \bar{X})^2} = \frac{Cov(X, Y)}{Var(X)}$$

$$\hat{\beta_0} = \bar{Y} - \hat{\beta_1} \bar{X}$$

## Multiple Linear Regression (Matrix Form)

When you have more than one independent variable (multiple linear regression), the derivations become more complex using summation notation. This is where matrix algebra simplifies things significantly.

The multiple linear regression model can be written as:

$$Y = X\beta + \epsilon$$

Where:
* $Y$: An $n \times 1$ column vector of observed dependent variable values.
    $Y = \begin{pmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{pmatrix}$
* $X$: An $n \times (k+1)$ design matrix of independent variables. The first column is typically a column of ones (for the intercept term), and the subsequent $k$ columns are the values of the $k$ independent variables.
    $X = \begin{pmatrix} 1 & X_{11} & X_{12} & \dots & X_{1k} \\ 1 & X_{21} & X_{22} & \dots & X_{2k} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & X_{n1} & X_{n2} & \dots & X_{nk} \end{pmatrix}$
* $\beta$: A $(k+1) \times 1$ column vector of unknown regression coefficients ($\beta_0, \beta_1, \dots, \beta_k$).
    $\beta = \begin{pmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_k \end{pmatrix}$
* $\epsilon$: An $n \times 1$ column vector of error terms.
    $\epsilon = \begin{pmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{pmatrix}$

The estimated regression equation in matrix form is:

$$\hat{Y} = X\hat{\beta}$$

Where $\hat{Y}$ is the $n \times 1$ vector of predicted values, and $\hat{\beta}$ is the $(k+1) \times 1$ vector of estimated coefficients.

The residuals vector is:

$$e = Y - \hat{Y} = Y - X\hat{\beta}$$

Our objective is to minimize the sum of squared residuals, which in matrix form is:

$$S = e^T e = (Y - X\hat{\beta})^T (Y - X\hat{\beta})$$

Expanding this expression:

$$S = (Y^T - (X\hat{\beta})^T) (Y - X\hat{\beta})$$
$$S = (Y^T - \hat{\beta}^T X^T) (Y - X\hat{\beta})$$
$$S = Y^T Y - Y^T X\hat{\beta} - \hat{\beta}^T X^T Y + \hat{\beta}^T X^T X\hat{\beta}$$

Since $Y^T X\hat{\beta}$ is a scalar (a single number), its transpose is itself. Also, the transpose of a scalar is itself. Therefore, $\hat{\beta}^T X^T Y = (Y^T X\hat{\beta})^T = Y^T X\hat{\beta}$.
So, we can combine the middle two terms:

$$S = Y^T Y - 2 Y^T X\hat{\beta} + \hat{\beta}^T X^T X\hat{\beta}$$

To find the $\hat{\beta}$ that minimizes $S$, we take the derivative of $S$ with respect to the vector $\hat{\beta}$ and set it to zero.

**Derivative rules for matrices (denominator layout convention):**
* $\frac{\partial (A\mathbf{x})}{\partial \mathbf{x}} = A^T$
* $\frac{\partial (\mathbf{x}^T A \mathbf{x})}{\partial \mathbf{x}} = (A + A^T)\mathbf{x}$ 
* Consequently, if A is symmetric, $\frac{\partial (\mathbf{x}^T A \mathbf{x})}{\partial \mathbf{x}} = 2A\mathbf{x}$
* NOTE: $A$ is matric, $\mathbf{x}$ is a column vector.

In our case, $X^T X$ is a symmetric matrix.

$\frac{\partial S}{\partial \hat{\beta}} = \frac{\partial}{\partial \hat{\beta}} (Y^T Y - 2 Y^T X\hat{\beta} + \hat{\beta}^T X^T X\hat{\beta}) =$

* $Y^T Y$ is a scalar constant with respect to $\hat{\beta}$ (it does not contain $\hat{\beta}$). The derivative of a constant is $0$.
* $\frac{\partial}{\partial \hat{\beta}} (-2 Y^T X\hat{\beta})$: we use the constant multiple rule: $-2 \frac{\partial}{\partial \hat{\beta}} (Y^T X\hat{\beta})$.
* Let $\mathbf{c}^T = Y^T X$. This is a $1 \times k$ row vector of constants. So, we are differentiating $\mathbf{c}^T \hat{\beta}$.
* Recall the derivative rule: $\frac{\partial (\mathbf{c}^T \mathbf{v})}{\partial \mathbf{v}} = \mathbf{c}$ (if $\mathbf{c}$ is a column vector and using denominator layout for gradient) or $\frac{\partial (\mathbf{c}^T \mathbf{v})}{\partial \mathbf{v}} = \mathbf{c}^T$ (if $\mathbf{c}$ is a row vector and using numerator layout for gradient).
* **Crucially, since we established the denominator layout for scalar-by-vector derivatives earlier (resulting in a column vector), we need the column vector equivalent.**
* If $f(\mathbf{v}) = \mathbf{c}^T \mathbf{v}$, then $\frac{\partial f}{\partial \mathbf{v}} = \mathbf{c}$ (where $\mathbf{c}$ is a column vector).
* In our case, $\mathbf{c}^T = Y^T X$. So, the equivalent column vector is $(Y^T X)^T = X^T Y$.
* Therefore, $\frac{\partial}{\partial \hat{\beta}} (Y^T X\hat{\beta}) = X^T Y$.
* So, the second term becomes $-2 X^T Y$.

Now, let's consider the third term $\frac{\partial}{\partial \hat{\beta}} (\hat{\beta}^T X^T X\hat{\beta})$:
* This is a quadratic form of the type $\hat{\beta}^T A \hat{\beta}$, where $A = X^T X$.
* We've already established that $X^T X$ is symmetric.
* Recall the derivative rule for quadratic forms, using denominator layout for the gradient: $\frac{\partial (\mathbf{v}^T A \mathbf{v})}{\partial \mathbf{v}} = (A + A^T)\mathbf{v}$.
* Since $A = X^T X$ is symmetric, $A^T = A$.
* So, $\frac{\partial (\hat{\beta}^T X^T X\hat{\beta})}{\partial \hat{\beta}} = (X^T X + (X^T X)^T)\hat{\beta} = (X^T X + X^T X)\hat{\beta} = 2 X^T X \hat{\beta}$.
* This term becomes $2 X^T X \hat{\beta}$.

Here is the result:

$$\frac{\partial S}{\partial \hat{\beta}} = 0 - 2 X^T Y + 2 X^T X \hat{\beta}$$

Set the derivative to zero:

$$0 = -2 X^T Y + 2 X^T X \hat{\beta}$$

Rearrange the terms:

$$2 X^T X \hat{\beta} = 2 X^T Y$$

Divide by 2:

$$X^T X \hat{\beta} = X^T Y$$

This is the matrix form of the **Normal Equations**.

To solve for $\hat{\beta}$, we need to multiply both sides by the inverse of $(X^T X)$. Note that $(X^T X)^{-1}$ exists if $X^T X$ is invertible (which generally means there is no perfect multicollinearity among the independent variables).

$$(X^T X)^{-1} (X^T X) \hat{\beta} = (X^T X)^{-1} X^T Y$$

$$I \hat{\beta} = (X^T X)^{-1} X^T Y$$

$$\hat{\beta} = (X^T X)^{-1} X^T Y$$

This is the famous OLS estimator formula in matrix form for multiple linear regression. It directly gives you the vector of all estimated coefficients, including the intercept.

## Assumptions of OLS

For the OLS estimators to be the Best Linear Unbiased Estimators (BLUE), a set of assumptions must hold (known as the Gauss-Markov assumptions):

1.  **Linearity:** The relationship between the dependent variable and the independent variables is linear in the parameters.
2.  **Random Sampling:** The data is a random sample from the population.
3.  **No Perfect Multicollinearity:** There is no perfect linear relationship between the independent variables. (This ensures $(X^T X)^{-1}$ exists).
4.  **Zero Conditional Mean of Errors:** The expected value of the error term is zero for any given values of the independent variables ($E[\epsilon_i | X_i] = 0$). This means that the independent variables are not correlated with the error term.
5.  **Homoscedasticity:** The variance of the error term is constant across all levels of the independent variables ($Var(\epsilon_i | X_i) = \sigma^2$).
6.  **No Autocorrelation:** The error terms are uncorrelated with each other ($Cov(\epsilon_i, \epsilon_j | X_i, X_j) = 0$ for $i \neq j$). This is particularly important for time-series data.
7.  **Normality of Errors (optional for BLUE, but important for inference):** The error terms are normally distributed ($\epsilon_i \sim N(0, \sigma^2)$). This assumption is crucial for performing exact hypothesis tests and constructing confidence intervals with correct coverage properties in finite samples. If the sample size is large enough, the Central Limit Theorem ensures that the OLS estimators are approximately normally distributed even if the errors are not, allowing for asymptotically valid inference.