# **Ordinary Least Squares**

### **Multiple Linear Regression**

MLS is an extension of Simple Linear Regression (SLR) in that it uses multiple features in our model  

From discussion: 
* If $\hat{\theta_1} = \frac{r \sigma_{speed}}{\sigma_{turn}}$
* If from a different model $\hat{\theta_w,1} =  \frac{r \sigma_{turn}}{\sigma_{speed}}$
* To go from $\hat{\theta_w,1}$ to $\hat{\theta_1}$, we multiply by $\frac{\sigma_{speed}^2}{\sigma_{turn} ^2}$ in order to get our stuff in the right units


The multiple linear regression model takes on the form: 

$$ \hat{y} = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + ... + \theta_p x_p $$

* Our predicted value of $y$, $\hat{y}$ is a linear combination of the single **observations** (features), $x_i$, and parameters $\theta_i$

We defint the **parameter vector** as the vector of all of our "weights" for each specific feature 
* We often incorporate our bias term here as well 

$$\theta = \begin{bmatrix}
           \theta_{0} \\
           \theta_{1} \\
           \vdots \\
           \theta_{p}
         \end{bmatrix}$$

We work with two vectors in OLS: 
1) A row vector corresponding to our observed data
2) A column vector containing the model parameters

This is equivalent to the **dot (scalar)** product of the observation vector and parameter vector 

$$[1,\:x_{1},\:x_{2},\:x_{3},\:...,\:x_{p}] \theta = [1,\:x_{1},\:x_{2},\:x_{3},\:...,\:x_{p}] \begin{bmatrix}
           \theta_{0} \\
           \theta_{1} \\
           \vdots \\
           \theta_{p}
         \end{bmatrix} = \theta_0\:+\:\theta_1x_{1}\:+\:\theta_2 x_{2}\:+\:...\:+\:\theta_p x_{p}$$

Notice that we have inserted $1$ as the first value in the obseration vector 
* When the dot product is computed, this $1$ will be multiplied with $\theta_0$ to give the intercept of the regression model 
* We call this $1$ entry the **intercept** or **bias** term 

In statistics, this kind of model + loss is called **Ordinary Least Squares (OLS)**
* The solution to OLS is the minimizing loss for parameters $\hat{\theta}$, also called the **least squares estimate**



### **Some of the Linear Algebra**

* We can express our linear regression in terms of Matrices 

$$ \hat{Y} = X\Theta$$

* Here, $\hat{Y}$ is the **prediction vector** with $n$ elements, and contains the prediction made by the model for each one of our $n$ observations 
* $X$ is the **design matrix** 
* $\Theta$ is the **parameter vector** 
* Note that our **true output** $Y$ is also a vector with $n$ elements 

### **Loss Functions**

* For OLS, we define MSE (L2 loss) as the following: 

$$R(\theta) = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \frac{1}{n} (||\mathbb{Y} - \hat{\mathbb{Y}}||_2)^2$$

or 

$$R(\theta) = \frac{1}{n} (||\mathbb{Y} - \mathbb{X} \theta||_2)^2$$

So, our new task is to fit the optimal parameter vector $\Theta$ such that the cost function is minimized 

* We want to **minimize** the distance between the vectors of true values $Y$, and the predicted values $\hat{Y}$
* We want to minimize the **length** of the **residual** vector, defined as 

$$e = \mathbb{Y} - \mathbb{\hat{Y}} = \begin{bmatrix}
         y_1 - \hat{y}_1 \\
         y_2 - \hat{y}_2 \\
         \vdots \\
         y_n - \hat{y}_n
       \end{bmatrix}$$


### **Terminology for MLS**

* $x$ is most often called 
    * Feature(s)
    * Independent variable(s)

* $y$ can be called an 
    * Output
    * Response 

* $\hat{y}$ can be called a
    * Prediction 

* $\theta$ can be called a
    * Weight(s)
    * Parameter(s)

* $\hat{\theta}$ can be called a
    * Estimator(s)
    * Optimal parameters 

A datapoint $(x,y)$ is als called an observation

### **Some of the Geometry**

* We can think of $\hat{Y}$ as a **linear combination of feature vectors**, scaled by the **parameters**

<img src="https://ds100.org/course-notes/ols/images/columns.png" alt="Image Alt Text" width="400" height="200">

$$\hat{\mathbb{Y}} =
\theta_0 \begin{bmatrix}
           1 \\
           1 \\
           \vdots \\
           1
         \end{bmatrix} + \theta_1 \begin{bmatrix}
           x_{11} \\
           x_{21} \\
           \vdots \\
           x_{n1}
         \end{bmatrix} + \ldots + \theta_p \begin{bmatrix}
           x_{1p} \\
           x_{2p} \\
           \vdots \\
           x_{np}
         \end{bmatrix}
         = \theta_0 \mathbb{X}_{:,\:1} + \theta_1 \mathbb{X}_{:,\:2} + \ldots + \theta_p \mathbb{X}_{:,\:p+1}$$

* The sum of residuals is not always equal to $0$, but in general it is
  * In the design matrix $X$, all observations must have the same intercept term in order for the sum of the residuals to be $0$

Since $\hat{\mathbb{Y}} = \mathbb{X} \theta$ is a **linear combination** of the columns of $X$, we know that the **predictions are contained in the span** of $X$
* So we know that $\mathbb{\hat{Y}} \in \text{Span}(\mathbb{X})$

<img src="https://ds100.org/course-notes/ols/images/span.png" alt="Image Alt Text" width="400" height="200">

* Remember that the modeling fitting goal is to generate predictions such that the distance between the true vector of true values, $Y$, and the vector od predicted $\hat{Y}$ is minimized 
* This means **we want $\hat{Y}$ to be the vector in $\text{Span}(X)$ that is closest to $Y$**

<img src="https://ds100.org/course-notes/ols/images/residual.png" alt="Image Alt Text" width="400" height="200">

The vector in $\text{Span}(X)$ that is closest to $Y$ is always the **orthogonal projection** of $Y$ onto $\text{Span}(X)$
* Thus, we choose $\theta$ that makes the **residual vector orthogonal to any vector in** $\text{Span}(X)$

### **The Least Squares Estimate**

Any vector $\theta$ that minimizes MSE on a dataset must satisfy the following equation 

$$\hat{\theta} = (\mathbb{X}^T \mathbb{X})^{-1} \mathbb{X}^T \mathbb{Y}$$

We call this the **least squares estimate** of $\Theta$, and is only valid if $\mathbb{X}^T \mathbb{X}$ is invertible 
* This means that $\mathbb{X}^T \mathbb{X}$ needs to be full column rank, which happens when $X$ is full column rank
    * This does NOT mean that $X$, $X^T$, nor $Y$ need to be invertible

* This also means that we cannot derive solutions for anything with l1 loss, since the normal equation optimizes for MSE 

### **Root Mean Squared Error**

* Is very similar to MSE, but now we just take the square root!

$$ \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2} $$

* A low RMSE indicates more "accurate" predictions - that there is lower average loss acoss the dataset

* Taking the square root like this also converts our value back into original, non-squared units of $y_i$, which is useful for understanding model performance

### **Multiple** $R^2$

* **Multiple** $R^2$ is also referred to as the **coefficient of determination**, as is the proportion of variance of our **fitted values** $\hat{y}_i$ to our true values of $y$
* Ranges from $0$ to $1$ and is effecitvely the *proportion* of variance in the observations that the **model explains**

$$R^2 = \frac{\text{variance of } \hat{y}_i}{\text{variance of } y_i} = \frac{\sigma^2_{\hat{y}}}{\sigma^2_y}$$

* We can interpret $R^2$ as being the correlation between $y$ and $\hat{y}$ (for OLS with an intercept term)
* Typically, as we add more features $R^2$ also goes up (not always a good thing)

### **Properties of OLS**

1) When using the optimal parameter vector, our residuals $e = Y = \hat{Y}$ are orthogonal to $span(\mathbb{X})$
* $\mathbb{X}^Te = 0$

2) For all linear models with an **intercept term**, the **sum of residuals is zero**
* $\sum_i^n e_i = 0$