# Introduction to linear models

Fall 2022: Peter Ralph

https://uodsci.github.io/dsci345

In [1]:
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams['figure.figsize'] = (15, 8)
import numpy as np
import pandas as pd
from dsci345 import pretty

rng = np.random.default_rng()

$$\renewcommand{\P}{\mathbb{P}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\text{var}} \newcommand{\sd}{\text{sd}} \newcommand{\cov}{\text{cov}} \newcommand{\cor}{\text{cor}}$$
This is here so we can use `\P` and `\E` and `\var` and `\cov` and `\cor` and `\sd` in LaTeX below.

# A first linear model

If we suppose two random variables $X$ and $Y$ depend on each other,
perhaps the simplest set-up is that
$$ Y = a X + b + \epsilon, $$
where
- $a$ is the *slope*, i.e., how much $Y$ goes up by, on average, per unit of increase in $X$,
- $b$ is the *intercept*, i.e., the expected value of $Y$ when $X=0$ (if this makes sense),
- $\hat Y = a X + b$ is the *predicted value* of $Y$ given $X$,
- $\epsilon = Y - \hat Y$ is the *residual*, which for the above to be true must have mean zero.

# Prediction

Suppose $X$ and $Y$ are two, related, measurements,
which we treat as random variables having some joint distribution.
*Question:* If we know the value of $X$, how are we to best$^*$ predict the value of $Y$?

$^*$ where how about "best" means "with smallest mean squared error" (i.e., smallest MSE)

So: we'd like to define an estimator of $Y$,
which we'll call $\hat Y$,
and will be a function of $X$ (and only $X$),
and we'd like that estimator to minimize
$$ \E[(Y - \hat Y)^2] \qquad \text{(the MSE)}.$$

## First observation

If we're allowed to add a constant to our estimator,
then the mean of the optimal estimator, $\hat Y$, must match that of $Y$:
$$ \E[\hat Y] = \E[Y] . $$

*Note:* for a counterexample about the "if we're allowed to add a constant"
note, suppose we're looking for an $a$ such that $\hat Y = \exp(a X)$;
then $\hat Y + m$ is *not* allowed, as it doesn't take the same form.

*Proof:*
Suppose that $\tilde Y$ is an estimator;
consider the MSE for $\tilde Y + m$, where $m$ is a number,
which we define to be $f(m)$:
$$\begin{aligned}
 f(m)
 &=
 \E[(Y - (\tilde Y + m))^2] \\
 &=
 \E[(Y - \tilde Y)^2]
 - 2 m \E[Y - \tilde Y]
 + m^2 .
\end{aligned}$$
We want to find the value of $m$ that minimizes $f(m)$,
so differentiate with respect to $m$:
$$\begin{aligned}
 f'(m)
 &=
 - 2\E[Y - \tilde Y]
 + 2 m .
\end{aligned}$$
Setting this equal to zero, we find that $f'(m) = 0$ if
$$ m = \E[Y - \tilde Y] = \E[Y] - \E[\tilde Y]. $$

In conclusion, we've found out that if $\tilde Y$ is an estimator,
then $\hat Y = \tilde Y + \E[Y] - \E[\tilde Y]$ is a *better* estimator.
And, notice that (by additivity of means),
$\E[\hat Y] = \E[Y]$.

So, any estimator of $Y$ can be improved (in the mean-squared-error sense)
by shifting it so that the mean of the estimator is equal to the mean of $Y$. 

## Second observation

If the estimator $\hat Y$ is *linear*,
i.e.,
$$ \hat Y = a X + b $$
for some $a$ and $b$, then
$$
a = \frac{\sd[Y]}{\sd[X]} \cor[X, Y] ,
$$
and $b$ is chosen so that $\E[\hat Y] = \E[Y]$:
$$ b = \E[Y] - a \E[X] .$$

In words, the slope, $a$,
is equal to the correlation between $X$ and $Y$,
but in units of standard deviations.

*Proof:*
We'll do just the same thing as above.
First, thanks to the first observation,
we can assume $\E[X] = \E[Y] = 0$,
which implies that $b = 0$,
and $\E[X^2] = \var[X]$ and $\E[Y^2] = \var[Y]$
and $\E[XY] = \cov[X,Y]$.
Now,
$$\begin{aligned}
    \text{(MSE)}
    &= \E[(Y - \hat Y)^2 ] \\
    &= \E[(Y - a X)^2 ] \\
    &= \E[Y^2 - 2 a X Y + a^2 X^2 ] \\
    &= \var[Y] - 2 a \cov[X, Y] + a^2 \var[X] .
\end{aligned}$$
If we differentiate this with respect to $a$
and set it equal to zero,
we find that the MSE is minimized at
$$ a = \frac{\cov[X,Y]}{\var[X]} . $$
If we now substitute in for $\cov[X,Y] = \sd[X]\sd[Y]\cor[X,Y]$,
we get the form of $a$ above.