## Week04

Regression is a fundemental tool for studying relationships between random variables. 
We will first study Simple Linear Regression because this model can serve as a foundation for most future regression techniques. 

Our goal this week is to:
1. Understand regression as a probabilistic model
2. Estimate the parameters of this model using a Residual Sum of Squares (RSS) strategy
3. Estimate the parameters of this model using a maximum likelihood strategy
4. Discuss geometric properties of these estimators

### Data Setup

Suppose we are given a dataset that contains pairs of points $D = [(x_{1},y_{1}),(x_{2},y_{2}),(x_{3},y_{3}),\cdots,(x_{N},y_{N})]$ and further we assume that the data point $x_{i}$ was generated by a random variable $X_{i}$ and that the data point $y_{i}$ was generated from a random variable $Y_{i}$. 

### Probabilistic and model form
**Simple Linear Regression** supposes the following conditional probability between the above random variables  

$$
Y_{i}|x_{i} \sim N(\beta_{0} + x_{i}\beta_{1},\sigma^2)
$$

The conditional probability of $Y_{i}$ is linearly related to $x_{i}$ with two parameters: an intercept $(\beta_{0})$ and a slope $(\beta_{1})$. A third parameter $\sigma^{2}$ is ued to express the variability around the conditional mean. To note, this setup assumes every $Y_{i}$ has a similar normal distribution, using the same parameter values but because $x_{i}$ is not necessarily the same for each $Y_{i}$, the mean of this normal distribution may differ.

Often, the equation 

$$
Y_{i}|x_{i} \sim N(\beta_{0} + x_{i}\beta_{1},\sigma^2)
$$

is written 

$$
Y_{i}|x_{i} \sim N(\mu(x_{i}),\sigma^2)
$$
where $\mu(x_{i}) = \beta_{0} + x_{i}\beta_{1}$ to emphasize that the Normal distribution is governed by two parameters and that our focus is on $\mu$ as a function of data points we collected.

When we write a regression model in terms of a single, or in more complex case many, probability distributions, it is called **probabilistic form**.
Probabilsitic form highlights the distribution of our variable of interest ($Y$). 

Another common way to write this relationship is 

\begin{align}
    y_{i}     &= \beta_{0} + x_{i}*\beta_{1}+\epsilon_{i}\\
 \epsilon_{i} &\sim N(0,\sigma^{2})
\end{align}

This is called **model form** for SLR.
Model form highlights the relationship between $Y$ and $X$, focusing less on the distribution of $Y$.

## Expected value and Variance

### Expected value
If $X \sim \mathcal{N}\left(\mu,\sigma^{2}\right)$ then the expected value of $X$ is 

\begin{align}
    E(X) = \mu
\end{align}

We can apply the above to compute the expected value of $Y|x$.
Because the conditional probability of $Y$ given $x$ has a normal distribution with $\mu(x_{i}) \beta_{0} + x_{i}\beta_{1}$ then the expected value is 

\begin{align}
    E(Y_{i} | x_{i}) = \mu(x_{i}) = \beta_{0} + x_{i}\beta_{1}
\end{align}


### Variance
If $X \sim \mathcal{N}\left(\mu,\sigma^{2}\right)$ then the variance of $X$ is 

\begin{align}
    Var(X) = \sigma^{2}
\end{align}

We can apply the above to compute the variance of $Y|x$.
Because the conditional probability of $Y$ given $x$ has a normal distribution with $\mu(x_{i}) \beta_{0} + x_{i}\beta_{1}$ and variance $\sigma^{2}$ then the variance is 

\begin{align}
    Var(Y_{i}|x_{i}) = \sigma^{2}
\end{align}


### Maximum likelihood

To compute the likelihod function, we will assume the random variables $Y_{i}$ and $Y_{j}$ for any $i$ and any $j$ are independent. This means then $p(Y_{j}|Y_{i}) = p(Y_{j})$. 


The likelihood is 

\begin{align}
    \mathcal{L} &= \prod_{i=1}^{N} f(y_{i}|x_{i},\beta_{0},\beta_{1},\sigma^{2}) \\
    &= \prod_{i=1}^{N} \frac{1}{\sqrt{2\pi}\sigma} \exp \left\{- \frac{\left( y_{i} - [\beta_{0} + \beta_{1}x_{i}] \right)^{2}}{2\sigma^{2}} \right\}\\
    \mathcal{L}(\beta_{0},\beta_{1},\sigma^{2})&= \left(\frac{1}{\sqrt{2\pi}\sigma}\right)^{N} \exp \left\{- \sum_{i=1}^{N}\frac{\left( y_{i} - [\beta_{0} + \beta_{1}x_{i}] \right)^{2}}{2\sigma^{2}} \right\}\\ 
\end{align}

and the loglikelihood is 

\begin{align}
   \ell \ell =  \log \left[\mathcal{L}(\beta_{0},\beta_{1},\sigma^{2}) \right]&= \log \left [ \left(\frac{1}{\sqrt{2\pi}\sigma}\right)^{N} \exp \left\{- \sum_{i=1}^{N}\frac{\left( y_{i} - [\beta_{0} + \beta_{1}x_{i}] \right)^{2}}{2\sigma^{2}} \right\} \right ]\\ 
   &= N\log\left(\frac{1}{\sqrt{2\pi}\sigma}\right) - \sum_{i=1}^{N}\frac{\left( y_{i} - [\beta_{0} + \beta_{1}x_{i}] \right)^{2}}{2\sigma^{2}} \\
  \ell \ell(\beta_{0},\beta_{1},\sigma^{2})  &= -N\log\sqrt{2\pi}\sigma - \frac{1}{2\sigma^{2}} \sum_{i=1}^{N} \left( y_{i} - [\beta_{0} + \beta_{1}x_{i}] \right)^{2} \\
\end{align}

The above loglieklihood, when maximized for $\beta_{0}$,$\beta_{1}$, and $\sigma$ will return the maximum likelihod estimates for these parameters $\hat \beta_{0}$,$\hat \beta_{1}$, and $\hat \sigma$.


With the above maximum likelihood estimates we could find the maximum likelihood esitmate of the expected value of $Y_{i}$, in other words we can compute the mle of  $E(Y_{i}|{x_{i}})$ as $E(Y_{i}|x_{i}) = \hat \beta_{0} + x_{i}\hat \beta_{1}$.

### Residual Sum Squares

In BSTA001 we talked about one intuitive way to find "optimal" $\beta_{0}$ and $\beta_{1}$ parameters.
That setup went as follows:

We would like to find parameters $\beta_{0}$ (the intercept) and $\beta_{1}$ (the slope) so that they are, in some sense, optimal. 
There are many different ways to define optimal. 
The most common method to define an optimal $\beta_{0}$ and $\beta_{1}$ for linear regression is least squares.

Given $N$ pairs $(x_{i},y_{i})$, a solution to the least squares equation is the pair $(\beta_{0},\beta_{1})$ such that

\begin{align}
    L(\beta_{0},\beta_{1}) =  \sum_{i=1}^{N} \left( y_{i} - \left[ \beta_{0} + \beta_{1}x_{i} \right]  \right)^{2}
\end{align}

We want to find $\beta_{0}$ and $\beta_{1}$ so that the squared **vertical** distance between any pair $(x_{i},y_{i})$ and our line is on average minimized. 

That is, we wish to find the pair ($\beta_{0}^{*}$, $\beta_{1}^{*}$) such that 

\begin{align}
    L(\beta_{0}^{*},\beta_{1}^{*}) \le L(\beta_{0},\beta_{1})
\end{align}

for all pairs $(\beta_{0},\beta_{1})$. 

Recall our equation for the loglikelihood 

\begin{align}
    \ell \ell(\beta_{0},\beta_{1},\sigma^{2})  &= -N\log\sqrt{2\pi}\sigma - \frac{1}{2\sigma^{2}} \sum_{i=1}^{N} \left( y_{i} - [\beta_{0} + \beta_{1}x_{i}] \right)^{2}
\end{align}

We see the same term $L(\beta_{0},\beta_{1})$ appear in the loglikelihood.
However, in the loglikelihood that term has a negative sign infront of it.
But to maximize $-L(\beta_{0},\beta_{1})$ is equivalent to minimizing $L(\beta_{0},\beta_{1})$, and so the least squares solution and the maximum likeliehood solution are the same.