# Simple Linear Regression



## Objectives



## Model Assumptions

Linear regression models make the following assumptions about the data and the process that generated them.

[Here](https://www.statisticssolutions.com/assumptions-of-linear-regression/) is a helpful resource on the assumptions of linear regression.

### Linearity

**The relationship between the target and predictor is linear.** Check this by drawing a scatter plot of your predictor and your target, and see if there is evidence that the relationship might not follow a straight line.

### Independent

**The errors are independent**. In other words: Knowing the error for one point doesn't tell you anything about the error for another.

### Normality

**The errors are normally distributed.** That is, smaller errors are more probable than larger errors, according to the familiar bell curve.

**How can I check for this?**
- Make a histogram of the residuals
- Build a QQ-Plot
- Check the Jarque-Bera or Omnibus p-value (from `statsmodels` output)

### Homoskedasticity

**The errors are homoskedastic.** That is, the errors have the same variance. 

(The Greek word $\sigma\kappa\epsilon\delta\acute{\alpha}\nu\nu\upsilon\mu\iota$ means "to scatter".)


**How can I check for this?**
- Make a scatter plot of the residuals and target values and look to see if they are more or less spread out at different places
- Conduct a formal test (e.g. Goldfeld-Quandt)

Linear regression models make the following assumptions about the data and the process that generated them.

[Here](https://www.statisticssolutions.com/assumptions-of-linear-regression/) is a helpful resource on the assumptions of linear regression. 

### Violation of the Assumptions

No model is perfect, and your assumptions will never hold perfectly. If the violations of assumptions are severe, you can try adjusting the data so the assumptions will hold, such as by... 

- Transforming your data with a non-linear function (e.g. log)
- Only modeling a subset of your data
- Dropping outliers

These can make it harder to explain or interpret your model, but the trade-off may be worth it. Alternatively, you may be better of just using a different type of model (you will learn many).

## Mathematical Proof

### Covariance
For two variables $X$ and $Y$, each with $n$ values:

$$\sigma_{XY} = \frac{\Sigma^n_{i = 1}(x_i - \mu_x)(y_i - \mu_y)}{n}$$

Note that the value of the covariance is very much a function of the values of X and Y, which can make interpretation difficult. What is wanted is a _standardized_ scale for covariance, hence: _correlation_.

### Correlation

Pearson Correlation:

$$r_P = \frac{\Sigma^n_{i = 1}(x_i - \mu_x)(y_i - \mu_y)}{\sqrt{\Sigma^n_{i = 1}(x_i - \mu_x)^2\Sigma^n_{i = 1}(y_i -\mu_y)^2}}$$

Note that we are simply standardizing the covariance by the standard deviations of X and Y (the $n$'s cancel!).

We'll always have $-1 \leq r \leq 1$. (This was the point of standardizing by the standard deviations of X and Y.)

A correlation of -1 means that X and Y are perfectly negatively correlated, and a correlation of 1 means that X and Y are perfectly positively correlated.


### Model Specification

The solution for a simple regression best-fit line is as follows:

$$Y = b + m(X) + \epsilon$$

- slope: 

$$m = r_P\frac{\sigma_y}{\sigma_x} = \frac{cov(X, Y)}{var(X)}$$

- y-intercept:

$$b = \mu_y - m\mu_x$$

### Proof of Minimizing Error ($\epsilon$)

We demonstrate this by setting the derivative of the loss function, $\Sigma^n_{i=1}(y_i - (mx_i + b))^2$, equal to 0. **We shall see this calculus trick many times!**

For this purpose we consider the loss a function of its optimizing parameters $m$ and $b$. So there are therefore two partial derivatives to consider. (We'll cover this in more depth later in the course.)

$$\frac{\partial}{\partial b}[\sum^n_{i=1}(y_i - mx_i - b)^2] = -2\sum^n_{i=1}(y_i - mx_i - b)$$

$$\frac{\partial}{\partial m}[\sum^n_{i=1}(y_i - mx_i - b)^2] = -2\sum^n_{i=1}x_i\sum^n_{i=1}(y_i - mx_i - b)$$

- Let's set the first to 0:

$$-2\sum^n_{i=1}(y_i - mx_i - b) = 0$$

$$\sum^n_{i=1}(y_i - mx_i) = \sum^n_{i=1}b = nb$$

**So:** 

$$b = \frac{\sum^n_{i=1}(y_i - mx_i)}{n} = \mu_y - m\mu_x$$

- Let's set the second to 0:

$$-2\sum^n_{i=1}x_i\sum^n_{i=1}(y_i - mx_i - b) = 0$$

$$\sum^n_{i=1}(x_iy_i - mx^2_i - bx_i) = 0$$

- Plugging in our previous result, we have:

$$\sum^n_{i=1}x_iy_i - (\frac{1}{n}\sum^n_{i=1}y_i - \frac{m}{n}\sum^n_{i=1}x_i)\sum^n_{i=1}x_i - m\sum^n_{i=1}x^2_i = 0$$

$$\sum^n_{i=1}x_iy_i - \frac{1}{n}\sum^n_{i=1}x_i\sum^n_{i=1}y_i + \frac{m}{n}(\sum^n_{i=1}x_i)^2 - m\sum^n_{i=1}x^2_i = 0$$

**So:** 

$$m = \frac{\sum^n_{i=1}x_iy_i - \frac{1}{n}\sum^n_{i=1}x_i\sum^n_{i=1}y_i}{\sum^n_{i=1}x^2_i - \frac{1}{n}(\sum^n_{i=1}x_i)^2} = \frac{n\times(\frac{1}{n}\sum^n_{i=1}x_iy_i - \frac{1}{n^2}\sum^n_{i=1}x_i\sum^n_{i=1}y_i)}{n\times(\frac{1}{n}\sum^n_{i=1}x^2_i - \mu^2_x)} = \frac{cov(X, Y)}{var(X)}$$

For more on the proof see [here](https://math.stackexchange.com/questions/716826/derivation-of-simple-linear-regression-parameters).

## Coefficient Interpretation



## Model Evaluation 

### Coefficient of Determination

Very often a data scientist will calculate $R^2$, the *coefficient of determination*, as a measure of how well the model fits the data.

$R^2$ for a model is ultimately a _relational_ notion. It's a measure of goodness of fit _relative_ to a (bad) baseline model. This bad baseline model is simply the horizontal line $y = \mu_Y$, for dependent variable $Y$.

The actual calculation of $R^2$ is: 

$$R^2\equiv 1-\frac{\Sigma_i(y_i - \hat{y}_i)^2}{\Sigma_i(y_i - \bar{y})^2}$$

$R^2$ is a measure of how much variation in the dependent variable your model explains.

### Other Regression Statistics

- **F-statistic**: The F-test measures the significance of your model relative to a model in which all coefficients are 0, i.e. relative to a model that says there is no correlation whatever between the predictors and the target.
- **Log-Likelihood**: The probability in question is the probability of seeing these data points, *given* the model parameter values. The higher this is, the more our data conform to our model and so the better our fit. AIC and BIC are related to the log-likelihood; we'll talk about those later.
- **coef**: These are the betas as calculated by the least-squares regression. We also have p-values and 95%-confidence intervals.
- **Omnibus**: This is a test for error normality. The probability is the chance that the errors are normally distributed. 
- **Durbin-Watson**: This is a test for autocorrelation. We'll return to this topic in a future lecture. 
- **Jarque-Bera**: This is another test for error normality.
- **Cond. No.**: The condition number tests for independence of the predictors. Lower scores are better. When the predictors are *not* independent, we can run into problems of multicollinearity. For more on the condition number, see [here](https://stats.stackexchange.com/questions/168259/how-do-you-interpret-the-condition-number-of-a-correlation-matrix).