# Linear Regression

## Linear regression model definitions

- Population: $ E(Y) = \beta_0 + \beta_1 X $

- Model: $ Y = \beta_0 + \beta_1 X + \epsilon $, where $\epsilon$ is a RV: $\epsilon \sim N(0, \sigma) = Y - E(Y) $

![Linear Regression assumptions](linear_regression.png)

*Figure 1: The statistical assumptions behind linear regression.  The model assumes that the error is normally distributed with constant variance.*

## Creating statistical (frequentist) models

![Pearson correlation table](pearson.png)

*Figure 2: Pearson correlation coefficient values and strengths of correlation.*

1. Run Pearson correlation coefficient $R$ to check correlation: `scipy.stats.pearsonr(ind_var, dep_var)`.
2. Define a cost function (e.g. sum of squared errors).
3. Run the fit (e.g. least-squares) to minimize the cost function to generate $\hat{Y} = b_0 + b_1 X$: `statsmodels.api.ols(model, data).fit()`.
4. Plot residuals for goodness of fit (lack of patterns in residuals).
5. Check the coefficient of determination (R-squared) and p-values to check that error variances do not account for majority of data. This is usually why ANOVA `statsmodels.api.stats.anova_lm(model, typ=2)` is run together with linear regression fitting.
6. Use `prediction.summary_frame()` to get intervals. Prediction intervals (obs_ci*) quantify uncertainty of an individual value given an $x$. Confidence intervals (mean_ci*) quantify uncertainty of population means given an $x$.

Try this below to see if floor area can be used to model the home price in `homes.csv`:

Try linear regression on heights and GPAs in `gpa.csv`:

## Multiple regression

There is not much difference when adding additional predictors.  Let's try this on `Cars.csv`:

If Pearson shows correlation between predictors, we can include an interaction term (the product of two terms we think are related).  Let's explore `internetusage.csv`:

## Classification via regression

You can even create classification models via regression, by assigning numerical labels to categorical predictors!

Pandas provides `pd.get_dummies` to facilitate this!  Try it on `Heating.csv`: