# Multiple and Polynomial Regression

[Resource](https://harvard-iacs.github.io/2018-CS109A/labs/lab-4/solutions/)

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import statsmodels.api as sm
from statsmodels.api import OLS
from sklearn import preprocessing
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

from pandas.plotting import scatter_matrix

import seaborn as sns

%matplotlib inline

# Learning Goals

* Implement arbitrary multiple regression models in both sklearn and statsmodels.
* Interpret the coefficient estimates produced by each model, including transformed and dummy variables.

`statsmodels` is focused on the **inference task**: guess good values for the betas and discuss how certain you are in those answers.

`sklearn` is focused on the **prediction task**: Given new data, guess what the response value is. As a result, `statsmodels` has lots of tools to discuss confidence, but isn't great at dealing with test sets. `sklearn` is great at test sets and validations, but can't really discuss uncertainty in the parameters or predictions. In short:
* `sklearn` is about putting a line through it and predicting new values using that line. If the line gives good predictions on the test set, who  cares about anything else?
* `statsmodels` assumes more about how the data were generated, and (if the assumptions are correct) can tell you about uncertainty in the results.

## Some terms
* **R-squared**: An interpretable summary of how well the model did. 1 is perfect, 0 is a trivial baseline model, negative is worse than the trivial model.
* **F-statistic**: A value testing whether we're likely to see these results (or even stronger ones) if none of the predictors actually mattered.
* **Prob (F-statistic)**: The probability that we'd see these results (or even stronger ones) if none of the predictors actually mattered. If this probability is small then either A) some combination of predictors actually matters of B) something rather unlikely has happened.
* **coef**: The estimate of each beta. This has several components:
    * **std err**: The amount we'd expect this value to wiggle if we re-did the data collection and re-ran our model. More data tends to make this wiggle smaller, but sometimes the collected data just isn't enough to pin down a particular value.
    * **t and P>|t|**: Similar to the F-statistic, these measure the probability of seeing coefficients this big (or even bigger) if the given variable didn't actually matter. Small probability doesn't necessarily mean the value matters.
    * **[0.025 0.975]**: Endpoints of the 95% confidence interval. This is an interval drawn in a clever way and which gives us an idea of where the true beta value might plausibly live.