# Additional Linear Modeling Schemes

Previously, we used primarily Ordinary Least Squares, which is helpful because it can help tell us if the noise on the data (the residuals of the "smooth" model) is normal.

Once we have a regression model, we can run a few diagnostics and tests, many of which are already run by `statsmodels` as part of the `.summary()` function.  There are a few more diagnostics available as well, see: https://www.statsmodels.org/stable/examples/notebooks/generated/regression_diagnostics.html

However, sometimes our data/residuals are not normal.  See: https://en.wikipedia.org/wiki/Homoscedasticity_and_heteroscedasticity

Sometimes, there is a correlation between the residuals, which happens quite often in time-series data. In this case, we may want to use Generalized Least Squares.

Other times, we might have too much data to really thrown out any outliers, and thus need to create a robust regression model using either Robust Least Squares or RANSAC.  RANSAC is actually used quite often in computer vision!

We'll explore all of these today using `../datasets_as/fat.csv`!

## `statsmodels` design matrices API

First, a little bit more about why we have `endog` and `exog`, and why we had the "machine learning" API.

For reference: https://www.statsmodels.org/stable/gettingstarted.html#design-matrices-endog-exog

Conveniently, this will help us as we explore our other regression modeling library, `sklearn` (scikit-learn)!

Set up the X (exog) and y (endog) matrices, then double-check that it works with OLS:

## Introducing `sklearn`

scikit-learn is a library full of machine learning schemes!  We don't cover the workings of the algorithms per se here, but we will walk through what each scheme does aim to minimize.

See: https://scikit-learn.org/stable/modules/linear_model.html

Then, perform OLS using `sklearn` and see how it compares to OLS from `statsmodels`:

### Ridge, Lasso, and Elastic Net regression

Try out these techniques now:

### RANSAC Robust Regression

Sometimes, we don't really know about the outliers other than that they exist.  So we can try using RANSAC: https://scikit-learn.org/stable/auto_examples/linear_model/plot_ransac.html

Let's try using RANSAC on the GECAD wind turbine dataset:

## Frequentist error types

Type I: false positive

Type II: false negative