<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Regression Techniques

_Your one-stop-shop for keeping regression techniques and code handy_

---


We will split up the work for this notebook:
- Person 1: OLS (Linear Regression)
- Person 2: Ridge Regression
- Person 3: Lasso Regression
- Person 4: Elastic Net Regression
- Person 5: Least Angle Regression (LARS)
---

Regression models covered in this notebook:
- [OLS (Linear Regression)](#Linear-Regression)
- [Ridge Regression](#Ridge)
- [Lasso Regression](#LASSO)
- [Elastic Net Regression](#Elastic-Net)
- [Least Angle Regression (LARS)](#LARS)
---

In [1]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
%matplotlib inline

  import pandas.util.testing as tm


In [2]:
# import data from: http://faculty.marshall.usc.edu/gareth-james/ISL/data.html
df = pd.read_csv('http://faculty.marshall.usc.edu/gareth-james/ISL/Advertising.csv')
df.drop(columns='Unnamed: 0', inplace=True) # drop extra index column
df.head()

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


In [3]:
# split into testing and training sets
features = ['TV', 'radio', 'newspaper']
X = df[features] # feature matrix
y = df['sales'] # target vector

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

---
# Linear Regression
_Noelle_

When should you use the algorithm?
> **When to use**:
- When you are modeling linear data
- When you need a regression model that is extremely interpretable and simple
- When you need to see the influence of specific features
- When your model is not overfit (if so, try regularization - see next sections)  
Read more [here](https://statisticsbyjim.com/regression/choose-linear-nonlinear-regression/)  

What are some benefits of the model?
> **Benefits**:
- Computationally cheap, simple, fast, & easy
- Very interpretable
- Can get lots of statistics to help you understand the impact of features

> **Drawbacks**:  
- May not perform as well as other models
- Data must be linear (meet LINE assumptions)
- Cannot model very complicated relationships/curves in data

Give us an example of how to use it in Python using the provided dataset
> **ANSWER BELOW**

In [5]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

lin_reg = LinearRegression() # instantiate
lin_reg.fit(X_train, y_train) # fit

# simple evaluation function for scoring
def evaluate(y_train_true, y_train_pred, y_test_true, y_test_pred):
    print('Evaluation Metrics')
    print('---------------------------')
    print('Training r2:', r2_score(y_train_true, y_train_pred))
    print('Testing r2:', r2_score(y_test_true, y_test_pred))
    print(' ')
    print('Training MAE:', mean_absolute_error(y_train_true, y_train_pred))
    print('Testing MAE:', mean_absolute_error(y_test_true, y_test_pred))
    print(' ')
    print('Training MSE:', mean_squared_error(y_train_true, y_train_pred))
    print('Testing MSE:', mean_squared_error(y_test_true, y_test_pred))
    print(' ')
    print('Training RMSE:', np.sqrt(mean_squared_error(y_train_true, y_train_pred)))
    print('Testing RMSE:', np.sqrt(mean_squared_error(y_test_true, y_test_pred)))

In [6]:
evaluate(y_train, lin_reg.predict(X_train), y_test, lin_reg.predict(X_test))

Evaluation Metrics
---------------------------
Training r2: 0.8966445527601498
Testing r2: 0.8935163320163657
 
Training MAE: 1.2156188847355558
Testing MAE: 1.4023124989385078
 
Training MSE: 2.7678910780469734
Testing MSE: 2.880023730094194
 
Training RMSE: 1.6636980128758263
Testing RMSE: 1.6970632663793634


In [11]:
# using pipelines & Polynomial Regression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

# instantiate pipeline using Polynomial Features & Linear regression
pipe = Pipeline([
    ('poly', PolynomialFeatures()), # default degree is 2, I will leave this
    ('lin_reg', LinearRegression())
])

# fit pipeline
pipe.fit(X_train, y_train)

# score pipeline
evaluate(y_train, pipe.predict(X_train), y_test, pipe.predict(X_test))

Evaluation Metrics
---------------------------
Training r2: 0.9861997890896803
Testing r2: 0.9855413464588894
 
Training MAE: 0.4128744495535545
Testing MAE: 0.5137614721126593
 
Training MSE: 0.3695739477107388
Testing MSE: 0.3910577658726892
 
Training RMSE: 0.6079259393303915
Testing RMSE: 0.6253461168606463


---
# Ridge
_Matt_

When should you use the algorithm?  
> What is Ridge Regression?
- Ridge regression, like Lasso regression, is a regularization technique. That is, a technique used to deal with overfitting and when a dataset is large. More specifically, it is a technique that introduces increased bias in exchange for decreased variance. 
- Ridge regression uses a penalty parameter (alpha) to shrink our coefficients toward zero to simplify our model. The larger alpha, the greater the penalty. Note that, unlike in Lasso regression, in Ridge regression our coefficients shrink toward zero but never zero out. 

What are some benefits of the model?
> What are some benefits of the model? When should you use the algorithm?
- **Helps us to avoid overfitting.** Another way to say this is that Ridge regression reduces variance by introducing bias into our estimates. We do this by adding a penalty term, alpha, as described above. 
- **When we have large datasets.** Especially when we have "short and fat" data -- lots of features relative to the number of rows. This is particularly relevant when we have computational constraints.
- **When we have highly correlated features.** Ridge regression will include all of the coefficients in the model but the they will be distributed depending on the correlation. 

Give us an example of how to use it in Python using the provided dataset
> **ANSWER BELOW**

In [4]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# scale our data
ss = StandardScaler()
Z_train = ss.fit_transform(X_train)
Z_test = ss.transform(X_test)
# instantiate
ridge = Ridge(alpha=10)
# fit
ridge.fit(Z_train, y_train)
print(f'train r2: {ridge.score(Z_train, y_train)}')
print(f'test r2: {ridge.score(Z_test, y_test)}')

train r2: 0.8932824104762538
test r2: 0.8905897730203325


> sources:
- https://www.quora.com/What-are-the-benefits-of-using-ridge-regression-over-ordinary-linear-regression
- https://www.analyticsvidhya.com/blog/2016/01/ridge-lasso-regression-python-complete-tutorial/#three
- https://www.youtube.com/watch?v=OEU22e20tWw
- lesson 4.05-regulatization

---
# LASSO
_Nate_

When should you use the algorithm?
> **ANSWER HERE**:  
> [LASSO](https://www.statisticshowto.com/lasso-regression/) regression should be used when a model has high [multicollinearity](https://www.statisticshowto.com/multicollinearity/) or when you have a lot of variables and want to automate feature selection.

What are some benefits of the model?
> **ANSWER HERE**:  
> It's great for features selection, helps avoid overfitting, and retains interpretability.  
> [More pros and cons of LASSO regression](https://www.quora.com/What-are-the-pros-and-cons-of-lasso-regression)

Give us an example of how to use it in Python using the provided dataset
> **ANSWER BELOW**

In [5]:
from sklearn.linear_model import Lasso, LassoCV
from sklearn.preprocessing import StandardScaler
# Set up a list of Lasso alphas to check.
l_alphas = np.logspace(-3, 1, 100)
# Cross-validate over our list of Lasso alphas.
lasso_cv = LassoCV(
    alphas=l_alphas,
    cv=5,
    max_iter=5000)
# Fit model using best ridge alpha!
lasso_cv.fit(Z_train, y_train);

In [7]:
# Here is the optimal value of alpha
print('alpha:', lasso_cv.alpha_)
# R2 Scores
print('Train:', lasso_cv.score(Z_train, y_train))
print('Test:', lasso_cv.score(Z_test, y_test))
# Coefficients
print('TV:', lasso_cv.coef_[0])
print('Radio:', lasso_cv.coef_[1])
print('Newspaper:', lasso_cv.coef_[2])

alpha: 0.07924828983539177
Train: 0.896105190390421
Test: 0.8950875719765927
TV: 3.7774284647455936
Radio: 2.7886904205062333
Newspaper: 0.0


---
# Elastic Net
_Colin_

### What is it?
Elastic Net adds both penalties from Ridge and LASSO models:
$$
\begin{eqnarray}
SSE + Ridge + Lasso &=& \sum_{i=1}^n \left(y_i - \hat{y}_i\right)^2 + \alpha\left[\rho\sum_{j=1}^p |\beta_j| + (1-\rho)\sum_{j=1}^p \beta_j^2\right] \\
&=& \|\mathbf{y} - \mathbf{X}\beta\|^2 + \alpha\left(\rho\|\beta\|_1 + (1 - \rho)\|\beta\|^2\right)
\end{eqnarray}
$$
- The rho(looks like p) variable is a hyperparameter that tunes how much of each model you want to use. 
    - rho = 0.5 is 50% Ridge and 50% LASSO
    - rho = 0.7 is 70% Ridge and 30% LASSO
- Alpha dials the whole penalty up or down, so functions the same as alpha Ridge or LASSO.

### Why and when would we use it?
<br>
<details>
        - "My preference for elastic net is rooted in my skepticism that one will confidently know that 𝐿1 or 𝐿2    is the true model." - some guy on stats.stackexchange.com
</details>
<br>   
<details>
        - "It's cute but you don't need it" - Tim Book
</details>
<br>
<details>
<br>   
        - "Elastic net is always preferred over lasso & ridge regression because it solves the limitations of both methods, while also including each as special cases. So if the ridge or lasso solution is, indeed, the best, then any good model selection routine will identify that as part of the modeling process." - same guy on stats.stackexchange.com

Give us an example of how to use it in Python using the provided dataset
> **ANSWER BELOW**

In [8]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import ElasticNet, ElasticNetCV
from sklearn.pipeline import Pipeline

#preprocessing
ss = StandardScaler()
Z_train = ss.fit_transform(X_train)
Z_test = ss.transform(X_test)

In [9]:
# Hyperparamaters
enet_alphas = np.logspace(-3, 2, 100)
rho = 0.5
# Instance
enet = ElasticNetCV(alphas=enet_alphas, l1_ratio=rho, cv=5)
# Fit
enet.fit(Z_train, y_train)
# rho = 0.5
print(enet.score(Z_train, y_train))
print(enet.score(Z_test, y_test))

0.8966161917256624
0.8936580334433731


---
# LARS
_Justin_

https://medium.com/acing-ai/what-is-least-angle-regression-lar-bb86756f01d0
https://blog.echen.me/2011/04/21/a-mathematical-introduction-to-least-angle-regression/
http://statweb.stanford.edu/~tibs/ftp/lars.pdf
https://en.wikipedia.org/wiki/Least-angle_regression
https://stats.stackexchange.com/questions/404984/writing-by-hand-first-steps-in-least-angle-regression-lars
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lars.html
https://www.quora.com/What-is-Least-Angle-Regression-and-when-should-it-be-used

When should you use the algorithm?
> **ANSWER HERE**: When it is a linear problem, but there are many variables with different impacts and correlations.  ALL PREDICTORS MUST BE STANDARDIZED FIRST (this is a default arguement when instantiating). It will start with the one that is most correlated with the residuals (which is $r = y - \bar{y}$).  It will then increase the coefficient until another variable is showing similar correlation with the residuals.  Then that variable is added and coefficients are fitted again until another variable can be added.  This process continues until the model is big enough/used all the features.  The model is optimizing for ordinary least squares.  This will also "turn off" variables that are overly correlated with others and thusly have no impact.
>It is a form of forward stepwise regression
>This will also help crackdown on overfitting
![](https://miro.medium.com/max/1400/1*_uwNycdc2EjJlGBWZKUFZQ.jpeg)

What are some benefits of the model?
> **ANSWER HERE**: Easy to explain.  Efficient to run.  Starts as a simple model and increases complexity as it iterates over the variables.

Give us an example of how to use it in Python using the provided dataset
> **ANSWER BELOW**

In [10]:
from sklearn.linear_model import Lars
larsreg = Lars()
larsreg.fit(X_train, y_train)
lars_coefs = dict(zip(df.columns.tolist()[:-1], 
                      np.round(larsreg.coef_, 5)))
lars_coefs

{'TV': 0.04543, 'radio': 0.19146, 'newspaper': 0.00257}

In [11]:
print(f'LARS training: {larsreg.score(X_train, y_train):.4}')
print(f'LARS testing: {larsreg.score(X_test, y_test):.4}')

LARS training: 0.8966
LARS testing: 0.8935


In [12]:
# zip code for coefs dictionary adopted from:
# https://towardsdatascience.com/a-comparison-of-shrinkage-and-selection-methods-for-linear-regression-ee4dd3a71f16