# Ridge and Lasso Regression

### Table of Contents
1. [Getting started](#1.-Getting-started)
2. [Estimating coefficients](#2.-Estimating-coefficients)
3. [Model evaluation](#3.-Model-evaluation)
4. [Exercise: model comparison and cross-validation](#4.-Exercise:-model-comparison-and-cross-validation)
5. [Exercise: regression with a new dataset](#5.-Exercise:-regression-with-a-new-dataset)

### 1. Getting started

In [None]:
# importing packages

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler

In [None]:
# importing the data

adv = pd.read_csv('../data/advertising.csv') 
adv.head(5) # top 5 rows

In [None]:
adv['TV'].hist()

This go around, we are going to __scale__ our features. Thinking back to our feature engineering lesson, a common way to prepare data for regression is __standardization__, also known as __z-score normalization__.

The idea is that for every column `x`, the transformed values for that column `x'`are calculated as follows so that the resulting values are normally distributed:

$x' = \frac{x - x_{mean}}{\sigma}$

In [None]:
scaler = StandardScaler()
columns_to_scale = ['TV', 'Radio', 'Newspaper']
scaled_column_names = [column + '_scaled' for column in columns_to_scale]
scaled_columns = pd.DataFrame(scaler.fit_transform(adv[columns_to_scale]),
                              columns = scaled_column_names)
adv = pd.concat([adv, scaled_columns], axis = 1)

In [None]:
adv.head()

In [None]:
adv['TV_scaled'].hist()

In [None]:
# Split data into train and test

train, test = train_test_split(adv,
                               test_size=0.3,
                               random_state=1)

In [None]:
# Convert them back into dataframes, for convenience

train = pd.DataFrame(data=train,
                     columns=adv.columns)

test = pd.DataFrame(data=test,
                    columns=adv.columns)

### 2. Estimating coefficients

In [None]:
# Fit a linear regression model using OLS

slm = LinearRegression()
slm.fit(train[['TV_scaled','Newspaper_scaled']],
        train['Sales']) # obtaining fit only based on TV and Newspaper

In [None]:
# Evaluate the output

print(slm.intercept_)
print(slm.coef_)

In [None]:
# Fit a linear regression model using Ridge

ridge = Ridge()
ridge.fit(train[['TV_scaled','Newspaper_scaled']],
          train['Sales']) # obtaining fit only based on Tv and Newspaper.

In [None]:
# Evaluate the output

print(ridge.intercept_)
print(ridge.coef_)

In [None]:
# Fit a linear regression model using Lasso

lasso = Lasso()
lasso.fit(train[['TV_scaled','Newspaper_scaled']],
          train['Sales']) # obtaining fit only based on Tv and Newspaper.

In [None]:
# Evaluate the output

print(lasso.intercept_)
print(lasso.coef_)

### 3. Model Evaluation

Now, we evaluate the models we've created using the __test__ dataset (the data the model hasn't yet seen).

1. Evaluate the predictions of the two models based on the testing dataset

In [None]:
# Ridge
ridge_preds = ridge.predict(test[['TV_scaled','Newspaper_scaled']])
#  predicting the sales of test dataset based on TV and Newspaper

np.sqrt(mean_squared_error(test['Sales'], ridge_preds))
# RMSE obtained by Ridge

In [None]:
# Lasso
lasso_preds = lasso.predict(test[['TV_scaled','Newspaper_scaled']])
np.sqrt(mean_squared_error(test['Sales'], lasso_preds))

2. Evaluate the model using cross-validation

In [None]:
ridge_cv_scores = cross_val_score(ridge,
                                  adv[['TV_scaled', 'Newspaper_scaled']], adv['Sales'],
                                  cv=5, scoring='neg_mean_squared_error')
np.mean(np.sqrt(-ridge_cv_scores))

In [None]:
lasso_cv_scores = cross_val_score(lasso,
                                  adv[['TV_scaled', 'Newspaper_scaled']], adv['Sales'],
                                  cv=5, scoring='neg_mean_squared_error')
np.mean(np.sqrt(-lasso_cv_scores))

### 4. Exercise: Model comparison and cross-validation

__(10 min.)__

1. Run all three types of multiple linear regressions (OLS, Ridge, Lasso) with __all__ of your features. 
  - Now that you've scaled your features, you don't need to use both the unscaled and scaled version
  - Which coefficients have higher values?
  - What does this suggest practically?


2. Calculate the 5-fold CV RMSE. Is it better or worse than before?

### 5. Exercise: Regression with a new dataset

__(20 min.)__

1. Perform EDA on a new dataset: `credit.csv`
2. Determine your target variable and features
3. Select a model: Ridge, Lasso, OLS
4. Support your selections to your client

### 6. Reference

- [Ridge and lasso regression](http://statweb.stanford.edu/~tibs/sta305files/Rudyregularization.pdf)
- [scikit-learn](http://scikit-learn.org/stable/) 
- [scatter plots](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.scatter.html)
- [mean squared error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html)
- [cross-validation](http://scikit-learn.org/stable/modules/cross_validation.html)