## Regression

Regression - just as classification - is a supervised machine learning problem however in case of regression the target variable is continuous. It is also "a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables (or 'predictors')." from: <a href="https://en.wikipedia.org/wiki/Regression_analysis">Wiki</a>

It is important to note that instead of the descriptive nature of statistical regression analysis Data Science focuses on the predictive side of this method.

## Why is it important?
_"Regression analysis is widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning."_ from: <a href="https://en.wikipedia.org/wiki/Regression_analysis">Wiki</a>

It is used to forecast any continuous variable:
- stock market
- salary prediction
- network traffic
- traffic
- etc.

## Tools
- Linear regression
- Ridge regression
- LASSO
- Bayesian regression
- Support Vector regression
- etc.

$\newcommand{\bs}[1]{\boldsymbol{#1}}$

## Variations on a Theme

The traditional linear problem is stated like this:
$$ y_i = \bs{x}_i \bs{\beta} $$
for every observation $i$, or more compactly
$$ \bs{y} = \bs{X}\bs{\beta} $$
where $ \bs{X} $ is the matrix observed values, $\bs{y}$ is the vector of observed output variables, and $\bs{\beta}$ is the weight vector which we want to find. 

In OLS, we try to find the $\bs{\beta}$ while minimizing a *loss function*, which is simply the sum of squares of the differences between the predicted and observed values (also called sum of squared residuals or SSR), 

$ \mathrm{Cost}(\bs{\beta}) = \mathrm{SSR}(\bs{\beta}) = \sum _i (\hat y_i - y_i)^{2} $.  

<a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html">Ridge</a>, <a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html">LASSO</a> and <a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.BayesianRidge.html">Bayesian</a> regressions (and a couple more) are basically simple <a href"http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html">linear</a> regressions, but with the loss function being modified.  
Ridge regression adds the sum of the squares of the weights with a constant multiplier to the loss, i.e.

$ \mathrm{Cost}(\bs{\beta}) = \sum _i (\hat y_i - y_i)^{2} + \alpha \sum _i \beta _i^{2}. $

LASSO adds the sum of the absolute values of the coefficients, i.e.

$ \mathrm{Cost}(\bs{\beta}) = \sum _i (\hat y_i - y_i)^{2} + \alpha \sum _i \vert \beta _i. \vert $

### Ok, but what is the point of this?

This technique is called **regularization**, and the use of this in our case is to prevent the model from **overfitting** to the data (which is our greatest enemy, right before **the curse of dimensionality**). Basically it prevents the coefficients from growing too large. To illustrate this, we will use the <a href="http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html">*boston dataset*</a>. (You should also check out <a href="https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/">this</a> for a more detailed discussion on Ridge and LASSO)

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

import numpy as np
import pandas as pd

from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

In [None]:
boston = load_boston()
X, y = boston.data, boston.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
lsop_train = X_train[:,12][..., np.newaxis]
lsop_test = X_test[:,12][..., np.newaxis]
curve_x = np.linspace(-10,50, num=300)[..., np.newaxis]

In [None]:
plt.scatter(lsop_train, y_train)
plt.xlabel("% lower status of the population")
plt.ylabel("Median value of owner-occupied homes in $1000's")

## OLS

In [None]:
ols = Pipeline([('poly', PolynomialFeatures()), ('ols', LinearRegression())])
parameters = {'poly__degree': range(1,16)}
grid_search = GridSearchCV(ols, parameters, n_jobs=2, scoring='neg_mean_squared_error')

In [None]:
grid_search.fit(lsop_train, y_train)

In [None]:
print grid_search.best_params_
print grid_search.best_estimator_.score(lsop_test, y_test)

For plotting purposes, explicitly create the pipelines

In [None]:
pipes = {}
for degree in range(1,16):
    pipes[degree] = Pipeline([('poly', PolynomialFeatures(degree=degree)), ('ols', LinearRegression())])
    pipes[degree].fit(lsop_train, y_train)

In [None]:
for degree, color in [(1,'g'), (2,'r'), (3,'y'), (5,'c'), (13,'m')]:
    plt.plot(curve_x, pipes[degree].predict(curve_x), color, lw=2, label=degree)
plt.scatter(lsop_train, y_train, s=10)
plt.xlabel("% lower status of the population")
plt.ylabel("Median value of owner-occupied homes in $1000's")
plt.legend(loc='upper right')
plt.ylim([0,60])
plt.xlim([-10,50])

## Ridge regression

In [None]:
from sklearn.linear_model import Ridge

In [None]:
ridge = Pipeline([('poly', PolynomialFeatures(degree=5)), ('ridge', Ridge())])
params = {'ridge__alpha': np.logspace(-15, 13, 29)}
rgrid_search = GridSearchCV(ridge, params, n_jobs=2, scoring='neg_mean_squared_error')

In [None]:
rgrid_search.fit(lsop_train, y_train)

In [None]:
print rgrid_search.best_params_
print rgrid_search.best_estimator_.score(lsop_test, y_test)

Create the pipelines here too, to see how the regularization parameters "deform" the 5 degree polynomial we saw in the previous plot.

In [None]:
pipes = {}
for alpha in np.logspace(-15, 13, 29):
    pipes[alpha] = Pipeline([('poly', PolynomialFeatures(degree=5)), ('ridge', Ridge(alpha=alpha))])
    pipes[alpha].fit(lsop_train, y_train)

In [None]:
for alpha, color in [(1e-13,'g'), (1e-1,'r'), (1e1,'y'), (1e2,'c'), (1e10,'m')]:
    plt.plot(curve_x, pipes[alpha].predict(curve_x), color, lw=2, label=alpha)
plt.scatter(lsop_train, y_train, s=10)
plt.xlabel("% lower status of the population")
plt.ylabel("Median value of owner-occupied homes in $1000's")
plt.legend(loc='upper right')
plt.ylim([0,60])
plt.xlim([-10,50])

## LASSO

Least absolute shrinkage and selection operator

In [None]:
from sklearn.linear_model import Lasso

In [None]:
lasso = Pipeline([('poly', PolynomialFeatures(degree=5)), ('lasso', Lasso(max_iter=10000))])
params = {'lasso__alpha': np.logspace(-5, 13, 19)}
lgrid_search = GridSearchCV(lasso, params, scoring='neg_mean_squared_error')
lgrid_search.fit(lsop_train, y_train)

In [None]:
lgrid_search.best_params_, -lgrid_search.score(lsop_test, y_test)

LASSO also works as a feature selection tool, we can see that by setting the alpha high enough, it sets some coefficients to zero. Also, we can see that if we go overboard with this, it can lead to **underfitting**, which is also bad.

In [None]:
pipes = {}
coefs = pd.DataFrame()
for alpha in np.logspace(-5, 13, 19):
    pipes[alpha] = Pipeline([('poly', PolynomialFeatures(degree=5)), ('lasso', Lasso(max_iter=100000, alpha=alpha))])
    pipes[alpha].fit(lsop_train, y_train)
    coefs[alpha] = pipes[alpha].named_steps['lasso'].coef_[1:]
    
print coefs.T

In [None]:
for alpha, color in [(1e-5,'g'), (1e-2,'r'), (1e-1,'y'), (1e1,'c'), (1e7,'m')]:
    plt.plot(curve_x, pipes[alpha].predict(curve_x), color, lw=2, label=alpha)
plt.scatter(lsop_train, y_train, s=10)
plt.xlabel("% lower status of the population")
plt.ylabel("Median value of owner-occupied homes in $1000's")
plt.legend(loc='upper right')
plt.ylim([0,60])
plt.xlim([-10,50])

## Bayesian Ridge regression

In [None]:
from sklearn.linear_model import BayesianRidge

In [None]:
b_ridge = Pipeline([('poly', PolynomialFeatures(degree=5)), ('b_ridge', BayesianRidge())])
b_ridge.fit(lsop_train, y_train)

In [None]:
mean_squared_error(b_ridge.predict(lsop_test), y_test), b_ridge.score(lsop_test, y_test)

## Support Vector Regression

In [None]:
from sklearn.svm import SVR

In [None]:
svr = Pipeline([('svr', SVR())])
svr.fit(lsop_train, y_train)
mean_squared_error(svr.predict(lsop_test), y_test), svr.score(lsop_test, y_test)