# Intro to Data Science 
## Part VII. - Regression and Embedding pipelines

### Table of contents

- #### Regression
    - <a href="#What-is-Regression?">Theory</a>
    - <a href="#Linear-regression---OLS">Linear regression</a>
    - <a href="#Ridge-regression">Ridge regression</a>
    - <a href="#LASSO">LASSO regression</a>
    - <a href="#Bayesian-Ridge-regression">Bayesian regression</a>
    - <a href="#Support-Vector-regression">Support Vector regression</a>
    - <a href="#XGBoost">XGBoost</a>

- #### Managing model lifecycle
    - <a href="#Reusing-trained-pipelines">Reusing trained pipelines</a>
        - <a href="#Saving-pipelines">Exporting pipelines</a>
        - <a href="#Loading-pipelines">Loading pipelines</a>
    - <a href="#Tracking-sklearn-models">Managing model lifecycle with MLFlow</a>
        - <a href="#What-is-MLFlow?">MLFlow Experiments</a>
        - <a href="#Tracking-Experiments">Tracking Experiments</a>
        - <a href="#Loading-saved-models">Saving and loading models</a>
    - <a href="#Track-and-save-regression-models">Track multiple experiments</a>

---

## What is Regression?
Regression - just as classification - is a supervised machine learning problem however in case of regression the target variable is continuous. It is also _"a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a __dependent variable__ and one or more __independent variable__s (or 'predictors')."_ from: <a href="https://en.wikipedia.org/wiki/Regression_analysis">Wiki</a>

It is important to note that instead of the descriptive nature of statistical regression analysis Data Science focuses on the predictive side of this method.

## Why is it important?
_"Regression analysis is widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning."_ from: <a href="https://en.wikipedia.org/wiki/Regression_analysis">Wiki</a>

It is used to forecast any continuous variable:
- stock market
- salary prediction
- network traffic
- traffic
- etc.

## Tools
- Linear regression
- Ridge regression
- LASSO
- Bayesian regression
- Support Vector regression
- etc.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

import numpy as np
import pandas as pd

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score

from sklearn.pipeline import Pipeline

In [None]:
def plot_pred(y, predicted):
    fig, ax = plt.subplots()
    ax.scatter(y, predicted, edgecolors='k')
    ax.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)
    ax.set_xlabel('Measured')
    ax.set_ylabel('Predicted')
    plt.show()

    
def plot_boston(ax):
    ax.scatter(lsop_train, y_train, edgecolors='k', s=10)
    ax.set_xlabel("% lower status of the population")
    ax.set_ylabel("Median value of owner-occupied homes in $1000's")
    ax.set_xlim([-10,50])
    ax.set_ylim([0,60])
    
    
def plot_curve(estimator, param, values, ax):   
    for color, value in zip(colors, values):
        estimator = estimator.set_params(**{param: value}).fit(lsop_train, y_train)
        ax.plot(curve_x, estimator.predict(curve_x), '-', c=color, lw=2, label=value)
    plot_boston(ax)
    ax.legend(loc='upper right')

    
def show_score(model, X, y, cv=10, metric=None):
    scores = cross_val_score(model, X, y, cv=cv, scoring=metric)
    return "Accuracy: {:.2f} (+/- {:.2f})".format(scores.mean(), scores.std() * 2)

colors = ['g', 'r', 'y', 'c', 'm', 'b']

## Variations on a Theme

The traditional linear problem is stated like this:
$$ y_i = \bs{x}_i \bs{\beta} $$
for every observation $i$, or more compactly
$$ \bs{y} = \bs{X}\bs{\beta} $$
where $ \bs{X} $ is the matrix observed values, $\bs{y}$ is the vector of observed output variables, and $\bs{\beta}$ is the weight vector which we want to find. 

In OLS, we try to find the $\bs{\beta}$ while minimizing a *loss function*, which is simply the sum of squares of the differences between the predicted and observed values (also called sum of squared residuals or SSR), 

$ \mathrm{Cost}(\bs{\beta}) = \mathrm{SSR}(\bs{\beta}) = \sum _i (\hat y_i - y_i)^{2} $.  

<a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html">Ridge</a>, <a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html">LASSO</a> and <a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.BayesianRidge.html">Bayesian</a> regressions (and a couple more) are basically simple <a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html">linear</a> regressions, but with the loss function being modified.  
Ridge regression adds the sum of the squares of the weights with a constant multiplier to the loss, i.e.

$ \mathrm{Cost}(\bs{\beta}) = \sum _i (\hat y_i - y_i)^{2} + \alpha \sum _i \beta _i^{2}. $

LASSO adds the sum of the absolute values of the coefficients, i.e.

$ \mathrm{Cost}(\bs{\beta}) = \sum _i (\hat y_i - y_i)^{2} + \alpha \sum _i \vert \beta _i. \vert $

### Ok, but what is the point of this?

This technique is called <a href="https://en.wikipedia.org/wiki/Regularization_(mathematics)">**regularization**</a>, and the use of this in our case is to prevent the model from **overfitting** to the data (which is our greatest enemy, right before **the curse of dimensionality**). Basically it prevents the coefficients from growing too large. To illustrate this, we use the <a href="http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html">*boston dataset*</a> - it has ethical issues built-in, but it is a good opportunity to discover how datasets includes prejudice. (You should also check out <a href="https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/">this</a> for a more detailed discussion on Ridge and LASSO)

---

## Loading the boston dataset


In [None]:
from sklearn.datasets import load_boston

In [None]:
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]

X, y = data, target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

lsop = X[:, 12][:, np.newaxis]
lsop_train = X_train[:, 12][:, np.newaxis]
lsop_test = X_test[:, 12][:, np.newaxis]

curve_x = np.linspace(-10, 50, num=300)[..., np.newaxis]

## Linear regression - OLS

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

In [None]:
ols = Pipeline([('poly', PolynomialFeatures()), 
                ('ols', LinearRegression())])
parameters = {'poly__degree': range(1,16)}
ols_grid = GridSearchCV(ols, 
                        parameters, 
                        cv=5,
                        n_jobs=2, 
                        scoring='neg_mean_squared_error')
ols_grid.fit(lsop_train, y_train)

In [None]:
ols_grid.best_estimator_

In [None]:
ols_grid.best_params_

In [None]:
show_score(ols_grid.best_estimator_, lsop_test, y_test, metric='neg_mean_squared_error')

In [None]:
y_hat = ols_grid.best_estimator_.predict(lsop_test)
plot_pred(y_test, y_hat)

Plot some example curve with different degrees.

In [None]:
fig, ax = plt.subplots()
plot_curve(ols, 'poly__degree', [1, 2, 3, 5, 13], ax)

## Ridge regression

In [None]:
import sklearn
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler

In [None]:
ridge = Pipeline([
    ('scale', StandardScaler()),
    ('poly', PolynomialFeatures(degree=5)), 
    ('ridge', Ridge())
])
params = {'ridge__alpha': np.logspace(-15, 13, 29)}
ridge_grid = GridSearchCV(ridge, 
                          params, 
                          cv=5,
                          n_jobs=2, 
                          scoring='neg_mean_squared_error')
ridge_grid.fit(lsop_train, y_train)

In [None]:
ridge_grid.best_params_

Available scorers are:

In [None]:
print(' -', '\n - '.join(key for key in sklearn.metrics.SCORERS.keys()))

In [None]:
show_score(ridge_grid.best_estimator_, lsop_test, y_test, metric='neg_mean_squared_error')

In [None]:
y_hat = ridge_grid.best_estimator_.predict(lsop_test)
plot_pred(y_test, y_hat)

Plot some example curves to see how the regularization parameters "deform" the 5 degree polynomial we saw in the previous plot.

In [None]:
fig, ax = plt.subplots()
plot_curve(ridge, 'ridge__alpha', [1e-13, 1e-6, 1e-1, 1e0, 1e2], ax)

## LASSO

Least absolute shrinkage and selection operator

In [None]:
from sklearn.linear_model import Lasso

In [None]:
lasso = Pipeline([
    ('scale', StandardScaler()),
    ('poly', PolynomialFeatures(degree=5)), 
    ('lasso', Lasso(max_iter=100_000))
])
params = {'lasso__alpha': np.logspace(-5, 13, 19)}
lasso_grid = GridSearchCV(lasso, 
                          params, 
                          cv=5,
                          scoring='neg_mean_squared_error')
lasso_grid.fit(lsop_train, y_train)

In [None]:
show_score(lasso_grid.best_estimator_, lsop_test, y_test, metric='neg_mean_squared_error')

In [None]:
y_hat = lasso_grid.best_estimator_.predict(lsop_test)
plot_pred(y_test, y_hat)

LASSO also works as a feature selection tool, we can see that by setting the alpha high enough, it sets some coefficients to zero. Also, we can see that if we go overboard with this, it can lead to **underfitting**, which is also bad.

In [None]:
coefs = pd.DataFrame()
pipe = Pipeline([('poly', PolynomialFeatures(degree=5)),
                 ('lasso', Lasso(max_iter=100_000))])

for alpha in np.logspace(-5, 13, 19):
    pipe = pipe.set_params(lasso__alpha=alpha).fit(lsop_train, y_train)
    coefs[alpha] = pipe.named_steps['lasso'].coef_[1:]

coefs.T

In [None]:
fig, ax = plt.subplots()
plot_curve(lasso, 'lasso__alpha', [1e-5, 1e-2, 1e-1, 1e1, 1e8], ax)

## Bayesian Ridge regression

Bayesian Ridge Regression is really similar to the regular Ridge regression with a major difference: instead of setting an arbitrary $\lambda$ parameter for the $\ell_{2}$ regularization, the parameter is considered a variable and estimated from the data.

In [None]:
from sklearn.linear_model import BayesianRidge

In [None]:
bayes = Pipeline([('poly', PolynomialFeatures(degree=5)), 
                  ('bayes', BayesianRidge())])
params = {'bayes__alpha_1': np.logspace(-5, 5, 5),
          'bayes__alpha_2': np.logspace(-5, 13, 5),
          'bayes__lambda_1': np.logspace(-5, 13, 5),
          'bayes__lambda_2': np.logspace(-5, 13, 5)}
bayes_grid = GridSearchCV(bayes, 
                          params,
                          cv=5,
                          scoring='neg_mean_squared_error')
bayes_grid.fit(lsop_train, y_train)

In [None]:
show_score(bayes_grid.best_estimator_, lsop_test, y_test, metric='neg_mean_squared_error')

In [None]:
y_hat = bayes_grid.best_estimator_.predict(lsop_test)
plot_pred(y_test, y_hat)

In [None]:
fig, ax = plt.subplots()
plot_curve(bayes, 'bayes__alpha_1', [1e-5, 1e-2, 1e-1, 1e1, 1e2], ax)

## Support Vector Regression

Support vector machines can be used for regression purposes too. The main idea is to:
a) reduce the number of required training points to the support vectors
b) fit a linear model
c) transform data points into higher dimensions and fit the linear model in that higher space then transform the fitted curve to the original, lower dimension
d) instead of actually transforming the data, use kernel functions

In [None]:
from sklearn.svm import SVR

In [None]:
svr = SVR(kernel='rbf', C=1e3, gamma=5e-5, degree=5)
svr.fit(lsop_train, y_train)
show_score(svr, lsop, y, metric='neg_mean_squared_error')

In [None]:
y_hat = svr.predict(lsop_test)
plot_pred(y_test, y_hat)

In [None]:
fig, ax = plt.subplots()
plot_curve(svr, 'kernel', ['linear', 'poly', 'rbf'], ax)

In [None]:
fig, ax = plt.subplots()
plot_curve(svr, 'degree', [2, 3, 4, 5], ax)

## [XGBoost](https://xgboost.readthedocs.io/en/latest/model.html)

XGBoost is short for **Extreme Gradient Boosting** which is a Gradient Boosted Tree method. Boosted tree is an **ensemble method**, basically training multiple trees on the same training set results a more robust solution. It is important that boosted trees incorporates a **regularization term** in its objective function. In this sense, boosted trees are the same as random forests. The difference comes from the training process. 

XGBoost use additive training: in each step it adds individual trees by selecting the best tree each time. The best tree is the **simplest tree** (tree structure score is minimal) **with the most information gain**.

For more detailed explanation please consult with these [slides](https://web.njit.edu/~usman/courses/cs675_spring20/BoostedTree.pdf) and this [tutorial](https://xgboost.readthedocs.io/en/latest/tutorials/model.html) or with this [wiki page](https://en.wikipedia.org/wiki/Gradient_boosting) on gradient boosting. Install it using the `conda install py-xgboost` command.

In [None]:
from xgboost.sklearn import XGBRegressor

In [None]:
xgb = XGBRegressor()
xgb.fit(lsop_train, y_train)
y_hat = xgb.predict(lsop_test)
show_score(xgb, lsop, y, metric='neg_mean_squared_error')

In [None]:
plot_pred(y_test, y_hat)

In [None]:
fig, ax = plt.subplots()
plot_curve(xgb, 'n_estimators', [1, 5, 10, 25, 100], ax)

---

## Managing model lifecycle

### Reusing trained pipelines

Trained pipelines can be used outside of the training program as well.

#### Saving pipelines

First, we have to `serialize` the models. This process will save the whole pipeline object into a file. After saving, we can freely move the file and read it in elsewhere.  
**Important** to know that the used libraries must be the same versions in the saving and the loading end.

In [None]:
import pickle

with open('xgboost_model.pickle', 'wb') as picklefile:
    pickle.dump(xgb, picklefile)

#### Loading pipelines

Loading and using the models is pretty easy - as long as we have the same libraries installed (and the same versions).

In [None]:
import pickle

with open('xgboost_model.pickle', 'rb') as picklefile:
    model = pickle.load(picklefile)

In [None]:
show_score(model, lsop, y, metric='neg_mean_squared_error')

### Tracking sklearn models

One of the typical errors even experienced professionals are exposed to is training models without tracking all of their experiments. Once several combination of pipeline items, parameters, models are tried it is hard to remember which gave the best performance. To avoid these mistakes, a tracking solution can be used.

#### What is <a href="https://mlflow.org/">MLFlow</a>?

From it's <a href="https://mlflow.org/docs/latest/index.html">documentation</a>:  

_"MLflow is an open source platform for managing the end-to-end machine learning lifecycle. It tackles four primary functions:_
- _Tracking experiments to record and compare parameters and results (MLflow Tracking)._
- _Packaging ML code in a reusable, reproducible form in order to share with other data scientists or transfer to production (MLflow Projects)._
- _Managing and deploying models from a variety of ML libraries to a variety of model serving and inference platforms (MLflow Models)._
- _Providing a central model store to collaboratively manage the full lifecycle of an MLflow Model, including model versioning, stage transitions, and annotations (MLflow Model Registry)."_


#### Tracking Experiments

In order to track your experiment, you have to:
- install the library with:
    ```bash
    pip install mlflow
    ```
- then start mlflow's tracking server with:
    ```bash
    mlflow ui
    ```
- and use the library to create and log your experiments
- once the tracking server is running you can follow your experiments at:
    ```
    localhost:5000
    ```

In [None]:
import mlflow
import mlflow.sklearn

In [None]:
with mlflow.start_run(run_name="xgboost-default"):
    xgb = XGBRegressor()
    xgb.fit(lsop_train, y_train)
    
    # Log parameter values
    for param, val in xgb.get_params().items():
        mlflow.log_param(param, val)
    
    # Log metrics of the run
    predictions = xgb.predict(lsop_test)
    r2 = r2_score(y_test, predictions)
    rmse = mean_squared_error(y_test, predictions, squared=False)
    ev = explained_variance_score(y_test, predictions)
    
    mlflow.log_metric("r2", r2)
    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("ev", ev)
    
    # Log pictures
    fig, ax = plt.subplots()
    plot_curve(xgb, 'n_estimators', [1, 5, 10, 25, 100], ax)
    fig.savefig('xgboost_default_model_curve.png', transparent=True)
    mlflow.log_artifact('xgboost_default_model_curve.png')
    
    # Log the model itself
    mlflow.sklearn.log_model(xgb, "xgboost_default_model")

#### Loading saved models

Exported models can be loaded later. You have to check the logged model details on the UI in order to get the model path:
<img src="pics/mlflow_ui_model_details.png" width=500>

In [None]:
xgb_loaded = mlflow.sklearn.load_model("path/to/model/from/the/ui")
show_score(xgb_loaded, lsop, y, metric='neg_mean_squared_error')

### Track and save regression models

Use the pipelines we built previously to:
- track them using mlflow (kudos for using functions and/or loops)
- compare the results on the mlflow UI