# Boston Housing: model development with MLFlow

This notebook is intended to demonstrate the use of MLFlow for tracking experiments, comparing parameters and results, storing models and packaging code. A follow-on notebook will show how MLFlow can also be used to deploy the model as a prediction web service. 

The Boston housing dataset has been used as a simple use case for the trialling of various supervised regression techniques to find a strong performer while MLFlow tracks the experiments and packages the resultant models.  

**Goal:** *build a model for predicting house prices using the Boston housing dataset.* 

### Import libraries

In [23]:
import pandas as pd
import numpy as np

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_selector, make_column_transformer

from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression, Ridge, Lasso

import mlflow
import mlflow.sklearn
import joblib
from urllib.parse import urlparse

import logging
import warnings

### Configure MLFlow

Configure MLFlow to store entities in a local SQLite database and artefacts in local file storage.
 - Entities: runs, parameters, metrics, tags, metadata etc. 
 - Artefacts: models, files, images etc.

In [3]:
# set mlflow to store entities in a SQLite database (artefacts stored in ./mlruns folder)
mlflow.set_tracking_uri('sqlite:///mlruns.db')

# alternatively set mlflow to use a file storage location for both artefacts and entities
#mlflow.set_tracking_uri('file:///C:/Users/eddlo/Python/Projects/MLFlow-housing/mlflow-housing/experimentation/mlruns')

**Create an experiment in MLFlow**

Set an experiment to track our model development for the housing dataset. You can either:
 - Create a new experiment
 - Set the experiment ID to an existing experiment

In [4]:
warnings.filterwarnings("ignore")

# create a new experiment
experiment_id = mlflow.create_experiment(name="Boston Housing")

# use an existing experiment
#experiment_id='1'

# print out config details for the experiment
experiment = mlflow.get_experiment(experiment_id)
mlflow.set_experiment(experiment_id=experiment_id)

print(f"Name: {experiment.name}")
print(f"Experiment_id: {experiment.experiment_id}")
print(f"Artifact Location: {experiment.artifact_location}")

2021/12/26 14:12:34 INFO mlflow.store.db.utils: Creating initial MLflow database tables...
2021/12/26 14:12:34 INFO mlflow.store.db.utils: Updating database tables
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
INFO  [alembic.runtime.migration] Running upgrade  -> 451aebb31d03, add metric step
INFO  [alembic.runtime.migration] Running upgrade 451aebb31d03 -> 90e64c465722, migrate user column to tags
INFO  [alembic.runtime.migration] Running upgrade 90e64c465722 -> 181f10493468, allow nulls for metric values
INFO  [alembic.runtime.migration] Running upgrade 181f10493468 -> df50e92ffc5e, Add Experiment Tags Table
INFO  [alembic.runtime.migration] Running upgrade df50e92ffc5e -> 7ac759974ad8, Update run tags with larger limit
INFO  [alembic.runtime.migration] Running upgrade 7ac759974ad8 -> 89d4b8295536, create latest metrics table
INFO  [89d4b8295536_create_latest_metrics_table_py] Migration complete!
INFO  

Name: Boston Housing
Experiment_id: 1
Artifact Location: ./mlruns/1


#### Start up MLFlow UI to review runs as you go

Start up a local tracking server and point it towards the SQLite db (for entities) and local file storage (for artefacts) using the mlflow CLI. Then, navigate to http://localhost:5000/ in a browser to see the MLFlow UI and compare models. 

In [5]:
!mlflow server --backend-store-uri 'sqlite:///mlruns.db' --default-artifact-root ./mlruns --host 0.0.0.0

'mlflow' is not recognized as an internal or external command,
operable program or batch file.


### Feature engineering

**Load data**

In [5]:
cols=['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','MEDV']
df = pd.read_csv('datasets\housing.csv',sep=' ',skipinitialspace=True,header=None,names=cols)

In [6]:
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


**Data cleaning**

Filter dataset and extract useful features based on learnings in EDA

In [10]:
# filter for MEDV is not equal to 50
sel_df = df[df['MEDV'] != 50].copy()

In [11]:
sel_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


**Feature selection and train-test split**

Choose features based on knowledge gained through EDA.

Split the data, holding out 20% for final evaluation. The remaining 80% will be used for training and hyperameter tuning with cross-validation.

In [12]:
# include all features in X and use pipeline in next step to only keep valuable features
#X_cols = ['INDUS','RM','TAX','PTRATIO','LSTAT']
X_cols = ['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT']
y_col = 'MEDV'

X = sel_df[X_cols].copy()
y = sel_df[[y_col]].copy()

# train-test split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

**Data pre-processing**

Looking ahead, we must ensure any new data input to the model has been pre-processed in the same way as for our train and test sets. This includes scaling it and generating polynomial features (where polynomial features have been used).

To include these data transforms as part of our model, sklearn's *pipelines* can be used. Data pre-processing steps are combined with an estimator into a pipeline. The pipeline is fit to the training data, and then can be applied to test / inference data. 

First initialise a pipeline and scale our chosen features using standard scaler. Other columns will be discarded.

In [18]:
# create a pipeline that scales the selected columns (discard other columns)
scaler_pipe = make_column_transformer(
    (make_pipeline(StandardScaler()), ['INDUS','RM','TAX','PTRATIO','LSTAT']))
scaler_pipe.fit(X_train)

ColumnTransformer(transformers=[('pipeline',
                                 Pipeline(steps=[('standardscaler',
                                                  StandardScaler())]),
                                 ['INDUS', 'RM', 'TAX', 'PTRATIO', 'LSTAT'])])

In [20]:
# apply the pipeline to train and test sets
X_train_scaled = scaler_pipe.transform(X_train) 
X_test_scaled = scaler_pipe.transform(X_test)

In [41]:
X_train_scaled

array([[-0.47855844, -0.92824862, -0.5775791 , -1.59680549,  2.40109038],
       [-1.19389525,  0.92133058, -0.85092188, -1.40517905, -1.11727385],
       [ 1.00279695, -0.63237924,  1.53191405,  0.79852495,  1.29612619],
       ...,
       [-0.16867569, -0.31119482,  0.14143124, -0.35123366, -0.37262407],
       [-0.61757126, -0.29695512, -1.04107337, -0.30332705,  0.82842396],
       [-1.03171362, -0.31277701, -0.66671262, -0.92611296, -0.39947103]])

### Train, validate and test models

First create a function for consistent testing and evaluation throughout. R2 score will be used to evaluate the models.

In [29]:
# create a function for training models and logging results with mlflow
def model_train_val(name, feature_pipe, model, params, X_train, y_train, X_test, y_test, cv=4, scoring='r2'):
    
    with mlflow.start_run(run_name = name):
        
        # perform data pre-processing
        X_train_processed = feature_pipe.transform(X_train) 
        X_test_processed = feature_pipe.transform(X_test)
        
        # run gridsearch through dict of hyperparams and with k=4 CV folds 
        clf = GridSearchCV(model, param_grid=params, cv=cv, scoring=scoring, return_train_score=True)
        clf.fit(X_train_processed, y_train.values.ravel())
        print(f'Best params: {clf.best_params_}')
        print(f'Best CV score: {clf.best_score_}')
        print(f"Training set score: {clf.cv_results_['mean_train_score'][clf.best_index_]:2.3}")
        print(f'Test set score: {clf.best_estimator_.score(X_test_processed,y_test):2.3}')
        
        #log metrics and parameters used
        for k,v in clf.best_params_.items():
            mlflow.log_param(k,v)
        mlflow.log_metric("r2_train", clf.cv_results_['mean_train_score'][clf.best_index_])
        mlflow.log_metric("r2_test", clf.best_estimator_.score(X_test_processed,y_test))
        
        # construct prediction pipeline 
        predict_pipe = make_pipeline(feature_pipe, clf.best_estimator_)
        
        # register model
        tracking_url_type_store = urlparse(mlflow.get_tracking_uri()).scheme
        if tracking_url_type_store != "file":
            mlflow.sklearn.log_model(predict_pipe, "model", registered_model_name=name+'_pipeline')
        else:
            mlflow.sklearn.log_model(predict_pipe, "model")

**Dummy regressor**

Create a dummy model to benchmark the score.

In [24]:
# set up dummy regressor
dummy_model = DummyRegressor(strategy='mean')
name="DummyRegressor (mean)"
params = {}
model_train_val(name, scaler_pipe, dummy_model, params, X_train, y_train, X_test, y_test)

Best params: {}
Best CV score: -0.0058795234512875605
Training set score: 0.0
Test set score: -0.0158


Successfully registered model 'DummyRegressor (mean)_pipeline'.
2021/12/26 14:28:55 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: DummyRegressor (mean)_pipeline, version 1
Created version '1' of model 'DummyRegressor (mean)_pipeline'.


**Simple linear regression**

In [30]:
linreg_model = LinearRegression()
name="SimpleLinearRegression"
params = {}
model_train_val(name, scaler_pipe, linreg_model, params, X_train, y_train, X_test, y_test)

Best params: {}
Best CV score: 0.7018324502631647
Training set score: 0.734
Test set score: 0.72


Successfully registered model 'SimpleLinearRegression_pipeline'.
2021/12/26 14:33:38 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: SimpleLinearRegression_pipeline, version 1
Created version '1' of model 'SimpleLinearRegression_pipeline'.


**Ridge regression**

In [31]:
ridge_model = Ridge()
name = "RidgeRegression"
params = {'alpha':[0.01,0.02,0.04,0.08,0.16,0.32,0.64,1,2,4,8]}
model_train_val(name, scaler_pipe, ridge_model, params, X_train, y_train, X_test, y_test)

Best params: {'alpha': 8}
Best CV score: 0.7040355395274798
Training set score: 0.734
Test set score: 0.719


Successfully registered model 'RidgeRegression_pipeline'.
2021/12/26 14:34:29 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: RidgeRegression_pipeline, version 1
Created version '1' of model 'RidgeRegression_pipeline'.


**Lasso regression**

In [32]:
lasso_model = Lasso()
name = "LassoRegression"
params = {'alpha':[0.01,0.02,0.04,0.08,0.16,0.32,0.64,1,2,4,8]}
model_train_val(name, scaler_pipe, lasso_model, params, X_train, y_train, X_test, y_test)

Best params: {'alpha': 0.08}
Best CV score: 0.7021768334235038
Training set score: 0.734
Test set score: 0.717


Successfully registered model 'LassoRegression_pipeline'.
2021/12/26 14:34:40 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: LassoRegression_pipeline, version 1
Created version '1' of model 'LassoRegression_pipeline'.


**Polynomial regression**

In [33]:
degrees = [2,3,4]
for degree in degrees: 

    poly_pipe = make_pipeline(scaler_pipe, PolynomialFeatures(degree))
    poly_pipe.fit(X_train)
    
    print("----------------------------------")
    print(f"Degree: {degree}")
    name = f"PolynomialRegression({degree} degrees)"
    poly_model = LinearRegression()
    params={}
    model_train_val(name, poly_pipe, poly_model, params, X_train, y_train, X_test, y_test)

----------------------------------
Degree: 2
Best params: {}
Best CV score: 0.8586957466635642
Training set score: 0.876
Test set score: 0.795


Successfully registered model 'PolynomialRegression(2 degrees)_pipeline'.
2021/12/26 14:34:55 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: PolynomialRegression(2 degrees)_pipeline, version 1
Created version '1' of model 'PolynomialRegression(2 degrees)_pipeline'.


----------------------------------
Degree: 3
Best params: {}
Best CV score: 0.7271421260699754
Training set score: 0.882
Test set score: 0.783


Successfully registered model 'PolynomialRegression(3 degrees)_pipeline'.
2021/12/26 14:34:59 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: PolynomialRegression(3 degrees)_pipeline, version 1
Created version '1' of model 'PolynomialRegression(3 degrees)_pipeline'.


----------------------------------
Degree: 4
Best params: {}
Best CV score: -7.988512820588891
Training set score: 0.903
Test set score: -1.12


Successfully registered model 'PolynomialRegression(4 degrees)_pipeline'.
2021/12/26 14:35:04 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: PolynomialRegression(4 degrees)_pipeline, version 1
Created version '1' of model 'PolynomialRegression(4 degrees)_pipeline'.


**Polynomial regression with regularization**

It looks like introducing higher order features improves the performance of the model. However, models are scoring higher on training than test sets (overfitting / high variance). So lets introduce some regularization with the higher order features. 

In [34]:
degrees = [2,3,4,5]
for degree in degrees: 
    
    poly_pipe = make_pipeline(scaler_pipe, PolynomialFeatures(degree))
    poly_pipe.fit(X_train)
    
    print("----------------------------------")
    print(f"Degree: {degree}")
    name = f"PolynomialLassoRegression({degree} degrees)"
    poly_model = Lasso()
    params={'alpha':[0.01,0.02,0.04,0.08,0.16,0.32,0.64,1]}
    model_train_val(name, poly_pipe, lasso_model, params, X_train, y_train, X_test, y_test)

----------------------------------
Degree: 2
Best params: {'alpha': 0.02}
Best CV score: 0.8605507380562232
Training set score: 0.876
Test set score: 0.792


Successfully registered model 'PolynomialLassoRegression(2 degrees)_pipeline'.
2021/12/26 14:35:15 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: PolynomialLassoRegression(2 degrees)_pipeline, version 1
Created version '1' of model 'PolynomialLassoRegression(2 degrees)_pipeline'.


----------------------------------
Degree: 3
Best params: {'alpha': 0.04}
Best CV score: 0.8614797776232865
Training set score: 0.89
Test set score: 0.838


Successfully registered model 'PolynomialLassoRegression(3 degrees)_pipeline'.
2021/12/26 14:35:20 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: PolynomialLassoRegression(3 degrees)_pipeline, version 1
Created version '1' of model 'PolynomialLassoRegression(3 degrees)_pipeline'.


----------------------------------
Degree: 4
Best params: {'alpha': 0.08}
Best CV score: 0.8632205550168695
Training set score: 0.893
Test set score: 0.817


Successfully registered model 'PolynomialLassoRegression(4 degrees)_pipeline'.
2021/12/26 14:35:24 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: PolynomialLassoRegression(4 degrees)_pipeline, version 1
Created version '1' of model 'PolynomialLassoRegression(4 degrees)_pipeline'.


----------------------------------
Degree: 5
Best params: {'alpha': 0.08}
Best CV score: 0.8040759681883326
Training set score: 0.9
Test set score: 0.787


Successfully registered model 'PolynomialLassoRegression(5 degrees)_pipeline'.
2021/12/26 14:35:30 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: PolynomialLassoRegression(5 degrees)_pipeline, version 1
Created version '1' of model 'PolynomialLassoRegression(5 degrees)_pipeline'.


In [35]:
degrees = [2,3,4,5]
for degree in degrees: 
    
    poly_pipe = make_pipeline(scaler_pipe, PolynomialFeatures(degree))
    poly_pipe.fit(X_train)
    
    print("----------------------------------")
    print(f"Degree: {degree}")
    name = f"PolynomialRidgeRegression({degree} degrees)"
    poly_model = Ridge()
    params={'alpha':[0.01,0.02,0.04,0.08,0.16,0.32,0.64,1]}
    model_train_val(name, poly_pipe, lasso_model, params, X_train, y_train, X_test, y_test)

----------------------------------
Degree: 2
Best params: {'alpha': 0.02}
Best CV score: 0.8605507380562232
Training set score: 0.876
Test set score: 0.792


Successfully registered model 'PolynomialRidgeRegression(2 degrees)_pipeline'.
2021/12/26 14:35:43 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: PolynomialRidgeRegression(2 degrees)_pipeline, version 1
Created version '1' of model 'PolynomialRidgeRegression(2 degrees)_pipeline'.


----------------------------------
Degree: 3
Best params: {'alpha': 0.04}
Best CV score: 0.8614797776232865
Training set score: 0.89
Test set score: 0.838


Successfully registered model 'PolynomialRidgeRegression(3 degrees)_pipeline'.
2021/12/26 14:35:49 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: PolynomialRidgeRegression(3 degrees)_pipeline, version 1
Created version '1' of model 'PolynomialRidgeRegression(3 degrees)_pipeline'.


----------------------------------
Degree: 4
Best params: {'alpha': 0.08}
Best CV score: 0.8632205550168695
Training set score: 0.893
Test set score: 0.817


Successfully registered model 'PolynomialRidgeRegression(4 degrees)_pipeline'.
2021/12/26 14:35:53 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: PolynomialRidgeRegression(4 degrees)_pipeline, version 1
Created version '1' of model 'PolynomialRidgeRegression(4 degrees)_pipeline'.


----------------------------------
Degree: 5
Best params: {'alpha': 0.08}
Best CV score: 0.8040759681883326
Training set score: 0.9
Test set score: 0.787


Successfully registered model 'PolynomialRidgeRegression(5 degrees)_pipeline'.
2021/12/26 14:35:58 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: PolynomialRidgeRegression(5 degrees)_pipeline, version 1
Created version '1' of model 'PolynomialRidgeRegression(5 degrees)_pipeline'.


### Clean up in MLFlow

If you don't want to keep the experiment metadata or models, you can remove all MLFlow runs and experiments as a clean-up exercise.

In [34]:
mlflow.list_experiments()
print()
mlflow.list_run_infos(experiment_id=experiment_id)




[<RunInfo: artifact_uri='./mlruns/2/0d5ac0d52e7d442c92477e8373884beb/artifacts', end_time=1640169301505, experiment_id='2', lifecycle_stage='active', run_id='0d5ac0d52e7d442c92477e8373884beb', run_uuid='0d5ac0d52e7d442c92477e8373884beb', start_time=1640169298060, status='FINISHED', user_id='eddlo'>,
 <RunInfo: artifact_uri='./mlruns/2/05e95941ba024ab395d0ab0648c018c7/artifacts', end_time=1640169298022, experiment_id='2', lifecycle_stage='active', run_id='05e95941ba024ab395d0ab0648c018c7', run_uuid='05e95941ba024ab395d0ab0648c018c7', start_time=1640169295078, status='FINISHED', user_id='eddlo'>,
 <RunInfo: artifact_uri='./mlruns/2/2ce79d559d0546dd91bd26a252b0ac78/artifacts', end_time=1640169295057, experiment_id='2', lifecycle_stage='active', run_id='2ce79d559d0546dd91bd26a252b0ac78', run_uuid='2ce79d559d0546dd91bd26a252b0ac78', start_time=1640169292270, status='FINISHED', user_id='eddlo'>,
 <RunInfo: artifact_uri='./mlruns/2/ace73a69b6614032b3b8d0c9518a29c2/artifacts', end_time=1640169

Delete individual runs (pass run id)

In [None]:
mlflow.delete_run()

Delete current experiment (send the experiment and associated runs to the trash folder). These need to be manually deleted from the trash folder if you want to create a new experiment with the same name / id.

In [35]:
mlflow.delete_experiment(experiment_id=experiment_id)
!mlflow experiments delete --experiment-id {experiment_id}

Delete the default experiment that gets created when mlflow is imported

In [37]:
mlflow.delete_experiment(experiment_id='0')
!mlflow experiments delete --experiment-id 0

In [None]:
!mlflow gc --backend-store-uri 'sqlite:///mlruns.db'