# Objective

This notebook provides an introduction to mlflw tracking using a local folder and a sqlite database. The data can be found on [Kaggle](www.kaggle.com): https://www.kaggle.com/datasets/harlfoxem/housesalesprediction.

nessecary packages for mlflow: ```pip install mlflow```

necessary packages to use the web ui using sqlite ```pip install sqlalchemy```

# Setup

In [1]:
import pandas as pd
import numpy as np

import xgboost as xgb
from sklearn.neighbors import KernelDensity
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor

import mlflow

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

# Regression Model

We will first train a regression model using Random Forest Regressor. We will do some prior preprocessing of the data, which can be found in the notebook 01_linear_regression_xgboost. We will apply the preprocessing explored there. For further details please refer to this notebook. 

## Read the Data

In [2]:
df = pd.read_csv('data/kc_house_data.csv')
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


## Preprocessing
Here we define a function that performs the preprocessing explored in notebook 01_linear_regression_xgboost. For the purpose of this notebook we will divide the dataset in a training and validation set and omit the third test set.

In [3]:
def preprocessing(df):
    
    # used columns
    columns = ['sqft_living','grade', 'sqft_above', 'sqft_living15',
           'bathrooms','view','sqft_basement','lat','long','waterfront',
           'yr_built','bedrooms']
    # Delete entry with 33 bedrooms
    df = df[df["bedrooms"] != 33]
    
    # Convert grade, view, waterfront to type object
    df[['grade','view','waterfront']] = df[['grade','view','waterfront']].astype('object')
    
    # Create training and validation set
    X_train, X_val, y_train, y_val = train_test_split(df[columns], df['price'], test_size=0.2, shuffle=True, random_state=42)
    print(f'train data shape: X - {X_train.shape}, y - {y_train.shape}')
    print(f'validation data shape: X - {X_val.shape}, y - {y_val.shape}')
    
    # log transform the target varibale 
    y_train = np.log1p(y_train)
    y_val = np.log1p(y_val)
    
    # define categorical and numerical varibales 
    categorical = ['grade', 'view', 'waterfront']
    numerical = ['sqft_living', 'sqft_above', 'sqft_living15',
           'bathrooms','sqft_basement','lat','long',
           'yr_built','bedrooms']
    
    # one-hot encode categorical variables
    ohe = OneHotEncoder()
    X_train_cat = ohe.fit_transform(X_train[categorical]).toarray()
    X_val_cat = ohe.transform(X_val[categorical]).toarray()
    
    # define numerical columns 
    X_train_num = np.array(X_train[numerical])
    X_val_num = np.array(X_val[numerical])
    
    # concatenate numerical and categorical variables
    X_train = np.concatenate([X_train_cat, X_train_num], axis=1)
    X_val = np.concatenate([X_val_cat, X_val_num], axis=1)
    print('Shapes after one-hot encoding')
    print(f'X_train shape: {X_train.shape}, X_val shape {X_val.shape}')
    
    return X_train, X_val, y_train, y_val

In [4]:
X_train, X_val, y_train, y_val = preprocessing(df)

train data shape: X - (17289, 12), y - (17289,)
validation data shape: X - (4323, 12), y - (4323,)
Shapes after one-hot encoding
X_train shape: (17289, 28), X_val shape (4323, 28)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[['grade','view','waterfront']] = df[['grade','view','waterfront']].astype('object')


## Modelling
### Experiment Tracking using mlflow
Define a function for fitting and evaluating a Random forest Regressor Model. 

In [5]:
def train_rf(X_train, y_train, X_val, y_val, n_estimators=100, max_depth=6):
    
    model = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth)
    model.fit(X_train, y_train)
    
    # generate predictions
    y_pred_train = model.predict(X_train).reshape(-1,1)
    y_pred = model.predict(X_val).reshape(-1,1)
    
    # calculate errors
    rmse_train = mean_squared_error(y_pred_train, y_train, squared=False)
    rmse_val = mean_squared_error(y_pred, y_val, squared=False)
    print(f"rmse training: {rmse_train:.3f}\t rmse validation: {rmse_val:.3f}")

In [6]:
train_rf(X_train, y_train,X_val, y_val, n_estimators=100, max_depth=6)

rmse training: 0.219	 rmse validation: 0.229


Now log several paramters and metrics using mlflow.

In [7]:
def train_rf(X_train, y_train, X_val, y_val, n_estimators=100, max_depth=6):
    
    model = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth)
    model.fit(X_train, y_train)
    
    # generate predictions
    y_pred_train = model.predict(X_train).reshape(-1,1)
    y_pred = model.predict(X_val).reshape(-1,1)
    
    # calculate errors
    rmse_train = mean_squared_error(y_pred_train, y_train, squared=False)
    rmse_val = mean_squared_error(y_pred, y_val, squared=False)
    print(f"rmse training: {rmse_train:.3f}\t rmse validation: {rmse_val:.3f}")
    
    # Logging params and metrics to MLFlow
    mlflow.log_param('n_estimators', n_estimators)
    mlflow.log_param('max_depth', max_depth)
    mlflow.log_metric('rmse_val', rmse_val)
    mlflow.log_metric('rmse_train', rmse_train) 

In [8]:
train_rf(X_train, y_train, X_val, y_val, n_estimators=100, max_depth=6)

rmse training: 0.219	 rmse validation: 0.228


### Store logs in a local  Folder

The easiest way to track our experiments is to save all the logged information in a local folder. We can do this by using ```with mlflow.start_run()``` and run our model in the ```with``` statement. A directory ```'./mlflow'```, where the runs and all associated loggings are saved is created. Every time we run code with the command ```with mlflow.start_run()``` a new run will be created under the same experiment name. By default the experiment is named ```0```. 

In [11]:
with mlflow.start_run():
    train_rf(X_train, y_train, X_val, y_val, n_estimators=100, max_depth=6)

rmse training: 0.219	 rmse validation: 0.228


We perform a new run of the model changing the "max_depth" parameter. If we run it as above it will be stored under the current experiment folder as the previous, in this case with experiment_id=0.

In [12]:
with mlflow.start_run():
    train_rf(X_train, y_train, X_val, y_val, n_estimators=100, max_depth=None)

rmse training: 0.072	 rmse validation: 0.185


If we want to save our results under a different experiment, we can set a new experiment, using ```mlflow.set_experiment()```. If the experiment exists it will save the runs under this experiment, if it doesn't exist it will create a new one.

Let us create a second model and log parameters and metrics as in the previous example, this time using XGBoost. We want to save the results under a new experiment. 

In [13]:
def train_xgb(X_train, y_train,Xval, y_val, 
              n_estimators=100,
              objective='reg:squarederror',
              learning_rate=0.3,
              min_child_weight=1,
              lambda_=1,
              gamma=0):
    
    # Initialize XGB with objective function
    parameters = {"objective": objective,
              "n_estimators": n_estimators,
              "eta": learning_rate,
              "lambda": lambda_,
              "gamma": gamma,
              "max_depth": None,
              "min_child_weight": min_child_weight,
              "verbosity": 0}

    
    model = xgb.XGBRegressor(**parameters)
    model.fit(X_train, y_train)
    
    # generate predictions
    y_pred_train = model.predict(X_train).reshape(-1,1)
    y_pred = model.predict(X_val).reshape(-1,1)
    
    # calculate errors
    rmse_train = mean_squared_error(y_pred_train, y_train, squared=False)
    rmse_val = mean_squared_error(y_pred, y_val, squared=False)
    print(f"rmse training: {rmse_train:.3f}\t rmse validation: {rmse_val:.3f}")
    
    # Logging params and metrics to MLFlow
    mlflow.log_param('n_estimators', n_estimators)
    mlflow.log_param('objective', objective)
    mlflow.log_param('lambda', lambda_)
    mlflow.log_param('gamma', gamma)
    mlflow.log_param('eta', learning_rate)
    mlflow.log_param('min_child_weight', min_child_weight)
    mlflow.log_metric('rmse_val', rmse_val)
    mlflow.log_metric('rmse_train', rmse_train)

In [14]:
# defining a new experiment
experiment_name = 'xgboost'
exp_id = mlflow.set_experiment(experiment_name=experiment_name)
with mlflow.start_run():
    train_xgb(X_train, y_train, X_val, y_val, learning_rate=0.1)

2022/05/26 19:13:02 INFO mlflow.tracking.fluent: Experiment with name 'xgboost' does not exist. Creating a new experiment.


rmse training: 0.156	 rmse validation: 0.178


We can run models under specific experiment names, by setting the experiment_id to an existing experiment.

In [15]:
with mlflow.start_run(experiment_id=1):
    train_xgb(X_train, y_train, X_val, y_val, learning_rate=0.01)

rmse training: 4.606	 rmse validation: 4.612


Using ```tree mlruns``` we can see the structure of the folder, where our runs are stored.

In [17]:
!tree mlruns

[01;34mmlruns[00m
├── [01;34m0[00m
│   ├── [01;34m29ed490ed510441692142a030a392a37[00m
│   │   ├── [01;34martifacts[00m
│   │   ├── meta.yaml
│   │   ├── [01;34mmetrics[00m
│   │   │   ├── rmse_train
│   │   │   └── rmse_val
│   │   ├── [01;34mparams[00m
│   │   │   ├── max_depth
│   │   │   └── n_estimators
│   │   └── [01;34mtags[00m
│   │       ├── mlflow.source.git.commit
│   │       ├── mlflow.source.name
│   │       ├── mlflow.source.type
│   │       └── mlflow.user
│   ├── [01;34m5cb2dbf0b0584f81bd2c45b5d1bdadee[00m
│   │   ├── [01;34martifacts[00m
│   │   ├── meta.yaml
│   │   ├── [01;34mmetrics[00m
│   │   │   ├── rmse_train
│   │   │   └── rmse_val
│   │   ├── [01;34mparams[00m
│   │   │   ├── max_depth
│   │   │   └── n_estimators
│   │   └── [01;34mtags[00m
│   │       ├── mlflow.source.git.commit
│   │       ├── mlflow.source.name
│   │       ├── mlflow.source.type
│   │       └── mlflow.user
│   ├── [01;34mc95dfd97e79

### Store logs in Sqlite Database

An alternative for storing all the logged outputs locally in a folder is to use a database to do so. Here we will use ```sqlite```. In order to tell mlflow to store things in a database we need to set the tracking urias ```mlflow.set_tracking_uri("sqlite:///mlflow.db")```. Alternatively we can export the environment variable ```export MLFLOW_TRACKING_URI sqlite:///mlflow.db```.

In [20]:
mlflow.set_tracking_uri("sqlite:///mlruns.db")

In [21]:
with mlflow.start_run(experiment_id=0):
    train_rf(X_train, y_train, X_val, y_val)

rmse training: 0.219	 rmse validation: 0.228


This creates a database ```mlflow.db```, where the logged paramters and metrics are stored. We can check the entries using sqlite in the terminal:
```
sqlite3 mlflow.db
sqlite3> .table --shows all tables in the database
sqlite3> SELECT * FROM metrics --shows all metrics 
```

## Using the web ui for Visualization
in the first case, when we store the logged parameters and metrics locally in ```.\mlflow```, we can simply type ```mlflow ui``` in the terminal to the the results interactively. This is short for ```mlflow ui --backend-store-uri mlruns```

In the second case, we need to change the ```backend-store-uri``` and use ```mlflow ui --backend-store-uri sqlite:///mlruns.db```. In order for this to work sqlalchemy needs to be installed.

Another useful paramter is ```--port``` to change the port for the ui. 

In [22]:
mlflow.end_run()