# Using MLFlow and Evidently to Evaluate Data Drift

In this example, we will explore the MLflow integration with Evidently.

This notebook shows how you can use the Evidently and MLflow to:
* calculate data drift for the model, performed as batch checks 
* log data drift using MLflow Tracking
* explore the result using MLflow UI

Acknowledgments:
* The dataset used in the example is from:  https://www.kaggle.com/c/bike-sharing-demand/data?select=train.csv
* Fanaee-T, Hadi, and Gama, Joao, 'Event labeling combining ensemble detectors and background knowledge', Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg
* More information about the dataset can be found in UCI machine learning repository: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset

## Getting Started¶
To run this tutorial:

1. Install MLflow
You can install MLflow with the following command `pip install mlflow` or install MLflow with scikit-learn via `pip install mlflow[extras]`
More details:https://mlflow.org/docs/latest/tutorials-and-examples/tutorial.html#id5

2. Install Evidently
You can install Evidently with the following command `pip install evidently`
In case you are also interested in Evidently Dashboard visualization in Jupyter install jupyter nbextention:
`jupyter nbextension install --sys-prefix --symlink --overwrite --py evidently`
And activate it:
`jupyter nbextension enable evidently --py --sys-prefix`
More details: https://docs.evidentlyai.com/install-evidently 

3. Load data from https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset and save in locally

In [1]:
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

In [2]:
import json
import pandas as pd
import requests
import zipfile
import io

from evidently.model_profile import Profile
from evidently.model_profile.sections import DataDriftProfileSection
from evidently.pipeline.column_mapping import ColumnMapping

import mlflow
import mlflow.sklearn
from mlflow.tracking import MlflowClient

More information about the dataset can be found in Kaggle Playground Competition: https://www.kaggle.com/c/bike-sharing-demand/data?select=train.csv

Acknowledgement: Fanaee-T, Hadi, and Gama, Joao, 'Event labeling combining ensemble detectors and background knowledge', Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg

In [3]:
#load data
content = requests.get("https://archive.ics.uci.edu/ml/machine-learning-databases/00275/Bike-Sharing-Dataset.zip").content
with zipfile.ZipFile(io.BytesIO(content)) as arc:
    raw_data = pd.read_csv(arc.open("day.csv"), header=0, sep=',', parse_dates=['dteday'], index_col='dteday')

In [4]:
#observe data structure
raw_data.head()

Unnamed: 0_level_0,instant,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
dteday,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2011-01-01,1,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
2011-01-02,2,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
2011-01-03,3,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
2011-01-04,4,1,0,1,0,2,1,1,0.2,0.212122,0.590435,0.160296,108,1454,1562
2011-01-05,5,1,0,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869,82,1518,1600


In [5]:
#set column mapping for Evidently Profile
data_columns = ColumnMapping()
data_columns.numerical_features = ['weathersit', 'temp', 'atemp', 'hum', 'windspeed']
data_columns.categorical_features = ['holiday', 'workingday']

In [8]:
#evaluate data drift with Evidently Profile
def eval_drift(reference, production, column_mapping):
    data_drift_profile = Profile(sections=[DataDriftProfileSection()])
    data_drift_profile.calculate(reference, production, column_mapping=column_mapping)
    report = data_drift_profile.json()
    json_report = json.loads(report)

    drifts = []
    num_features = column_mapping.numerical_features if column_mapping.numerical_features else []
    cat_features = column_mapping.categorical_features if column_mapping.categorical_features else []
    
    for feature in column_mapping.numerical_features + column_mapping.categorical_features:
        drifts.append((feature, json_report['data_drift']['data']['metrics'][feature]['p_value'])) 
    return drifts

In [9]:
#set reference dates
reference_dates = ('2011-01-01 00:00:00','2011-01-28 23:00:00')

#set experiment batches dates
experiment_batches = [
    ('2011-01-01 00:00:00','2011-01-29 23:00:00'),
    ('2011-01-29 00:00:00','2011-02-07 23:00:00'),
    ('2011-02-07 00:00:00','2011-02-14 23:00:00'),
    ('2011-02-15 00:00:00','2011-02-21 23:00:00'),  
]

In [8]:
#log into MLflow
client = MlflowClient()

#set experiment
mlflow.set_experiment('Data Drift Evaluation with Evidently')

#start new run
for date in experiment_batches:
    with mlflow.start_run() as run: #inside brackets run_name='test'
        
        # Log parameters
        mlflow.log_param("begin", date[0])
        mlflow.log_param("end", date[1])

        # Log metrics
        metrics = eval_drift(raw_data.loc[reference_dates[0]:reference_dates[1]], 
                             raw_data.loc[date[0]:date[1]], 
                             column_mapping=data_columns)
        for feature in metrics:
            mlflow.log_metric(feature[0], round(feature[1], 3))

        print(run.info)

INFO: 'Data Drift Evaluation with Evidently' does not exist. Creating a new experiment
<RunInfo: artifact_uri='file:///Users/emeli/Dev/evidently_dev/evidently/tutorials/mlruns/1/0edacc2499a842c98f2865d4db4d5e69/artifacts', end_time=None, experiment_id='1', lifecycle_stage='active', run_id='0edacc2499a842c98f2865d4db4d5e69', run_uuid='0edacc2499a842c98f2865d4db4d5e69', start_time=1632492708903, status='RUNNING', user_id='emeli'>
<RunInfo: artifact_uri='file:///Users/emeli/Dev/evidently_dev/evidently/tutorials/mlruns/1/39968ca74f8a41cba5366449334f3b94/artifacts', end_time=None, experiment_id='1', lifecycle_stage='active', run_id='39968ca74f8a41cba5366449334f3b94', run_uuid='39968ca74f8a41cba5366449334f3b94', start_time=1632492709376, status='RUNNING', user_id='emeli'>
<RunInfo: artifact_uri='file:///Users/emeli/Dev/evidently_dev/evidently/tutorials/mlruns/1/026b744ff6a7457c80098a33ffab576a/artifacts', end_time=None, experiment_id='1', lifecycle_stage='active', run_id='026b744ff6a7457c800

In [None]:
#run MLflow UI (it will be more convinient to run it directly from the terminal)
!mlflow ui