# Using MLFlow and Evidently to Evaluate Data Drift

In this example, we will explore the MLflow integration with Evidently.

This notebook shows how you can use the Evidently and MLflow to:
* calculate data drift for the model, performed as batch checks 
* log data drift using MLflow Tracking
* explore the result using MLflow UI

Acknowledgments:
* The dataset used in the example is from:  https://www.kaggle.com/c/bike-sharing-demand/data?select=train.csv
* Fanaee-T, Hadi, and Gama, Joao, 'Event labeling combining ensemble detectors and background knowledge', Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg
* More information about the dataset can be found in UCI machine learning repository: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset

## Getting Started¶
To run this tutorial:

1. Install MLflow
You can install MLflow with the following command `pip install mlflow` or install MLflow with scikit-learn via `pip install mlflow[extras]`
More details:https://mlflow.org/docs/latest/tutorials-and-examples/tutorial.html#id5

2. Install Evidently
You can install Evidently with the following command `pip install evidently`
In case you are also interested in Evidently Dashboard visualization in Jupyter install jupyter nbextention:
`jupyter nbextension install --sys-prefix --symlink --overwrite --py evidently`
And activate it:
`jupyter nbextension enable evidently --py --sys-prefix`
More details: https://docs.evidentlyai.com/install-evidently 

3. Optionally, you can load data from https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset and save in locally or skip this step and download data with  ```requests```  using instructions below

In [None]:
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

In [None]:
import json
import pandas as pd
import requests
import zipfile
import io

from evidently.pipeline.column_mapping import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

import mlflow
import mlflow.sklearn
from mlflow.tracking import MlflowClient

More information about the dataset can be found in Kaggle Playground Competition: https://www.kaggle.com/c/bike-sharing-demand/data?select=train.csv

Acknowledgement: Fanaee-T, Hadi, and Gama, Joao, 'Event labeling combining ensemble detectors and background knowledge', Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg

In [None]:
#load data
content = requests.get("https://archive.ics.uci.edu/static/public/275/bike+sharing+dataset.zip").content
with zipfile.ZipFile(io.BytesIO(content)) as arc:
    raw_data = pd.read_csv(arc.open("day.csv"), header=0, sep=',', parse_dates=['dteday'])

In [None]:
#observe data structure
raw_data.head()

In [None]:
#set column mapping for Evidently Profile
data_columns = ColumnMapping()
data_columns.datetime = 'dteday'
data_columns.numerical_features = ['weathersit', 'temp', 'atemp', 'hum', 'windspeed']
data_columns.categorical_features = ['holiday', 'workingday']

In [None]:
#evaluate data drift with Evidently Profile
def eval_drift(reference, production, column_mapping):
    """
    Returns a list with pairs (feature_name, drift_score)
    Drift Score depends on the selected statistical test or distance and the threshold
    """    
    data_drift_report = Report(metrics=[DataDriftPreset()])
    data_drift_report.run(reference_data=reference, current_data=production, column_mapping=column_mapping)
    report = data_drift_report.as_dict()

    drifts = []

    for feature in column_mapping.numerical_features + column_mapping.categorical_features:
        drifts.append((feature, report["metrics"][1]["result"]["drift_by_columns"][feature]["drift_score"]))

    return drifts

In [None]:
#set reference dates
reference_dates = ('2011-01-01 00:00:00','2011-01-28 23:00:00')

#set experiment batches dates
experiment_batches = [
    ('2011-01-01 00:00:00','2011-01-29 23:00:00'),
    ('2011-01-29 00:00:00','2011-02-07 23:00:00'),
    ('2011-02-07 00:00:00','2011-02-14 23:00:00'),
    ('2011-02-15 00:00:00','2011-02-21 23:00:00'),  
]

In [None]:
#log into MLflow
client = MlflowClient()

#set experiment
mlflow.set_experiment('Data Drift Evaluation with Evidently')

#start new run
for date in experiment_batches:
    with mlflow.start_run() as run: #inside brackets run_name='test'
        
        # Log parameters
        mlflow.log_param("begin", date[0])
        mlflow.log_param("end", date[1])

        # Log metrics
        metrics = eval_drift(raw_data.loc[raw_data.dteday.between(reference_dates[0], reference_dates[1])], 
                             raw_data.loc[raw_data.dteday.between(date[0], date[1])], 
                             column_mapping=data_columns)
        for feature in metrics:
            mlflow.log_metric(feature[0], round(feature[1], 3))

        print(run.info)

In [None]:
#run MLflow UI (it will be more convinient to run it directly from the terminal)
#!mlflow ui

# Support Evidently
Did you find the example useful? Star Evidently on GitHub to contribute back! This helps us continue creating free open-source tools for the community. https://github.com/evidentlyai/evidently