<a href="https://colab.research.google.com/github/cagBRT/MLOps/blob/main/EvidentlyAI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Examining Data Drift over three weeks in the life of a model

The notebook uses an open source library called [Evidently](https://www.evidentlyai.com/)

In [None]:
# Clone the entire repo.
!git clone -s https://github.com/cagBRT/MLOps.git cloned-repo
%cd cloned-repo

**Import and install EvidentlyAI**

In [None]:
try:
    import evidently
except:
    !pip install git+https://github.com/evidentlyai/evidently.git

**Import Libraries**

In [None]:
import pandas as pd
import numpy as np
import requests
import zipfile
import io

from datetime import datetime, time
from sklearn import datasets, ensemble

from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, TargetDriftPreset, RegressionPreset

**Get the data**

Both hour.csv and day.csv have the following fields, except hr which is not available in day.csv

- instant: record index
- dteday : date
- season : season (1:winter, 2:spring, 3:summer, 4:fall)
- yr : year (0: 2011, 1:2012)
- mnth : month ( 1 to 12)
- hr : hour (0 to 23)
- holiday : weather day is holiday or not (extracted from [Web Link])
- weekday : day of the week
- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
+ weathersit :
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)
- atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)
- hum: Normalized humidity. The values are divided to 100 (max)
- windspeed: Normalized wind speed. The values are divided to 67 (max)
- casual: count of casual users
- registered: count of registered users
- cnt: count of total rental bikes including both casual and registered

In [None]:
content = requests.get("https://archive.ics.uci.edu/ml/machine-learning-databases/00275/Bike-Sharing-Dataset.zip").content
with zipfile.ZipFile(io.BytesIO(content)) as arc:
    raw_data = pd.read_csv(arc.open("hour.csv"), header=0, sep=',', parse_dates=['dteday'], index_col='dteday')


In [None]:
raw_data.index = raw_data.apply(
    lambda row: datetime.combine(row.name, time(hour=int(row['hr']))), axis = 1)

In [None]:
raw_data.shape

In [None]:
raw_data.head()

In [None]:
raw_data.columns

We want to predict the total number of bikes rented.<br>
The 'cnt' column will the labels. <br>

The columns of casual renters and registered renters can be dropped for this prediction. 

The numerical features for this model are: <br>
>temp, atemp, hum, windspeed, hr, weekday

The categorical features are:<br>
>season, holiday, workingday

In [None]:
target = 'cnt'
prediction = 'prediction'
numerical_features = ['temp', 'atemp', 'hum', 'windspeed', 'hr', 'weekday']
categorical_features = ['season', 'holiday', 'workingday']


The past or reference data begins on Jan 1, 2011. <br>
The current data of production release is Feb 28, 2011

In [None]:
reference = raw_data.loc['2011-01-01 00:00:00':'2011-01-28 23:00:00']
current = raw_data.loc['2011-01-29 00:00:00':'2011-02-28 23:00:00']


In [None]:
reference.head()


The model we will use is an ensemble RandomForestRegressor


---

# Example Below

In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Load the Breast Cancer Dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Arrange Data into Features Matrix and Target Vector
X = df.loc[:, df.columns != 'target']
y = df.loc[:, 'target'].values

# Split the data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, y, random_state=0)

# Random Forests in `scikit-learn` (with N = 100)
rf = RandomForestClassifier(n_estimators=100,
                            random_state=0)
rf.fit(X_train, Y_train)
n_estimators=100

In [None]:
from sklearn.metrics import accuracy_score
estimatorAccuracy=[]
for curEstimator in range(n_estimators):
        estimatorAccuracy.append([curEstimator,accuracy_score(y, rf.estimators_[curEstimator].predict(X))])

estimatorAccuracy=pd.DataFrame(estimatorAccuracy,columns=['estimatorNumber','Accuracy'])
estimatorAccuracy.sort_values(inplace=True,by='Accuracy',ascending=False)

bestDecisionTree= rf.estimators_[estimatorAccuracy.head(1)['estimatorNumber'].values[0]]


In [None]:
fn=data.feature_names
cn=data.target_names
fig, axes = plt.subplots(nrows = 1,ncols = 1,figsize = (4,4), dpi=800)
tree.plot_tree(rf.estimators_[0],
               feature_names = fn, 
               class_names=cn,
               filled = True);
fig.savefig('rf_individualtree.png')


In [None]:
# This may not the best way to view each estimator as it is small
fn=data.feature_names
cn=data.target_names
fig, axes = plt.subplots(nrows = 1,ncols = 5,figsize = (10,2), dpi=900)
for index in range(0, 5):
    tree.plot_tree(rf.estimators_[index],
                   feature_names = fn, 
                   class_names=cn,
                   filled = True,
                   ax = axes[index]);

    axes[index].set_title('Estimator: ' + str(index), fontsize = 11)
fig.savefig('rf_5trees.png')

# End of Example



---



In [None]:
regressor = ensemble.RandomForestRegressor(random_state = 0, n_estimators = 50)

In [None]:
regressor.fit(reference[numerical_features + categorical_features], reference[target])

In [None]:
ref_prediction = regressor.predict(reference[numerical_features + categorical_features])
current_prediction = regressor.predict(current[numerical_features + categorical_features])

In [None]:

reference['prediction'] = ref_prediction
current['prediction'] = current_prediction


In [None]:
column_mapping = ColumnMapping()

column_mapping.target = target
column_mapping.prediction = prediction
column_mapping.numerical_features = numerical_features
column_mapping.categorical_features = categorical_features

In [None]:
regression_perfomance = Report(metrics=[RegressionPreset()])
regression_perfomance.run(current_data=reference, reference_data=None, column_mapping=column_mapping)

In [None]:
regression_perfomance.show()


In [None]:
#regression_perfomance.save('reports/regression_performance_at_training.html')

week 1

In [None]:
regression_perfomance = Report(metrics=[RegressionPreset()])
regression_perfomance.run(current_data=current.loc['2011-01-29 00:00:00':'2011-02-07 23:00:00'], 
                          reference_data=reference,
                          column_mapping=column_mapping)

regression_perfomance.show()

In [None]:
#regression_perfomance.save('reports/regression_performance_after_week1.html')

In [None]:
target_drift = Report(metrics=[TargetDriftPreset()])
target_drift.run(current_data=current.loc['2011-01-29 00:00:00':'2011-02-07 23:00:00'],
                 reference_data=reference,
                 column_mapping=column_mapping)

target_drift.show()

In [None]:
#target_drift.save('reports/target_drift_after_week1.html')

week 2

In [None]:
regression_perfomance = Report(metrics=[RegressionPreset()])
regression_perfomance.run(current_data=current.loc['2011-02-07 00:00:00':'2011-02-14 23:00:00'], 
                          reference_data=reference,
                          column_mapping=column_mapping)

regression_perfomance.show()


In [None]:
#regression_perfomance.save('reports/regression_performance_after_week2.html')

In [None]:
target_drift = Report(metrics=[TargetDriftPreset()])
target_drift.run(current_data=current.loc['2011-02-07 00:00:00':'2011-02-14 23:00:00'],
                 reference_data=reference,
                 column_mapping=column_mapping)

target_drift.show()


In [None]:
#target_drift.save('reports/target_drift_after_week2.html')


week 3

In [None]:
regression_perfomance = Report(metrics=[RegressionPreset()])
regression_perfomance.run(current_data=current.loc['2011-02-15 00:00:00':'2011-02-21 23:00:00'], 
                          reference_data=reference,
                          column_mapping=column_mapping)

regression_perfomance.show()


In [None]:
#regression_perfomance.save('reports/regression_performance_after_week3.html')

In [None]:
target_drift = Report(metrics=[TargetDriftPreset()])
target_drift.run(current_data=current.loc['2011-02-15 00:00:00':'2011-02-21 23:00:00'],
                 reference_data=reference,
                 column_mapping=column_mapping)

target_drift.show()

In [None]:
#target_drift.save('reports/target_drift_after_week3.html')


data drift

In [None]:
column_mapping = ColumnMapping()

column_mapping.numerical_features = numerical_features

In [None]:
data_drift = Report(metrics = [DataDriftPreset()])
data_drift.run(current_data = current.loc['2011-01-29 00:00:00':'2011-02-07 23:00:00'],
               reference_data = reference,
               column_mapping=column_mapping)

data_drift.show()


In [None]:
#data_drift.save("reports/data_drift_dashboard_after_week1.html")