### Data Drift & Model Drift Detection

#### Data Drift
If there is changes in the data, we normally call it as Data Drift or Data Shift. 
A Data Drift can also refer to
+ changes in the input data
+ changes in the values of the features used to define or predict a target label.
+ changes in the properties of the independent variable

#### Model Drift
This refers to changes in the performance of the model over time. 
It is the deterioration of models over time in the case of accuracy and prediction.
ML Models do not live in a static environment hence they will deteriorate or decay over time.

#### Deepchecks
+ Useful for detecting data drift,data integrity,model performance,etc
+ pip install deepchecks

In [None]:
# RUNS OK in jupyter but not in VSCode

In [1]:
# Load Packages
import pandas as pd 
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [2]:
#### Build A Model
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [3]:
# load data
df = pd.read_csv("data/bank-additional-full_encoded.csv")

In [4]:
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,0,0,0,0,0,0,0,0,0,...,1,999,0,0,1.1,93.994,-36.4,4.857,5191.0,0
1,57,1,0,1,1,0,0,0,0,0,...,1,999,0,0,1.1,93.994,-36.4,4.857,5191.0,0
2,37,1,0,1,0,1,0,0,0,0,...,1,999,0,0,1.1,93.994,-36.4,4.857,5191.0,0
3,40,2,0,2,0,0,0,0,0,0,...,1,999,0,0,1.1,93.994,-36.4,4.857,5191.0,0
4,56,1,0,1,0,0,1,0,0,0,...,1,999,0,0,1.1,93.994,-36.4,4.857,5191.0,0


In [5]:
# Features & Labels
Xfeatures = df.drop('y',axis=1)
# Select last column of dataframe as a dataframe object
ylabels = df.iloc[: , -1:]

In [6]:
Xfeatures.columns

Index(['age', 'job', 'marital', 'education', 'default', 'housing', 'loan',
       'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'emp.var.rate', 'cons.price.idx',
       'cons.conf.idx', 'euribor3m', 'nr.employed'],
      dtype='object')

In [7]:
# Split Dataset
x_train,x_test,y_train,y_test = train_test_split(Xfeatures,ylabels,test_size=0.3,random_state=7)

### Requirements
+ Datasets
    - train,test data
+ Model

#### Components
+ Suites
+ Checks
+ Dataset

In [8]:
# Build the Model
pipe_lr = Pipeline(steps=[('sc',StandardScaler()),('lr',LogisticRegression())])

In [9]:
pipe_lr

In [10]:
# Train to Fit
pipe_lr.fit(x_train,y_train)

  y = column_or_1d(y, warn=True)


In [11]:
# Accuarcy
pipe_lr.score(x_test,y_test)

0.9105770008901837

### Using Deepchecks for Offline ML Data Drift Detection

In [12]:
import deepchecks

In [13]:
# Method
dir(deepchecks)

['BaseCheck',
 'BaseSuite',
 'CheckFailure',
 'CheckResult',
 'Condition',
 'ConditionCategory',
 'ConditionResult',
 'Context',
 'Dataset',
 'ModelComparisonCheck',
 'ModelComparisonSuite',
 'ModelOnlyBaseCheck',
 'ModelOnlyCheck',
 'SingleDatasetBaseCheck',
 'SingleDatasetCheck',
 'Suite',
 'SuiteResult',
 'TrainTestBaseCheck',
 'TrainTestCheck',
 '_SubstituteModule',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__original_module__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_init_module_attrs',
 'analytics',
 'core',
 'get_verbosity',
 'is_notebook',
 'matplotlib',
 'os',
 'pio',
 'pio_backends',
 'set_verbosity',
 'sys',
 'tabular',
 'types',
 'utils',
 'validate_latest_version',
 'version',

### Full Suite
+ Data Drift Detection
+ Model Performance /Confidence
+ Data Integrity Check
+ Label Ambuiguity
+ Other checks

In [14]:
from deepchecks.tabular.suites import full_suite

In [15]:
# Create the Dataset Objects
ds_train = deepchecks.tabular.Dataset(df=x_train,label=y_train,cat_features=[])
ds_test = deepchecks.tabular.Dataset(df=x_test,label=y_test,cat_features=[])

In [16]:
# Create the suite
fsuite = full_suite()

In [17]:
results = fsuite.run(train_dataset=ds_train,test_dataset=ds_test,model=pipe_lr)

deepchecks - INFO - Calculating permutation feature importance. Expected to finish in 9 seconds


In [18]:
results

Accordion(children=(VBox(children=(HTML(value='\n<h1 id="summary_B710GSOGJCW4WLHQ4MME28EPE">Full Suite</h1>\n<…

#### Feature/Data Drift

In [19]:
from deepchecks.tabular.checks import FeatureDrift

In [20]:
check = FeatureDrift()

In [21]:
result = check.run(train_dataset=ds_train, test_dataset=ds_test, model=pipe_lr)

deepchecks - INFO - Calculating permutation feature importance. Expected to finish in 9 seconds


In [22]:
result

VBox(children=(HTML(value='<h4><b>Feature Drift</b></h4>'), HTML(value='<p>    Calculate drift between train d…

In [23]:
### Label Drift
from deepchecks.tabular.checks import LabelDrift
lcheck = LabelDrift()
lresult = lcheck.run(train_dataset=ds_train, test_dataset=ds_test)

In [24]:
lresult

VBox(children=(HTML(value='<h4><b>Label Drift</b></h4>'), HTML(value='<p>    Calculate label drift between tra…

### Dataset Integrity Checks using Deepchecks
+ pip install deepchecks

#### Components
+ checks
+ suites
+ Dataset

In [25]:
# import pandas as pd
# import deepchecks

In [26]:
# Load Dataset
df = pd.read_csv("data/bank-additional-full_encoded.csv")

In [27]:
dir(deepchecks.tabular.suites)

['__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 'data_integrity',
 'default_suites',
 'full_suite',
 'model_evaluation',
 'production_suite',
 'train_test_validation']

In [28]:
# from deepchecks.tabular.suites import single_dataset_integrity
from deepchecks.tabular.suites import data_integrity

In [29]:
# Fxn
integrity = data_integrity()
integrity.run(df)



Accordion(children=(VBox(children=(HTML(value='\n<h1 id="summary_7NVLSRCFCE15C4KCK016T6K28">Data Integrity Sui…

In [30]:
#### Thanks For Your Time
#### Jesus Saves @JCharisTech
#### Jesse E.Agbe(JCharis)