<a href="https://colab.research.google.com/github/felipemaiapolo/detectshift/blob/main/Regression2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dataset shift diagnostics with *DetectShift* in a regression task using categorical features

## Starting...

Installing *DetectShift*:

In [1]:
from IPython.display import clear_output
!pip install git+https://github.com/felipemaiapolo/detectshift
!pip install wget
clear_output()

Loading packages:

In [2]:
import detectshift as ds
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
import wget

random_seed=42
np.random.seed(random_seed)

Loading data used in the ENEM (standardized test) experiment contained in the paper ["A unified framework for dataset shift diagnostics"](https://arxiv.org/pdf/2205.08340.pdf). Click [here](https://github.com/felipemaiapolo/dataset_shift_diagnostics/blob/main/EXP5_digits_experiment.ipynb) and [here](https://github.com/felipemaiapolo/dataset_shift_diagnostics/tree/main/data) if you want to check how the data was generated. 

In this demo, we data from 2017 as source and data from 2020 as target.

*PS: The code used in this demo is not exactly the same used in the paper, then we expect some variation, specially in the p-values. However, we expect the results to be qualitatively very similar.*


In [3]:
url="https://github.com/felipemaiapolo/detectshift/raw/main/demo/data/enem_data.npy"
filename = wget.download(url, out=None)

data=np.load(filename,allow_pickle=True).tolist()

In [4]:
Xs, ys = data['X'][17], data['y'][17]
Xt, yt = data['X'][20], data['y'][20]

## Example 

In this example, we expect to detect all kinds of shift except label shift and concept shift of type 2 (just like in the paper):

We start by setting some parameters:

In [5]:
#DetectShift parameters
task='reg'
test=.1 #test size fraction
B=500 #number of permutations

Creating data (**please use *prep_data* funtion to prepare your data**):

In [6]:
Xs_train, Xs_test, ys_train, ys_test, Zs_train, Zs_test, \
Xt_train, Xt_test, yt_train, yt_test, Zt_train, Zt_test = ds.tools.prep_data(Xs, ys, Xt, yt, test=test, task=task, random_state=random_seed)            

Specifying indices of categorical features:

In [7]:
Xt_test.head()

Unnamed: 0,x0,x1,x2,x3,x4,x5
0,M,3,1,C,B,A
1,F,3,1,B,B,B
2,F,1,2,E,B,B
3,F,1,1,D,B,B
4,M,3,1,E,D,B


In [8]:
cat_features=[0,1,2,3,4,5]

Training models (in this case, we use Catboost classifier with early-stopping):

In [9]:
#Training classifiers to estimate the R-N derivative
totshift_model = ds.tools.KL(boost=True, cat_features=cat_features)
totshift_model.fit(Zs_train, Zt_train)
covshift_model = ds.tools.KL(boost=True, cat_features=cat_features)
covshift_model.fit(Xs_train, Xt_train)
labshift_model = ds.tools.KL(boost=True)
labshift_model.fit(ys_train, yt_train)

#Estimating the conditional distribution
cd_model = ds.cdist.cde_reg(boost=True, cat_features=cat_features)
cd_model.fit(pd.concat([Xs_train, Xt_train], axis=0), 
            pd.concat([ys_train, yt_train], axis=0))

Getting test statistics and p-values using *ShiftDiagnostics* function (all at once):

In [10]:
out = ds.tests.ShiftDiagnostics(Xs_test, ys_test, Xt_test, yt_test,
                                totshift_model=totshift_model, covshift_model=covshift_model, labshift_model=labshift_model,
                                cd_model=cd_model, task=task, n_bins=10, B=B, verbose=True)

Calculating p-value for total shift...


100%|██████████| 500/500 [00:00<00:00, 1211.49it/s]



Calculating p-value for label shift...


100%|██████████| 500/500 [00:00<00:00, 1155.32it/s]



Calculating p-value for covariate shift...


100%|██████████| 500/500 [00:00<00:00, 1135.58it/s]



Calculating p-value for concept shift type 1...


100%|██████████| 500/500 [00:24<00:00, 20.07it/s]



Calculating p-value for concept shift type 2...


100%|██████████| 500/500 [00:10<00:00, 48.69it/s]


Visualizing result:

In [11]:
pd.DataFrame(out).T.iloc[:,:2]

Unnamed: 0,pval,kl
tot,0.001996,0.270623
lab,0.001996,0.096877
cov,0.001996,0.164866
conc1,0.001996,0.173746
conc2,0.057884,0.105757


Getting test statistics and p-values separately:

In [12]:
verbose = True

print("Calculating p-value for total shift...") 
tot = ds.tests.Permut(Zs_test, Zt_test, totshift_model, B=B, verbose = verbose)

print("\nCalculating p-value for label shift...")
lab = ds.tests.Permut(ys_test, yt_test, labshift_model, B=B, verbose = verbose)

print("\nCalculating p-value for covariate shift...")
cov = ds.tests.Permut(Xs_test, Xt_test, covshift_model, B=B, verbose = verbose)

print("\nCalculating p-value for concept shift type 1...")
conc1 = ds.tests.LocalPermut(Xs_test, ys_test, Xt_test, yt_test, 
                             totshift_model, labshift_model=labshift_model, task=task, n_bins=10, B=B, verbose = verbose)

print("\nCalculating p-value for concept shift type 2...")
conc2 = ds.tests.CondRand(Xs_test, ys_test, Xt_test, yt_test, 
                          cd_model, totshift_model, covshift_model, B=B, verbose = verbose)
    
out = {'tot':tot, 'lab':lab, 'cov':cov, 'conc1':conc1, 'conc2':conc2}

Calculating p-value for total shift...


100%|██████████| 500/500 [00:00<00:00, 1153.00it/s]



Calculating p-value for label shift...


100%|██████████| 500/500 [00:00<00:00, 1150.15it/s]



Calculating p-value for covariate shift...


100%|██████████| 500/500 [00:00<00:00, 1146.49it/s]



Calculating p-value for concept shift type 1...


100%|██████████| 500/500 [00:25<00:00, 19.97it/s]



Calculating p-value for concept shift type 2...


100%|██████████| 500/500 [00:10<00:00, 45.98it/s]


Visualizing result:

In [13]:
pd.DataFrame(out).T.iloc[:,:2]

Unnamed: 0,pval,kl
tot,0.001996,0.270623
lab,0.001996,0.096877
cov,0.001996,0.164866
conc1,0.001996,0.173746
conc2,0.065868,0.105757
