<a href="https://colab.research.google.com/github/felipemaiapolo/detectshift/blob/main/demo/Classification2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dataset shift diagnostics with *DetectShift* in a multinomial classification task

## Starting...

Installing *DetectShift*:

In [12]:
from IPython.display import clear_output
!pip install detectshift
!pip install wget
clear_output()

Loading packages:

In [13]:
import detectshift as ds
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
import wget

random_seed=42
np.random.seed(random_seed)

Loading data used in the MNIST/USPS experiment contained in the paper ["A unified framework for dataset shift diagnostics"](https://arxiv.org/pdf/2205.08340.pdf). Click [here](https://github.com/felipemaiapolo/dataset_shift_diagnostics/blob/main/EXP5_digits_experiment.ipynb) and [here](https://github.com/felipemaiapolo/dataset_shift_diagnostics/tree/main/data) if you want to check how the data was generated. 

In this demo, we only use the pure MNIST dataset and the even split.

*PS: The code used in this demo is not exactly the same used in the paper, then we expect some variation, specially in the p-values. However, we expect the results to be qualitatively very similar.*

In [14]:
url="https://github.com/felipemaiapolo/detectshift/raw/main/demo/data/digits_data.npy"
filename = wget.download(url, out=None)

data=np.load(filename,allow_pickle=True).tolist()

In [15]:
Xs, ys = data['X'][0], data['y'][0]
Xt, yt = data['X'][-1], data['y'][-1]

## Example 

In this example, we expect to detect all kinds of shift except label shift and concept shift of type 2 (just like in the paper):

We start by setting some parameters:

In [16]:
#DetectShift parameters
task='class'
test=.1 #test size fraction
B=500 #number of permutations

Creating data (**please use *prep_data* funtion to prepare your data**):

In [17]:
Xs_train, Xs_test, ys_train, ys_test, Zs_train, Zs_test, \
Xt_train, Xt_test, yt_train, yt_test, Zt_train, Zt_test = ds.tools.prep_data(Xs, ys, Xt, yt, test=test, task=task, random_state=random_seed)            

Training models (in this case, we use Catboost classifier with early-stopping):

In [None]:
#Training classifiers to estimate the R-N derivative
totshift_model = ds.tools.KL(boost=True)
totshift_model.fit(Zs_train, Zt_train)
covshift_model = ds.tools.KL(boost=True)
covshift_model.fit(Xs_train, Xt_train)

#Estimating the conditional distribution
cd_model = ds.cdist.cde_class(boost=True)
cd_model.fit(pd.concat([Xs_train, Xt_train], axis=0), 
             pd.concat([ys_train, yt_train], axis=0))

Getting test statistics and p-values using *ShiftDiagnostics* function (all at once):

In [None]:
out = ds.tests.ShiftDiagnostics(Xs_test, ys_test, Xt_test, yt_test,
                                totshift_model=totshift_model, covshift_model=covshift_model, labshift_model=None,
                                cd_model=cd_model, task=task, B=B, verbose=True)

Visualizing result:

In [None]:
pd.DataFrame(out).T.iloc[:,:2]

Getting test statistics and p-values separately:

In [None]:
verbose = True

print("Calculating p-value for total shift...") 
tot = ds.tests.Permut(Zs_test, Zt_test, totshift_model, B=B, verbose = verbose)

print("\nCalculating p-value for label shift...")
lab = ds.tests.PermutDiscrete(ys_test, yt_test, B=B, verbose = verbose)

print("\nCalculating p-value for covariate shift...")
cov = ds.tests.Permut(Xs_test, Xt_test, covshift_model, B=B, verbose = verbose)

print("\nCalculating p-value for concept shift type 1...")
conc1 = ds.tests.LocalPermut(Xs_test, ys_test, Xt_test, yt_test, 
                             totshift_model, labshift_model=None, task=task, B=B, verbose = verbose)

print("\nCalculating p-value for concept shift type 2...")
conc2 = ds.tests.CondRand(Xs_test, ys_test, Xt_test, yt_test, 
                          cd_model, totshift_model, covshift_model, B=B, verbose = verbose)
    
out = {'tot':tot, 'lab':lab, 'cov':cov, 'conc1':conc1, 'conc2':conc2}

Visualizing result:

In [None]:
pd.DataFrame(out).T.iloc[:,:2]