<a href="https://colab.research.google.com/github/felipemaiapolo/detectshift/blob/main/demo/Classification1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dataset shift diagnostics with *DetectShift* in a binary classification task

## Starting...

Installing *DetectShift*:

In [1]:
from IPython.display import clear_output
!pip install detectshift
clear_output()

Loading packages:

In [2]:
import detectshift as ds
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm

random_seed=0
np.random.seed(random_seed)

Defining function to generate data. This function generates data as in the first experiment of Section 3.1 of the paper ["A unified framework for dataset shift diagnostics"](https://arxiv.org/pdf/2205.08340.pdf) - $\delta$ controls label shift while $\gamma$ controls concept shift of type 1. 

*PS: The code used in this demo is not exactly the same used in the paper, then we expect some variation, specially in the p-values. However, we expect the results to be qualitatively very similar.*

In [3]:
def GenData2(gamma, delta, n = 1000):
    ys = np.random.binomial(1, .5, n)  
    yt = np.random.binomial(1, .5 + delta, n)
    Xs=[]
    Xt=[]
    for i in range(n):
        Xs.append(np.random.normal(ys[i],1,3).tolist())
        Xt.append(np.random.normal(yt[i]+gamma,1,3).tolist())  
    Xs=np.array(Xs)
    Xt=np.array(Xt)
    Xs = Xs.reshape(n,-1)
    Xt = Xt.reshape(n,-1)
    ys = ys.reshape(-1)
    yt = yt.reshape(-1)
    return Xs, ys, Xt, yt

## Example 1

In this example, we cause concept shift of type 1 but not label shift. 

We start by setting some parameters:

In [4]:
#DetectShift parameters
task='class'
test=.1 #test size fraction
B=500 #number of permutations

#Shift parameters
gamma=1
delta=0

Creating data (**please use *prep_data* funtion to prepare your data**):

In [5]:
Xs, ys, Xt, yt = GenData2(gamma, delta) #"s" stands for "source" and "t" stands for "target"

Xs_train, Xs_test, ys_train, ys_test, Zs_train, Zs_test, \
Xt_train, Xt_test, yt_train, yt_test, Zt_train, Zt_test = ds.tools.prep_data(Xs, ys, Xt, yt, test=test, task=task, random_state=random_seed)            



Training models (in this case, we use Catboost classifier with early-stopping):

In [6]:
#Training classifiers to estimate the R-N derivative
totshift_model = ds.tools.KL(boost=True)
totshift_model.fit(Zs_train, Zt_train)
covshift_model = ds.tools.KL(boost=True)
covshift_model.fit(Xs_train, Xt_train)

#Estimating the conditional distribution
cd_model = ds.cdist.cde_class(boost=True)
cd_model.fit(pd.concat([Xs_train, Xt_train], axis=0), 
             pd.concat([ys_train, yt_train], axis=0))

Getting test statistics and p-values using *ShiftDiagnostics* function (all at once):

In [None]:
out = ds.tests.ShiftDiagnostics(Xs_test, ys_test, Xt_test, yt_test,
                                totshift_model=totshift_model, covshift_model=covshift_model, labshift_model=None,
                                cd_model=cd_model, task=task, B=B, verbose=True)

Visualizing result:

In [8]:
pd.DataFrame(out).T.iloc[:,:2]

Unnamed: 0,pval,kl
tot,0.001996,0.726824
lab,0.532934,0.003218
cov,0.001996,0.394171
conc1,0.001996,0.723605
conc2,0.003992,0.332653


## Example 2

In this example, we cause label shift but not concept shift of type 1. 

We start by setting some parameters:

In [9]:
#DetectShift parameters
task='class'
test=.1 #test size fraction
B=500 #number of permutations

#Shift parameters
gamma=0
delta=.4

Creating data (**please use *prep_data* funtion to prepare your data**):

In [10]:
Xs, ys, Xt, yt = GenData2(gamma, delta)

Xs_train, Xs_test, ys_train, ys_test, Zs_train, Zs_test, \
Xt_train, Xt_test, yt_train, yt_test, Zt_train, Zt_test = ds.tools.prep_data(Xs, ys, Xt, yt, test=test, task=task, random_state=random_seed)            



Training models (in this case, we use $l_2$ regularized logistic regression):

In [11]:
#Training classifiers to estimate the R-N derivative
totshift_model = ds.tools.KL(boost=False, cv=5)
totshift_model.fit(Zs_train, Zt_train)
covshift_model = ds.tools.KL(boost=False, cv=5)
covshift_model.fit(Xs_train, Xt_train)

#Estimating the conditional distribution
cd_model = ds.cdist.cde_class(boost=False, cv=5)
cd_model.fit(pd.concat([Xs_train, Xt_train], axis=0), 
             pd.concat([ys_train, yt_train], axis=0))

Getting test statistics and p-values separately:

In [None]:
verbose = True

print("Calculating p-value for total shift...") 
tot = ds.tests.Permut(Zs_test, Zt_test, totshift_model, B=B, verbose = verbose)

print("\nCalculating p-value for label shift...")
lab = ds.tests.PermutDiscrete(ys_test, yt_test, B=B, verbose = verbose)

print("\nCalculating p-value for covariate shift...")
cov = ds.tests.Permut(Xs_test, Xt_test, covshift_model, B=B, verbose = verbose)

print("\nCalculating p-value for concept shift type 1...")
conc1 = ds.tests.LocalPermut(Xs_test, ys_test, Xt_test, yt_test, 
                             totshift_model, labshift_model=None, task=task, B=B, verbose = verbose)

print("\nCalculating p-value for concept shift type 2...")
conc2 = ds.tests.CondRand(Xs_test, ys_test, Xt_test, yt_test, 
                          cd_model, totshift_model, covshift_model, B=B, verbose = verbose)
    
out = {'tot':tot, 'lab':lab, 'cov':cov, 'conc1':conc1, 'conc2':conc2}

Visualizing result:

In [13]:
pd.DataFrame(out).T.iloc[:,:2]

Unnamed: 0,pval,kl
tot,0.001996,0.391496
lab,0.001996,0.43138
cov,0.001996,0.190247
conc1,0.167665,-0.039885
conc2,0.001996,0.201249
