<a href="https://colab.research.google.com/github/felipemaiapolo/detectshift/blob/main/Regression1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dataset shift diagnostics with *DetectShift* in a regression task

## Starting...

Installing *DetectShift*:

In [1]:
from IPython.display import clear_output
!pip install git+https://github.com/felipemaiapolo/detectshift
clear_output()

Loading packages:

In [2]:
import detectshift as ds
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm

random_seed=42
np.random.seed(random_seed)

Defining function to generate data. This function generates data as in the second experiment of Section 3.1 of the paper ["A unified framework for dataset shift diagnostics"](https://arxiv.org/pdf/2205.08340.pdf) - $\lambda$ controls covariate shift while $\theta$ controls concept shift of type 2:

In [3]:
def GenData(theta, lamb, n = 1000):
    Xs = np.random.normal(0,1,n)
    ys = np.random.normal(Xs,1)
    Xs = Xs.reshape(-1,1)
    ys = ys.reshape(-1)
    Xt = np.random.normal(lamb,1,n)
    yt = np.random.normal(theta + Xt,1)
    Xt = Xt.reshape(-1,1)
    yt = yt.reshape(-1)
    return Xs, ys, Xt, yt

## Example 1

In this example, we cause covariate shift but not concept shift. 

We start by setting some parameters:

In [4]:
#DetectShift parameters
task='reg'
test=.1 #test size fraction
B=500 #number of permutations

#Shift parameters
lamb=1
theta=0

Creating data (**please use *prep_data* funtion to prepare your data**):

In [5]:
Xs, ys, Xt, yt = GenData(theta, lamb) #"s" stands for "source" and "t" stands for "target"

Xs_train, Xs_test, ys_train, ys_test, Zs_train, Zs_test, \
Xt_train, Xt_test, yt_train, yt_test, Zt_train, Zt_test = ds.tools.prep_data(Xs, ys, Xt, yt, test=test, task=task, random_state=random_seed)            

Training models (in this case, we use Catboost with early-stopping):

In [6]:
#Training classifiers to estimate the R-N derivative
totshift_model = ds.tools.KL(boost=True)
totshift_model.fit(Zs_train, Zt_train)
covshift_model = ds.tools.KL(boost=True)
covshift_model.fit(Xs_train, Xt_train)
labshift_model = ds.tools.KL(boost=True)
labshift_model.fit(ys_train, yt_train)

#Estimating the conditional distribution
cd_model = ds.cdist.cde_reg(boost=True)
cd_model.fit(pd.concat([Xs_train, Xt_train], axis=0), 
             pd.concat([ys_train, yt_train], axis=0))

Getting test statistics and p-values using *ShiftDiagnostics* function (all at once):

In [7]:
out = ds.tests.ShiftDiagnostics(Xs_test, ys_test, Xt_test, yt_test,
                                totshift_model=totshift_model, covshift_model=covshift_model, labshift_model=labshift_model,
                                cd_model=cd_model, task=task, n_bins=10, B=B, verbose=True)

Calculating p-value for total shift...


100%|██████████| 500/500 [00:00<00:00, 7313.42it/s]



Calculating p-value for label shift...


100%|██████████| 500/500 [00:00<00:00, 13423.58it/s]



Calculating p-value for covariate shift...


100%|██████████| 500/500 [00:00<00:00, 5502.92it/s]



Calculating p-value for concept shift type 1...


100%|██████████| 500/500 [00:05<00:00, 88.96it/s]



Calculating p-value for concept shift type 2...


100%|██████████| 500/500 [00:02<00:00, 207.62it/s]


Visualizing result:

In [8]:
pd.DataFrame(out).T.iloc[:,:2]

Unnamed: 0,pval,kl
tot,0.001996,0.339829
lab,0.001996,0.146041
cov,0.001996,0.364368
conc1,0.001996,0.193788
conc2,0.520958,-0.024539


## Example 2

In this example, we cause label shift but not concept shift of type 1. 

We start by setting some parameters:

In [9]:
#DetectShift parameters
task='reg'
test=.1 #test size fraction
B=500 #number of permutations

#Shift parameters
lamb=0
theta=1

Creating data (**please use *prep_data* funtion to prepare your data**):

In [10]:
Xs, ys, Xt, yt = GenData(theta, lamb) #"s" stands for "source" and "t" stands for "target"

Xs_train, Xs_test, ys_train, ys_test, Zs_train, Zs_test, \
Xt_train, Xt_test, yt_train, yt_test, Zt_train, Zt_test = ds.tools.prep_data(Xs, ys, Xt, yt, test=test, task=task, random_state=random_seed)            

Training models (in this case, we use $l_2$ regularized logistic regression and ridge regression + normal error for the cond. dist. model):

In [11]:
#Training classifiers to estimate the R-N derivative
totshift_model = ds.tools.KL(boost=False, cv=5)
totshift_model.fit(Zs_train, Zt_train)
covshift_model = ds.tools.KL(boost=False, cv=5)
covshift_model.fit(Xs_train, Xt_train)
labshift_model = ds.tools.KL(boost=False, cv=5)
labshift_model.fit(ys_train, yt_train)

#Estimating the conditional distribution
cd_model = ds.cdist.cde_reg(boost=False, cv=5)
cd_model.fit(pd.concat([Xs_train, Xt_train], axis=0), 
             pd.concat([ys_train, yt_train], axis=0))

Getting test statistics and p-values separately:

In [12]:
verbose = True

print("Calculating p-value for total shift...") 
tot = ds.tests.Permut(Zs_test, Zt_test, totshift_model, B=B, verbose = verbose)

print("\nCalculating p-value for label shift...")
lab = ds.tests.Permut(ys_test, yt_test, labshift_model, B=B, verbose = verbose)

print("\nCalculating p-value for covariate shift...")
cov = ds.tests.Permut(Xs_test, Xt_test, covshift_model, B=B, verbose = verbose)

print("\nCalculating p-value for concept shift type 1...")
conc1 = ds.tests.LocalPermut(Xs_test, ys_test, Xt_test, yt_test, 
                             totshift_model, labshift_model=labshift_model, task=task, n_bins=10, B=B, verbose = verbose)

print("\nCalculating p-value for concept shift type 2...")
conc2 = ds.tests.CondRand(Xs_test, ys_test, Xt_test, yt_test, 
                          cd_model, totshift_model, covshift_model, B=B, verbose = verbose)
    
out = {'tot':tot, 'lab':lab, 'cov':cov, 'conc1':conc1, 'conc2':conc2}

Calculating p-value for total shift...


100%|██████████| 500/500 [00:00<00:00, 6581.26it/s]



Calculating p-value for label shift...


100%|██████████| 500/500 [00:00<00:00, 5036.46it/s]



Calculating p-value for covariate shift...


100%|██████████| 500/500 [00:00<00:00, 5228.72it/s]



Calculating p-value for concept shift type 1...


100%|██████████| 500/500 [00:08<00:00, 57.37it/s] 



Calculating p-value for concept shift type 2...


100%|██████████| 500/500 [00:01<00:00, 296.05it/s]


Visualizing result:

In [13]:
pd.DataFrame(out).T.iloc[:,:2]

Unnamed: 0,pval,kl
tot,0.001996,0.551234
lab,0.001996,0.297965
cov,0.89022,-7e-06
conc1,0.001996,0.253269
conc2,0.001996,0.551241
