# AutoEDA XShiftDetection Tutorial

## Introduction

Distributional shift is when there is a difference between the training and test data in a prediction problem.  In this tutorial we introduce the `C2STShiftDetector` class which will detect and explain a change in the covariate (X) distributions, a phenomenon that we call XShift.  This is one of the ways in which distributional shift can manifest, but not the only one.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline 

In [2]:
import autogluon.eda.analysis as eda
import autogluon.eda.auto as auto
import autogluon.eda.visualization as viz

In [3]:
from helpers import load_adult_data, sim_cov_shift
from sklearn import metrics
from autogluon.tabular import TabularPredictor
import plotnine as p9
import bisect

## Importing data

We will import the adult dataset.  In the following analysis we will construct a dataset with covariate shift.  This means that we will need to identify a feature that can be used to bias the training sample, in order to make it not representative of the test population.  As we can see the marital status has a good mix of married and never-married, making it a potential candidate 

In [4]:
train, test = load_adult_data()

train.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,178478,Bachelors,13,Never-married,Tech-support,Own-child,White,Female,0,0,40,United-States,<=50K
1,23,State-gov,61743,5th-6th,3,Never-married,Transport-moving,Not-in-family,White,Male,0,0,35,United-States,<=50K
2,46,Private,376789,HS-grad,9,Never-married,Other-service,Not-in-family,White,Male,0,0,15,United-States,<=50K
3,55,?,200235,HS-grad,9,Married-civ-spouse,?,Husband,White,Male,0,0,50,United-States,>50K
4,36,Private,224541,7th-8th,4,Married-civ-spouse,Handlers-cleaners,Husband,White,Male,0,0,40,El-Salvador,<=50K


In [5]:
train[['marital-status']].value_counts(normalize=True)

marital-status        
 Married-civ-spouse       0.458705
 Never-married            0.330586
 Divorced                 0.134338
 Separated                0.031787
 Widowed                  0.031377
 Married-spouse-absent    0.012592
 Married-AF-spouse        0.000614
dtype: float64

## Detecting XShift

First we will apply the XShift detector to the original adults dataset.  This detector uses the Classifier 2 Sample Test, hence it is `C2STShiftDetector`.  We see that our test does not detect a substantial difference between the training and test X distributions.  This was determined by calculating the balanced accuracy (50.02%) for a classifier, which predicts if a sample is in the test or training set.  This is so close to 50% (random guessing) that we suspect that the adults training/test sets are a random sample split.

In [6]:
analysis_args = dict(
    train_data=train,
    test_data=test,
    label='class',
)

viz_args = dict(headers=True)

auto.analyze(**analysis_args, anlz_facets=[
    eda.shift.XShiftDetector(classifier_kwargs = {'path': 'AutogluonModels'})
],
viz_facets=[
    viz.shift.XShiftSummary(**viz_args)
]
)



We did not detect a substantial difference between the training and test X distributions.

## Simulating covariate shift

In this section, we will simulate covariate shift for the adults dataset.  We do this by finding a variable that has both high enough entropy to be useful to bias the training data, but also has some bearing on the penultimate prediction.  We find that marital status is one such variable, and the function `sim_cov_shift` creates a biased sample based on this.

In [7]:
# pred = TabularPredictor(label='class', 
#                         verbosity=0, 
#                         problem_type='binary',
#                         path='AutogluonModels').fit(train)

In [8]:
# yhat = pred.predict(test)
# metrics.balanced_accuracy_score(yhat, test['class'])

In [9]:
# pred.feature_importance(test)

In [10]:
train_cs, test_cs = sim_cov_shift(train, test)

We can see that the new training data underrepresents the 'Married-civ-spouse' status while the test data overrepresents it.

In [11]:
train_cs.value_counts('marital-status',normalize=True)

marital-status
Never-married            0.403670
Married-civ-spouse       0.340509
Divorced                 0.163548
Separated                0.038353
Widowed                  0.037724
Married-spouse-absent    0.015235
Married-AF-spouse        0.000960
dtype: float64

In [12]:
test_cs.value_counts('marital-status',normalize=True)

marital-status
Married-civ-spouse       0.648721
Never-married            0.210682
Divorced                 0.090890
Widowed                  0.020323
Separated                0.019947
Married-spouse-absent    0.009009
Married-AF-spouse        0.000429
dtype: float64

We now train the XShift detector on the shifted data.

In [13]:
analysis_args = dict(
    train_data=train_cs,
    test_data=test_cs,
    label='class',
)

viz_args = dict(headers=True)

auto.analyze(**analysis_args, anlz_facets=[
    eda.shift.XShiftDetector(classifier_kwargs = {'path': 'AutogluonModels'})
],
viz_facets=[
    viz.shift.XShiftSummary(**viz_args)
]
)



We detected a substantial difference between the training and test X distributions,
a type of distribution shift.

**Test results**: We can predict whether a sample is in the test vs. training set with a balanced_accuracy of
0.6524 with a p-value of 0.0010 (smaller than the threshold of 0.0100).

**Feature importances**: The variables that are the most responsible for this shift are those with high feature importance:

|                |   importance |      stddev |     p_value |   n |    p99_high |      p99_low |
|:---------------|-------------:|------------:|------------:|----:|------------:|-------------:|
| marital-status |  0.118595    | 0.00888607  | 3.75415e-06 |   5 | 0.136891    |  0.100298    |
| relationship   |  0.0289664   | 0.00348383  | 2.46319e-05 |   5 | 0.0361397   |  0.0217932   |
| sex            |  0.00310872  | 0.00137286  | 0.00358182  |   5 | 0.00593546  |  0.000281973 |
| fnlwgt         |  0.00071082  | 0.00295516  | 0.309604    |   5 | 0.00679553  | -0.00537389  |
| occupation     |  0.000296461 | 0.00095199  | 0.26228     |   5 | 0.00225662  | -0.0016637   |
| capital-loss   |  0.000225193 | 0.000764239 | 0.272983    |   5 | 0.00179877  | -0.00134839  |
| education-num  |  0.000178889 | 0.00044286  | 0.208738    |   5 | 0.00109074  | -0.000732965 |
| native-country |  6.38637e-05 | 0.000993005 | 0.446303    |   5 | 0.00210847  | -0.00198075  |
| education      |  4.83311e-05 | 0.00090621  | 0.455411    |   5 | 0.00191423  | -0.00181757  |
| race           |  2.99717e-05 | 0.000283847 | 0.412473    |   5 | 0.000614416 | -0.000554472 |
| hours-per-week | -0.000184255 | 0.000757922 | 0.692203    |   5 | 0.00137632  | -0.00174483  |
| capital-gain   | -0.000529886 | 0.000769248 | 0.900832    |   5 | 0.001054    | -0.00211378  |
| workclass      | -0.000560566 | 0.000392398 | 0.983461    |   5 | 0.000247386 | -0.00136852  |
| age            | -0.0011886   | 0.00166521  | 0.907146    |   5 | 0.00224009  | -0.00461728  |