# Detecting Covariate Shift in Datasets

Regression models predict a response based on predictor variables. The model parameters are etimated from the data. When the underlying data distribution associated with the predictors changes, the model and its associated parameters that were determined based on a previous batch of data, may not be optimal for the data that we are now seeing. This is known as **covariate shift**. Ok, so all that is well and good, but how do we detect that the underlying dataset distribution has changed? We can employ a simple technique to do that and we will illustrate that in this notebook.

The problem context is as follows. We have two batches of data. One is the version used to build the current model in production and the other is the batch of data that we have received since the model has been deployed. The question is **is the current batch of data different in a distributional sense from the one used to build the current model?** We will use machine learning to solve this problem. We will tag the data from the batch used to build the current production model as $0$ and the batch of data that we have received since then as $1$. We will develop a model to discriminate these two labels. If the model we develop can discriminate very well between data from these two batches, then **covariate shift** has occured and we need to revisit modeling. If the model cannot discriminate well between these two batches, for example, the classifier we develop produces an accuracy of about $0.5$ then this classifier is not very discriminatory. It only performs as well as tossing a fair coin. If we observe such a result, then we conclude that sufficient dataset shift has not occured and our current model will serve us well.

We illustrate this idea with the data from the **california housing** dataset (available in the UCI machine learning repository). The machine learning task associated with the dataset is to predict the **median house value** given a set of predictors. The rest of the notebook illustrates the idea discussed above. 

## Read the data

In [None]:
import pandas as pd
fp = "cal_housing.csv"
df = pd.read_csv(fp)

In [None]:
req_cols = df.columns.tolist()
req_cols.remove("medianHouseValue")
df = df[req_cols]
df.dtypes

In [None]:
df["lat"].describe()

## Exploring the Data
When we plot the histogram of the **lat** variable, we see two populations (see below):
1. A group with **lat** values less than -119
2. A group with **lat** values greater than -119
Lets pretend that the current batch of data used to develop our regression model is the first one. We have now received the second batch. Can we discriminate between the two. Lets develop a classifier and see if we can.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline  
df["lat"].hist()

In [None]:
df1 = df.query("lat <= -119")

In [None]:
df2 = df.query("lat > -119")

## Use the dataset shift api

In [None]:
from arangopipe.arangopipe_analytics.rf_dataset_shift_detector import RF_DatasetShiftDetector


In [None]:
rfd = RF_DatasetShiftDetector()
score = rfd.detect_dataset_shift(df1, df2)
print ("Detaset shift score : ", score)
    

## Interpretation of the score reported by the shift detector
The API uses a classifier to discriminate between the datasets provided to it. The score reported by the API is the accuracy of the classifier to discriminate between the datasets. Values close to $0.5$ indicate that the classifier in not able to discriminate between the two datasets. This could be interpretted as a situation where no discernable shift has occured in the data since the last model deployment. Values close $1$ indicate that dataset shift is discernable and that we may need to revisit modeling. How dataset shift affects the performance of the deployed model is problem dependent. So the score must be assessed in the context of a particular application. An experiment to track the loss of model accuracy with the observed score could provide insights into a threshold score beyond which a model redevelopment is needed.