# Retrospective analysis of data leakage in a price prediction pipeline

This example evolves around an [ML pipeline for predicting the price of taxi rides](https://github.com/schelterlabs/arguseyes-example/blob/main/pipelines/mlflow-regression-nyctaxifare.py), based on a sample from the [New York City Taxi Fare Prediction](https://www.kaggle.com/competitions/new-york-city-taxi-fare-prediction/data) dataset. The pipeline computes additional features, splits the data into train and testset based on date information, and learns a **regression model to predict the fare of a ride**, based on attributes such as the **pickup time**, **dropoff time**, **trip_distance** and the **zip codes** of the pickup and dropoff locations.

When screening this pipeline on Github with this [configuration](https://github.com/schelterlabs/arguseyes-example/blob/main/mlflow-regression-nyctaxifare-dataleakage.yaml), ArgusEyes detects a **data leakage problem** in the pipeline. The screenshot shows the result of the [screening during the build triggered by a Github action](https://github.com/schelterlabs/arguseyes-example/actions/runs/3523396218/jobs/5907507086): There are **177 input tuples which leaked from the train set to the test set**.

In the following, we show how to **leverage ArgusEyes to retrospectively analyze the pipeline run** (based on metadata and captured data artifacts), and **figure out the root cause of the data leakage issue**.

![data-leakage-screening-via-a-github-action](github-action-dataleakage-screening.png)

### Load the metadata and artifacts from the original run of the pipeline

ArgusEyes needs the run id from the mlflow run where ArgusEyes stored the metadata and artifacts. (Note we use a local run here for demo purposes).

In [1]:
from arguseyes.retrospective import PipelineRun, DataLeakageRetrospective

In [2]:
run_id = 'b490f97b50244876b6d7b9ec89af6dbf'

run = PipelineRun(run_id=run_id)

### Interactively explore the dataflow plan and data of the pipeline run

We can view a dataflow plan of the pipeline, which highlights the input datasets, as well as the features and labels for the train and test data computed by the pipeline. We can interactively explore the pipeline data. Clicking on the pink data vertices provides us with details about the corresponding data.

<span style='color: red;'>[Note that this interactive widget is only shown during the actual execution of the notebook with jupyter and not rendered in the offline view on Github.]</span>

In [3]:
run.explore_data()

# Pipeline Data Explorer

HBox(children=(CytoscapeWidget(cytoscape_layout={'name': 'dagre'}, cytoscape_style=[{'selector': 'node', 'css'…

## Retrospective analysis of the data leakage issue

ArgusEyes allows us to instantiate a special `DataLeakageRetrospective`, which helps us analyze data leakage problems from a pipeline run

In [4]:
retrospective = DataLeakageRetrospective(run)

### Materialize leaked tuples

We can compute the tuples that were leaked between the train and test set

In [5]:
leaked_data = retrospective.compute_leaked_tuples()
leaked_data

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,fare_amount,pickup_zip,dropoff_zip,trip_distance
0,2016-02-15 12:07:32,2016-02-15 12:40:20,38.5,10018,11371,11.547500
1,2016-02-15 01:22:13,2016-02-15 01:30:47,8.5,10020,10022,1.207500
2,2016-02-14 23:55:18,2016-02-15 00:03:40,8.5,10152,10001,1.630000
3,2016-02-15 13:03:55,2016-02-15 13:19:04,12.5,10017,10021,1.833846
4,2016-02-15 21:20:45,2016-02-15 21:27:58,7.5,10012,10119,2.232500
...,...,...,...,...,...,...
172,2016-02-15 16:36:09,2016-02-15 16:49:11,9.5,10003,10119,1.784286
173,2016-02-15 13:26:35,2016-02-15 13:36:25,8.0,10001,10020,1.071667
174,2016-02-15 20:22:56,2016-02-15 20:48:52,30.5,11371,10103,10.305714
175,2016-02-15 14:24:35,2016-02-15 14:37:27,11.0,10025,10019,2.450000


### Deep dive into leaked tuples

In the following, we can explore the leaked tuples in detail in order to find patterns, which help us determine the root cause of the leakage

In [6]:
leaked_data.trip_distance.describe()

count    177.000000
mean       2.954813
std        3.761105
min        0.480909
25%        1.047273
50%        1.470870
75%        2.530000
max       19.087500
Name: trip_distance, dtype: float64

In [7]:
leaked_data.tpep_pickup_datetime.dt.date.value_counts()

2016-02-15    175
2016-02-14      2
Name: tpep_pickup_datetime, dtype: int64

### Identifying the root cause of the leakage

All the leaked tuples share the same day in their dropoff time! This is a strong hint that the data was not split correctly for train/test. Fixing this will remove the data leakage issue in the pipeline.

In [8]:
leaked_data.tpep_dropoff_datetime.dt.date.value_counts()

2016-02-15    177
Name: tpep_dropoff_datetime, dtype: int64