# [Data Leakage](https://www.kaggle.com/dansbecker/data-leakage)

## What is it?

Data leakage is one of the most important issues for a data scientist to understand.  
If you don't know how to prevent it, leakage will come up frequently, and it will ruin your models in the most subtle and dangerous ways.  
Specifically, leakage causes a model to look accurate until you start making decisions with the model, and then the model becomes very inaccurate.  
This tutorial will show you what leakage is and how to avoid it.  
There are two main types of leakage: **Leaky Predictors** and **Leaky Validation Strategies**.

### Leaky Predictors

This occurs when your predictors include data that will not be available at the time you make your predictions.  
For example, imagine that you want to predict who will catch pneumonia.  
The first few rows of your raw data might look like this:

![data_leakage1](img/data_leakage1.png)

People take antibiotic medicines after getting pneumonia in order to recover.  
So the raw data shows a strong relationship between those columns.  
But `took_antibiotic_medicine` is frequently changed after the value for `got_pneumonia` is determined.  
This is target leakage.  
The model would see that anyone who has a value of `False` for `took_antibiotic_medicine` didn't have pneumonia.  
Validation data comes from the same source, so the pattern will repeat itself in validation, and the model will have great validation (or cross-validation) scores.  
However, the model will be less accurate when subsequently deployed in the real world.  
To prevent this type of data leakage, any variable updated (or created) after the target value is realized should be excluded.  
Because when we use this model to make new predictions, that data won't be available to the model.

![data_leakage2](img/data_leakage2.png)

### Leaky Validation Strategies

A much different type of leak occurs when you aren't careful distinguishing training data from validation data.  
For example, this happens if you run preprocessing (like fitting the Imputer for missing values) before calling `train_test_split`.  
Validation is meant to be a measure of how the model does on data it hasn't considered before.  
You can corrupt this process in subtle ways if the validation data affects the preprocessing behavior.  
Your model will get very good validation scores, giving you great confidence in it, but perform poorly when you deploy it to make decisions.

### Preventing Leaky Predictors

There is no single solution that universally prevents leaky predictors.  
That being said, there are a few common strategies you can use.  
Leaky predictors frequently have high statistical correlations to the target.  
To screen for possible leaks, look for columns that are strongly correlated to your target.  
If you then build your model and the results are very accurate, then there is a good chance of a leakage problem.

### Preventing Leaky Validation Strategies

If your validation is based on a simple train-test split, exclude the validation data from any type of fitting, including the fitting of preprocessing steps.  
This another place where scikit-learn pipelines make themselves useful.  
When using cross-validation, it's very helpful to use pipelines and do your preprocessing inside the pipeline.

## Now for the code:

We will use a small dataset about credit card applications, and we will build a model predicting which applications were accepted (stored in a variable called `card`).

In [8]:
import pandas as pd

data = pd.read_csv('input/credit_card_data.csv', true_values=['yes'], false_values=['no'])
data.head()

Unnamed: 0,card,reports,age,income,share,expenditure,owner,selfemp,dependents,months,majorcards,active
0,True,0,37.66667,4.52,0.03327,124.9833,True,False,3,54,1,12
1,True,0,33.25,2.42,0.005217,9.854167,False,False,3,34,1,13
2,True,0,33.66667,4.5,0.004156,15.0,True,False,4,58,1,5
3,True,0,30.5,2.54,0.065214,137.8692,False,False,0,25,1,7
4,True,0,32.16667,9.7867,0.067051,546.5033,True,False,2,64,1,5


In [9]:
data.shape

(1319, 12)

This can be considered a small dataset, so we'll use cross-validation to ensure accurate measures of model quality.

In [10]:
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

y = data.card
X =data.drop(['card'], axis=1)

# Using a pipeline is best practice, so it's included here even though
# the absence of preprocessing makes it unnecessary.
modeling_pipeline = make_pipeline(RandomForestClassifier())
cv_scores = cross_val_score(modeling_pipeline, X, y, scoring='accuracy')
print("Cross-validation accuracy: ")
print(cv_scores.mean())

Cross-validation accuracy: 
0.978774073307103


With experience, you'll find that it's very rare to find models that are accurate 98% of the time.  
It happens, but it's rare enough that we should inspect the data more closely to see if it is target leakage.  
Here is a summary of the data:  
* **card**: Dummy variable, 1 if application for credit card accepted, 0 if not
* **reports**: Number of major derogatory reports
* **age**: Age n years plus twelfths of a year
* **income**: Yearly income (divided by 10,000)
* **share**: Ratio of monthly credit card expenditure to yearly income
* **expenditure**: Average monthly credit card expenditure
* **owner**: 1 if owns their home, 0 if rent
* **selfempl**: 1 if self employed, 0 if not.
* **dependents**: 1 + number of dependents
* **months**: Months living at current address
* **majorcards**: Number of major credit cards held  
* **active**: Number of active credit accounts

A few variables look suspicious. For example, does expenditure mean expenditure on this card or on cards used before appying?  
At this point, basic data comparisons can be very helpful:

In [11]:
expenditures_cardholders = data.expenditure[data.card]
expenditures_not_cardholders = data.expenditure[~data.card]
((expenditures_cardholders == 0).mean())

0.020527859237536656

In [12]:
((expenditures_not_cardholders == 0).mean())

1.0

Everyone with `card == False` had no expenditures, while only 2% of those with `card == True` had no expenditures.  
It's not surprising that our model appeared to have a high accuracy.  
But this seems a data leak, where expenditures probably means expenditures *on the card they applied for*.  
Since `share` is partially determined by `expenditure`, it should be excluded too.  
The variables `active`, `majorcards` are a little less clear, but from the description, they may be affected.  
In most situations, it's better to be safe than sorry if you can't track down the people who created the data to find out more.

Now that that pitfall has presented itself, it's time to build a model that is more data-leakage resistant:

In [13]:
potential_leaks = ['expenditure', 'share', 'active', 'majorcards']
X2 = X.drop(potential_leaks, axis=1)
cv_scores = cross_val_score(modeling_pipeline, X2, y, scoring='accuracy')
cv_scores.mean()

0.8036532753503142

The accuracy is lower but much more realistic (and believable).  
Data leakage can be a multi-million dollar mistake in many data science applications.  
Careful separation of training and validation data is a first step, and pipelines can help implement this separation.  
Leaking predictors are a more frequent issue, and harder to track down.  
A combination of caution, common sense and data exploration can help identify leaking predictors so you remove them from your model.  
Review the data in your ongoing project.  
Are there any predictors that may cause leakage?  
As a hint, most datasets from Kaggle competitions don't have these variables.  
Once you get past those carefully curated datasets, this becomes a common issue.