We will learn what data leakage is and how to avoid it. If we don’t know how to prevent it, leaks will frequently occur and ruin our models in subtle and dangerous ways. It is, therefore, one of the most important concepts for all machine learning practitioners.

### What is Data Leakage?

Data leakage generally occurs when our training data is fed with the information about the target, but similar data is available when the model is used in predictions. This leads to high performance on the drive assembly, but the model will perform poorly in production.

In simple words, data leakage makes a machine learning model look very precise until we start making predictions with the model and then the model becomes very inaccurate.

Data Leakage is of two types: 
* target leakage and 
* train-test contamination.

### Target leakage

A target leak occurs when our predictors include data that will not be available at the time we make the predictions. It’s important to think of the target leak in terms of the timing or chronological order of data availability, and not just whether a feature makes good predictions.

### Train-Test Contamination

A different type of leak occurs when we are not careful to distinguish training data from validation data. Validation is meant to be a measure of how well the model performs on data it has not previously considered. We can subtly corrupt this process if the validation data affects preprocessing behaviour. This is referred to as train-test contamination.

### Data Leakage in Action

Here we will learn one way to detect and remove target leaks. We will use credit card apps dataset and ignore the master data setup code. The result is that the information about each credit card application is stored in an `X` DataFrame. We will use it to predict which applications have been accepted in a `y` series.

In [1]:
import pandas as pd

# Read the data
data = pd.read_csv('AER_credit_card_data.csv', 
                   true_values = ['yes'], false_values = ['no'])

In [3]:
# Select target
y = data.card

# Select predictors
X = data.drop(['card'], axis=1)

In [4]:
print("Number of rows in the dataset:", X.shape[0])
X.head()

Number of rows in the dataset: 1319


Unnamed: 0,reports,age,income,share,expenditure,owner,selfemp,dependents,months,majorcards,active
0,0,37.66667,4.52,0.03327,124.9833,True,False,3,54,1,12
1,0,33.25,2.42,0.005217,9.854167,False,False,3,34,1,13
2,0,33.66667,4.5,0.004156,15.0,True,False,4,58,1,5
3,0,30.5,2.54,0.065214,137.8692,False,False,0,25,1,7
4,0,32.16667,9.7867,0.067051,546.5033,True,False,2,64,1,5


Since this is a small dataset, we will use cross-validation to ensure accurate measures of model quality:

In [5]:
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

In [6]:
# Since there is no preprocessing, we don't need a pipeline (used anyway as best practice!)
my_pipeline = make_pipeline(RandomForestClassifier(n_estimators=100))
cv_scores = cross_val_score(my_pipeline, X, y, 
                            cv=5,
                            scoring='accuracy')

print("Cross-validation accuracy: %f" % cv_scores.mean())

Cross-validation accuracy: 0.980292


With experience, we will find that it is very rare to find accurate models for `98%` of the time. It does happen, but it’s quite rare that we have to inspect the data more closely to detect any target leaks. Here is a summary of the data, that we will observe:

Some variables seem suspicious. For example, does an expense mean an expense on this card or cards used before the application? At this point, baseline data comparisons can be very helpful:

In [7]:
expenditures_cardholders = X.expenditure[y]
expenditures_noncardholders = X.expenditure[~y]

In [8]:
print('Fraction of those who did not receive a card and had no expenditures: %.2f' 
      %((expenditures_noncardholders == 0).mean()))
print('Fraction of those who received a card and had no expenditures: %.2f' 
      %(( expenditures_cardholders == 0).mean()))

Fraction of those who did not receive a card and had no expenditures: 1.00
Fraction of those who received a card and had no expenditures: 0.02


As noted above, all of those who did not receive a card had no spending, while only `2%` of those who received a card had no spending. It is not surprising that our model appears to have high accuracy. But it also appears to be a case of goal leakage, where spending likely means spending on the card they requested.

Since the share is partly determined by expenditure, it should also be excluded. The active and major variables are a little less clear, but from the description, they look worrisome. In most of the situations, it’s better to play safe than sorry if we can’t track down the people who created the data to find out more. We will run a model with no target leak as follows:

In [9]:
# Drop leaky predictors from dataset
potential_leaks = ['expenditure', 'share', 'active', 'majorcards']
X2 = X.drop(potential_leaks, axis=1)

# Evaluate the model with leaky predictors removed
cv_scores = cross_val_score(my_pipeline, X2, y, 
                            cv=5,
                            scoring='accuracy')

print("Cross-val accuracy: %f" % cv_scores.mean())

Cross-val accuracy: 0.830928


This accuracy is a bit lower, which can be disappointing. However, we can expect it to be correct about `80%` of the time when used on new applications when the leaky model would likely do a lot worse than that.

Data Leakage can be a million-dollar mistake in many Machine Learning tasks