# Target Leakage in Machine Learning

© Yuriy Guts, 2018

## Example 02: Data Preparation Stage

In this example, we will explore how preprocessing the dataset before partitioning can introduce minor leakage about the test features into the training pipeline.
As a result, our model will have slightly better scores compared to the more robust approach where we derive preprocessing parameters on the training subset, and then use them to transform the test set.

**Note**: This is a toy example on a rather small dataset so the impact won't be large but visible enough to illustrate the point.

In [1]:
import numpy as np
import pandas as pd

In [2]:
from sklearn.metrics import log_loss, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler, Imputer

### Read Data

In [3]:
from sklearn.neighbors import KNeighborsClassifier

Let's read the [Titanic](https://www.kaggle.com/c/titanic/data) dataset.

In [4]:
df = pd.read_csv('data/titanic-train.csv')

We'll be using KNN, so let's one-hot encode the two categorical variables: sex and the port of departure.

In [5]:
df['IsFemale'] = df['Sex'].map({'male': 0, 'female': 1})
df['IsAgeMissing'] = df['Age'].isnull()
df[['EmbarkedC', 'EmbarkedQ', 'EmbarkedS']] = pd.get_dummies(df['Embarked'])

Let's leave only the simple features that are likely to carry the most signal.

In [6]:
df_X = df[['Pclass', 'IsFemale', 'Age', 'IsAgeMissing', 'SibSp', 'Parch', 'Fare', 'EmbarkedC', 'EmbarkedQ', 'EmbarkedS']].copy()
df_y = df['Survived'].copy()

### Preprocess Data

**MISTAKE INCOMING!** Now we will transform the entire dataset, before partitioning it into train and test. This is likely to cause leakage if the features drift significantly across the training and evaluation folds. We do not actually know the distribution of the test features at prediction time.

In [7]:
mean_imputer = Imputer(missing_values='NaN', strategy='mean')
scaler = StandardScaler()

In [8]:
df_X['Age'] = mean_imputer.fit_transform(df_X[['Age']])
df_X[['Age', 'Fare']] = scaler.fit_transform(df_X[['Age', 'Fare']])

In [9]:
print(mean_imputer.statistics_)

[ 29.69911765]


In [10]:
print(scaler.mean_)
print(scaler.scale_)

[ 29.69911765  32.20420797]
[ 12.99471687  49.66553444]


Only now will we partition and train the model.

In [11]:
X_train, X_test, y_train, y_test = train_test_split(df_X, df_y, test_size=0.3, random_state=12345)

In [12]:
print('X_train:', X_train.shape)
print('X_test: ', X_test.shape)
print('y_train:', y_train.shape)
print('y_test: ', y_test.shape)

X_train: (623, 10)
X_test:  (268, 10)
y_train: (623,)
y_test:  (268,)


### Train and Evaluate Model

In [13]:
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

In [14]:
y_test_pred = model.predict_proba(X_test)[:, -1]

In [15]:
log_loss_before = log_loss(y_test, y_test_pred)
auc_before = roc_auc_score(y_test, y_test_pred)

In [16]:
print('Test LogLoss:', log_loss_before)
print('Test AUC:    ', auc_before)

Test LogLoss: 4.03039167749
Test AUC:     0.755666208791


## Removing Leakage

### Read Data

In [17]:
print(X_train['Age'].mean())
print(X_test['Age'].mean())

-0.045685896518762134
0.10620266242980987


Let's repeat our initial dataset preparation (missing indicator variable, one-hot encoding, feature selection)

In [18]:
df = pd.read_csv('data/titanic-train.csv')
df['IsFemale'] = df['Sex'].map({'male': 0, 'female': 1})
df['IsAgeMissing'] = df['Age'].isnull()
df[['EmbarkedC', 'EmbarkedQ', 'EmbarkedS']] = pd.get_dummies(df['Embarked'])
df_X = df[['Pclass', 'IsFemale', 'Age', 'IsAgeMissing', 'SibSp', 'Parch', 'Fare', 'EmbarkedC', 'EmbarkedQ', 'EmbarkedS']].copy()
df_y = df['Survived'].copy()

But now we'll partition first, then figure out the preprocessing.

In [19]:
X_train, X_test, y_train, y_test = train_test_split(df_X, df_y, test_size=0.3, random_state=12345)

In [20]:
X_train = X_train.copy()
X_test = X_test.copy()
y_train = y_train.copy()
y_test = y_test.copy()

Learn imputation parameters only on the training set...

In [21]:
mean_imputer = Imputer(missing_values='NaN', strategy='mean')
scaler = StandardScaler()

In [22]:
X_train['Age'] = mean_imputer.fit_transform(X_train[['Age']])
X_train[['Age', 'Fare']] = scaler.fit_transform(X_train[['Age', 'Fare']])

In [23]:
print(mean_imputer.statistics_)

[ 28.95791583]


In [24]:
print(scaler.mean_)
print(scaler.scale_)

[ 28.95791583  31.82662424]
[ 12.86442653  45.07671825]


...and use them to **transform** the test set.

In [25]:
X_test['Age'] = mean_imputer.transform(X_test[['Age']])
X_test[['Age', 'Fare']] = scaler.transform(X_test[['Age', 'Fare']])

### Train and Evaluate Model

In [26]:
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

In [27]:
y_test_pred = model.predict_proba(X_test)[:, -1]

In [28]:
log_loss_after = log_loss(y_test, y_test_pred)
auc_after = roc_auc_score(y_test, y_test_pred)

In [29]:
print('Test LogLoss:', log_loss_after)
print('Test AUC:    ', auc_after)

Test LogLoss: 4.03341753651
Test AUC:     0.753147893773


## Evaluate the Impact of Leakage

In [30]:
print('LogLoss difference:', log_loss_after - log_loss_before)
print('AUC difference:    ', auc_after - auc_before)

LogLoss difference: 0.00302585901573
AUC difference:     -0.00251831501832
