# Target Leakage in Machine Learning

© Yuriy Guts, 2018

## Example 03: Model Stacking

In this example, we'll try to build a 2-level stacked ensemble and see how much leakage is introduced by using in-sample predictions vs. out-of-sample predictions as 1st-level model outputs. Let's use RF + KNN + SVM classifiers on level 1, and a logistic regression on level 2.

In [1]:
import numpy as np
import pandas as pd

In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import KFold, cross_val_predict
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import Imputer, StandardScaler
from sklearn.svm import SVC

In [3]:
RANDOM_STATE = 12345

### Read Data

Let's read the [Titanic](https://www.kaggle.com/c/titanic/data) dataset.

In [4]:
df = pd.read_csv('data/titanic-train.csv')

### Preprocess Data

One-hot encode the categoricals.

In [5]:
df['IsFemale'] = df['Sex'].map({'male': 0, 'female': 1})
df['IsAgeMissing'] = df['Age'].isnull()
df[['EmbarkedC', 'EmbarkedQ', 'EmbarkedS']] = pd.get_dummies(df['Embarked'])

Let's leave only the simple features that are likely to carry the most signal.

In [6]:
df_X = df[['Pclass', 'IsFemale', 'Age', 'IsAgeMissing', 'SibSp', 'Parch', 'Fare', 'EmbarkedC', 'EmbarkedQ', 'EmbarkedS']].copy()
df_y = df['Survived'].copy()

We only want to measure the impact of the mistake of making in-sample predictions, so it's okay to make the mistake from example 02 here.

In [7]:
mean_imputer = Imputer(missing_values='NaN', strategy='mean')
scaler = StandardScaler()

In [8]:
df_X['Age'] = mean_imputer.fit_transform(df_X[['Age']])
df_X[['Age', 'Fare']] = scaler.fit_transform(df_X[['Age', 'Fare']])

### Make 1st-Level Predictions

**MISTAKE INCOMING!**: Rather than making out-of-fold predictions, we'll just make predictions on the training data.

In [9]:
svm = SVC(random_state=RANDOM_STATE)
knn = KNeighborsClassifier(n_neighbors=3)
rf = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=RANDOM_STATE)

In [10]:
rf.fit(df_X, df_y)
knn.fit(df_X, df_y)
svm.fit(df_X, df_y)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=12345, shrinking=True,
  tol=0.001, verbose=False)

In [11]:
X_rf = knn.predict(df_X)
X_knn = knn.predict(df_X)
X_svm = svm.predict(df_X)

In [12]:
X_level2 = np.vstack([X_rf, X_knn, X_svm]).T

In [13]:
X_level2.shape

(891, 3)

### Make 2nd-Level Predictions

Now that we have in-sample predictions as our level 2 features, let's compute the CV scores for our level 2 (logistic regression) model.

In [14]:
lr = LogisticRegression(random_state=RANDOM_STATE)

In [15]:
cv_level2 = KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

In [16]:
y_pred = cross_val_predict(lr, X_level2, df_y, cv=cv_level2)

In [17]:
auc_before = roc_auc_score(y_pred, df_y)

In [18]:
print('AUC before:', auc_before)

AUC before: 0.8737967914438503


## Removing Leakage

To reduce the leakage, let's make out-of-fold predictions on a separate CV split for our level 1 models.

In [19]:
cv_level1 = KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE * 2)
cv_level2 = KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

### Make 1st-Level Predictions

Now, instead of making in-sample predictions, we'll collect out-of-fold predictions from level 1 CV.

In [20]:
svm = SVC(random_state=RANDOM_STATE)
knn = KNeighborsClassifier(n_neighbors=3)
rf = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=RANDOM_STATE)

In [21]:
X_rf_oof = cross_val_predict(rf, df_X, df_y, cv=cv_level1)
X_knn_oof = cross_val_predict(knn, df_X, df_y, cv=cv_level1)
X_svm_oof = cross_val_predict(svm, df_X, df_y, cv=cv_level1)

In [22]:
X_level2 = np.vstack([X_rf_oof, X_knn_oof, X_svm_oof]).T

In [23]:
X_level2.shape

(891, 3)

### Make 2nd-Level Predictions

We don't change anything about the 2nd-level model – our main improvement was about the 1st-level predictions.

In [24]:
lr = LogisticRegression(random_state=RANDOM_STATE)

In [25]:
y_pred = cross_val_predict(lr, X_level2, df_y, cv=cv_level2)

In [26]:
auc_after = roc_auc_score(y_pred, df_y)

In [27]:
print('AUC after:', auc_after)

AUC after: 0.8311195445920303


## Evaluate the Impact of Leakage

In [28]:
print('AUC difference:', auc_after - auc_before)

AUC difference: -0.04267724685181995
