<h3>
    Data Leakage
</h3>
<div style="width: 80%">
Data leakage occurs when information outside the training dataset is used to create the model. This can lead to overly optimistic performance estimates and ultimately, poor generalization to new data. One common mistake is applying feature engineering techniques, like polynomial features, before splitting the data into training and testing sets.
</div>

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes # type: ignore
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [2]:
from sklearn.model_selection import KFold

In [3]:
np.random.seed(123)
X = np.arange(0.1, 1.1, 0.1,)
y = np.random.randint(1, 10, 10)

In [4]:
print(X)
print(y)

[0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1. ]
[3 3 7 2 4 7 2 1 2 1]


In [5]:
kf = KFold(n_splits=5)
print(kf)
KFold(n_splits=2, random_state=None, shuffle=False)
for i, (train_index, test_index) in enumerate(kf.split(X)):
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}")
    print(f"  Test:  index={test_index}")

KFold(n_splits=5, random_state=None, shuffle=False)
Fold 0:
  Train: index=[2 3 4 5 6 7 8 9]
  Test:  index=[0 1]
Fold 1:
  Train: index=[0 1 4 5 6 7 8 9]
  Test:  index=[2 3]
Fold 2:
  Train: index=[0 1 2 3 6 7 8 9]
  Test:  index=[4 5]
Fold 3:
  Train: index=[0 1 2 3 4 5 8 9]
  Test:  index=[6 7]
Fold 4:
  Train: index=[0 1 2 3 4 5 6 7]
  Test:  index=[8 9]


In [13]:
diabetes = load_diabetes(scaled=False)
df = pd.DataFrame(data=diabetes.data, columns=diabetes.feature_names)
df["target"] = diabetes.target
df

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,59.0,2.0,32.1,101.00,157.0,93.2,38.0,4.00,4.8598,87.0,151.0
1,48.0,1.0,21.6,87.00,183.0,103.2,70.0,3.00,3.8918,69.0,75.0
2,72.0,2.0,30.5,93.00,156.0,93.6,41.0,4.00,4.6728,85.0,141.0
3,24.0,1.0,25.3,84.00,198.0,131.4,40.0,5.00,4.8903,89.0,206.0
4,50.0,1.0,23.0,101.00,192.0,125.4,52.0,4.00,4.2905,80.0,135.0
...,...,...,...,...,...,...,...,...,...,...,...
437,60.0,2.0,28.2,112.00,185.0,113.8,42.0,4.00,4.9836,93.0,178.0
438,47.0,2.0,24.9,75.00,225.0,166.0,42.0,5.00,4.4427,102.0,104.0
439,60.0,2.0,24.9,99.67,162.0,106.6,43.0,3.77,4.1271,95.0,132.0
440,36.0,1.0,30.0,95.00,201.0,125.2,42.0,4.79,5.1299,85.0,220.0


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     442 non-null    float64
 1   sex     442 non-null    float64
 2   bmi     442 non-null    float64
 3   bp      442 non-null    float64
 4   s1      442 non-null    float64
 5   s2      442 non-null    float64
 6   s3      442 non-null    float64
 7   s4      442 non-null    float64
 8   s5      442 non-null    float64
 9   s6      442 non-null    float64
 10  target  442 non-null    float64
dtypes: float64(11)
memory usage: 38.1 KB


In [53]:
df.describe()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
count,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0
mean,48.5181,1.468326,26.375792,94.647014,189.140271,115.43914,49.788462,4.070249,4.641411,91.260181,152.133484
std,13.109028,0.499561,4.418122,13.831283,34.608052,30.413081,12.934202,1.29045,0.522391,11.496335,77.093005
min,19.0,1.0,18.0,62.0,97.0,41.6,22.0,2.0,3.2581,58.0,25.0
25%,38.25,1.0,23.2,84.0,164.25,96.05,40.25,3.0,4.2767,83.25,87.0
50%,50.0,1.0,25.7,93.0,186.0,113.0,48.0,4.0,4.62005,91.0,140.5
75%,59.0,2.0,29.275,105.0,209.75,134.5,57.75,5.0,4.9972,98.0,211.5
max,79.0,2.0,42.2,133.0,301.0,242.4,99.0,9.09,6.107,124.0,346.0


We're intentionally introducing data leakage by applying StandardScaler before splitting the data.

In [58]:
X, y = load_diabetes(return_X_y=True, scaled=False)

# Apply StandardScaler to intentionally introduce data leakage
ss = StandardScaler().fit(X)
X_scaled = ss.transform(X)

# Now we can run train_test_split on the scaled data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=55)

# Now let's fit a model
model = LinearRegression()
model.fit(X_train, y_train)

# Predictions
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

# Evaluate the model
train_rmse = np.sqrt(mean_squared_error(y_train, y_pred_train))
test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))

print("Data Leakage: Train RMSE:", train_rmse)
print("Data Leakage: Test RMSE:", test_rmse)

print("this code correctly introduces data leakage by applying StandardScaler to the entire dataset before splitting it into training and testing sets.")

Data Leakage: Train RMSE: 53.96778336707593
Data Leakage: Test RMSE: 51.9835533014626
this code correctly introduces data leakage by applying StandardScaler to the entire dataset before splitting it into training and testing sets.


In [47]:
# Let's say we want to use 'bmi' as our feature
X = df[["bmi"]]
y = df["target"]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Applying polynomial features without taking precautions
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

# Now let's fit a model
model = LinearRegression()
model.fit(X_train_poly, y_train)

# Predictions
y_pred_train = model.predict(X_train_poly)
y_pred_test = model.predict(X_test_poly)

# Evaluate the model
train_rmse = np.sqrt(mean_squared_error(y_train, y_pred_train))
test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))

print("Train RMSE:", train_rmse)
print("Test RMSE:", test_rmse)


Train RMSE: 62.04299423312175
Test RMSE: 63.914204061942534


In this example, we applied polynomial features to the entire dataset before splitting it into training and testing sets. This introduces data leakage because the test set indirectly influenced the feature engineering process. As a result, the model might perform unrealistically well on the test set.

To avoid data leakage, we need to apply polynomial features only to the training set and then transform the test set using the same transformation. Let's correct the mistake:



In [60]:
X, y = load_diabetes(return_X_y=True, as_frame=True)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Apply StandardScaler to the training data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Apply PolynomialFeatures
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train_scaled)
X_test_poly = poly.transform(X_test_scaled)

# Now let's fit a model
model = LinearRegression()
model.fit(X_train_poly, y_train)

# Predictions
y_pred_train = model.predict(X_train_poly)
y_pred_test = model.predict(X_test_poly)

# Evaluate the model
train_rmse = np.sqrt(mean_squared_error(y_train, y_pred_train))
test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))

print("No Data Leakage: Train RMSE:", train_rmse)
print("No Data Leakage: Test RMSE:", test_rmse)

No Data Leakage: Train RMSE: 49.036517076606614
No Data Leakage: Test RMSE: 52.09479350651258
