<a href="https://colab.research.google.com/github/cagBRT/Data/blob/main/DataLeakage_from_Resampling_of_Imbalanced_Datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data leakage  from Resampling of Imbalanced Datasets

Data leakage occurs when information that would not be available at prediction time is used when building the model.

In the resampling setting, there is a common pitfall that corresponds to resample the entire dataset before splitting it into a train and a test partitions. Note that it would be equivalent to resample the train and test partitions as well.

Such of a processing leads to two issues:

1. the model will not be tested on a dataset with class distribution similar to the real use-case. Indeed, by resampling the entire dataset, both the training and testing set will be potentially balanced while **the model should be tested on the natural imbalanced dataset to evaluate the potential bias of the model**;

2. the resampling procedure might use information about samples in the dataset to either generate or select some of the samples. Therefore, we might use information of samples which will be later used as testing samples which is the typical data leakage issue.




This notebook demonstrates the wrong and right ways to do some sampling and emphasize the tools that one should use, avoiding to fall into the data leakage trap.

# Get the data

We will use the adult census dataset. <br>

For the sake of simplicity, we will only use the numerical features. <br>

Also, we will make the dataset more imbalanced to increase the effect of the wrongdoings:

**from sklearn.datasets import fetch_openml**

Fetch dataset from openml by name or dataset id.

Datasets are uniquely identified by either an integer ID or by a combination of name and version (i.e. there might be multiple versions of the ‘iris’ dataset). Please give either name or data_id (not both). In case a name is given, a version can also be provided.

In [None]:
from sklearn.datasets import fetch_openml

The datasets can be found here:
https://openml.org/search?type=data&sort=runs&status=active

# Make the dataset imbalanced

**from imblearn.datasets import make_imbalance**

In [None]:
from imblearn.datasets import make_imbalance
from collections import Counter
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import cross_validate
from imblearn.under_sampling import RandomUnderSampler

Turn a dataset into an imbalanced dataset with a specific sampling strategy.

In [None]:
X, y = fetch_openml(data_id=1119, as_frame=True, return_X_y=True,parser='auto')
X = X.select_dtypes(include="number")
X, y = make_imbalance(X, y, sampling_strategy={">50K": 300}, random_state=1)

**The Dataset**<br>
>age: the age of an individual<br>
education­num: the highest level of education achieved in numerical form.<br>
capital­gain: capital gains for an individual<br>
capital­loss: capital loss for an individual<br>
hours­per­week: the hours an individual has reported to work per week

In [None]:
X

Unnamed: 0,age,fnlwgt:,education-num:,capital-gain:,capital-loss:,hours-per-week:
0,39,77516,13,2174,0,40
1,50,83311,13,0,0,13
2,38,215646,9,0,0,40
3,53,234721,7,0,0,40
4,28,338409,13,0,0,40
...,...,...,...,...,...,...
25015,52,254211,14,15024,0,60
25016,44,377018,11,0,0,40
25017,36,114605,11,0,0,40
25018,37,212005,11,0,0,40


# Check the balancing ratio on this dataset:

A Counter is a dict subclass for counting hashable objects. It is a collection where elements are stored as dictionary keys and their counts are stored as dictionary values.

fnlwgt is the label for this example

In [None]:
{key: value / len(y) for key, value in Counter(y).items()}

{'<=50K': 0.988009592326139, '>50K': 0.011990407673860911}

# Train and test data split

To later highlight some of the issues, we will keep aside a left-out set that we will not use for the evaluation of the model:

In [None]:
from sklearn.model_selection import train_test_split
X, X_left_out, y, y_left_out = train_test_split(X, y, stratify=y, random_state=0)

**from sklearn.ensemble import HistGradientBoostingClassifier**

Use an sklearn.ensemble.HistGradientBoostingClassifier as a baseline classifier. <br>

- Train and check the performance of the classifier, without any preprocessing to alleviate the bias toward the majority class.

- Evaluate the generalization performance of the classifier via cross-validation:

In [None]:
model = HistGradientBoostingClassifier(random_state=0)
cv_results = cross_validate(
    model, X, y, scoring="balanced_accuracy",
    return_train_score=True, return_estimator=True,
    n_jobs=-1)
print(
    f"Balanced accuracy mean +/- std. dev.: "
    f"{cv_results['test_score'].mean():.3f} +/- "
    f"{cv_results['test_score'].std():.3f}")

Balanced accuracy mean +/- std. dev.: 0.609 +/- 0.024


# Balanced Accuracy

**The classifier does not give good performance in terms of balanced accuracy mainly due to the class imbalance issue.**

In the cross-validation, we stored the different classifiers of all folds.<br>

We will show that evaluating these classifiers on the left-out data will give close statistical performance:

In [None]:
import numpy as np
from sklearn.metrics import balanced_accuracy_score
scores = []
for fold_id, cv_model in enumerate(cv_results["estimator"]):
    scores.append(
        balanced_accuracy_score(
            y_left_out, cv_model.predict(X_left_out)
        ))
print(
    f"Balanced accuracy mean +/- std. dev.: "
    f"{np.mean(scores):.3f} +/- {np.std(scores):.3f}"
)

Balanced accuracy mean +/- std. dev.: 0.628 +/- 0.009


# The Wrong Pattern
**The wrong pattern** to apply when it comes to resampling to alleviate the class imbalance issue. <br>

Use a sampler to balance the entire dataset and check the statistical performance of our classifier via cross-validation:

In [None]:
sampler = RandomUnderSampler(random_state=0)
X_resampled, y_resampled = sampler.fit_resample(X, y)
model = HistGradientBoostingClassifier(random_state=0)
cv_results = cross_validate(
    model, X_resampled, y_resampled, scoring="balanced_accuracy",
    return_train_score=True, return_estimator=True,
    n_jobs=-1
)
print(
    f"Balanced accuracy mean +/- std. dev.: "
    f"{cv_results['test_score'].mean():.3f} +/- "
    f"{cv_results['test_score'].std():.3f}"
)

Balanced accuracy mean +/- std. dev.: 0.724 +/- 0.042


The cross-validation performance looks good, but evaluating the classifiers on the left-out data shows a different picture:

In [None]:
scores = []
for fold_id, cv_model in enumerate(cv_results["estimator"]):
    scores.append(
        balanced_accuracy_score(
            y_left_out, cv_model.predict(X_left_out)
       )
    )
print(
    f"Balanced accuracy mean +/- std. dev.: "
    f"{np.mean(scores):.3f} +/- {np.std(scores):.3f}"
)

Balanced accuracy mean +/- std. dev.: 0.698 +/- 0.014


**The performance is worse than the cross-validated performance. The data leakage gave us too optimistic results.**





# The correct pattern to use <br>
Use a Pipeline to avoid making a data leakage because the resampling will be delegated to imbalanced-learn and does not require any manual steps.

In [None]:
from imblearn.pipeline import make_pipeline
model = make_pipeline(
    RandomUnderSampler(random_state=0),
    HistGradientBoostingClassifier(random_state=0)
)
cv_results = cross_validate(
    model, X, y, scoring="balanced_accuracy",
    return_train_score=True, return_estimator=True,
    n_jobs=-1
)
print(
    f"Balanced accuracy mean +/- std. dev.: "
    f"{cv_results['test_score'].mean():.3f} +/- "
    f"{cv_results['test_score'].std():.3f}"
)

Balanced accuracy mean +/- std. dev.: 0.732 +/- 0.019


Observe that we get good statistical performance as well.<br>

Check the performance of the model from each cross-validation fold to ensure that we have similar performance.

In [None]:
scores = []
for fold_id, cv_model in enumerate(cv_results["estimator"]):
    scores.append(
        balanced_accuracy_score(
            y_left_out, cv_model.predict(X_left_out)
       )
    )
print(
    f"Balanced accuracy mean +/- std. dev.: "
    f"{np.mean(scores):.3f} +/- {np.std(scores):.3f}"
)

Balanced accuracy mean +/- std. dev.: 0.762 +/- 0.018


After your Machine Learning Model is built, it is advisable to **test your metric on your NOT-UPSAMPLED train dataset**. <br>

Testing your metric on the NOT-UPSAMPLED data set gives you a more realistic estimate of your model than testing it on the UPSAMPLED dataset. It might be advisavle to keep a version of the train dataset that wasn’t upsampled.