# Oversampling to create spurious relationship

In general, most oversampling strategies assume some relationship between the features and the target. For example, SMOTE assumes that it can generate new samples by linearly interpolating between existing minority samples. This **creates** a relationship between the features and the target, even if there is none.

This does not mean that SMOTE is bad, it just means that you have to be careful with it &mdash; and with all oversampling strategies. Some suggestions for best practice:

- Know exactly what it does.
- Check the difference that oversampling makes.
- Consider simple strategies like fuzzing, eg Gaussian noise up-sampling, at least for comparison.
- Only ever apply data augmentation *after* splitting out validation and test sets. Be aware that this means you have to be very careful if applying folded cross-validation, for example as part of a hyperparameter tuning step.

## Make a dataset

This highly imbalanced dataset is random and contains no predictable relationships.

In [None]:
import numpy as np

rng = np.random.default_rng(42)

# The higher these numbers, the clearer the problem.
N = 10_000   # Number of samples.
M = 5        # Number of features.

X = rng.uniform(size=(N, M))
y = rng.binomial(n=1, p=0.1, size=N)

## Fit and score a classifier

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

Let's try a random forest...

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
model.fit(X_train, y_train)
model.score(X_test, y_test)

Other models perform similarly.

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)
model.score(X_test, y_test)

Looks amazing... remember how imbalanced the dataset is!

The ROC-AUC will not be fooled:

In [None]:
from sklearn.metrics import roc_auc_score

y_prob = model.predict_proba(X_test)[:, 1]
roc_auc_score(y_test, y_prob)

And the `DummyClassifier`, whose default strategy will simply pick the majority class, makes it obvious that our model is bad.

In [None]:
from sklearn.dummy import DummyClassifier

dummy = DummyClassifier()
dummy.fit(X_train, y_train)
dummy.score(X_test, y_test)

## Now with oversampling

We will first oversample BEFORE splitting.

In [None]:
from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X, y)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res)

This (binary) dataset is now balanced, so the dummy classifier scores about 50%.

In [None]:
dummy = DummyClassifier()
dummy.fit(X_train, y_train)
dummy.score(X_test, y_test)

Now with a random forest:

In [None]:
model = RandomForestClassifier()
model.fit(X_train, y_train)
model.score(X_test, y_test)

In [None]:
y_prob = model.predict_proba(X_test)[:, 1]
roc_auc_score(y_test, y_prob)

Wow! Amazing.

Logistic regression behaves as before:

In [None]:
model = LogisticRegression()
model.fit(X_train, y_train)
model.score(X_test, y_test)

In [None]:
y_prob = model.predict_proba(X_test)[:, 1]
roc_auc_score(y_test, y_prob)

## Oversampling after split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X_train, y_train)

In [None]:
dummy = DummyClassifier()
dummy.fit(X_res, y_res)
dummy.score(X_test, y_test)

In [None]:
model = RandomForestClassifier()
model.fit(X_res, y_res)
model.score(X_test, y_test)

In [None]:
y_prob = model.predict_proba(X_test)[:, 1]
roc_auc_score(y_test, y_prob)

This is back to the 50/50 score we saw before.

## Why?

In [None]:
M = 2        # Number of features.

X = rng.uniform(size=(N, M))
y = rng.binomial(n=1, p=0.1, size=N)

sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X, y)

In [None]:
import matplotlib.pyplot as plt

plt.scatter(*X_res.T, c=y_res, s=5, cmap='bwr')

## Open questions

- Should you over or undersample before or after scaling? Let's check if SMOTE changes mean or stdev. E.g. see this paper and probably lots of others: https://www.sciencedirect.com/science/article/pii/S1568494622009024 and https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3648438/. Intuitively, I think it's safer to scale first, because then the scaler only gets to see real data.

---

&copy; 2023 Matt Hall, licensed CC BY