# Splitting imbalanced data

In [22]:
from sklearn.datasets import make_classification

X, y = make_classification(weights=[0.85])

In [24]:
from sklearn.model_selection import train_test_split

for i in range(1000):
    _, _, _, y_test = train_test_split(X, y, random_state=i)
    if (yl := len(y_test[y_test == 1])) == 0:
        print(i)

212
255
279
282
351
366
425
733
913


So in this moderately imbalanced dataset, nine splits resulted in there being **no samples** from the minority class in the small side (i.e. `test`) of the split.

Let's see what happens when we train on this.

In [42]:
from sklearn.linear_model import LogisticRegression

# Try 212 then 255
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=279)

model = LogisticRegression()
model.fit(X_train, y_train)
y_hat = model.predict(X_test)

In [43]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_hat))

              precision    recall  f1-score   support

           0       1.00      0.92      0.96        25
           1       0.00      0.00      0.00         0

    accuracy                           0.92        25
   macro avg       0.50      0.46      0.48        25
weighted avg       1.00      0.92      0.96        25



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Sometimes everything seems to work... except all the labels in both `y_test` and `y_hat` are of the majority class.

If there are minority labels in `y_hat` then we will see a `UndefinedMetricWarning` suggesting something is wrong.

In [44]:
y_hat

array([0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0])

In [45]:
y_test

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0])

### But most of the time, we will just have very few minority examples in the test set and we might not notice anything especially bad.