# Splitting imbalanced data

Let's look at how you can run into trouble when you split datasets with strong class imbalance.

In [None]:
from sklearn.datasets import make_classification

X, y = make_classification(weights=[0.85], random_state=1111)

In [None]:
from sklearn.model_selection import train_test_split

for i in range(1000):
    _, _, _, y_test = train_test_split(X, y, random_state=i)
    if (yl := len(y_test[y_test == 1])) == 0:
        print(i)
        
# These are the experiments with no minority samples in test.

So in this moderately imbalanced dataset, 13 splits resulted in there being **no samples** from the minority class in the small side (i.e. `test`) of the split. That is, it worked out super-badly 1% of the time (and just 'badly' I don't know how often... should redo the experiment counting how often the representation is worse than expected).

Let's see what happens when we train on this.

In [None]:
from sklearn.linear_model import LogisticRegression

# Choose the first bad seed...
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=142)

model = LogisticRegression()
model.fit(X_train, y_train)
y_hat = model.predict(X_test)

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_hat))

Sometimes everything seems to work... except all the labels in both `y_test` and `y_hat` are of the majority class.

If there are minority labels in `y_hat` then we will see a `UndefinedMetricWarning` suggesting something is wrong.

In [None]:
y_hat

In [None]:
y_test

### But most of the time, we will just have very few minority examples in the test set and we might not notice anything especially bad.