**Imbalanced Dataset**:  
A dataset with unequal class distribution.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split


In [4]:
train = pd.read_csv('train.csv')
target = pd.read_csv('target.csv')

In [5]:
df = pd.concat([train, target], axis=1)

In [8]:
df['coppaRisk'].value_counts()

coppaRisk
False    6304
True      696
Name: count, dtype: int64

As we can see the false and true distribution is imbalanced, we need to rebalance it with the method I show below.

In [10]:
safe = df[df['coppaRisk'] == False]
unsafe = df[df['coppaRisk'] == True]
print(safe.shape, unsafe.shape)

(6304, 17) (696, 17)


In [11]:
safe_sample = safe.sample(unsafe.shape[0], random_state=42)
print(safe_sample.shape, unsafe.shape)

(696, 17) (696, 17)


In [12]:
balanced_df = pd.concat([safe_sample, unsafe], axis=0)
print(balanced_df['coppaRisk'].value_counts())

coppaRisk
False    696
True     696
Name: count, dtype: int64


It's all balanced now!