# Handling Imbalanced Dataset

**An imbalanced dataset occurs when the number of observations in one class is significantly higher than in the other(s). This imbalance can bias machine learning models toward the majority class, reducing their ability to correctly predict the minority class.**

## Resampling Methods
* **Random Oversampling (Upsampling):** Duplicate or synthesize samples of the minority class.
* **Random Undersampling (Downsampling):** Remove samples from the majority class.

In [1]:
# Let's create some random samples to demonstrate

import numpy as np
import pandas as pd

np.random.seed(286)

n_total = 1000
class_zero_ratio = 0.9
n_class_zero = int(n_total * class_zero_ratio)
n_class_one = int(n_total - n_class_zero)

print(n_class_zero, n_class_one)

900 100


**Creating the DataFrame with Imbalanced Classes**

We create two dataframes for each class with two features generated from a normal distribution. The target column indicates the class label.

* For class zero, generate 900 data points.
* For class one, generate 100 data points.
* We then concatenate these dataframes and reset the index.

In [12]:
feature1_class0 = np.random.normal(loc=0,
                                   scale=1,
                                   size=n_class_zero)
feature2_class0 = np.random.normal(loc=5,
                                   scale=1,
                                   size=n_class_zero)
class0 = pd.DataFrame({
            'feature1': feature1_class0,
            'feature2': feature2_class0,
            'target': np.zeros(n_class_zero, dtype=int)
})

feature1_class1 = np.random.normal(loc=1,
                                   scale=1,
                                   size=n_class_one)
feature2_class1 = np.random.normal(loc=5,
                                   scale=1,
                                   size=n_class_one)
class1 = pd.DataFrame({
            'feature1': feature1_class1,
            'feature2': feature2_class1,
            'target': np.ones(n_class_one, dtype=int)
})

df = pd.concat([class0, class1], axis=0).reset_index(drop=True)

print(df.head())
print(df['target'].value_counts())

   feature1  feature2  target
0 -0.413095  4.027042       0
1  0.180983  5.465918       0
2 -0.680601  4.596678       0
3  0.472531  4.721452       0
4 -0.514688  5.779424       0
target
0    900
1    100
Name: count, dtype: int64


## Method 1 - Random Oversampling (Upsampling):

Random oversampling is a resampling technique used to balance class distribution by duplicating existing samples from the minority class until it matches the size of the majority class. This process does not alter the feature values or their relationships—the distributions and correlations of the features remain the same. Only the frequency of minority class samples increases, which helps machine learning models pay equal attention to both classes but may increase the risk of overfitting.

In [14]:
from sklearn.utils import resample

# Separate majority and minoroty class
df_minority = df[df['target'] == 1]
df_majority = df[df['target'] == 0]

# Upsample minority class
df_minority_upsampled = resample(
                                df_minority,
                                replace=True,
                                n_samples=len(df_majority),
                                random_state=42
)

# Combine df_majority and df_minority_upsampled classes
df_upsampled = pd.concat([df_majority, df_minority_upsampled]).reset_index(drop=True) 

print(df_upsampled['target'].value_counts())

target
0    900
1    900
Name: count, dtype: int64


## Method 2 - Downsampling (Undersampling):

Downsampling is a resampling technique used to handle imbalanced datasets by reducing the size of the majority class so that it matches the minority class. Instead of duplicating minority samples, it randomly removes samples from the majority class.

* **Goal:** Balance the dataset and prevent the model from being biased toward the majority class.

* **Effect on features:** The feature values themselves are not changed; only fewer majority samples are kept.

* **Advantage:** Simple and reduces training time.

* **Disadvantage:** May discard useful information from the majority class, which can lead to underfitting.

In [15]:
# Recreate the original imbalanced dataset
np.random.seed(0)

n_total = 1000
class_zero_ratio = 0.9
n_class_zero = int(n_total * class_zero_ratio)
n_class_one = n_total - n_class_zero

feature1_class0 = np.random.normal(loc=0, scale=1, size=n_class_zero)
feature2_class0 = np.random.normal(loc=5, scale=1, size=n_class_zero)
class0 = pd.DataFrame({
    'feature1': feature1_class0,
    'feature2': feature2_class0,
    'target': np.zeros(n_class_zero, dtype=int)
})

feature1_class1 = np.random.normal(loc=2, scale=1, size=n_class_one)
feature2_class1 = np.random.normal(loc=6, scale=1, size=n_class_one)
class1 = pd.DataFrame({
    'feature1': feature1_class1,
    'feature2': feature2_class1,
    'target': np.ones(n_class_one, dtype=int)
})

df = pd.concat([class0, class1], axis=0).reset_index(drop=True)

# Separate majority and minority classes
df_minority = df[df['target'] == 1]
df_majority = df[df['target'] == 0]

# Downsample majority class
df_majority_downsampled = resample(
    df_majority,
    replace=False,  # sample without replacement
    n_samples=len(df_minority),  # match minority class size
    random_state=42
)

# Combine minority class with downsampled majority class
df_downsampled = pd.concat([df_minority, df_majority_downsampled]).reset_index(drop=True)

print(df_downsampled['target'].value_counts())

target
1    100
0    100
Name: count, dtype: int64
