handling imbalanced datasets
1. Upsampling
2. Downsampling

- An imbalanced dataset in machine learning is one where the classes or categories are not equally represented. This means one class (the majority class) has significantly more instances than the other(s) (the minority class(es)). This imbalance can cause traditional machine learning models to be biased towards the majority class, leading to poor performance on the minority class.

- What is imbalanced data?

Uneven Distribution:

- In a binary classification problem, an imbalanced dataset might have 90% of instances belonging to one class and only 10% to the other.

Real-world Examples:
- Imbalanced datasets are common in fraud detection (few fraudulent transactions), medical diagnosis (rare diseases), and spam detection (most emails are not spam).

Impact on Models:

- Standard machine learning algorithms are often trained to optimize overall accuracy. In an imbalanced dataset, this can lead to models that perform well on the majority class but poorly on the minority class.
import random

In [3]:
import random

# Generate a single random integer between 1 and 100
random_number = random.randint(1, 100)
print("Random Integer between 1 and 100:", random_number)

# Generate 5 random integers between 10 and 50
random_numbers_list = [random.randint(10, 50) for _ in range(5)]
print("List of 5 random integers between 10 and 50:", random_numbers_list)

# Generate a random float between 0 and 1
random_float = random.random()
print("Random float between 0 and 1:", random_float)

# Generate a random float between 5.5 and 9.5
random_uniform = random.uniform(5.5, 9.5)
print("Random float between 5.5 and 9.5:", random_uniform)

Random Integer between 1 and 100: 43
List of 5 random integers between 10 and 50: [48, 24, 48, 23, 15]
Random float between 0 and 1: 0.40056518009923003
Random float between 5.5 and 9.5: 7.6744260545352985


In [4]:
import numpy as np
import pandas as pd

# Set the random seed for reproducibility
np.random.seed(123)

# Create a dataframe with two classes
n_samples = 1000
class_0_ratio = 0.9
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0

In [5]:
n_class_0,n_class_1

(900, 100)

In [6]:
# create my dataframe with imbalanced dataset

class_0 = pd.DataFrame({
    'Feature_1': np.random.normal(loc=0, scale=1, size=n_class_0),
    'Feature_2': np.random.normal(loc=0, scale=1, size=n_class_0),
    'target': [0] * n_class_0
    })

class_1 = pd.DataFrame({
    'Feature_1': np.random.normal(loc=2, scale=1, size=n_class_1),
    'Feature_2': np.random.normal(loc=2, scale=1, size=n_class_1),
    'target': [1] * n_class_1
})

In [7]:
df = pd.concat([class_0,class_1]).reset_index(drop=True)

In [8]:
df.head()

Unnamed: 0,Feature_1,Feature_2,target
0,-1.085631,0.551302,0
1,0.997345,0.419589,0
2,0.282978,1.815652,0
3,-1.506295,-0.25275,0
4,-0.5786,-0.292004,0


In [9]:
df.tail()

Unnamed: 0,Feature_1,Feature_2,target
995,1.376371,2.845701,1
996,2.23981,0.880077,1
997,1.13176,1.640703,1
998,2.902006,0.390305,1
999,2.69749,2.01357,1


In [10]:
df['target'].value_counts()

target
0    900
1    100
Name: count, dtype: int64

# UpSampling: Increases the number of samples in the minority class.

# DownSampling: Reduces the majority class size to match the minority class.

In [11]:
## Upsampling

df_minority = df[df['target'] ==1]
df_majority = df[df["target"] ==0]

In [12]:
from sklearn.utils import resample

df_minority_upsampled = resample(df_minority, replace=True, ## Sample with replacement
                             n_samples=len(df_majority),
                             random_state=42
                            )

In [13]:
df_minority_upsampled.shape

(900, 3)

In [14]:
df_minority_upsampled.head()

Unnamed: 0,Feature_1,Feature_2,target
951,1.125854,1.843917,1
992,2.19657,1.397425,1
914,1.93217,2.998053,1
971,2.272825,3.034197,1
960,2.870056,1.550485,1


In [15]:
df_upsampled = pd.concat([df_majority,df_minority_upsampled])

In [16]:
df_upsampled['target'].value_counts()

target
0    900
1    900
Name: count, dtype: int64

Downsampling

In [18]:
## Downsampling 

import pandas as pd

# Set the random seed for reproducibility
np.random.seed(123)

# Create a dataframe with two classes
n_samples = 1000
class_0_ratio = 0.9
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0

class_0 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0, scale=1, size=n_class_0),
    'feature_2': np.random.normal(loc=0, scale=1, size=n_class_0),
    'target': [0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=2, scale=1, size=n_class_1),
    'feature_2': np.random.normal(loc=2, scale=1, size=n_class_1),
    'target': [1] * n_class_1
})

df = pd.concat([class_0, class_1]).reset_index(drop=True)

# Check the class distribution
print(df['target'].value_counts())

target
0    900
1    100
Name: count, dtype: int64


In [19]:
from sklearn.utils import resample
df_majority_downsampled=resample(df_majority,replace=True, #Sample With replacement
         n_samples=len(df_minority),
         random_state=42
        )

In [20]:
## downsampling
df_minority=df[df['target']==1]
df_majority=df[df['target']==0]

In [21]:
df_majority_downsampled.shape

(100, 3)

In [22]:
df_upsampled=pd.concat([df_minority,df_majority_downsampled])

In [23]:
df_upsampled['target'].value_counts()

target
1    100
0    100
Name: count, dtype: int64