Handling Imbalanced Dataset


Introduction to Handling Imbalanced Datasets
In this video, we will continue our series on feature engineering by discussing how to handle imbalanced datasets. Understanding why and how to address imbalanced datasets is crucial when developing machine learning projects.


What is an Imbalanced Dataset?
Consider a classification problem where the output is categorical. For example, in a binary classification problem, the output can be one of two categories.


Suppose we have 1000 data points, with the output being either "yes" or "no". If 900 data points are "yes" and 100 are "no", this represents a 9:1 ratio. This situation is called an imbalanced dataset because one class significantly outnumbers the other.


Problems Caused by Imbalanced Datasets
When training a machine learning model on an imbalanced dataset, the model tends to be biased towards the majority class. This bias can reduce the model's ability to correctly predict the minority class, which is often the class of interest. Therefore, it is necessary to address the imbalance to improve model performance.


Techniques to Handle Imbalanced Datasets
Two common techniques to handle imbalanced datasets are:


Upsampling: Increasing the number of data points in the minority class.


Downsampling: Reducing the number of data points in the majority class.
We will first perform upsampling and then downsampling to understand both approaches.

In [None]:
import numpy as np 
import pandas as pd
##creating an imbalanced dataset 
np.random.seed(123)

n_samples=1000
class_0_ratio=0.9
n_class_0=int(n_samples*class_0_ratio)
n_class_1=n_samples - n_class_0

In [3]:
n_class_0,n_class_1

(900, 100)

In [6]:
##create df with imbalanced dataset 

class_0 = pd.DataFrame({
    'feature_1':np.random.normal(loc=0,scale=1,size=n_class_0),
    'feature_2':np.random.normal(loc=0,scale=1,size=n_class_0),
    'target':[0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature_1':np.random.normal(loc=2,scale=1,size=n_class_1),
    'feature_2':np.random.normal(loc=2,scale=1,size=n_class_1),
    'target':[1] * n_class_1
})


In [10]:
df= pd.concat([class_0,class_1]).reset_index(drop=True)

In [11]:
df.head()

Unnamed: 0,feature_1,feature_2,target
0,-0.300232,0.667532,0
1,-0.632261,0.100458,0
2,-0.204317,-0.01261,0
3,0.213696,0.219907,0
4,1.033878,0.813623,0


In [13]:
df.tail()

Unnamed: 0,feature_1,feature_2,target
995,0.438134,4.540514,1
996,1.232181,1.917294,1
997,2.387223,2.444621,1
998,0.787082,3.896404,1
999,4.018714,2.237581,1


In [14]:
df['target'].value_counts()

target
0    900
1    100
Name: count, dtype: int64

ðŸ‘‰ df[df['column']] = dataframe me sirf wahi rows dikhao jaha column ka value True ho.

UPSAMPLING

In [None]:
df_minority=df[df['target']==1]
df_majority=df[df['target']==0]
# df me sirf un rows ko select karo 
# jaha target column ka value 1 hai.
# df me sirf un rows ko select karo 
# jaha target column ka value 0 hai.

In [17]:
from sklearn.utils import resample
df_minority_upsampled = resample(
    df_minority,
    replace=True,  # sample with replacement
    n_samples=len(df_majority),  # match number in majority class
    random_state=42
)

df_minority â†’ minority class rows (target = 1)


replace=True â†’ ek hi row multiple baar aa sakti hai


n_samples=... â†’ minority class ko majority ke barabar banana


random_state=42 â†’ reproducibility; hamesha same sampling


In [18]:
df_upsampled=pd.concat([df_majority,df_minority_upsampled],ignore_index=True)

In [19]:
df_upsampled['target'].value_counts()

target
0    900
1    900
Name: count, dtype: int64

DOWN SAMPLING

In [20]:
##create df with imbalanced dataset 

class_0 = pd.DataFrame({
    'feature_1':np.random.normal(loc=0,scale=1,size=n_class_0),
    'feature_2':np.random.normal(loc=0,scale=1,size=n_class_0),
    'target':[0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature_1':np.random.normal(loc=2,scale=1,size=n_class_1),
    'feature_2':np.random.normal(loc=2,scale=1,size=n_class_1),
    'target':[1] * n_class_1
})
df= pd.concat([class_0,class_1]).reset_index(drop=True)
df['target'].value_counts()

target
0    900
1    100
Name: count, dtype: int64

In [21]:
##Downsampling 
df_minority=df[df['target']==1]
df_majority=df[df['target']==0]

In [23]:
from sklearn.utils import resample 
df_majority_downsampled=resample(df_majority,replace=False,
n_samples=len(df_minority),
random_state=42
)

In [29]:
df_downsampled = pd.concat([df_majority_downsampled,df_minority],ignore_index=True)

In [30]:
df_downsampled['target'].value_counts()

target
0    100
1    100
Name: count, dtype: int64