**Balancing a Dataset with Downsampling**

    Imagine we have a datset for a binary classification task where the class labels are imblanced,and we wnat to downsample the majority class to balance the dataset.

In [33]:
import pandas as pd
from sklearn.utils import resample
df=pd.DataFrame({
    'Age':[22,25,27,28,30,35,40,45,50,55,60,65,70],
    'Income':[2000,2500,2700,3200,3500,3800,4000,4200,4300,4500,5000,5500,6000],
    'Class':['High','Low','Low','High','High','Low','High','High','Low','Low','High','High','Low']
})

High class has 7 instances


Low class has 6 instances

In [59]:
#Separate majotrity and minority classes
df_high=df[df['Class']=='High']
df_low=df[df['Class']=='Low']


In [57]:
#Downsample majority class
df_high_downsampled=resample(df_high,replace=False,n_samples=len(df_low),random_state=42)

In [39]:
#Combine downsample majority with minority class
df_balanced =pd.concat([df_high_downsampled,df_low])

In [41]:
print(df_balanced['Class'].value_counts())

Class
High    6
Low     6
Name: count, dtype: int64


**Upsampling the Minority Class**

Let's use a dataset eith a binary classification task where the Minority class has fewer instances than the Minority class,and we'll perform upsampling on the minority class

In [64]:
df_high_upsampled=resample(df_low,replace=True,n_samples=len(df_high),random_state=42)

In [71]:
import pandas as pd
from sklearn.utils import resample
df=pd.DataFrame({
    'Age':[22,25,27,28,30,35,40,45,50,55,60,65,70],
    'Income':[2000,2500,2700,3200,3500,3800,4000,4200,4300,4500,5000,5500,6000],
    'Class':['Minority','Majority','Majority','Majority','Majority','Minority','Minority','Minority','Majority','Majority','Majority','Majority','Majority']
})

Majority class has 9 instances


Minority class has 4 instances

In [79]:
df_majority =df[df['Class']=='Majority']
df_minority=df[df['Class']=='Minority']

In [89]:
df_majority_upsampled=resample(df_minority,replace=True,n_samples=len(df_majority),random_state=42)

In [93]:
df_balanced =pd.concat([df_majority_upsampled,df_majority])

In [95]:
print(df_balanced['Class'].value_counts())

Class
Minority    9
Majority    9
Name: count, dtype: int64


In [None]:
SMOTE(synthetic Minority Over-sampling Technique) to balance the dataset

In [1]:
pip install imbalanced-learn

Note: you may need to restart the kernel to use updated packages.


1.SMOTE to generate synthetic samples instead of duplicating existing ones

2.Conver categorical class labels into numeric form for SMOTE to work

3.Apply SMOTE to balance the dataset

4.Convert back to original categorical labels

5.Combine the resampled data into a final balanced dataset

In [3]:
import pandas as pd
from imblearn.over_sampling import SMOTE
df=pd.DataFrame({
    'Age':[22,25,27,28,30,35,40,45,50,55,60,65,70],
    'Income':[2000,2500,2700,3200,3500,3800,4000,4200,4300,4500,5000,5500,6000],
    'Class':['Minority','Majority','Majority','Majority','Majority',
             'Minority','Minority','Minority','Majority','Majority',
             'Majority','Majority','Majority']
})

In [4]:
df['Class'] = df['Class'].map({'Majority': 0, 'Minority': 1})
X = df[['Age', 'Income']]
Y = df['Class']

# Applying SMOTE
smote = SMOTE(sampling_strategy='auto', random_state=42, k_neighbors=3)
X_resampled, Y_resampled = smote.fit_resample(X, Y)

# Remapping the class labels back to 'Majority' and 'Minority'
Y_resampled = Y_resampled.map({0: 'Majority', 1: 'Minority'})

# Creating the balanced DataFrame
df_balanced = pd.concat([pd.DataFrame(X_resampled, columns=['Age', 'Income']), pd.DataFrame(Y_resampled, columns=['Class'])], axis=1)

# Outputting the class distribution and the balanced DataFrame
print(df_balanced['Class'].value_counts())
print(df_balanced)


Class
Minority    9
Majority    9
Name: count, dtype: int64
    Age  Income     Class
0    22    2000  Minority
1    25    2500  Majority
2    27    2700  Majority
3    28    3200  Majority
4    30    3500  Majority
5    35    3800  Minority
6    40    4000  Minority
7    45    4200  Minority
8    50    4300  Majority
9    55    4500  Majority
10   60    5000  Majority
11   65    5500  Majority
12   70    6000  Majority
13   40    4031  Minority
14   35    3831  Minority
15   44    4176  Minority
16   35    3826  Minority
17   41    4040  Minority
