Handling Imbalance Datset - suppose we have 1000 datapoints
1000 = 900 (Yes)  and 100 (No)
so the ratio is 900:100 or 9:10, this is an imbalanced datset, our model will get biased towards maximum number of datapoints so we need to fix them and make them equal.

For this we use two techniques
1.Upsampling - increasing the datapoints from minority
2.Downsampling - decreasing the datapoints from majority

In [1]:
import numpy as np
import pandas as pd

# Set the random seed for reproducibility
np.random.seed(123)

# Create a dataframe with two classes
n_samples = 1000
class_0_ratio = 0.9
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0

In [2]:
n_class_0,n_class_1

(900, 100)

In [3]:
## CREATE MY DATAFRAME WITH IMBALANCED DATASET
class_0 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0, scale=1, size=n_class_0),
    'feature_2': np.random.normal(loc=0, scale=1, size=n_class_0),
    'target': [0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=2, scale=1, size=n_class_1),
    'feature_2': np.random.normal(loc=2, scale=1, size=n_class_1),
    'target': [1] * n_class_1
})

In [4]:
df = pd.concat([class_0,class_1]).reset_index(drop = True)

In [6]:
df.head()

Unnamed: 0,feature_1,feature_2,target
0,-1.085631,0.551302,0
1,0.997345,0.419589,0
2,0.282978,1.815652,0
3,-1.506295,-0.25275,0
4,-0.5786,-0.292004,0


In [7]:
df.tail()

Unnamed: 0,feature_1,feature_2,target
995,1.376371,2.845701,1
996,2.23981,0.880077,1
997,1.13176,1.640703,1
998,2.902006,0.390305,1
999,2.69749,2.01357,1


In [8]:
df['target'].value_counts()

0    900
1    100
Name: target, dtype: int64

In [9]:
df_minority = df[df['target']==1]
df_majority = df[df['target']==0]

In [10]:
#Perform upsampling
from sklearn.utils import resample
df_minority_upsampled = resample(df_minority,replace = True,
                        n_samples = len(df_majority),
                        random_state = 42
                        )

In [12]:
df_upsampled = pd.concat([df_majority,df_minority_upsampled])

In [14]:
df_upsampled['target'].value_counts()


0    900
1    900
Name: target, dtype: int64

In [20]:
##Perform downsampling - But is is not good as we are losing datapoints

In [21]:
df_minority = df[df['target']==1]
df_majority = df[df['target']==0]

In [22]:
from sklearn.utils import resample
df_majority_downsampled = resample(df_majority , replace = False,
                          n_samples = len(df_minority),
                          random_state = 42
                          )

In [23]:
df_downsampled = pd.concat([df_minority , df_majority_downsampled])

In [24]:
df_downsampled['target'].value_counts()

1    100
0    100
Name: target, dtype: int64