<a href="https://colab.research.google.com/github/ghassenov/ML-notebooks/blob/main/handling_imbalanced_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Imbalanced Dataset : Def,Impact, and Handling Techniques.

---
What is an imbalanced dataset?
* an imbalanced dataset occurs when the distribution of classes in the target variable is highly skewed, meaning one class outnumbers the other one.

Why handle Imbalanced Data?
* imbalanced datasets can lead to biased model performance because the model may overfit the majority class and ignore the minority class.
* the model fails to learn meaningful patterns for the minority class.

How to Handle Imbalanced Data?
* Undersampling: Reduce the majority class samples randomly to balance the dataset.
* Oversampling : increase minority class samples


In [1]:
# importing libraries
import pandas as pd
from sklearn.datasets import make_classification
from collections import Counter

* 'make_classification' function from scikit-learn generates an artificial dataset for classification problems
* 'Counter' class helps count the occurences of each class in the target variable.

In [2]:
# create an imbalanced synthetic dataset
X,y = make_classification(n_samples=1000,weights=[0.9,0.1],random_state=42)

* 90% of samples belong to class 0
* 10% of samples belong to class 1

In [3]:
print(X.shape,y.shape)

(1000, 20) (1000,)


In [5]:
df = pd.DataFrame(X,columns=[f'feature_{i}' for i in range(X.shape[1])])
df['target'] = y

this converts the feature matrix X into a pandas DataFrame.
* X.shape[1] gives the number of features (columns) in X
* f'feature_{i}' dynamically names columns as feature_0, feature_1...
* df['target'] = y adds the target variable y as a new column named 'target'

In [7]:
print("Class Distribution",Counter(y))

Class Distribution Counter({np.int64(0): 897, np.int64(1): 103})


Undersampling majority class

In [10]:
# separating majority and minority classes
df_majority = df[df['target']==0]
df_minority = df[df['target']==1]

In [11]:
# undersampling majority class to match minority class size
df_majority_undersampled = df_majority.sample(n=len(df_minority),random_state=42)

In [12]:
#combine undersampled majority and minority classes
df_balanced = pd.concat([df_majority_undersampled,df_minority],axis = 0)

In [13]:
print("Balanced Class Distribution: ",Counter(df_balanced['target']))

Balanced Class Distribution:  Counter({0: 103, 1: 103})
