In [None]:
'''
Handling Imbalanced Datasets:

Imbalanced datasets:
They are datasets where the classes are not represented equally. 
For example, in a binary classification problem,
if 90% of the samples belong to class A and only 10% belong to class B, the dataset is imbalanced. 
Imbalanced datasets can lead to models that are biased towards the majority class, 
resulting in poor performance on the minority class.
Imbalanced datasets are common in many real-world applications, 
such as fraud detection, medical diagnosis, and anomaly detection. 
When the classes are imbalanced, standard machine learning algorithms may not perform well because 
they tend to favor the majority class. 
This can result in high accuracy but poor recall and precision for the minority class, which is often the class of interest.

Strategies to handle imbalanced datasets:
1. Resampling Techniques:
   - **Oversampling**: Increase the number of instances in the minority class by duplicating existing samples or 
                        generating synthetic samples (e.g., using SMOTE).
   - **Undersampling**: Reduce the number of instances in the majority class by randomly removing samples.
   - **Combination**: Use a combination of oversampling and undersampling to balance the dataset.
2. Algorithmic Approaches:
    - **Cost-sensitive learning**: Modify the learning algorithm to take into account the class imbalance by 
                                    assigning different costs to misclassifications of different classes.
    - **Ensemble methods**: Use techniques like bagging or boosting that can help improve performance on imbalanced datasets.
3. Evaluation Metrics:
   - Use metrics that are more informative than accuracy, such as precision, recall, 
   F1-score, and area under the ROC curve (AUC-ROC) to evaluate model performance on imbalanced datasets.


'''

In [None]:
'''
1. Upsampling the minority class using SMOTE (Synthetic Minority Over-sampling Technique) 
or called Oversampling.
2. Downsampling the majority class by randomly removing samples (Undersampling).
or called Undersampling.
'''
import numpy as np
import pandas as pd

# create a sample imbalanced dataset
np.random.seed(123) # for reproducibility
n_samples = 1000
class_0_ratio = 0.9 # 90% of the samples belong to class 0
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0 # 10% of the samples belong to class 1



In [3]:
n_class_0, n_class_1

(900, 100)

In [5]:
#create dataframe with imbalanced classes
class_0 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0, scale=1, size=n_class_0),
    'feature_2': np.random.normal(loc=0, scale=1, size=n_class_0),
    'target':[0]*n_class_0
})  

class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=1, scale=1, size=n_class_1),
    'feature_2': np.random.normal(loc=1, scale=1, size=n_class_1),
    'target':[1]*n_class_1
})  

In [7]:
df = pd.concat([class_0, class_1], ignore_index=True)
df.head()

Unnamed: 0,feature_1,feature_2,target
0,0.551302,-0.300232,0
1,0.419589,-0.632261,0
2,1.815652,-0.204317,0
3,-0.25275,0.213696,0
4,-0.292004,1.033878,0


In [8]:
df['target'].value_counts()

target
0    900
1    100
Name: count, dtype: int64

In [10]:
# Oversampling or Upsampling the minority class
df_minority = df[df['target'] == 1]
df_majority = df[df['target'] == 0]


In [11]:
from sklearn.utils import resample
df_minority_upsampled = resample(df_minority,replace=True,     # sample with replacement
                                 n_samples=len(df_majority),
                                 random_state=42)


In [14]:
df_minority_upsampled.head()

Unnamed: 0,feature_1,feature_2,target
951,0.831202,1.185775,1
992,0.495097,1.046499,1
914,1.600053,0.232972,1
971,2.066104,-0.197674,1
960,0.182489,-1.039363,1


In [None]:
df_upsampled = pd.concat([df_majority, df_minority_upsampled])
df_upsampled['target'].value_counts()
# now we have balanced classes 0 - 900 and 1 - 900



target
0    900
1    900
Name: count, dtype: int64

In [19]:
# Downsampling or Undersampling the majority class
df_majority_downsampled = resample(df_majority, 
                                    replace=False,    # sample without replacement
                                    n_samples=len(df_minority), 
                                    random_state=42)  # reproducible results

In [20]:
df_majority_downsampled.head()

Unnamed: 0,feature_1,feature_2,target
70,1.72092,-0.13124,0
827,-0.464899,0.253618,0
231,-0.969798,-1.096354,0
588,-0.70472,0.328862,0
39,1.012868,0.304062,0


In [21]:
df_downsampled = pd.concat([df_majority_downsampled, df_minority])
df_downsampled['target'].value_counts()

target
0    100
1    100
Name: count, dtype: int64