<h1>Imbalanced Dataset</h1>


<h3><span style='color:yellow'>An imbalanced dataset is one in which the distribution of data sampels across different classes is uneven.</span></h3>

<h3><span style='color:yellow'>Strategies to address imbalanced datasets:</span></h3>

<ul style='font-size: 1.2em;'>
    <li> Oversampling: Increase the frequency of samples from underrepresented classes to match the representation of the majority class.</li>
    <li> Class Weighting: Modify the loss function by assigning higher weights to the minority classes, amplifying their influence during model training.</li>
    <li> Synthetic Data Generation: Create artificial data points for the minority class to balance the dataset.</li>
</ul>

<h3><span style='color:yellow'>The imbalanced dataset I'm using below is highly skewed, consisting of two classes: cats (8 samples) and dogs (2 samples).</span></h3>


In [63]:
# Importing libraries
import torch
import torchvision.datasets as datasets
import os
from torch.utils.data import DataLoader, WeightedRandomSampler
import torchvision.transforms as transforms
import torch.nn as nn

In [3]:
# Class Weighting: After building the model, we pass a weight argument to the loss function.
# The dataset comprises two classes: 8 samples for the first class and 2 samples for the second class.
# We assign a weight of 2 to the first class and a weight of 8 to the second class.


# Two class weightining: 
loss = nn.CrossEntropyLoss(weight=torch.tensor([2, 8])) # 8 is the weight for the minority class

# Two class weightining v1: 
# 8+2 = 10>> 2/10 = 0.2, 8/10 = 0.8
loss = nn.CrossEntropyLoss(weight=torch.tensor([0.2, 0.8])) # 8 is the weight for the minority class

# Two class weightining v2: 
# dog/cat = 2/8 = 4,,, cat/dog = 2/8 = 0.25 we need the smaller value to weigh the first (major) class
loss = nn.CrossEntropyLoss(weight=torch.tensor([0.25, 4])) # 8 is the weight for the minority class



<h3><span style='color:yellow'>What if we have four classes? In that case, we should use the total number of samples across all classes as the numerator, i.e., len(dataset) divided by the number of samples in each class.</span></h3>

<h3><span style='color:yellow'>If I have four classes and the samples are distributed among the classes as follows: 100, 20, 50, and 250.</span></h3>

<h4>weight for Class 1: Total samples / Number of samples of Class 1 = 420/100 = 4.2</h4>
<h4>Weight for Class 2: Total samples / Number of samples of Class 2 = 420/20 = 21</h4>
<h4>Weight for Class 3: Total samples / Number of samples of Class 3 = 420/50 = 8.4</h4>
<h4>Weight for Class 4: Total samples / Number of samples of Class 4 = 420/250 = 1.68</h4>


In [64]:
# Four class weightining: 
loss = nn.CrossEntropyLoss(weight=torch.tensor([4.2, 21, 8.4, 1.68]))


In [68]:
# Oversampling
root_dir='/home/mohanad/learn/Pytorch/8- Dataset and DataLoader/datastes/imbalanced data'
def get_loader_with_sampling(root_dir,batch_size):
    image_transform=transforms.Compose([
        transforms.Resize((224,224)),
        transforms.ToTensor(),])
    dataset=datasets.ImageFolder(root=root_dir,transform=image_transform)
    
    class_weights=[.2,.8]  # This should be determined based on the class and sample distribution as demonstrated in the previous class weighting example.
    sample_weights=[0]*len(dataset)
    
    for idx, (image,label) in enumerate(dataset):
        class_weight=class_weights[label]
        sample_weights[idx]=class_weight
        
    sampler=WeightedRandomSampler(sample_weights,num_samples=len(sample_weights),replacement=True) # Setting replacement=True means that a single sample can be selected multiple times during sampling.
    loader=DataLoader(dataset=dataset,batch_size=batch_size,sampler=sampler)
    return loader
    

def main():
    loader=get_loader_with_sampling(root_dir=root_dir,batch_size=4)
    for idx, (images,labels) in enumerate(loader):
        print(labels)

In [69]:
if __name__=='__main__':
    main()

tensor([0, 1, 1, 0])
tensor([1, 1, 0, 0])
tensor([0, 0])


In [61]:
# In the above we imposed the class weight manually, but we can do it automatically as follows:
def get_loader_with_sampling(root_dir, batch_size):
    image_transform=transforms.Compose([
        transforms.Resize((224,224)),
        transforms.ToTensor(),
    ])

    dataset=datasets.ImageFolder(root=root_dir, transform=image_transform)

    # Calculate class weights
    class_counts = [0] * len(dataset.classes)  # Initialize counts for each class
    for _, label in dataset:
        class_counts[label] += 1

    total_samples = sum(class_counts)
    class_weights = [total_samples / count for count in class_counts]

    # Assign sample weights
    sample_weights = [0] * len(dataset)
    for idx, (_, label) in enumerate(dataset):
        sample_weights[idx] = class_weights[label]

    # Use WeightedRandomSampler
    sampler = WeightedRandomSampler(sample_weights, num_samples=len(sample_weights), replacement=True)
    loader = DataLoader(dataset=dataset, batch_size=batch_size, sampler=sampler)

    return loader


def main():
    loader=get_loader_with_sampling(root_dir=root_dir,batch_size=2)
    for idx, (images,labels) in enumerate(loader):
        print(labels)
        
    print("")
    num_cat=0
    num_dog=0
    for epoch in range(15):
        for image,label in loader:
            num_cat+=torch.sum(label==0)
            num_dog+=torch.sum(label==1)
    print("num_cat: ",num_cat)
    print("num_dog balanced: ",num_dog)

In [62]:
if __name__=='__main__':
    main()

tensor([1, 1])
tensor([0, 0])
tensor([1, 1])
tensor([0, 0])
tensor([1, 0])

num_cat:  tensor(70)
num_dog balanced:  tensor(80)
