In [1]:
import torch
from torch.utils.data import Dataset, DataLoader
import os
import time

In [2]:
# Define a mock dataset
class RandomDataset(Dataset):
    def __init__(self, num_samples, img_size):
        """
        Args:
            num_samples (int): Number of samples in the dataset.
            img_size (tuple): Image size (C, H, W).
        """
        self.num_samples = num_samples
        self.img_size = img_size
    
    def __len__(self):
        return self.num_samples
    
    def __getitem__(self, idx):
        # Generate a random image (tensor)
        image = torch.randn(self.img_size)
        # Generate a random "mask" or target (for the sake of example)
        target = torch.randint(0, 2, (1,))  # Binary target
        return image, target

# Parameters
num_samples = 1024  # Number of samples in the mock dataset
img_size = (3, 64, 64)  # Example image size (C, H, W)
batch_size = 64

# Create the dataset
dataset = RandomDataset(num_samples=num_samples, img_size=img_size)

cpu_count = os.cpu_count() or 1  # Fallback to 1 if os.cpu_count() returns None

for num_workers in range(cpu_count + 1):  # Iterate from 0 to cpu_count
    train_dl = DataLoader(dataset,
                          shuffle=True,
                          pin_memory=True,
                          batch_size=batch_size,
                          num_workers=num_workers)

    start = time.time()
    for epoch in range(10):  # Simulate 4 epochs
        for i, data in enumerate(train_dl, 0):
            pass  # Here you would process your data
    end = time.time()
    print(f"Finish with:{end - start} seconds, num_workers={num_workers}")


Finish with:1.0734808444976807 seconds, num_workers=0
Finish with:5.095129728317261 seconds, num_workers=1
Finish with:4.690190076828003 seconds, num_workers=2
Finish with:4.923038482666016 seconds, num_workers=3
Finish with:5.231919288635254 seconds, num_workers=4
Finish with:5.686760187149048 seconds, num_workers=5
Finish with:6.780835151672363 seconds, num_workers=6
Finish with:8.358283519744873 seconds, num_workers=7
Finish with:8.817540168762207 seconds, num_workers=8
Finish with:9.85634708404541 seconds, num_workers=9
Finish with:10.668752670288086 seconds, num_workers=10
Finish with:11.753221035003662 seconds, num_workers=11
Finish with:12.922778844833374 seconds, num_workers=12


In [3]:
for num_workers in range(cpu_count + 1):  # Iterate from 0 to cpu_count
    train_dl = DataLoader(dataset,
                          shuffle=True,
                          pin_memory=False,
                          batch_size=batch_size,
                          num_workers=num_workers)

    start = time.time()
    for epoch in range(10):  # Simulate 4 epochs
        for i, data in enumerate(train_dl, 0):
            pass  # Here you would process your data
    end = time.time()
    print(f"Finish with:{end - start} seconds, num_workers={num_workers}")

Finish with:1.0292959213256836 seconds, num_workers=0
Finish with:5.832638263702393 seconds, num_workers=1
Finish with:5.6890552043914795 seconds, num_workers=2
Finish with:5.761323690414429 seconds, num_workers=3
Finish with:5.874486684799194 seconds, num_workers=4
Finish with:6.3858232498168945 seconds, num_workers=5
Finish with:7.5590221881866455 seconds, num_workers=6
Finish with:8.2344229221344 seconds, num_workers=7
Finish with:8.8154878616333 seconds, num_workers=8
Finish with:9.214709758758545 seconds, num_workers=9
Finish with:9.995859861373901 seconds, num_workers=10
Finish with:10.611126184463501 seconds, num_workers=11
Finish with:12.081108093261719 seconds, num_workers=12


The results indicate a counterintuitive trend: as the number of workers increases, the time to process the data also increases, rather than decreasing. This is unusual because the expectation is that adding more workers would parallelize the data loading and potentially decrease the total processing time, up to a certain point, depending on the task's I/O and CPU-bound characteristics.

However, in this case, the increase in processing time with more workers suggests a few possibilities:

    Overhead from Too Many Workers: If the task of loading and preprocessing the data is relatively lightweight (as it might be with generating random data), the overhead of managing multiple worker processes can outweigh the benefits of parallelization. This is particularly true when the dataset is stored in memory (like your mock dataset), and there's minimal I/O latency to hide behind the workers' loading time.

    Contention: With too many workers, there might be contention for CPU resources, especially if the number of workers significantly exceeds the number of available CPU cores. This can lead to context switching overhead and reduced efficiency.

    Implementation Details: Depending on how PyTorch and the DataLoader are implemented and interact with the system's threading and multiprocessing capabilities, there could be inefficiencies that become more pronounced with a higher number of workers.

    Synchronization Overhead: When all the worker processes try to return their batches of data back to the main process, there might be a synchronization overhead, especially if the main process is not able to consume the data as quickly as it is produced.

Given these results, it appears that for this specific scenario (using a mock dataset that likely requires minimal disk I/O and possibly minimal preprocessing), having zero or a very low number of worker processes is the most efficient choice. This scenario highlights the importance of tuning the number of workers based on the specific characteristics of the dataset, the complexity of the preprocessing, and the hardware capabilities of the system.

For real-world datasets, especially those involving complex preprocessing or significant disk I/O (like loading large images from disk), you might find a different optimal number of workers. It's always a good idea to conduct similar experiments with your actual dataset to find the best configuration.