Seems like different threads are reading some overlapping files in torch data loader #149

sethiay · 2024-02-15T17:18:43Z

Hey,

I am using the below command to do I/O benchmark:

mpirun -np 8 ~/.local/bin/dlio_benchmark workload=unet3d ++workload.train.epochs=1 ++workload.workflow.profiling=True ++workload.profiling.profiler=iostat ++workload.profiling.iostat_devices=[md0] ++workload.dataset.data_folder="../unet3d-data-mounted/" ++workload.dataset.num_files_train=5000 ++workload.reader.batch_size=4 ++workload.dataset.record_length=149989507 ++workload.reader.read_threads=10 ++workload.reader.sample_shuffle=off

I am facing two issues:

Seems like after this commit, different threads of torch data loader are reading some overlapping files. I added a print statement here and found some of the file names are printed more than once in the same epoch. If I understand correctly, there should be no overlapping of files being read by different threads ?
The document says that the default for sample_shuffle is off but
- I added a print statement here and found that default is seed.
- Also, I tried to explicitly put off as mentioned in the command above but then found that this check self._args.sample_shuffle != Shuffle.OFF here is coming out to be true. I believe this check should be false.

Request you to look into the above issues and let me know if you need more info. Thanks !

The text was updated successfully, but these errors were encountered:

sethiay · 2024-02-20T07:47:08Z

Hey, Just want to check if someone got chance to look into this ?

hariharan-devarajan · 2024-02-22T16:59:19Z

@sethiay I will try and look into this by the end of this week (probably on Saturday :) )

hariharan-devarajan · 2024-02-24T03:33:46Z

@sethiay The code was fine. The configuration was incorrect, where we were not converting the shuffle value from conf file into enum.

For your question, currently DLIO uses Random Sampling from PyTorch. This generates a random order for the range. In general the files should not be repeated till the number of files are divisible by (num_ranksnum_workersbatch_size). Seems like the default way to do this using Random Sampling doesn't shard the data across GPUs. The recommended way to to write your own sampler. #154 Shows this implementation. Can u check if this PR solves your issue?

For your second point, that was the configuration bug. I fixed that in #154.

sethiay · 2024-02-27T16:18:59Z

Thank you @hariharan-devarajan for the fixes !

Can u check if this PR solves your issue?

I tried and I can still see some repeated files being read (though the number of repetitions have reduced, earlier I could see 7-8 but now 2). Also, I couldn't understand why we are multiplying with self.epochs in end = ((self.rank + 1) * samples_per_gpu) * self.epochs. IIUC, the sampler should work without this multiplication ?

hariharan-devarajan · 2024-02-27T16:32:36Z

So the sampler need to keep generating numbers till a certain point. As it's our own sampler, if we don't take epochs into consideration then it will just iterate over dataset once and stop. I tried it without the epoch and it stoped after one epoch of data.

hariharan-devarajan · 2024-02-27T16:54:09Z

Also, its surprising for me u saw only 7-8. What I saw was the pytorch sampler was not sharding the dataset. Therefore, in an epoch it was repeating samples across all MPI processes.

PyTorch workers were unique but the samples read per GPU were exactly the same.

If your # workers * # GPU is a multiple of # of total samples you should not see any repeats. I tested it with your 5000 file case where I had 2 GPUs and 10 workers then 10 GPUs 2 workers. In both cases for 1 epoch I didn't see any repeats.

sethiay · 2024-02-27T17:38:39Z

IIUC, DataLoader is recreated in every epoch (so is the sampler) and class dlio_sampler(Sampler): is not a batch sampler i.e. it returns one value per next(dlio_sampler), I think the below

        samples_per_gpu = self.num_samples // self.size
        start = self.rank * samples_per_gpu
        end = ((self.rank + 1) * samples_per_gpu) * self.epochs
        for i in range(start, end):
            yield indices[i % self.num_samples]

should be something like

        samples_per_gpu = self.num_samples // self.num_gpus
        start = self.rank * samples_per_gpu
        end = min(self.num_samples, ((self.rank + 1) * samples_per_gpu))
        for i in range(start, end):
            yield indices[i]

sethiay · 2024-02-27T18:33:19Z

If your # workers * # GPU is a multiple of # of total samples you should not see any repeats. I tested it with your 5000 file case where I had 2 GPUs and 10 workers then 10 GPUs 2 workers. In both cases for 1 epoch I didn't see any repeats.

If my understanding is correct, the dataset is divided by the number of GPUs i.e. if there are 5000 samples and 5 GPUs, then each GPU will get 1000 samples. Now, if batch_size and num_workers are set to say 2 and 10 then each GPU will have 10 threads/processes running where each thread/proecess will continuously read samples in set of 2 (internally each worker will call next(sampler) two times to get 2 indices and then call dataset.getitem(index) for the two indices to get the two samples of a batch).

hariharan-devarajan · 2024-02-28T04:45:01Z

set of 2

They dont read a set of 2 they return one item each and its then batched by the Torch Dataloader.

sethiay · 2024-02-28T04:52:50Z

set of 2

They dont read a set of 2 they return one item each and its then batched by the Torch Dataloader.

Okay !

Even then samples_per_gpu = self.num_samples // self.size should ideally be computed by samples_per_gpu = self.num_samples // self.num_gpus ? I am not expert in Pytorch so please correct my understanding if required.

hariharan-devarajan · 2024-02-28T05:43:58Z

size here is the comm_size which is the number of gpus

zhenghh04 · 2024-03-12T03:57:09Z

We have a recent fix #160 which might solve this issue also.

zhenghh04 assigned hariharan-devarajan Feb 22, 2024

hariharan-devarajan mentioned this issue Feb 24, 2024

Fix sample shuffling #154

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seems like different threads are reading some overlapping files in torch data loader #149

Seems like different threads are reading some overlapping files in torch data loader #149

sethiay commented Feb 15, 2024

sethiay commented Feb 20, 2024

hariharan-devarajan commented Feb 22, 2024

hariharan-devarajan commented Feb 24, 2024

sethiay commented Feb 27, 2024 •

edited

Loading

hariharan-devarajan commented Feb 27, 2024

hariharan-devarajan commented Feb 27, 2024

sethiay commented Feb 27, 2024

sethiay commented Feb 27, 2024

hariharan-devarajan commented Feb 28, 2024

sethiay commented Feb 28, 2024

hariharan-devarajan commented Feb 28, 2024

zhenghh04 commented Mar 12, 2024

Seems like different threads are reading some overlapping files in torch data loader #149

Seems like different threads are reading some overlapping files in torch data loader #149

Comments

sethiay commented Feb 15, 2024

sethiay commented Feb 20, 2024

hariharan-devarajan commented Feb 22, 2024

hariharan-devarajan commented Feb 24, 2024

sethiay commented Feb 27, 2024 • edited Loading

hariharan-devarajan commented Feb 27, 2024

hariharan-devarajan commented Feb 27, 2024

sethiay commented Feb 27, 2024

sethiay commented Feb 27, 2024

hariharan-devarajan commented Feb 28, 2024

sethiay commented Feb 28, 2024

hariharan-devarajan commented Feb 28, 2024

zhenghh04 commented Mar 12, 2024

sethiay commented Feb 27, 2024 •

edited

Loading