Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seems like different threads are reading some overlapping files in torch data loader #149

Open
sethiay opened this issue Feb 15, 2024 · 12 comments
Assignees

Comments

@sethiay
Copy link

sethiay commented Feb 15, 2024

Hey,

I am using the below command to do I/O benchmark:

mpirun -np 8 ~/.local/bin/dlio_benchmark workload=unet3d ++workload.train.epochs=1 ++workload.workflow.profiling=True ++workload.profiling.profiler=iostat ++workload.profiling.iostat_devices=[md0] ++workload.dataset.data_folder="../unet3d-data-mounted/" ++workload.dataset.num_files_train=5000 ++workload.reader.batch_size=4 ++workload.dataset.record_length=149989507 ++workload.reader.read_threads=10 ++workload.reader.sample_shuffle=off

I am facing two issues:

  1. Seems like after this commit, different threads of torch data loader are reading some overlapping files. I added a print statement here and found some of the file names are printed more than once in the same epoch. If I understand correctly, there should be no overlapping of files being read by different threads ?
  2. The document says that the default for sample_shuffle is off but
    • I added a print statement here and found that default is seed.
    • Also, I tried to explicitly put off as mentioned in the command above but then found that this check self._args.sample_shuffle != Shuffle.OFF here is coming out to be true. I believe this check should be false.

Request you to look into the above issues and let me know if you need more info. Thanks !

@sethiay
Copy link
Author

sethiay commented Feb 20, 2024

Hey, Just want to check if someone got chance to look into this ?

@hariharan-devarajan
Copy link
Collaborator

@sethiay I will try and look into this by the end of this week (probably on Saturday :) )

@hariharan-devarajan
Copy link
Collaborator

@sethiay The code was fine. The configuration was incorrect, where we were not converting the shuffle value from conf file into enum.

For your question, currently DLIO uses Random Sampling from PyTorch. This generates a random order for the range. In general the files should not be repeated till the number of files are divisible by (num_ranksnum_workersbatch_size). Seems like the default way to do this using Random Sampling doesn't shard the data across GPUs. The recommended way to to write your own sampler. #154 Shows this implementation. Can u check if this PR solves your issue?

For your second point, that was the configuration bug. I fixed that in #154.

@sethiay
Copy link
Author

sethiay commented Feb 27, 2024

Thank you @hariharan-devarajan for the fixes !

Can u check if this PR solves your issue?

I tried and I can still see some repeated files being read (though the number of repetitions have reduced, earlier I could see 7-8 but now 2). Also, I couldn't understand why we are multiplying with self.epochs in end = ((self.rank + 1) * samples_per_gpu) * self.epochs. IIUC, the sampler should work without this multiplication ?

@hariharan-devarajan
Copy link
Collaborator

So the sampler need to keep generating numbers till a certain point. As it's our own sampler, if we don't take epochs into consideration then it will just iterate over dataset once and stop. I tried it without the epoch and it stoped after one epoch of data.

@hariharan-devarajan
Copy link
Collaborator

Also, its surprising for me u saw only 7-8. What I saw was the pytorch sampler was not sharding the dataset. Therefore, in an epoch it was repeating samples across all MPI processes.

PyTorch workers were unique but the samples read per GPU were exactly the same.

If your # workers * # GPU is a multiple of # of total samples you should not see any repeats. I tested it with your 5000 file case where I had 2 GPUs and 10 workers then 10 GPUs 2 workers. In both cases for 1 epoch I didn't see any repeats.

@sethiay
Copy link
Author

sethiay commented Feb 27, 2024

IIUC, DataLoader is recreated in every epoch (so is the sampler) and class dlio_sampler(Sampler): is not a batch sampler i.e. it returns one value per next(dlio_sampler), I think the below

        samples_per_gpu = self.num_samples // self.size
        start = self.rank * samples_per_gpu
        end = ((self.rank + 1) * samples_per_gpu) * self.epochs
        for i in range(start, end):
            yield indices[i % self.num_samples]

should be something like

        samples_per_gpu = self.num_samples // self.num_gpus
        start = self.rank * samples_per_gpu
        end = min(self.num_samples, ((self.rank + 1) * samples_per_gpu))
        for i in range(start, end):
            yield indices[i]

@sethiay
Copy link
Author

sethiay commented Feb 27, 2024

If your # workers * # GPU is a multiple of # of total samples you should not see any repeats. I tested it with your 5000 file case where I had 2 GPUs and 10 workers then 10 GPUs 2 workers. In both cases for 1 epoch I didn't see any repeats.

If my understanding is correct, the dataset is divided by the number of GPUs i.e. if there are 5000 samples and 5 GPUs, then each GPU will get 1000 samples. Now, if batch_size and num_workers are set to say 2 and 10 then each GPU will have 10 threads/processes running where each thread/proecess will continuously read samples in set of 2 (internally each worker will call next(sampler) two times to get 2 indices and then call dataset.getitem(index) for the two indices to get the two samples of a batch).

@hariharan-devarajan
Copy link
Collaborator

set of 2

They dont read a set of 2 they return one item each and its then batched by the Torch Dataloader.

@sethiay
Copy link
Author

sethiay commented Feb 28, 2024

set of 2

They dont read a set of 2 they return one item each and its then batched by the Torch Dataloader.

Okay !

Even then samples_per_gpu = self.num_samples // self.size should ideally be computed by samples_per_gpu = self.num_samples // self.num_gpus ? I am not expert in Pytorch so please correct my understanding if required.

@hariharan-devarajan
Copy link
Collaborator

size here is the comm_size which is the number of gpus

@zhenghh04
Copy link
Member

We have a recent fix #160 which might solve this issue also.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants