-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seems like different threads are reading some overlapping files in torch data loader #149
Comments
Hey, Just want to check if someone got chance to look into this ? |
@sethiay I will try and look into this by the end of this week (probably on Saturday :) ) |
@sethiay The code was fine. The configuration was incorrect, where we were not converting the shuffle value from conf file into enum. For your question, currently DLIO uses Random Sampling from PyTorch. This generates a random order for the range. In general the files should not be repeated till the number of files are divisible by (num_ranksnum_workersbatch_size). Seems like the default way to do this using Random Sampling doesn't shard the data across GPUs. The recommended way to to write your own sampler. #154 Shows this implementation. Can u check if this PR solves your issue? For your second point, that was the configuration bug. I fixed that in #154. |
Thank you @hariharan-devarajan for the fixes !
I tried and I can still see some repeated files being read (though the number of repetitions have reduced, earlier I could see 7-8 but now 2). Also, I couldn't understand why we are multiplying with |
So the sampler need to keep generating numbers till a certain point. As it's our own sampler, if we don't take epochs into consideration then it will just iterate over dataset once and stop. I tried it without the epoch and it stoped after one epoch of data. |
Also, its surprising for me u saw only 7-8. What I saw was the pytorch sampler was not sharding the dataset. Therefore, in an epoch it was repeating samples across all MPI processes. PyTorch workers were unique but the samples read per GPU were exactly the same. If your # workers * # GPU is a multiple of # of total samples you should not see any repeats. I tested it with your 5000 file case where I had 2 GPUs and 10 workers then 10 GPUs 2 workers. In both cases for 1 epoch I didn't see any repeats. |
IIUC, DataLoader is recreated in every epoch (so is the sampler) and
should be something like
|
If my understanding is correct, the dataset is divided by the number of GPUs i.e. if there are 5000 samples and 5 GPUs, then each GPU will get 1000 samples. Now, if batch_size and num_workers are set to say 2 and 10 then each GPU will have 10 threads/processes running where each thread/proecess will continuously read samples in set of 2 (internally each worker will call next(sampler) two times to get 2 indices and then call dataset.getitem(index) for the two indices to get the two samples of a batch). |
They dont read a set of 2 they return one item each and its then batched by the Torch Dataloader. |
Okay ! Even then |
size here is the comm_size which is the number of gpus |
We have a recent fix #160 which might solve this issue also. |
Hey,
I am using the below command to do I/O benchmark:
I am facing two issues:
sample_shuffle
is off butseed
.off
as mentioned in the command above but then found that this checkself._args.sample_shuffle != Shuffle.OFF
here is coming out to betrue
. I believe this check should befalse
.Request you to look into the above issues and let me know if you need more info. Thanks !
The text was updated successfully, but these errors were encountered: