-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ShardedDataParallel doesn't work with multiple nodes #397
Comments
So, the assert is telling you what's wrong, you need to find which parameter is not being used, or deactivate the buckets by passing a reduce_buffer_size == 0 at init time. |
Adding a unit test so that future releases are not broken, sorry about that |
@kamo-naoyuki , could you confirm that this is not on master ? |
I tested master version and the error disappears, but I got another error.
Note that torch.nn.DistributedDataParallel can work. |
yes of course, just not on circleci so it's not visible here, but I did not get this issue. There's no root parameter for reduce, this is strange, can you test 0.1.7 just released ? |
ah wait, there are just not enough parameters in your example, so some shards are just empty, that's why. Just try with a bigger model, and I can add an assert for that, but I'm guessing that it does not happen very often. DDP does not shard, that's why you were not seeing that |
closing because the title is not really reflecting that, I'll create an issue to assert if some shards are empty |
I got same error with the large model.
It's better to test my code. You seem to just read the code and think the reason. |
Also note that
Obviously there are some bugs. |
has it occurred to you that maybe I cannot test your code on multiple nodes easily ? fairscale is used all the time on multiple nodes, just not your code, but it should not change much, this error was never raised. Empty shards was a real corner case which your first example created, now I don't know what happens except that root is not a parameter for a reduce, so it makes little sense from the python API. A quick lookup surfaces this NVIDIA/nccl#276. Could you try with the gloo backend ?
this we can agree on, FYI NCCL bugs are also a thing |
|
no problem with your "local_rank" variable by the way ? |
What do you mean? |
Gloo doesn't work.
I debugged why this happens. Here referring the size of fairscale/fairscale/optim/oss.py Line 551 in a606e84
but in this case i got len(self.buckets[device]) = 17. This is caused by the following since you don't check the length here. fairscale/fairscale/optim/oss.py Lines 596 to 597 in a606e84
This caused by the following: fairscale/fairscale/optim/oss.py Lines 190 to 195 in a606e84
There are ranks which don't have any parameters, thus still there are empty lists, so len(params) can be 0 for the case, then it causes more rank than world_size. |
"There are ranks which don't have parameters" which is the reason I mentioned above and why I wanted to add an assert which would have told you so (this will land after the week end #406). But thanks for the debugging ! |
I have to admit that the fact that it shows as a nccl error was and is strange to me, I'm guessing that the broadcast buckets are the reason, I should have guarded that earlier |
Ok to close ? |
How can I solve this issue? |
Okay, I understood.
This model is large, but has only 14 parameters. |
🐛 Bug
ShardedDataParallel successfully works with 8gpus x 1nodes, while got the following error with 8gpus x 2 nodes
However, obviously there are not unused parameters. Note that torch.nn.DistributedDataParallel can work with same environment.
To Reproduce
I used shared file system initialization for init_method.
Environment
Note that I tested tcp and infiniband connection both.
The text was updated successfully, but these errors were encountered: