You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is it possible to do parameter sharding for smaller models like 615M parameters encoder-decoder transformers?
I do see that data parallelism works as expected so that part is good. But reading FSDP paper and blogs I'm expecting that param sharding is happening in the background and thus I'm expecting a reduction in my peak memory (only 1 layer should be unsharded at the time).
I also noticed that the code sets the reshard_after_forward to false for the "root" modules, so I tried circumventing that by reducing min-params-to-wrap from 100M (default) to 10M (also see below for things I've tried). That did trigger nested FSDP wrapping but did not help.
I have 2 RTX 3090 GPUs and with only 615M both of my GPUs are already fully saturated (48 GB VRAM).
Even with the most pessimistic calculations one would expect (when using Adam) that I'd have ~24B per parameter (assuming Adam state = 12B, activations, gradients, model params, and assuming all full precision - fp32) which would give us roughly ~15 GB memory required. What I'm observing instead is 2 x 23 GBs being used (almost peak RTX 3090 memory). Even though I'm using fp16, even though I'm supposedly using param sharding?
Am I missing out on some fundamental piece of knowledge here or what is the explanation for this amount of VRAM bloat?
What have you tried?
Aside from the things I've mentioned above I tried the following arguments and nothing managed to reduce my peak VRAM consumption:
--min-params-to-wrap from default 100M to 10M (my enc/dec layers are 12/16M) so this forces FSDP to wrap them and not just the top level model.
–checkpoint-activations - didn't give me any noticable improvement as well
--model-parallel-size - this triggers the path that uses Megatron trainer but: a) it's apparent that this codepath is not well tested had to remove some redundant argumetns for Megatron trainer b) it's also not working i.e. my peak is still the same.
I also tried using pipeline parallelism but it's obvious it was not designed to be used on a single node with 2 GPUs as it has certain assumptions baked into it.
What's your environment?
I followed the installation instructions for the NLLB project here.
The text was updated successfully, but these errors were encountered:
Is it possible to do parameter sharding for smaller models like 615M parameters encoder-decoder transformers?
I do see that data parallelism works as expected so that part is good. But reading FSDP paper and blogs I'm expecting that param sharding is happening in the background and thus I'm expecting a reduction in my peak memory (only 1 layer should be unsharded at the time).
I also noticed that the code sets the
reshard_after_forward
to false for the "root" modules, so I tried circumventing that by reducingmin-params-to-wrap
from 100M (default) to 10M (also see below for things I've tried). That did trigger nested FSDP wrapping but did not help.I have 2 RTX 3090 GPUs and with only 615M both of my GPUs are already fully saturated (48 GB VRAM).
Even with the most pessimistic calculations one would expect (when using Adam) that I'd have ~24B per parameter (assuming Adam state = 12B, activations, gradients, model params, and assuming all full precision - fp32) which would give us roughly ~15 GB memory required. What I'm observing instead is 2 x 23 GBs being used (almost peak RTX 3090 memory). Even though I'm using fp16, even though I'm supposedly using param sharding?
Am I missing out on some fundamental piece of knowledge here or what is the explanation for this amount of VRAM bloat?
What have you tried?
Aside from the things I've mentioned above I tried the following arguments and nothing managed to reduce my peak VRAM consumption:
--min-params-to-wrap
from default 100M to 10M (my enc/dec layers are 12/16M) so this forces FSDP to wrap them and not just the top level model.–checkpoint-activations
- didn't give me any noticable improvement as well--model-parallel-size
- this triggers the path that uses Megatron trainer but: a) it's apparent that this codepath is not well tested had to remove some redundant argumetns for Megatron trainer b) it's also not working i.e. my peak is still the same.What's your environment?
I followed the installation instructions for the NLLB project here.
The text was updated successfully, but these errors were encountered: