Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is parameter / model / tensor sharding actually supported in fairseq? (peak GPU memory not reduced at all) #5318

Open
gordicaleksa opened this issue Sep 8, 2023 · 0 comments

Comments

@gordicaleksa
Copy link

Is it possible to do parameter sharding for smaller models like 615M parameters encoder-decoder transformers?

I do see that data parallelism works as expected so that part is good. But reading FSDP paper and blogs I'm expecting that param sharding is happening in the background and thus I'm expecting a reduction in my peak memory (only 1 layer should be unsharded at the time).

I also noticed that the code sets the reshard_after_forward to false for the "root" modules, so I tried circumventing that by reducing min-params-to-wrap from 100M (default) to 10M (also see below for things I've tried). That did trigger nested FSDP wrapping but did not help.

I have 2 RTX 3090 GPUs and with only 615M both of my GPUs are already fully saturated (48 GB VRAM).

Even with the most pessimistic calculations one would expect (when using Adam) that I'd have ~24B per parameter (assuming Adam state = 12B, activations, gradients, model params, and assuming all full precision - fp32) which would give us roughly ~15 GB memory required. What I'm observing instead is 2 x 23 GBs being used (almost peak RTX 3090 memory). Even though I'm using fp16, even though I'm supposedly using param sharding?

Am I missing out on some fundamental piece of knowledge here or what is the explanation for this amount of VRAM bloat?

What have you tried?

Aside from the things I've mentioned above I tried the following arguments and nothing managed to reduce my peak VRAM consumption:

  • --min-params-to-wrap from default 100M to 10M (my enc/dec layers are 12/16M) so this forces FSDP to wrap them and not just the top level model.
  • –checkpoint-activations - didn't give me any noticable improvement as well
  • --model-parallel-size - this triggers the path that uses Megatron trainer but: a) it's apparent that this codepath is not well tested had to remove some redundant argumetns for Megatron trainer b) it's also not working i.e. my peak is still the same.
  • I also tried using pipeline parallelism but it's obvious it was not designed to be used on a single node with 2 GPUs as it has certain assumptions baked into it.

What's your environment?

I followed the installation instructions for the NLLB project here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant