Is parameter / model / tensor sharding actually supported in fairseq? (peak GPU memory not reduced at all) #5318

gordicaleksa · 2023-09-08T13:51:16Z

Is it possible to do parameter sharding for smaller models like 615M parameters encoder-decoder transformers?

I do see that data parallelism works as expected so that part is good. But reading FSDP paper and blogs I'm expecting that param sharding is happening in the background and thus I'm expecting a reduction in my peak memory (only 1 layer should be unsharded at the time).

I also noticed that the code sets the reshard_after_forward to false for the "root" modules, so I tried circumventing that by reducing min-params-to-wrap from 100M (default) to 10M (also see below for things I've tried). That did trigger nested FSDP wrapping but did not help.

I have 2 RTX 3090 GPUs and with only 615M both of my GPUs are already fully saturated (48 GB VRAM).

Even with the most pessimistic calculations one would expect (when using Adam) that I'd have ~24B per parameter (assuming Adam state = 12B, activations, gradients, model params, and assuming all full precision - fp32) which would give us roughly ~15 GB memory required. What I'm observing instead is 2 x 23 GBs being used (almost peak RTX 3090 memory). Even though I'm using fp16, even though I'm supposedly using param sharding?

Am I missing out on some fundamental piece of knowledge here or what is the explanation for this amount of VRAM bloat?

What have you tried?

Aside from the things I've mentioned above I tried the following arguments and nothing managed to reduce my peak VRAM consumption:

--min-params-to-wrap from default 100M to 10M (my enc/dec layers are 12/16M) so this forces FSDP to wrap them and not just the top level model.
–checkpoint-activations - didn't give me any noticable improvement as well
--model-parallel-size - this triggers the path that uses Megatron trainer but: a) it's apparent that this codepath is not well tested had to remove some redundant argumetns for Megatron trainer b) it's also not working i.e. my peak is still the same.
I also tried using pipeline parallelism but it's obvious it was not designed to be used on a single node with 2 GPUs as it has certain assumptions baked into it.

What's your environment?

I followed the installation instructions for the NLLB project here.

The text was updated successfully, but these errors were encountered:

gordicaleksa added needs triage question labels Sep 8, 2023

gordicaleksa mentioned this issue Sep 9, 2023

Reduce peak memory when using FSDP on 2+ GPUs gordicaleksa/Open-NLLB#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is parameter / model / tensor sharding actually supported in fairseq? (peak GPU memory not reduced at all) #5318

Is parameter / model / tensor sharding actually supported in fairseq? (peak GPU memory not reduced at all) #5318

gordicaleksa commented Sep 8, 2023

Is parameter / model / tensor sharding actually supported in fairseq? (peak GPU memory not reduced at all) #5318

Is parameter / model / tensor sharding actually supported in fairseq? (peak GPU memory not reduced at all) #5318

Comments

gordicaleksa commented Sep 8, 2023

What have you tried?

What's your environment?