save cpu mem by leveraging FSDP rank0 broadcasting #77

lchu-ibm · 2023-08-01T04:41:10Z

What does this PR do?

for FSDP mode, this saves cpu memory by loading only one cpu copy of the model. This is specifically useful when using llama 70B as the current code would consume 2+ TB of cpu memory with 70B (70 * 4 * 8), which will cause cpu oom.

Notes

This would require latest nightlies. I vaguely remembered I hit various of issues with sync_module_states+param_init_fn in the past until the nightlies in the most recent months.
~~I wasn't sure what's the best in-general _param_init_fn we should use in the current version given the fast evolving PRs around meta device init. maybe @awgu can comment.~~

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

cc @HamidShojanazeri

This reverts commit c8d4f38.

llama_finetuning.py

HamidShojanazeri

Thanks @lchu-ibm for the PR! pls refer to inline comment.

llama_finetuning.py

HamidShojanazeri · 2023-08-03T04:45:26Z

Thanks @lchu-ibm for the updates, I would appreciate if we could add similar comments from the code about this feature to here, and here as well.

llama_finetuning.py

chauhang

@lchu-ibm Thanks for this PR to address the CPU OOM issues for 70B model. The code has some changes in the usage of "rank" and "local_rank". It will be good to test this on both single host multi-gpu and multi-host multi-gpu to verify things work correctly for both cases. It will be great if you can run the tests and attach the logs as well.

lchu-ibm · 2023-08-06T13:56:45Z

@chauhang Thanks for the suggestions! please see my response in your inline comment on the rank fix. Also, I have just done a quick code cleanup by optimizing the imports as original code also have a bunch of unused imports.

rohan-varma · 2023-08-08T07:35:57Z

llama_finetuning.py

+            raise Exception("latest pytorch nightly build is required to run with low_cpu_fsdp config, "
+                            "please install latest nightly.")
+        if rank == 0:
+            model = LlamaForCausalLM.from_pretrained(


can we figure out why torch.device("meta") init doesn't work here?

@rohan-varma for non-0 ranks, we are using torch.device("meta") init.

llama_finetuning.py

pacman100 · 2023-08-10T20:50:30Z

Hello everyone, FYI: PR huggingface/transformers#25107 and huggingface/accelerate#1777 will handle the loading of models when using transformers to be efficient to avoid CPU OOMs when using FSDP without any code changes from user side. Currently, testing that out on 70B on single-node and multi-node setups.

HamidShojanazeri · 2023-08-10T21:51:59Z

@pacman100 Thanks for the update, that would be very helpful will give it a try. Can you pls elaborate a bit on the usage as well.

pacman100 · 2023-08-11T06:59:06Z

Hello, just using AutoModelForCausalLM.from_pretrained() should work as long as one is using accelerate launcher with FSDP enabled. Basically, when FSDP is enabled with Accelerate, it sets env variable ACCELERATE_USE_FSDP to True and I am using that in the from_pretrained method:

def is_fsdp_enabled():
    return strtobool(os.environ.get("ACCELERATE_USE_FSDP", "False")) == 1

So, if you don't want to use accelerate launcher, you can simply run export ACCELERATE_USE_FSDP=true and then have your own training loop wherein you properly use FDSP class with sync_module_states=True

How it works?

Have the model load on meta device on all ranks
Load the state dict only on rank==0 and set the param values from meta to cpu for rank==0
For all other ranks, do torch.empty(*param.size(), dtype=dtype) for every parameter on meta device
So, rank==0 will have loaded the model with correct state dict while all other ranks will have random/0 weights.
Set sync_module_states=True so that FSDP object takes care of broadcasting them to all the ranks before training starts.

here is the output on a 7B model:

accelerator.process_index=0 GPU Memory before entering the loading : 0
accelerator.process_index=0 GPU Memory consumed at the end of the loading (end-begin): 0
accelerator.process_index=0 GPU Peak Memory consumed during the loading (max-begin): 0
accelerator.process_index=0 GPU Total Peak Memory consumed during the loading (max): 0
accelerator.process_index=0 CPU Memory before entering the loading : 926
accelerator.process_index=0 CPU Memory consumed at the end of the loading (end-begin): 26415
accelerator.process_index=0 CPU Peak Memory consumed during the loading (max-begin): 31818
accelerator.process_index=0 CPU Total Peak Memory consumed during the loading (max): 32744
accelerator.process_index=0 model.lm_head.weight=Parameter containing:
tensor([[-0.0179,  0.0201, -0.0273,  ..., -0.0275, -0.0396, -0.0131],
        [-0.0510, -0.0079, -0.0383,  ..., -0.0481,  0.0581,  0.0282],
        [-0.0217, -0.0216, -0.0064,  ..., -0.0508,  0.0554, -0.0013],
        ...,
        [ 0.0425,  0.0452, -0.0131,  ...,  0.0019,  0.0476,  0.0342],
        [-0.0170, -0.0085,  0.0449,  ..., -0.0074,  0.0178,  0.0043],
        [-0.0439, -0.0859, -0.0820,  ...,  0.0130,  0.0669,  0.0884]],
       requires_grad=True)
accelerator.process_index=1 GPU Memory before entering the loading : 0
accelerator.process_index=1 GPU Memory consumed at the end of the loading (end-begin): 0
accelerator.process_index=1 GPU Peak Memory consumed during the loading (max-begin): 0
accelerator.process_index=1 GPU Total Peak Memory consumed during the loading (max): 0
accelerator.process_index=1 CPU Memory before entering the loading : 933
accelerator.process_index=1 CPU Memory consumed at the end of the loading (end-begin): 10
accelerator.process_index=1 CPU Peak Memory consumed during the loading (max-begin): 573
accelerator.process_index=1 CPU Total Peak Memory consumed during the loading (max): 1506
accelerator.process_index=1 model.lm_head.weight=Parameter containing:
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], requires_grad=True)
accelerator.process_index=0 GPU Memory before entering the prepare : 0
accelerator.process_index=0 GPU Memory consumed at the end of the prepare (end-begin): 13202
accelerator.process_index=0 GPU Peak Memory consumed during the prepare (max-begin): 15458
accelerator.process_index=0 GPU Total Peak Memory consumed during the prepare (max): 15458
accelerator.process_index=0 CPU Memory before entering the prepare : 27345
accelerator.process_index=0 CPU Memory consumed at the end of the prepare (end-begin): -26394
accelerator.process_index=0 CPU Peak Memory consumed during the prepare (max-begin): 0
accelerator.process_index=0 CPU Total Peak Memory consumed during the prepare (max): 27345
FullyShardedDataParallel(
  (_fsdp_wrapped_module): RWForCausalLM(
    (transformer): RWModel(
      (word_embeddings): Embedding(65024, 4544)
      (h): ModuleList(
        (0-31): 32 x FullyShardedDataParallel(
          (_fsdp_wrapped_module): DecoderLayer(
            (input_layernorm): LayerNorm((4544,), eps=1e-05, elementwise_affine=True)
            (self_attention): Attention(
              (maybe_rotary): RotaryEmbedding()
              (query_key_value): Linear(in_features=4544, out_features=4672, bias=False)
              (dense): Linear(in_features=4544, out_features=4544, bias=False)
              (attention_dropout): Dropout(p=0.0, inplace=False)
            )
            (mlp): MLP(
              (dense_h_to_4h): Linear(in_features=4544, out_features=18176, bias=False)
              (act): GELU(approximate='none')
              (dense_4h_to_h): Linear(in_features=18176, out_features=4544, bias=False)
            )
          )
        )
      )
      (ln_f): LayerNorm((4544,), eps=1e-05, elementwise_affine=True)
    )
    (lm_head): Linear(in_features=4544, out_features=65024, bias=False)
  )
)
accelerator.process_index=1 GPU Memory before entering the prepare : 0
accelerator.process_index=1 GPU Memory consumed at the end of the prepare (end-begin): 13202
accelerator.process_index=1 GPU Peak Memory consumed during the prepare (max-begin): 15458
accelerator.process_index=1 GPU Total Peak Memory consumed during the prepare (max): 15458
accelerator.process_index=1 CPU Memory before entering the prepare : 945
accelerator.process_index=1 CPU Memory consumed at the end of the prepare (end-begin): 4
accelerator.process_index=1 CPU Peak Memory consumed during the prepare (max-begin): 4
accelerator.process_index=1 CPU Total Peak Memory consumed during the prepare (max): 949
accelerator.process_index=1 model.lm_head.weight=Parameter containing:
tensor([[-0.0179,  0.0201, -0.0273,  ..., -0.0275, -0.0396, -0.0131],
        [-0.0510, -0.0079, -0.0383,  ..., -0.0481,  0.0581,  0.0282],
        [-0.0217, -0.0216, -0.0064,  ..., -0.0508,  0.0554, -0.0013],
        ...,
        [ 0.0425,  0.0452, -0.0131,  ...,  0.0019,  0.0476,  0.0342],
        [-0.0170, -0.0085,  0.0449,  ..., -0.0074,  0.0178,  0.0043],
        [-0.0439, -0.0859, -0.0820,  ...,  0.0130,  0.0669,  0.0884]],
       device='cuda:1', requires_grad=True)
accelerator.process_index=0 model.lm_head.weight=Parameter containing:
tensor([[-0.0179,  0.0201, -0.0273,  ..., -0.0275, -0.0396, -0.0131],
        [-0.0510, -0.0079, -0.0383,  ..., -0.0481,  0.0581,  0.0282],
        [-0.0217, -0.0216, -0.0064,  ..., -0.0508,  0.0554, -0.0013],
        ...,
        [ 0.0425,  0.0452, -0.0131,  ...,  0.0019,  0.0476,  0.0342],
        [-0.0170, -0.0085,  0.0449,  ..., -0.0074,  0.0178,  0.0043],
        [-0.0439, -0.0859, -0.0820,  ...,  0.0130,  0.0669,  0.0884]],
       device='cuda:0', requires_grad=True)

save cpu mem by leveraging FSDP rank0 broadcasting

d8a81bb

facebook-github-bot added the cla signed label Aug 1, 2023

lchu-ibm added 2 commits August 1, 2023 01:24

replace init_empty_weights with torch.device(meta)

c8d4f38

Revert "replace init_empty_weights with torch.device(meta)"

101391f

This reverts commit c8d4f38.

awgu reviewed Aug 1, 2023

View reviewed changes

llama_finetuning.py Outdated Show resolved Hide resolved

awgu reviewed Aug 1, 2023

View reviewed changes

llama_finetuning.py Outdated Show resolved Hide resolved

switch to simpler param_init_fn and meta device init

1e64fc9

awgu mentioned this pull request Aug 2, 2023

fsdp load model causing insufficient CPU memory pytorch/pytorch#104755

Open

HamidShojanazeri self-requested a review August 2, 2023 16:15

HamidShojanazeri reviewed Aug 2, 2023

View reviewed changes

llama_finetuning.py Outdated Show resolved Hide resolved

lchu-ibm added 2 commits August 2, 2023 18:32

add nightly check for using low_cpu_fsdp mode

895dfce

address meta-llama#87

e216c6f

HamidShojanazeri reviewed Aug 3, 2023

View reviewed changes

llama_finetuning.py Outdated Show resolved Hide resolved

lchu-ibm added 4 commits August 3, 2023 10:38

fix fsdp construction on low_cpu_fsdp

c19c5c6

fix meta-llama#90

0c51b47

further fix meta-llama#90

80a4c36

add doc example about using low_cpu_fsdp

c453b66

HamidShojanazeri mentioned this pull request Aug 4, 2023

Multi-node fine-tuning getting RuntimeError: : CUDA error: invalid device ordinal #90

Closed

2 tasks

chauhang reviewed Aug 5, 2023

View reviewed changes

llama_finetuning.py Outdated Show resolved Hide resolved

chauhang reviewed Aug 5, 2023

View reviewed changes

llama_finetuning.py Show resolved Hide resolved

chauhang suggested changes Aug 5, 2023

View reviewed changes

lchu-ibm added 2 commits August 6, 2023 09:44

remove unused import

1cc9df1

code cleanup to remove all unused imports

41ffbca

HamidShojanazeri approved these changes Aug 7, 2023

View reviewed changes

HamidShojanazeri mentioned this pull request Aug 8, 2023

Min hardware requirement for 70B #96

Closed

rohan-varma reviewed Aug 8, 2023

View reviewed changes

lchu-ibm added 2 commits August 8, 2023 10:39

minor code optimization

3d1e9cd

resolve conflicts

feaa344

HamidShojanazeri mentioned this pull request Aug 9, 2023

Not able to fietune on single node multiple gpu #108

Closed

2 tasks

chauhang approved these changes Aug 11, 2023

View reviewed changes

chauhang merged commit 205e5a4 into meta-llama:main Aug 11, 2023
3 checks passed

jshin49 mentioned this pull request Aug 15, 2023

Llama-2-chat-70b-hf (with LoRA) CUDA OOMs on 4 x A100 (80gb) at first training step #118

Closed

2 tasks

lwmlyy mentioned this pull request Aug 15, 2023

add util for ram efficient loading of model when using fsdp huggingface/transformers#25107

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

save cpu mem by leveraging FSDP rank0 broadcasting #77

save cpu mem by leveraging FSDP rank0 broadcasting #77

lchu-ibm commented Aug 1, 2023 •

edited

Loading

HamidShojanazeri left a comment

HamidShojanazeri commented Aug 3, 2023 •

edited

Loading

chauhang left a comment

lchu-ibm commented Aug 6, 2023 •

edited

Loading

rohan-varma Aug 8, 2023

lchu-ibm Aug 8, 2023

pacman100 commented Aug 10, 2023

HamidShojanazeri commented Aug 10, 2023 •

edited

Loading

pacman100 commented Aug 11, 2023

save cpu mem by leveraging FSDP rank0 broadcasting #77

save cpu mem by leveraging FSDP rank0 broadcasting #77

Conversation

lchu-ibm commented Aug 1, 2023 • edited Loading

What does this PR do?

Notes

Before submitting

HamidShojanazeri left a comment

Choose a reason for hiding this comment

HamidShojanazeri commented Aug 3, 2023 • edited Loading

chauhang left a comment

Choose a reason for hiding this comment

lchu-ibm commented Aug 6, 2023 • edited Loading

rohan-varma Aug 8, 2023

Choose a reason for hiding this comment

lchu-ibm Aug 8, 2023

Choose a reason for hiding this comment

pacman100 commented Aug 10, 2023

HamidShojanazeri commented Aug 10, 2023 • edited Loading

pacman100 commented Aug 11, 2023

lchu-ibm commented Aug 1, 2023 •

edited

Loading

HamidShojanazeri commented Aug 3, 2023 •

edited

Loading

lchu-ibm commented Aug 6, 2023 •

edited

Loading

HamidShojanazeri commented Aug 10, 2023 •

edited

Loading