Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A100 80GB lora training out of memory #12

Open
leonary opened this issue Aug 7, 2024 · 13 comments
Open

A100 80GB lora training out of memory #12

leonary opened this issue Aug 7, 2024 · 13 comments

Comments

@leonary
Copy link

leonary commented Aug 7, 2024

Training lora encounters insufficient video memory on a single A100 80GB graphics card. Any help would be much appreciated.
lora rank is 16 and batch size is 1.

11.206656 parameters
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose "Don't visualize my results"
wandb: Tracking run with wandb version 0.16.6
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
08/08/2024 04:37:22 - INFO - __main__ - ***** Running training *****
08/08/2024 04:37:22 - INFO - __main__ -   Num Epochs = 300
08/08/2024 04:37:22 - INFO - __main__ -   Instantaneous batch size per device = 1
08/08/2024 04:37:22 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 2
08/08/2024 04:37:22 - INFO - __main__ -   Gradient Accumulation steps = 2
08/08/2024 04:37:22 - INFO - __main__ -   Total optimization steps = 300
Checkpoint 'latest' does not exist. Starting a new training run.
Steps:   0%|                                                                                                                                                                                                                                          | 0/300 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/root/autodl-tmp/x-flux/train_flux_lora_deepspeed.py", line 301, in <module>
    main()
  File "/root/autodl-tmp/x-flux/train_flux_lora_deepspeed.py", line 231, in main
    model_pred = dit(img=x_t.to(weight_dtype),
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/accelerate/utils/operations.py", line 822, in forward
    return model_forward(*args, **kwargs)
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/accelerate/utils/operations.py", line 810, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/root/autodl-tmp/x-flux/src/flux/model.py", line 213, in forward
    img = block(img, vec=vec, pe=pe)
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/autodl-tmp/x-flux/src/flux/modules/layers.py", line 332, in forward
    qkv, mlp = torch.split(self.linear1(x_mod), [3 * self.hidden_size, self.mlp_hidden_dim], dim=-1)
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 116, in forward
    return F.linear(input, self.weight, self.bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB. GPU 
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /root/autodl-tmp/x-flux/wandb/offline-run-20240808_043714-dhyc07rv
wandb: Find logs at: ./wandb/offline-run-20240808_043714-dhyc07rv/logs
Traceback (most recent call last):
  File "/root/miniconda3/envs/3.10/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1097, in launch_command
    simple_launcher(args)
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 703, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/root/miniconda3/envs/3.10/bin/python', 'train_flux_lora_deepspeed.py', '--config', 'train_configs/test_lora.yaml']' returned non-zero exit status 1.
@filliptm
Copy link

filliptm commented Aug 8, 2024

same here.

a6000 cuda at rank 4 and 512. somethings whack here

@m-pektas
Copy link

m-pektas commented Aug 8, 2024

I tried Lora training with flux-schnell and dev models using train batch size 1, gradient_accumulation_steps 4, and rank 2 on 40gb a100, but it still raised Cuda out of memory exception.

@philz1337x
Copy link

same here

@thavocado
Copy link

thavocado commented Aug 9, 2024

You need to run accelerate config to configure DeepSpeed first using the settings in the README, or make a accelerate_config.yaml for a single GPU:

compute_environment: LOCAL_MACHINE
debug: false                                                                                               
deepspeed_config:
  gradient_accumulation_steps: 2
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Launch using:

accelerate launch --config_file "accelerate_config.yaml" train_flux_lora_deepspeed.py --config "train_configs/test_lora.yaml"

With default settings in the example files this requires 42,837 MiB VRAM.

@arcanite24
Copy link

@thavocado How much RAM are you using? I"m getting a SIGKILL with ~50GB of RAM

@iamwangyabin
Copy link

Please use Deepspeed and set Accelerate accordingly.
I train a lora with rank 16, it just needs 40GB vram.

@bghira
Copy link

bghira commented Aug 9, 2024

really shouldn't need deepspeed though. i think it's because the VAE and T5 / CLIP are all loaded during training.

@thavocado
Copy link

@thavocado How much RAM are you using? I"m getting a SIGKILL with ~50GB of RAM

@arcanite24 40GB of RAM here

@WarriorMama777
Copy link

#12 (comment)
This information was very helpful, thank you.

I deployed 3 x A40 48GB GPUs on Runpod for training and got the following results:

  • img_size: 1024: out of memory
  • img_size: 512: ✅

It seems that to train at 1024 resolution, we might need at least about 150GB of VRAM or more...(?)

@CCRss
Copy link

CCRss commented Aug 11, 2024

Just 30 hours on ADA 6000. Using config from #12 (comment) is it fine or I need to adjust something. 42GB used
image

@bghira
Copy link

bghira commented Aug 11, 2024

looks like the proper speed

@ZhePang
Copy link

ZhePang commented Sep 12, 2024

#12 (comment) This information was very helpful, thank you.

I deployed 3 x A40 48GB GPUs on Runpod for training and got the following results:

  • img_size: 1024: out of memory
  • img_size: 512: ✅

It seems that to train at 1024 resolution, we might need at least about 150GB of VRAM or more...(?)

#12 (comment) This information was very helpful, thank you.

I deployed 3 x A40 48GB GPUs on Runpod for training and got the following results:

  • img_size: 1024: out of memory
  • img_size: 512: ✅

It seems that to train at 1024 resolution, we might need at least about 150GB of VRAM or more...(?)

I'm running on 8*A100 for 1024 img size and requires around 61GB for each GPU

@aiXia121
Copy link

You need to run accelerate config to configure DeepSpeed first using the settings in the README, or make a accelerate_config.yaml for a single GPU:

compute_environment: LOCAL_MACHINE
debug: false                                                                                               
deepspeed_config:
  gradient_accumulation_steps: 2
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Launch using:

accelerate launch --config_file "accelerate_config.yaml" train_flux_lora_deepspeed.py --config "train_configs/test_lora.yaml"

With default settings in the example files this requires 42,837 MiB VRAM.

I'm running on 1*A100 for 1024 img size and requires around 63GB(65147MiB) for A100-GPU, while i have processed the data and only load vae and dit model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests