A100 80GB lora training out of memory #12

leonary · 2024-08-07T20:40:12Z

Training lora encounters insufficient video memory on a single A100 80GB graphics card. Any help would be much appreciated.
lora rank is 16 and batch size is 1.

11.206656 parameters
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose "Don't visualize my results"
wandb: Tracking run with wandb version 0.16.6
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
08/08/2024 04:37:22 - INFO - __main__ - ***** Running training *****
08/08/2024 04:37:22 - INFO - __main__ -   Num Epochs = 300
08/08/2024 04:37:22 - INFO - __main__ -   Instantaneous batch size per device = 1
08/08/2024 04:37:22 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 2
08/08/2024 04:37:22 - INFO - __main__ -   Gradient Accumulation steps = 2
08/08/2024 04:37:22 - INFO - __main__ -   Total optimization steps = 300
Checkpoint 'latest' does not exist. Starting a new training run.
Steps:   0%|                                                                                                                                                                                                                                          | 0/300 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/root/autodl-tmp/x-flux/train_flux_lora_deepspeed.py", line 301, in <module>
    main()
  File "/root/autodl-tmp/x-flux/train_flux_lora_deepspeed.py", line 231, in main
    model_pred = dit(img=x_t.to(weight_dtype),
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/accelerate/utils/operations.py", line 822, in forward
    return model_forward(*args, **kwargs)
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/accelerate/utils/operations.py", line 810, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/root/autodl-tmp/x-flux/src/flux/model.py", line 213, in forward
    img = block(img, vec=vec, pe=pe)
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/autodl-tmp/x-flux/src/flux/modules/layers.py", line 332, in forward
    qkv, mlp = torch.split(self.linear1(x_mod), [3 * self.hidden_size, self.mlp_hidden_dim], dim=-1)
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 116, in forward
    return F.linear(input, self.weight, self.bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB. GPU 
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /root/autodl-tmp/x-flux/wandb/offline-run-20240808_043714-dhyc07rv
wandb: Find logs at: ./wandb/offline-run-20240808_043714-dhyc07rv/logs
Traceback (most recent call last):
  File "/root/miniconda3/envs/3.10/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1097, in launch_command
    simple_launcher(args)
  File "/root/miniconda3/envs/3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 703, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/root/miniconda3/envs/3.10/bin/python', 'train_flux_lora_deepspeed.py', '--config', 'train_configs/test_lora.yaml']' returned non-zero exit status 1.

The text was updated successfully, but these errors were encountered:

filliptm · 2024-08-08T00:52:16Z

same here.

a6000 cuda at rank 4 and 512. somethings whack here

m-pektas · 2024-08-08T07:37:08Z

I tried Lora training with flux-schnell and dev models using train batch size 1, gradient_accumulation_steps 4, and rank 2 on 40gb a100, but it still raised Cuda out of memory exception.

philz1337x · 2024-08-08T15:46:51Z

same here

thavocado · 2024-08-09T01:47:28Z

You need to run accelerate config to configure DeepSpeed first using the settings in the README, or make a accelerate_config.yaml for a single GPU:

compute_environment: LOCAL_MACHINE
debug: false                                                                                               
deepspeed_config:
  gradient_accumulation_steps: 2
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Launch using:

accelerate launch --config_file "accelerate_config.yaml" train_flux_lora_deepspeed.py --config "train_configs/test_lora.yaml"

With default settings in the example files this requires 42,837 MiB VRAM.

arcanite24 · 2024-08-09T05:53:58Z

@thavocado How much RAM are you using? I"m getting a SIGKILL with ~50GB of RAM

iamwangyabin · 2024-08-09T11:44:29Z

Please use Deepspeed and set Accelerate accordingly.
I train a lora with rank 16, it just needs 40GB vram.

bghira · 2024-08-09T15:03:49Z

really shouldn't need deepspeed though. i think it's because the VAE and T5 / CLIP are all loaded during training.

thavocado · 2024-08-10T01:18:20Z

@thavocado How much RAM are you using? I"m getting a SIGKILL with ~50GB of RAM

@arcanite24 40GB of RAM here

WarriorMama777 · 2024-08-10T07:28:40Z

#12 (comment)
This information was very helpful, thank you.

I deployed 3 x A40 48GB GPUs on Runpod for training and got the following results:

img_size: 1024: out of memory
img_size: 512: ✅

It seems that to train at 1024 resolution, we might need at least about 150GB of VRAM or more...(?)

CCRss · 2024-08-11T17:10:52Z

Just 30 hours on ADA 6000. Using config from #12 (comment) is it fine or I need to adjust something. 42GB used

bghira · 2024-08-11T18:57:46Z

looks like the proper speed

ZhePang · 2024-09-12T14:15:19Z

#12 (comment) This information was very helpful, thank you.

I deployed 3 x A40 48GB GPUs on Runpod for training and got the following results:

img_size: 1024: out of memory

img_size: 512: ✅

It seems that to train at 1024 resolution, we might need at least about 150GB of VRAM or more...(?)

I'm running on 8*A100 for 1024 img size and requires around 61GB for each GPU

aiXia121 · 2024-09-25T09:56:33Z

You need to run accelerate config to configure DeepSpeed first using the settings in the README, or make a accelerate_config.yaml for a single GPU:
compute_environment: LOCAL_MACHINE
debug: false                                                                                               
deepspeed_config:
  gradient_accumulation_steps: 2
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Launch using:

accelerate launch --config_file "accelerate_config.yaml" train_flux_lora_deepspeed.py --config "train_configs/test_lora.yaml"

With default settings in the example files this requires 42,837 MiB VRAM.

I'm running on 1*A100 for 1024 img size and requires around 63GB（65147MiB） for A100-GPU， while i have processed the data and only load vae and dit model

neonsecret mentioned this issue Aug 9, 2024

Cuda out of memory #24

Open

youngwanLEE mentioned this issue Aug 29, 2024

full finetuning FLUX bghira/SimpleTuner#852

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A100 80GB lora training out of memory #12

A100 80GB lora training out of memory #12

leonary commented Aug 7, 2024 •

edited

Loading

filliptm commented Aug 8, 2024

m-pektas commented Aug 8, 2024

philz1337x commented Aug 8, 2024

thavocado commented Aug 9, 2024 •

edited

Loading

arcanite24 commented Aug 9, 2024

iamwangyabin commented Aug 9, 2024

bghira commented Aug 9, 2024

thavocado commented Aug 10, 2024

WarriorMama777 commented Aug 10, 2024

CCRss commented Aug 11, 2024

bghira commented Aug 11, 2024

ZhePang commented Sep 12, 2024

aiXia121 commented Sep 25, 2024

A100 80GB lora training out of memory #12

A100 80GB lora training out of memory #12

Comments

leonary commented Aug 7, 2024 • edited Loading

filliptm commented Aug 8, 2024

m-pektas commented Aug 8, 2024

philz1337x commented Aug 8, 2024

thavocado commented Aug 9, 2024 • edited Loading

arcanite24 commented Aug 9, 2024

iamwangyabin commented Aug 9, 2024

bghira commented Aug 9, 2024

thavocado commented Aug 10, 2024

WarriorMama777 commented Aug 10, 2024

CCRss commented Aug 11, 2024

bghira commented Aug 11, 2024

ZhePang commented Sep 12, 2024

aiXia121 commented Sep 25, 2024

leonary commented Aug 7, 2024 •

edited

Loading

thavocado commented Aug 9, 2024 •

edited

Loading