trian model CUDA out of memory #12

aoyang-hd · 2024-01-22T08:14:52Z

Is there any way to train on 24G on a GTX3090, even with one batch size?

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 3; 23.69 GiB total capacity; 23.03 GiB already allocated; 21.69 MiB free; 23.19 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Epoch 0: 0%| | 2/35135 [00:29<144:16:07, 14.78s/it, loss=0.389, v_num=0, train/loss_simple_step=0.131, train/loss_vlb_step=0.000475, train/loss_step=0.131, global_step=0.000, train/loss_x0_step=0.335, train/loss_x0_from_tao_step=0.366, train/loss_noise_from_tao_step=0.00291, train/loss_net_step=0.704]

cswry · 2024-02-08T15:01:54Z

Hello, you can try fp16 for training

jfischoff · 2024-02-08T16:05:13Z

reduce the batch sizes. It is harcoded to 16 but you can reduce them.

…

On Thu, Feb 8, 2024 at 7:02 AM Rongyuan Wu ***@***.***> wrote: Hello, you can try fp16 for training — Reply to this email directly, view it on GitHub <#12 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABJEO7ZSPT6W3DCG7LBHN3YSTSGZAVCNFSM6AAAAABCEYQOZWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZUGMYDKMJZGU> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

zhouyizhuo · 2024-03-01T02:53:37Z

@aoyang-hd @cswry @jfischoff I wanted to ask if you run it successfully on a single GPU. I'd appreciate it if you could reply to me.

jfischoff · 2024-03-05T22:55:35Z

yes, I just had to reduce the batch size

zhouyizhuo · 2024-03-06T00:32:13Z

@jfischoff How long did it take you to complete the training?(●'◡'●)

jfischoff · 2024-03-06T00:33:40Z

I didn't run the complete training like that. I just did a test. I think it took 2 days on A100 8x

zhouyizhuo · 2024-03-06T00:59:22Z

Thank you for responding.😊

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trian model CUDA out of memory #12

trian model CUDA out of memory #12

aoyang-hd commented Jan 22, 2024

cswry commented Feb 8, 2024

jfischoff commented Feb 8, 2024 via email

zhouyizhuo commented Mar 1, 2024

jfischoff commented Mar 5, 2024

zhouyizhuo commented Mar 6, 2024

jfischoff commented Mar 6, 2024

zhouyizhuo commented Mar 6, 2024

trian model CUDA out of memory #12

trian model CUDA out of memory #12

Comments

aoyang-hd commented Jan 22, 2024

cswry commented Feb 8, 2024

jfischoff commented Feb 8, 2024 via email

zhouyizhuo commented Mar 1, 2024

jfischoff commented Mar 5, 2024

zhouyizhuo commented Mar 6, 2024

jfischoff commented Mar 6, 2024

zhouyizhuo commented Mar 6, 2024