Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM error #8

Closed
taineleau opened this issue Aug 20, 2021 · 9 comments
Closed

OOM error #8

taineleau opened this issue Aug 20, 2021 · 9 comments

Comments

@taineleau
Copy link

Hi, I tried the seq2seq prefixtuning and found:

RuntimeError: CUDA out of memory. Tried to allocate 1.20 GiB (GPU 0; 15.90 GiB total capacity; 4.63 GiB already allocated; 797.50 MiB free; 5.81 GiB reserved in total by PyTorch)

I run the expr on a 16GB GPU. Am I supposed to use a 32GB GPU instead? Thanks!

@XiangLi1999
Copy link
Owner

Hi,

I used a 32GB GPU for the XSUM experiments. You could either switch to a GPU with larger memory, or you could reduce the bsz and increase the gradient_accumulation_steps.

@StevenTang1998
Copy link

Hi, I used one GPU (Tesla V100 SXM2 32GB) and used the command in the homepage.
However, I still had the OOM problem with bsz=16 or =12. If I set the bsz to 8 and the OOM will disappear.
So, is the command in the homepage the command used to reproduce the paper?

@XiangLi1999
Copy link
Owner

Hi,

it's the command to reproduce.

Could you check if you have --fp16 yes and whether this turn on half-precision? This should turn on half precision, so that bsz=16 could fit.

Side Note: I used AWS single GPU (I think it's A100) to run all XSUM experiments.

@StevenTang1998
Copy link

I have --fp 16 yes and how can I see whether this turn on half-precision?

Side Note: the A100 in AWS has 40GB GPU memory rather than 32GB.

@XiangLi1999
Copy link
Owner

Maybe check if your stdout contains this: Using native 16bit precision.

@StevenTang1998
Copy link

Thanks!
My stdout contains: Using native 16bit precision.
It may be due to the GPU memory and I reduce the bsz to train the model.

By the way, what does the --mid_dim mean?

@XiangLi1999
Copy link
Owner

It means the dim of the MLP's middle layer! (we use an MLP for re-parametrization.)

@StevenTang1998
Copy link

Got it! Thanks for your answer!

@taineleau
Copy link
Author

Thanks! My mistake.
I just figured out I didn't change the bsz in the right way so that I still get OOM even if I made bsz as 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants