-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BART on CNN/DM : how to train on small GPU ? #1413
Comments
Set your update-freq into 32 in your case and try max_tokens 800 (32GPU*2048/800/2GPU=32ish). If this still doesn't work, then you to modify the code by change --max-target-positions 512 --max-source-positions 512 (this will filter out samples that longer than 512) also you can train a smaller batch (less update freq but with a longer training) |
Thanks for the fast answer ! With
If I understood the paper, it's normal because BART takes 1024 tokens maximum, not 512 like BERT. And in CNN/DM there is a lot of sample with more than 800 tokens.. If I try using
Also I didn't understand what you mean by :
Do you mean reducing As far as I understand, it will not change the real batch size, it will just change the accumulated batch size. But for my memory problem, it's the real batch size that matter. Did I misunderstood something ? Thanks again for your help ! |
Actually, Bart took 512 during pretrain. However, we initialized the model with 1024 positional embedding -- the 512-1024 position embedding doesn't have update during pretrain. During fine-tune, we use 1024 position embedding -- the 512-1024 start to get update in this phase. Looks like in your case, 8GB gpu won't even save one single instance. You have to cut the pretrain model 's position layers from 1024 to 512 (rewrite the pretrain model's state). then use --max-target-positions 512. this will for sure hurt the performance on cnn dm dataaset --- tons of instances longer than 512. I did a briefly tuning on cnn/dm. Probably training with a smaller batch size but longer (more than 30000 steps) won't hurt the performance. You can try. |
Thanks for the kind explanation 👍 I can make the training run by specifying However my GPU does not support FP16... Just curious, did you try to train a BART using memory efficient FP16 ? |
We did try to use memory efficient FP16 on Roberta. With this setting on Roberta base on bookwiki data ppl is 4.00(fp-memory-efficient) VS 3.90 (fp16). so we only used fp16 on Bart. |
Thanks for sharing your knowledge. It is helpful ! I'm going to try this path ( I believe results are going to be higher by doing this, than truncating article to 512 tokens, because as you mentioned, a lot of article are longer than 512... What's your opinion about this ? |
sure. --memory-efficient-fp16 sounds better. |
Note that |
Hi @colanim , I've also tried to train the model on multiple 24GB GPUs (the number of machines varying from 2~8). Since my GPUs do not support FP16, I trained the model without Back to the point, have you tried training the model with ps) Merry Christmas! |
On my side, I trained BART on 4 x 11GB GPU. But still, it was not enough, so I reduced the With
It's a bit lower than normal BART, but it was expected due to my parameters. I didn't try training the model with lower number of Merry christmas :) |
hi @colanim ,i am wondering how many times it takes for you to finetune BART on CNNDM using 4 GPU? |
@zide05 It took quite long, I don't remember exactly but something like 24 hours |
@colanim I got this, thank you for your quick reply! |
@colanim |
My configuration :
I had to create a new model where I kept only the first 928 position tokens. I did it with :
|
@colanim May I ask one more question? |
My GPU have a small memory, so I couldn't even fit batch size of 1 in memory if the sample size is 1024. By reducing the length to 928, it takes less space in memory and I can fit batch size of 1. You can reduce it more, but you should expect a score decrease. |
Hi, I would like to ask some questions about fine-tuning on CNNDM. |
Right
This is expected, training in FP16 mode requires less memory than FP32 mode. |
Thanks for your quick answer! It helps a lot. |
hi, now I have faced another problem during training.
|
Summary: Pull Request resolved: fairinternal/fairseq-py#1413 Test Plan: Imported from OSS Reviewed By: ngoyal2707 Differential Revision: D24833476 Pulled By: myleott fbshipit-source-id: 380ea7e05c7b188086b2b10c15120ea6636e0a3e
Summary: Pull Request resolved: fairinternal/fairseq-py#1413 Test Plan: Imported from OSS Reviewed By: ngoyal2707 Differential Revision: D24833476 Pulled By: myleott fbshipit-source-id: 380ea7e05c7b188086b2b10c15120ea6636e0a3e
I'm trying to reproduce the CNN/DM results of BART.
Unfortunately, I don't have access to good GPU. I only have access to 2 GPU with 8GB of memory.
I updated the finetuning cmd accordingly (changing
UPDATE_FREQ
) for the number of GPU.But I have issue for the memory of GPU : I tried reducing
MAX_TOKENS
to512
in order to make the data fit in my 8GB, but I receive following error :If I set
MAX_TOKENS
to1024
, I have aCUDA out of memory
error (expected).What modification do I need to do to be able to finetune the model on small GPU (8GB) ?
@ngoyal2707 @yinhanliu
The text was updated successfully, but these errors were encountered: