BART Pretraining Script #1899

gyuwankim · 2020-03-24T04:10:54Z

❓ Questions and Help

First of all, thanks for the sharing BART model checkpoints and codes to run.

What is your question?

Could you provide a pertaining script used for BART models?

I hope to train the BART model in my own language.
(Of course, I am aware of mBART models that support other languages, but my target task is not MT so I believe training BART on other language data only might be better.)
Although I could figure out configurations base on the paper, it is prone to miss some important details for training.
Training scripts like RoBERTa (https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md) would be highly beneficial.

Thanks a lot in advance!

huihuifan · 2020-04-14T13:41:16Z

@ngoyal2707 @yinhanliu

SunbowLiu · 2020-06-19T07:58:14Z

❓ Questions and Help

First of all, thanks for the sharing BART model checkpoints and codes to run.

What is your question?

Could you provide a pertaining script used for BART models?

I hope to train the BART model in my own language.
(Of course, I am aware of mBART models that support other languages, but my target task is not MT so I believe training BART on other language data only might be better.)
Although I could figure out configurations base on the paper, it is prone to miss some important details for training.
Training scripts like RoBERTa (https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md) would be highly beneficial.

Thanks a lot in advance!

Hello,

I am also very interested in training a customized BART. Have you got any updates?

astariul · 2020-06-29T06:42:40Z

I'm also very interested in pretraining script. Any update ? @ngoyal2707 @yinhanliu

shamanez · 2020-07-23T10:19:20Z

Hi is there any updates on BART pertaining script?

avacaondata · 2021-01-25T02:43:39Z

I'm also highly interested in this, is there any update?

stale · 2021-07-21T03:04:38Z

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

firdota · 2021-07-22T08:50:30Z

any update?

thomas-li-sjtu · 2022-02-19T13:56:55Z

any update?

MITRA-S · 2022-03-14T15:22:48Z

Any update please?

mikelewis0 · 2022-03-16T18:30:24Z

Sorry for the (very) slow reply, this is actually the first time someone pointed me at this issue!

This command should set the hyperparameters from the original training run, though I haven't tested it with a recent fairseq, so you may need to fiddle with it a bit:
python -O train.py $DATA --fp16 --mask 0.3 --tokens-per-sample 512 --total-num-update 500000 --max-update 500000 --warmup-updates 10000 --task denoising --save-interval 1 --short-seq-prob 0.0 --arch denoising_large --optimizer adam --lr-scheduler polynomial_decay --lr 0.0004 --min-lr 1e-09 --dropout 0.1 --criterion cross_entropy --max-tokens 3200 --weight-decay 0.01 --attention-dropout 0.1 --share-all-embeddings --clip-norm 0.1 --skip-invalid-size-inputs-valid-test --log-format json --log-interval 1000 --save-interval-updates 5000 --keep-interval-updates 1 --update-freq 4 --seed 4 --distributed-world-size 256 --distributed-port 54187 --no-epoch-checkpoints --mask-length span-poisson --replace-length 1 --encoder-learned-pos --decoder-learned-pos --rotate 0.0 --mask-random 0.1 --permute-sentences 1.0 --insert 0.0 --poisson-lambda 3.5 --dataset-impl mmap --bpe gpt2 --num-workers 4

Hope that helps!

salrowili · 2022-03-18T02:49:14Z

Sorry for the (very) slow reply, this is actually the first time someone pointed me at this issue!

This command should set the hyperparameters from the original training run, though I haven't tested it with a recent fairseq, so you may need to fiddle with it a bit: python -O train.py $DATA --fp16 --mask 0.3 --tokens-per-sample 512 --total-num-update 500000 --max-update 500000 --warmup-updates 10000 --task denoising --save-interval 1 --short-seq-prob 0.0 --arch denoising_large --optimizer adam --lr-scheduler polynomial_decay --lr 0.0004 --min-lr 1e-09 --dropout 0.1 --criterion cross_entropy --max-tokens 3200 --weight-decay 0.01 --attention-dropout 0.1 --share-all-embeddings --clip-norm 0.1 --skip-invalid-size-inputs-valid-test --log-format json --log-interval 1000 --save-interval-updates 5000 --keep-interval-updates 1 --update-freq 4 --seed 4 --distributed-world-size 256 --distributed-port 54187 --no-epoch-checkpoints --mask-length span-poisson --replace-length 1 --encoder-learned-pos --decoder-learned-pos --rotate 0.0 --mask-random 0.1 --permute-sentences 1.0 --insert 0.0 --poisson-lambda 3.5 --dataset-impl mmap --bpe gpt2 --num-workers 4

Hope that helps!

Thank You Mike for updating us with the pre-training code. I think we should remove --min-lr 1e-09 because it cause the training to finish before even started and also to omit --short-seq-prob 0.0. Also batch size with these settings is 6.4K not 8192 as RoBERTA. Batch size=(max_token * update-freq * world size) / ( Max Seq. Length (tokens-per-sample) = 3200*4*256=3,276,800 /512= 6400 . Also i think mask value should be 0.15 for base

MITRA-S · 2022-03-18T13:59:02Z

Hi. Thank you so much. I appreciate your help.

…

On Fri, Mar 18, 2022 at 3:49 AM Sultan ***@***.***> wrote: Sorry for the (very) slow reply, this is actually the first time someone pointed me at this issue! This command should set the hyperparameters from the original training run, though I haven't tested it with a recent fairseq, so you may need to fiddle with it a bit: python -O train.py $DATA --fp16 --mask 0.3 --tokens-per-sample 512 --total-num-update 500000 --max-update 500000 --warmup-updates 10000 --task denoising --save-interval 1 --short-seq-prob 0.0 --arch denoising_large --optimizer adam --lr-scheduler polynomial_decay --lr 0.0004 --min-lr 1e-09 --dropout 0.1 --criterion cross_entropy --max-tokens 3200 --weight-decay 0.01 --attention-dropout 0.1 --share-all-embeddings --clip-norm 0.1 --skip-invalid-size-inputs-valid-test --log-format json --log-interval 1000 --save-interval-updates 5000 --keep-interval-updates 1 --update-freq 4 --seed 4 --distributed-world-size 256 --distributed-port 54187 --no-epoch-checkpoints --mask-length span-poisson --replace-length 1 --encoder-learned-pos --decoder-learned-pos --rotate 0.0 --mask-random 0.1 --permute-sentences 1.0 --insert 0.0 --poisson-lambda 3.5 --dataset-impl mmap --bpe gpt2 --num-workers 4 Hope that helps! I think we should remove --min-lr 1e-09 because it cause the training to finish before even started and also to omit --short-seq-prob 0.0. Also batch size with settings is 6.4K not 8192 as RoBERTA. Batch size=(max_token * update-freq * world size) / ( Max Seq. Length (tokens-per-sample) = 3200 *4*256=3,276,800 /512= 6400 . Also i think mask value should be 0.15 for base — Reply to this email directly, view it on GitHub <#1899 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ATYMA2QRZZFQIC4CB3HFVZDVAPVLNANCNFSM4LSLKLVQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you commented.Message ID: ***@***.***>

BramVanroy · 2022-09-05T17:25:15Z

Sorry for the (very) slow reply, this is actually the first time someone pointed me at this issue!

This command should set the hyperparameters from the original training run, though I haven't tested it with a recent fairseq, so you may need to fiddle with it a bit: python -O train.py $DATA --fp16 --mask 0.3 --tokens-per-sample 512 --total-num-update 500000 --max-update 500000 --warmup-updates 10000 --task denoising --save-interval 1 --short-seq-prob 0.0 --arch denoising_large --optimizer adam --lr-scheduler polynomial_decay --lr 0.0004 --min-lr 1e-09 --dropout 0.1 --criterion cross_entropy --max-tokens 3200 --weight-decay 0.01 --attention-dropout 0.1 --share-all-embeddings --clip-norm 0.1 --skip-invalid-size-inputs-valid-test --log-format json --log-interval 1000 --save-interval-updates 5000 --keep-interval-updates 1 --update-freq 4 --seed 4 --distributed-world-size 256 --distributed-port 54187 --no-epoch-checkpoints --mask-length span-poisson --replace-length 1 --encoder-learned-pos --decoder-learned-pos --rotate 0.0 --mask-random 0.1 --permute-sentences 1.0 --insert 0.0 --poisson-lambda 3.5 --dataset-impl mmap --bpe gpt2 --num-workers 4

Hope that helps!

This command leads to many issues, as you may have suspected:

--arch denoising_large does not exist. bart_large might be a good alternative
--permute_sentences apparently can't be 1.0: --permute-sentences: invalid int value: '1.0', has to be 1 instead
unrecognized arguments: --short-seq-prob 0.0 --min-lr 1e-09

So final command:

python train.py $DATA --fp16 --mask 0.3 --tokens-per-sample 512 --total-num-update 500000 --max-update 500000 --warmup-updates 10000 --task denoising --save-interval 1 --arch bart_base --optimizer adam --lr-scheduler polynomial_decay --lr 0.0004 --dropout 0.1 --criterion cross_entropy --max-tokens 3200 --weight-decay 0.01 --attention-dropout 0.1 --share-all-embeddings --clip-norm 0.1 --skip-invalid-size-inputs-valid-test --log-format json --log-interval 1000 --save-interval-updates 5000 --keep-interval-updates 1 --update-freq 4 --seed 4 --distributed-world-size 256 --distributed-port 54187 --no-epoch-checkpoints --mask-length span-poisson --replace-length 1 --encoder-learned-pos --decoder-learned-pos --rotate 0.0 --mask-random 0.1 --permute-sentences 1 --insert 0.0 --poisson-lambda 3.5 --dataset-impl mmap --bpe gpt2 --num-workers 4

erichans · 2022-10-21T14:37:22Z

@mikelewis0, what is the average loss at the end of pre-training?

Could someone who has pre-trained BART share the loss and target language?
I am pre-training for Portuguese and would like to know if an average loss of around 2 is ok.

mikelewis0 · 2022-10-21T17:42:39Z

IIRC it was a little under 2 on English. It's not very meaningful to compare these numbers across languages, as it will be strongly influenced by how the data is tokenized.

erichans · 2022-10-23T13:54:32Z

Yes sure! I was curious about how far the model could go in the loss function, because I'm only pre-training for 5 epochs due to hardware limitations (5,927 steps per epoch = 29,635 steps for training). This is far from what you did (500k steps) based on RoBERTa paper.

Thanks for the answer!

salrowili · 2022-10-23T14:07:16Z

I have pre-trained T5 and BART and it totally depend on the corpora and masking portion you are using. A larger corpora means the loss function need more time to capture the contextual representation so the loss function tend to be higher. Using a high masking portion (>15%) will also increase your loss score. It also worth noting that loss score has nothing to do with performance on downstream tasks. For example, you can pre-train BART on only 50MB, and the loss score will be very very low but the performance on downstream will be very very poor because you need at least 13GB (similar to BERT ), to have enough to capture the contextual representation for effective transfer learning approach. With masking portion of 15%, loss score tend to be below 0.5%.

jiaohuix · 2022-11-19T07:10:31Z

Sorry for the (very) slow reply, this is actually the first time someone pointed me at this issue!
This command should set the hyperparameters from the original training run, though I haven't tested it with a recent fairseq, so you may need to fiddle with it a bit: python -O train.py $DATA --fp16 --mask 0.3 --tokens-per-sample 512 --total-num-update 500000 --max-update 500000 --warmup-updates 10000 --task denoising --save-interval 1 --short-seq-prob 0.0 --arch denoising_large --optimizer adam --lr-scheduler polynomial_decay --lr 0.0004 --min-lr 1e-09 --dropout 0.1 --criterion cross_entropy --max-tokens 3200 --weight-decay 0.01 --attention-dropout 0.1 --share-all-embeddings --clip-norm 0.1 --skip-invalid-size-inputs-valid-test --log-format json --log-interval 1000 --save-interval-updates 5000 --keep-interval-updates 1 --update-freq 4 --seed 4 --distributed-world-size 256 --distributed-port 54187 --no-epoch-checkpoints --mask-length span-poisson --replace-length 1 --encoder-learned-pos --decoder-learned-pos --rotate 0.0 --mask-random 0.1 --permute-sentences 1.0 --insert 0.0 --poisson-lambda 3.5 --dataset-impl mmap --bpe gpt2 --num-workers 4
Hope that helps!

This command leads to many issues, as you may have suspected:

--arch denoising_large does not exist. bart_large might be a good alternative

--permute_sentences apparently can't be 1.0: --permute-sentences: invalid int value: '1.0', has to be 1 instead

unrecognized arguments: --short-seq-prob 0.0 --min-lr 1e-09

So final command:
python train.py $DATA --fp16 --mask 0.3 --tokens-per-sample 512 --total-num-update 500000 --max-update 500000 --warmup-updates 10000 --task denoising --save-interval 1 --arch bart_base --optimizer adam --lr-scheduler polynomial_decay --lr 0.0004 --dropout 0.1 --criterion cross_entropy --max-tokens 3200 --weight-decay 0.01 --attention-dropout 0.1 --share-all-embeddings --clip-norm 0.1 --skip-invalid-size-inputs-valid-test --log-format json --log-interval 1000 --save-interval-updates 5000 --keep-interval-updates 1 --update-freq 4 --seed 4 --distributed-world-size 256 --distributed-port 54187 --no-epoch-checkpoints --mask-length span-poisson --replace-length 1 --encoder-learned-pos --decoder-learned-pos --rotate 0.0 --mask-random 0.1 --permute-sentences 1 --insert 0.0 --poisson-lambda 3.5 --dataset-impl mmap --bpe gpt2 --num-workers 4

thank you for your scripts, I also run successfully!!!

StevenTang1998 · 2022-12-04T09:13:55Z

@mikelewis0 Hi, Mike. I am a little confused about the sentence permutation in denoising_dataset.py. The full_stop_index only contains the eos token rather than the full stop token.

ShiyuNee · 2023-02-14T14:11:33Z

Sorry for the (very) slow reply, this is actually the first time someone pointed me at this issue!
This command should set the hyperparameters from the original training run, though I haven't tested it with a recent fairseq, so you may need to fiddle with it a bit: python -O train.py $DATA --fp16 --mask 0.3 --tokens-per-sample 512 --total-num-update 500000 --max-update 500000 --warmup-updates 10000 --task denoising --save-interval 1 --short-seq-prob 0.0 --arch denoising_large --optimizer adam --lr-scheduler polynomial_decay --lr 0.0004 --min-lr 1e-09 --dropout 0.1 --criterion cross_entropy --max-tokens 3200 --weight-decay 0.01 --attention-dropout 0.1 --share-all-embeddings --clip-norm 0.1 --skip-invalid-size-inputs-valid-test --log-format json --log-interval 1000 --save-interval-updates 5000 --keep-interval-updates 1 --update-freq 4 --seed 4 --distributed-world-size 256 --distributed-port 54187 --no-epoch-checkpoints --mask-length span-poisson --replace-length 1 --encoder-learned-pos --decoder-learned-pos --rotate 0.0 --mask-random 0.1 --permute-sentences 1.0 --insert 0.0 --poisson-lambda 3.5 --dataset-impl mmap --bpe gpt2 --num-workers 4
Hope that helps!

This command leads to many issues, as you may have suspected:

--arch denoising_large does not exist. bart_large might be a good alternative

--permute_sentences apparently can't be 1.0: --permute-sentences: invalid int value: '1.0', has to be 1 instead

unrecognized arguments: --short-seq-prob 0.0 --min-lr 1e-09

So final command:
python train.py $DATA --fp16 --mask 0.3 --tokens-per-sample 512 --total-num-update 500000 --max-update 500000 --warmup-updates 10000 --task denoising --save-interval 1 --arch bart_base --optimizer adam --lr-scheduler polynomial_decay --lr 0.0004 --dropout 0.1 --criterion cross_entropy --max-tokens 3200 --weight-decay 0.01 --attention-dropout 0.1 --share-all-embeddings --clip-norm 0.1 --skip-invalid-size-inputs-valid-test --log-format json --log-interval 1000 --save-interval-updates 5000 --keep-interval-updates 1 --update-freq 4 --seed 4 --distributed-world-size 256 --distributed-port 54187 --no-epoch-checkpoints --mask-length span-poisson --replace-length 1 --encoder-learned-pos --decoder-learned-pos --rotate 0.0 --mask-random 0.1 --permute-sentences 1 --insert 0.0 --poisson-lambda 3.5 --dataset-impl mmap --bpe gpt2 --num-workers 4
thank you for your scripts, I also run successfully!!!

I'd like to know is the pretrain a further pretrain(training on a pretrained Bart)

PiotrNawrot · 2023-03-16T16:27:34Z

We've released nanoT5 that reproduces T5-model (similar to BART) pre-training in PyTorch (not Flax).

You can take a look!

Any suggestions are more than welcome.

gyuwankim added needs triage question labels Mar 24, 2020

huihuifan removed the needs triage label Apr 14, 2020

shamanez mentioned this issue Jul 23, 2020

How to pre-train BART model huggingface/transformers#4151

Closed

stale bot added the stale label Jul 21, 2021

stale bot removed the stale label Jul 22, 2021

sajastu mentioned this issue Jul 5, 2022

Pretraining BART language model huggingface/transformers#18030

Closed

BramVanroy mentioned this issue Sep 6, 2022

Add BART DLM PyTorch pretraining example huggingface/transformers#18904

Closed

3 tasks

VictorAtPL mentioned this issue Feb 20, 2023

Tips for training base model from scratch on smaller amount of datasets clovaai/donut#11

Closed

BUCKFAE mentioned this issue Apr 13, 2023

Denoising Task crashes OOM #5076

Open

frankang mentioned this issue Nov 9, 2023

BART pretraining instructions #1614

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BART Pretraining Script #1899

BART Pretraining Script #1899

gyuwankim commented Mar 24, 2020

huihuifan commented Apr 14, 2020

SunbowLiu commented Jun 19, 2020

❓ Questions and Help

What is your question?

astariul commented Jun 29, 2020 •

edited

Loading

shamanez commented Jul 23, 2020

avacaondata commented Jan 25, 2021

stale bot commented Jul 21, 2021

firdota commented Jul 22, 2021

thomas-li-sjtu commented Feb 19, 2022

MITRA-S commented Mar 14, 2022

mikelewis0 commented Mar 16, 2022 •

edited

Loading

salrowili commented Mar 18, 2022 •

edited

Loading

MITRA-S commented Mar 18, 2022 via email

BramVanroy commented Sep 5, 2022

erichans commented Oct 21, 2022 •

edited

Loading

mikelewis0 commented Oct 21, 2022

erichans commented Oct 23, 2022

salrowili commented Oct 23, 2022

jiaohuix commented Nov 19, 2022

StevenTang1998 commented Dec 4, 2022

ShiyuNee commented Feb 14, 2023

PiotrNawrot commented Mar 16, 2023

BART Pretraining Script #1899

BART Pretraining Script #1899

Comments

gyuwankim commented Mar 24, 2020

❓ Questions and Help

What is your question?

huihuifan commented Apr 14, 2020

SunbowLiu commented Jun 19, 2020

❓ Questions and Help

What is your question?

astariul commented Jun 29, 2020 • edited Loading

shamanez commented Jul 23, 2020

avacaondata commented Jan 25, 2021

stale bot commented Jul 21, 2021

firdota commented Jul 22, 2021

thomas-li-sjtu commented Feb 19, 2022

MITRA-S commented Mar 14, 2022

mikelewis0 commented Mar 16, 2022 • edited Loading

salrowili commented Mar 18, 2022 • edited Loading

MITRA-S commented Mar 18, 2022 via email

BramVanroy commented Sep 5, 2022

erichans commented Oct 21, 2022 • edited Loading

mikelewis0 commented Oct 21, 2022

erichans commented Oct 23, 2022

salrowili commented Oct 23, 2022

jiaohuix commented Nov 19, 2022

StevenTang1998 commented Dec 4, 2022

ShiyuNee commented Feb 14, 2023

PiotrNawrot commented Mar 16, 2023

astariul commented Jun 29, 2020 •

edited

Loading

mikelewis0 commented Mar 16, 2022 •

edited

Loading

salrowili commented Mar 18, 2022 •

edited

Loading

erichans commented Oct 21, 2022 •

edited

Loading