Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BART Pretraining Script #1899

Open
gyuwankim opened this issue Mar 24, 2020 · 21 comments
Open

BART Pretraining Script #1899

gyuwankim opened this issue Mar 24, 2020 · 21 comments
Labels

Comments

@gyuwankim
Copy link

❓ Questions and Help

First of all, thanks for the sharing BART model checkpoints and codes to run.

What is your question?

Could you provide a pertaining script used for BART models?

I hope to train the BART model in my own language.
(Of course, I am aware of mBART models that support other languages, but my target task is not MT so I believe training BART on other language data only might be better.)
Although I could figure out configurations base on the paper, it is prone to miss some important details for training.
Training scripts like RoBERTa (https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md) would be highly beneficial.

Thanks a lot in advance!

@huihuifan
Copy link
Contributor

@ngoyal2707 @yinhanliu

@SunbowLiu
Copy link

❓ Questions and Help

First of all, thanks for the sharing BART model checkpoints and codes to run.

What is your question?

Could you provide a pertaining script used for BART models?

I hope to train the BART model in my own language.
(Of course, I am aware of mBART models that support other languages, but my target task is not MT so I believe training BART on other language data only might be better.)
Although I could figure out configurations base on the paper, it is prone to miss some important details for training.
Training scripts like RoBERTa (https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md) would be highly beneficial.

Thanks a lot in advance!

Hello,

I am also very interested in training a customized BART. Have you got any updates?

@astariul
Copy link
Contributor

astariul commented Jun 29, 2020

I'm also very interested in pretraining script. Any update ? @ngoyal2707 @yinhanliu

@shamanez
Copy link

Hi is there any updates on BART pertaining script?

@avacaondata
Copy link

I'm also highly interested in this, is there any update?

@stale
Copy link

stale bot commented Jul 21, 2021

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

@stale stale bot added the stale label Jul 21, 2021
@firdota
Copy link

firdota commented Jul 22, 2021

any update?

@stale stale bot removed the stale label Jul 22, 2021
@thomas-li-sjtu
Copy link

any update?

@MITRA-S
Copy link

MITRA-S commented Mar 14, 2022

Any update please?

@mikelewis0
Copy link

mikelewis0 commented Mar 16, 2022

Sorry for the (very) slow reply, this is actually the first time someone pointed me at this issue!

This command should set the hyperparameters from the original training run, though I haven't tested it with a recent fairseq, so you may need to fiddle with it a bit:
python -O train.py $DATA --fp16 --mask 0.3 --tokens-per-sample 512 --total-num-update 500000 --max-update 500000 --warmup-updates 10000 --task denoising --save-interval 1 --short-seq-prob 0.0 --arch denoising_large --optimizer adam --lr-scheduler polynomial_decay --lr 0.0004 --min-lr 1e-09 --dropout 0.1 --criterion cross_entropy --max-tokens 3200 --weight-decay 0.01 --attention-dropout 0.1 --share-all-embeddings --clip-norm 0.1 --skip-invalid-size-inputs-valid-test --log-format json --log-interval 1000 --save-interval-updates 5000 --keep-interval-updates 1 --update-freq 4 --seed 4 --distributed-world-size 256 --distributed-port 54187 --no-epoch-checkpoints --mask-length span-poisson --replace-length 1 --encoder-learned-pos --decoder-learned-pos --rotate 0.0 --mask-random 0.1 --permute-sentences 1.0 --insert 0.0 --poisson-lambda 3.5 --dataset-impl mmap --bpe gpt2 --num-workers 4

Hope that helps!

@salrowili
Copy link

salrowili commented Mar 18, 2022

Sorry for the (very) slow reply, this is actually the first time someone pointed me at this issue!

This command should set the hyperparameters from the original training run, though I haven't tested it with a recent fairseq, so you may need to fiddle with it a bit: python -O train.py $DATA --fp16 --mask 0.3 --tokens-per-sample 512 --total-num-update 500000 --max-update 500000 --warmup-updates 10000 --task denoising --save-interval 1 --short-seq-prob 0.0 --arch denoising_large --optimizer adam --lr-scheduler polynomial_decay --lr 0.0004 --min-lr 1e-09 --dropout 0.1 --criterion cross_entropy --max-tokens 3200 --weight-decay 0.01 --attention-dropout 0.1 --share-all-embeddings --clip-norm 0.1 --skip-invalid-size-inputs-valid-test --log-format json --log-interval 1000 --save-interval-updates 5000 --keep-interval-updates 1 --update-freq 4 --seed 4 --distributed-world-size 256 --distributed-port 54187 --no-epoch-checkpoints --mask-length span-poisson --replace-length 1 --encoder-learned-pos --decoder-learned-pos --rotate 0.0 --mask-random 0.1 --permute-sentences 1.0 --insert 0.0 --poisson-lambda 3.5 --dataset-impl mmap --bpe gpt2 --num-workers 4

Hope that helps!

Thank You Mike for updating us with the pre-training code. I think we should remove --min-lr 1e-09 because it cause the training to finish before even started and also to omit --short-seq-prob 0.0. Also batch size with these settings is 6.4K not 8192 as RoBERTA. Batch size=(max_token * update-freq * world size) / ( Max Seq. Length (tokens-per-sample) = 3200*4*256=3,276,800 /512= 6400 . Also i think mask value should be 0.15 for base

@MITRA-S
Copy link

MITRA-S commented Mar 18, 2022 via email

@BramVanroy
Copy link

Sorry for the (very) slow reply, this is actually the first time someone pointed me at this issue!

This command should set the hyperparameters from the original training run, though I haven't tested it with a recent fairseq, so you may need to fiddle with it a bit: python -O train.py $DATA --fp16 --mask 0.3 --tokens-per-sample 512 --total-num-update 500000 --max-update 500000 --warmup-updates 10000 --task denoising --save-interval 1 --short-seq-prob 0.0 --arch denoising_large --optimizer adam --lr-scheduler polynomial_decay --lr 0.0004 --min-lr 1e-09 --dropout 0.1 --criterion cross_entropy --max-tokens 3200 --weight-decay 0.01 --attention-dropout 0.1 --share-all-embeddings --clip-norm 0.1 --skip-invalid-size-inputs-valid-test --log-format json --log-interval 1000 --save-interval-updates 5000 --keep-interval-updates 1 --update-freq 4 --seed 4 --distributed-world-size 256 --distributed-port 54187 --no-epoch-checkpoints --mask-length span-poisson --replace-length 1 --encoder-learned-pos --decoder-learned-pos --rotate 0.0 --mask-random 0.1 --permute-sentences 1.0 --insert 0.0 --poisson-lambda 3.5 --dataset-impl mmap --bpe gpt2 --num-workers 4

Hope that helps!

This command leads to many issues, as you may have suspected:

  • --arch denoising_large does not exist. bart_large might be a good alternative
  • --permute_sentences apparently can't be 1.0: --permute-sentences: invalid int value: '1.0', has to be 1 instead
  • unrecognized arguments: --short-seq-prob 0.0 --min-lr 1e-09

So final command:

python train.py $DATA --fp16 --mask 0.3 --tokens-per-sample 512 --total-num-update 500000 --max-update 500000 --warmup-updates 10000 --task denoising --save-interval 1 --arch bart_base --optimizer adam --lr-scheduler polynomial_decay --lr 0.0004 --dropout 0.1 --criterion cross_entropy --max-tokens 3200 --weight-decay 0.01 --attention-dropout 0.1 --share-all-embeddings --clip-norm 0.1 --skip-invalid-size-inputs-valid-test --log-format json --log-interval 1000 --save-interval-updates 5000 --keep-interval-updates 1 --update-freq 4 --seed 4 --distributed-world-size 256 --distributed-port 54187 --no-epoch-checkpoints --mask-length span-poisson --replace-length 1 --encoder-learned-pos --decoder-learned-pos --rotate 0.0 --mask-random 0.1 --permute-sentences 1 --insert 0.0 --poisson-lambda 3.5 --dataset-impl mmap --bpe gpt2 --num-workers 4

@erichans
Copy link

erichans commented Oct 21, 2022

@mikelewis0, what is the average loss at the end of pre-training?

Could someone who has pre-trained BART share the loss and target language?
I am pre-training for Portuguese and would like to know if an average loss of around 2 is ok.

@mikelewis0
Copy link

IIRC it was a little under 2 on English. It's not very meaningful to compare these numbers across languages, as it will be strongly influenced by how the data is tokenized.

@erichans
Copy link

Yes sure! I was curious about how far the model could go in the loss function, because I'm only pre-training for 5 epochs due to hardware limitations (5,927 steps per epoch = 29,635 steps for training). This is far from what you did (500k steps) based on RoBERTa paper.

Thanks for the answer!

@salrowili
Copy link

I have pre-trained T5 and BART and it totally depend on the corpora and masking portion you are using. A larger corpora means the loss function need more time to capture the contextual representation so the loss function tend to be higher. Using a high masking portion (>15%) will also increase your loss score. It also worth noting that loss score has nothing to do with performance on downstream tasks. For example, you can pre-train BART on only 50MB, and the loss score will be very very low but the performance on downstream will be very very poor because you need at least 13GB (similar to BERT ), to have enough to capture the contextual representation for effective transfer learning approach. With masking portion of 15%, loss score tend to be below 0.5%.

@jiaohuix
Copy link

Sorry for the (very) slow reply, this is actually the first time someone pointed me at this issue!
This command should set the hyperparameters from the original training run, though I haven't tested it with a recent fairseq, so you may need to fiddle with it a bit: python -O train.py $DATA --fp16 --mask 0.3 --tokens-per-sample 512 --total-num-update 500000 --max-update 500000 --warmup-updates 10000 --task denoising --save-interval 1 --short-seq-prob 0.0 --arch denoising_large --optimizer adam --lr-scheduler polynomial_decay --lr 0.0004 --min-lr 1e-09 --dropout 0.1 --criterion cross_entropy --max-tokens 3200 --weight-decay 0.01 --attention-dropout 0.1 --share-all-embeddings --clip-norm 0.1 --skip-invalid-size-inputs-valid-test --log-format json --log-interval 1000 --save-interval-updates 5000 --keep-interval-updates 1 --update-freq 4 --seed 4 --distributed-world-size 256 --distributed-port 54187 --no-epoch-checkpoints --mask-length span-poisson --replace-length 1 --encoder-learned-pos --decoder-learned-pos --rotate 0.0 --mask-random 0.1 --permute-sentences 1.0 --insert 0.0 --poisson-lambda 3.5 --dataset-impl mmap --bpe gpt2 --num-workers 4
Hope that helps!

This command leads to many issues, as you may have suspected:

  • --arch denoising_large does not exist. bart_large might be a good alternative
  • --permute_sentences apparently can't be 1.0: --permute-sentences: invalid int value: '1.0', has to be 1 instead
  • unrecognized arguments: --short-seq-prob 0.0 --min-lr 1e-09

So final command:

python train.py $DATA --fp16 --mask 0.3 --tokens-per-sample 512 --total-num-update 500000 --max-update 500000 --warmup-updates 10000 --task denoising --save-interval 1 --arch bart_base --optimizer adam --lr-scheduler polynomial_decay --lr 0.0004 --dropout 0.1 --criterion cross_entropy --max-tokens 3200 --weight-decay 0.01 --attention-dropout 0.1 --share-all-embeddings --clip-norm 0.1 --skip-invalid-size-inputs-valid-test --log-format json --log-interval 1000 --save-interval-updates 5000 --keep-interval-updates 1 --update-freq 4 --seed 4 --distributed-world-size 256 --distributed-port 54187 --no-epoch-checkpoints --mask-length span-poisson --replace-length 1 --encoder-learned-pos --decoder-learned-pos --rotate 0.0 --mask-random 0.1 --permute-sentences 1 --insert 0.0 --poisson-lambda 3.5 --dataset-impl mmap --bpe gpt2 --num-workers 4

thank you for your scripts, I also run successfully!!!

@StevenTang1998
Copy link

@mikelewis0 Hi, Mike. I am a little confused about the sentence permutation in denoising_dataset.py. The full_stop_index only contains the eos token rather than the full stop token.

@ShiyuNee
Copy link

Sorry for the (very) slow reply, this is actually the first time someone pointed me at this issue!
This command should set the hyperparameters from the original training run, though I haven't tested it with a recent fairseq, so you may need to fiddle with it a bit: python -O train.py $DATA --fp16 --mask 0.3 --tokens-per-sample 512 --total-num-update 500000 --max-update 500000 --warmup-updates 10000 --task denoising --save-interval 1 --short-seq-prob 0.0 --arch denoising_large --optimizer adam --lr-scheduler polynomial_decay --lr 0.0004 --min-lr 1e-09 --dropout 0.1 --criterion cross_entropy --max-tokens 3200 --weight-decay 0.01 --attention-dropout 0.1 --share-all-embeddings --clip-norm 0.1 --skip-invalid-size-inputs-valid-test --log-format json --log-interval 1000 --save-interval-updates 5000 --keep-interval-updates 1 --update-freq 4 --seed 4 --distributed-world-size 256 --distributed-port 54187 --no-epoch-checkpoints --mask-length span-poisson --replace-length 1 --encoder-learned-pos --decoder-learned-pos --rotate 0.0 --mask-random 0.1 --permute-sentences 1.0 --insert 0.0 --poisson-lambda 3.5 --dataset-impl mmap --bpe gpt2 --num-workers 4
Hope that helps!

This command leads to many issues, as you may have suspected:

  • --arch denoising_large does not exist. bart_large might be a good alternative
  • --permute_sentences apparently can't be 1.0: --permute-sentences: invalid int value: '1.0', has to be 1 instead
  • unrecognized arguments: --short-seq-prob 0.0 --min-lr 1e-09

So final command:

python train.py $DATA --fp16 --mask 0.3 --tokens-per-sample 512 --total-num-update 500000 --max-update 500000 --warmup-updates 10000 --task denoising --save-interval 1 --arch bart_base --optimizer adam --lr-scheduler polynomial_decay --lr 0.0004 --dropout 0.1 --criterion cross_entropy --max-tokens 3200 --weight-decay 0.01 --attention-dropout 0.1 --share-all-embeddings --clip-norm 0.1 --skip-invalid-size-inputs-valid-test --log-format json --log-interval 1000 --save-interval-updates 5000 --keep-interval-updates 1 --update-freq 4 --seed 4 --distributed-world-size 256 --distributed-port 54187 --no-epoch-checkpoints --mask-length span-poisson --replace-length 1 --encoder-learned-pos --decoder-learned-pos --rotate 0.0 --mask-random 0.1 --permute-sentences 1 --insert 0.0 --poisson-lambda 3.5 --dataset-impl mmap --bpe gpt2 --num-workers 4

thank you for your scripts, I also run successfully!!!

I'd like to know is the pretrain a further pretrain(training on a pretrained Bart)

@PiotrNawrot
Copy link

We've released nanoT5 that reproduces T5-model (similar to BART) pre-training in PyTorch (not Flax).

You can take a look!

Any suggestions are more than welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests