The result becomes 0 at the end of second epoch when I pretrain a model with the MLM objective for Mongolian and Chinese #16

Julisa-test · 2019-02-25T13:24:15Z

The result becomes 0 at the end of second epoch when I pretrain a model with the MLM objective for Mongolian and Chinese. Is the preprocessing method inappropriate?

details:
python train.py --exp_name 'my_mnzh_mlm' --dump_path './dumped/' --exp_id '190225' --data_path './data/processed/mn-zh/' --lgs 'mn-zh' --clm_steps '' --mlm_steps 'mn,zh' --emb_dim '1024' --n_layers '6' --n_heads '8' --dropout '0.2' --attention_dropout '0.2' --gelu_activation 'true' --batch_size '16' --bptt '256' --optimizer 'adam,lr=0.0001' --epoch_size '300000' --validation_metrics '_valid_mlm_ppl' --stopping_criterion '_valid_mlm_ppl,10'

python train.py --exp_name 'my_mnzh_mlm' --dump_path './dumped/' --exp_id '190225' --data_path './data/processed/mn-zh/' --lgs 'mn-zh' --clm_steps '' --mlm_steps 'mn,zh' --emb_dim '1024' --n_layers '6' --n_heads '8' --dropout '0.2' --attention_dropout '0.2' --gelu_activation 'true' --batch_size '16' --bptt '256' --optimizer 'adam,lr=0.0001' --epoch_size '300000' --validation_metrics '_valid_mlm_ppl' --stopping_criterion '_valid_mlm_ppl,10'

INFO - 02/25/19 13:21:37 - 3:07:50 - ============ End of epoch 0 ============
INFO - 02/25/19 13:21:48 - 3:08:01 - epoch -> 0.000000
INFO - 02/25/19 13:21:48 - 3:08:01 - valid_mn_mlm_ppl -> 574.678424
INFO - 02/25/19 13:21:48 - 3:08:01 - valid_mn_mlm_acc -> 17.192429
INFO - 02/25/19 13:21:48 - 3:08:01 - valid_zh_mlm_ppl -> 5591.294827
INFO - 02/25/19 13:21:48 - 3:08:01 - valid_zh_mlm_acc -> 14.550473
INFO - 02/25/19 13:21:48 - 3:08:01 - valid_mlm_ppl -> 3082.986625
INFO - 02/25/19 13:21:48 - 3:08:01 - valid_mlm_acc -> 15.871451
INFO - 02/25/19 13:21:48 - 3:08:01 - test_mn_mlm_ppl -> 436.168551
INFO - 02/25/19 13:21:48 - 3:08:01 - test_mn_mlm_acc -> 13.728215
INFO - 02/25/19 13:21:48 - 3:08:01 - test_zh_mlm_ppl -> 32195.137737
INFO - 02/25/19 13:21:48 - 3:08:01 - test_zh_mlm_acc -> 7.138838
INFO - 02/25/19 13:21:48 - 3:08:01 - test_mlm_ppl -> 16315.653144
INFO - 02/25/19 13:21:48 - 3:08:01 - test_mlm_acc -> 10.433527

INFO - 02/25/19 16:29:17 - 6:15:30 - ============ End of epoch 1 ============
INFO - 02/25/19 16:29:28 - 6:15:41 - epoch -> 1.000000
INFO - 02/25/19 16:29:28 - 6:15:41 - valid_mn_mlm_ppl -> 966.486405
INFO - 02/25/19 16:29:28 - 6:15:41 - valid_mn_mlm_acc -> 7.886435
INFO - 02/25/19 16:29:28 - 6:15:41 - valid_zh_mlm_ppl -> 8967.092445
INFO - 02/25/19 16:29:28 - 6:15:41 - valid_zh_mlm_acc -> 0.000000
INFO - 02/25/19 16:29:28 - 6:15:41 - valid_mlm_ppl -> 4966.789425
INFO - 02/25/19 16:29:28 - 6:15:41 - valid_mlm_acc -> 3.943218
INFO - 02/25/19 16:29:28 - 6:15:41 - test_mn_mlm_ppl -> 808.229061
INFO - 02/25/19 16:29:28 - 6:15:41 - test_mn_mlm_acc -> 12.853917
INFO - 02/25/19 16:29:28 - 6:15:41 - test_zh_mlm_ppl -> 43495.881859
INFO - 02/25/19 16:29:28 - 6:15:41 - test_zh_mlm_acc -> 0.000000
INFO - 02/25/19 16:29:28 - 6:15:41 - test_mlm_ppl -> 22152.055460
INFO - 02/25/19 16:29:28 - 6:15:41 - test_mlm_acc -> 6.426958

glample · 2019-02-25T13:37:14Z

Mmm the model diverged, I guess this may be because of your optimizer. Maybe try something like this to use a learning rate linear warmup + decay (it's usually good for transformer training):

optimizer = adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001,weight_decay=0

Basically the learning rate was too big in your experiment. attention_dropout = 0.2 is big, maybe try just 0 or 0.1 instead.
Also, batch size 16 is small, the bigger the more stable the model will be. In practice we use up to 32 GPU to have very large batch size.

Julisa-test · 2019-02-26T01:07:38Z

Thank you for these details. And I will try again based on your suggestion. The batch size 16 is the biggest value that fits in my computer‘s ’memory.

Dolprimates · 2019-07-12T12:12:28Z

@glample
Which part stands for warmup?

--optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001,weight_decay=0

glample · 2019-07-12T12:32:14Z

Nothing in this command. It will use default warmup here. Otherwise you can use:

--optimizer adam_inverse_sqrt,lr=0.00020,warmup_updates=30000,beta1=0.9,beta2=0.999,weight_decay=0.01,eps=0.000001

and warmup_updates is the number of warmup steps.

Julisa-test closed this as completed Feb 26, 2019

JxuHenry mentioned this issue Oct 28, 2019

I train UNMT with multi-GPU got the following errors! #224

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The result becomes 0 at the end of second epoch when I pretrain a model with the MLM objective for Mongolian and Chinese #16

The result becomes 0 at the end of second epoch when I pretrain a model with the MLM objective for Mongolian and Chinese #16

Julisa-test commented Feb 25, 2019

glample commented Feb 25, 2019

Julisa-test commented Feb 26, 2019

Dolprimates commented Jul 12, 2019

glample commented Jul 12, 2019

The result becomes 0 at the end of second epoch when I pretrain a model with the MLM objective for Mongolian and Chinese #16

The result becomes 0 at the end of second epoch when I pretrain a model with the MLM objective for Mongolian and Chinese #16

Comments

Julisa-test commented Feb 25, 2019

glample commented Feb 25, 2019

Julisa-test commented Feb 26, 2019

Dolprimates commented Jul 12, 2019

glample commented Jul 12, 2019