Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Experience OOM error during evaluate_mt() #21

Closed
yilinyang7 opened this issue Mar 1, 2019 · 1 comment
Closed

Experience OOM error during evaluate_mt() #21

yilinyang7 opened this issue Mar 1, 2019 · 1 comment

Comments

@yilinyang7
Copy link

Dear authors,
Thank you so much for your codes. I'm trying to reproduce supervised MT results on wmt14 en-de. The training works fine with single(multi)-gpu. However, I frequently experience OOM error after one epoch and during evaluate_mt() step. Here's the script I used and the error message:

python train.py --exp_name wmt14_ende --dump_path ./dumped/ --data_path ./data/processed/wmt14_de-en --lgs 'en-de' --encoder_only false --emb_dim 1024 --n_layers 6 --n_heads 8 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --tokens_per_batch 2000 --bptt 256 --optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 --epoch_size 200000 --eval_bleu true --stopping_criterion 'valid_en-de_mt_bleu,10' --validation_metrics 'valid_en-de_mt_bleu' --mt_steps "en-de" --gpus '0'
(--gpus just indicates the gpuid to use)

Traceback (most recent call last):
File "train.py", line 325, in
main(params)
File "train.py", line 300, in main
scores = evaluator.run_all_evals(trainer)
File "/nfs/eecs-fserv/share/yangyil/XLM/src/evaluation/evaluator.py", line 181, in run_all_evals
self.evaluate_mt(scores, data_set, lang1, lang2, eval_bleu)
File "/nfs/eecs-fserv/share/yangyil/XLM/src/evaluation/evaluator.py", line 377, in evaluate_mt
word_scores, loss = decoder('predict', tensor=dec2, pred_mask=pred_mask, y=y, get_scores=True)
File "/nfs/stak/users/yangyil/shared/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/nfs/eecs-fserv/share/yangyil/XLM/src/model/transformer.py", line 313, in forward
return self.predict(**kwargs)
File "/nfs/eecs-fserv/share/yangyil/XLM/src/model/transformer.py", line 416, in predict
scores, loss = self.pred_layer(masked_tensor, y, get_scores)
File "/nfs/stak/users/yangyil/shared/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/nfs/eecs-fserv/share/yangyil/XLM/src/model/transformer.py", line 132, in forward
loss = F.cross_entropy(scores, y, reduction='elementwise_mean')
File "/nfs/stak/users/yangyil/shared/anaconda3/lib/python3.6/site-packages/torch/nn/functional.py", line 1550, in cross_entropy
return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
File "/nfs/stak/users/yangyil/shared/anaconda3/lib/python3.6/site-packages/torch/nn/functional.py", line 975, in log_softmax
return input.log_softmax(dim)
RuntimeError: CUDA error: out of memory

The OOM always happens within F.cross_entropy(), although cross_entropy doesn't always trigger OOM. Do you have some idea to make it more stable?

Another thing: I uses pytorch 0.4.1 but didn't experience this #15, and if I update it to 1.0.1, I'll experience another error pytorch/pytorch#13273 (
_queue_reduction() doesn't take
torch.distributed.ProcessGroupNCCL object).

Best.
Yilin

@yilinyang7
Copy link
Author

Sorry for bothering, I think I found out why.
It's because, when I replicate wmt14 EnDe results, I use scripts from Fairseq, which cut the length up to 250. And because of that, during evaluating, the really long valid sequence will cause OOM.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant