KeyError: 'validation/main/loss' in pytorch when use multiple gpu in training #1723

LCF2764 · 2020-03-23T05:15:07Z

hi,
I use aishell recipe and running asr_train.py. When I use single GPU, it work well. However, when I use 2GPU, it finish training at the end of first epoch and throw the error: KeyError: 'validation/main/loss' error.

�[J/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/matplotlib/tight_layout.py:231: UserWarning: tight_layout : falling back to Agg renderer
  warnings.warn("tight_layout : falling back to Agg renderer")
Exception in main training loop: 'validation/main/loss'
Traceback (most recent call last):
  File "/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/chainer/training/trainer.py", line 345, in run
    if entry.trigger(self):
  File "/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/chainer/training/triggers/minmax_value_trigger.py", line 52, in __call__
    value = float(stats[key])  # copy to CPU
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
  File "/home/lcf/espnet/egs/Magicspeech/asr3/../../../espnet/bin/asr_train.py", line 368, in <module>
    main(sys.argv[1:])
  File "/home/lcf/espnet/egs/Magicspeech/asr3/../../../espnet/bin/asr_train.py", line 355, in main
    train(args)
  File "/home/lcf/espnet/espnet/asr/pytorch_backend/asr.py", line 631, in train
    trainer.run()
  File "/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/chainer/training/trainer.py", line 376, in run
    six.reraise(*exc_info)
  File "/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/chainer/training/trainer.py", line 345, in run
    if entry.trigger(self):
  File "/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/chainer/training/triggers/minmax_value_trigger.py", line 52, in __call__
    value = float(stats[key])  # copy to CPU
KeyError: 'validation/main/loss'
# Accounting: time=129 threads=1
# Ended (code 1) at Mon Mar 23 12:46:57 CST 2020, elapsed time 129 seconds

How to fix this?

The text was updated successfully, but these errors were encountered:

LCF2764 · 2020-03-23T10:11:37Z

The problem seems to be in this code snippet：
espnet/espnet/asr/pytorch_backend/asr.py
line571-575

    # Save best models
    trainer.extend(snapshot_object(model, 'model.loss.best'),
                   trigger=training.triggers.MinValueTrigger('validation/main/loss'))
    if mtl_mode != 'ctc':
        trainer.extend(snapshot_object(model, 'model.acc.best'),
                       trigger=training.triggers.MaxValueTrigger('validation/main/acc'))

when I comment out these code, it can be trained normally. But can't save the best model and can't show the "main/loss_ctc", "main/loss_att", "main/acc", and "main/loss" in the log file.

sw005320 · 2020-03-23T13:28:42Z

What kind of multiple GPU environments are you using?
The other people have never reported such errors.

LCF2764 · 2020-03-23T14:21:01Z

thanks for your reply！
this is my Environments

- Ubuntu 16.04.5
- python version: `3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34)  [GCC 7.3.0]`
- espnet version: `espnet 0.6.2`
- chainer version: `chainer 7.2.0`
- pytorch version: `pytorch 1.4.0`
- CUDA Version: 10.2
- GPU: 2 x GTX1080

sw005320 · 2020-03-23T16:52:40Z

OK. I'm concerning the following warning.

�[J/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '

Could you try it with an older version of pytorch?
Some interface changes may happen in the latest pytorch.

LCF2764 · 2020-03-24T04:00:22Z

I have try many older version of pytorch include:

pytorch 1.0.0
pytorch 1.0.1
pytorch 1.1.0
pytorch 1.2.0

but all have the same problem.

sw005320 · 2020-03-24T11:42:38Z

Many thanks.
This would be a bug.
I'll test it but I'd not use multiple GPUs easily and I'm not sure I could debug it.
Please keep post the update.

sw005320 · 2020-03-24T19:56:03Z

I just tested it and I did not get any errors.
My environment is as follows:
(You may try chainer 6.0.0?)

python version: 3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0]
espnet version: espnet 0.6.2
chainer version: chainer 6.0.0
pytorch version: pytorch 1.0.1.post2
Git hash: 41cfa571c273f4c68ce9826914ee03e28305702f
- Commit date: Mon Mar 23 09:10:14 2020 -0400

LCF2764 · 2020-03-25T15:35:15Z

Thank you very mush!
After I instasll chainer=6.0.0, the problem was solved.
Thanks!

sw005320 · 2020-03-25T15:38:03Z

Cool!
This is a very good note.
We may need to stick to use chainer 6.0.0.
I want to leave it as it is (because our default chainer version in Makefile is 6.0.0) but if many people are stacked about it due to that, we'll ask people to fix to use chainer 6.0.0 or fix this bug.

sw005320 added the Multiple GPUs label Mar 23, 2020

sw005320 added the Bug bug should be fixed label Mar 24, 2020

jiamliang mentioned this issue May 15, 2020

self.device = device AttributeError: can't set attribute #1925

Closed

b-flo closed this as completed Sep 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeyError: 'validation/main/loss' in pytorch when use multiple gpu in training #1723

KeyError: 'validation/main/loss' in pytorch when use multiple gpu in training #1723

LCF2764 commented Mar 23, 2020 •

edited

LCF2764 commented Mar 23, 2020 •

edited

sw005320 commented Mar 23, 2020

LCF2764 commented Mar 23, 2020

sw005320 commented Mar 23, 2020 •

edited

LCF2764 commented Mar 24, 2020 •

edited

sw005320 commented Mar 24, 2020

sw005320 commented Mar 24, 2020

LCF2764 commented Mar 25, 2020

sw005320 commented Mar 25, 2020

KeyError: 'validation/main/loss' in pytorch when use multiple gpu in training #1723

KeyError: 'validation/main/loss' in pytorch when use multiple gpu in training #1723

Comments

LCF2764 commented Mar 23, 2020 • edited

LCF2764 commented Mar 23, 2020 • edited

sw005320 commented Mar 23, 2020

LCF2764 commented Mar 23, 2020

sw005320 commented Mar 23, 2020 • edited

LCF2764 commented Mar 24, 2020 • edited

sw005320 commented Mar 24, 2020

sw005320 commented Mar 24, 2020

LCF2764 commented Mar 25, 2020

sw005320 commented Mar 25, 2020

LCF2764 commented Mar 23, 2020 •

edited

LCF2764 commented Mar 23, 2020 •

edited

sw005320 commented Mar 23, 2020 •

edited

LCF2764 commented Mar 24, 2020 •

edited