Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError: 'validation/main/loss' in pytorch when use multiple gpu in training #1723

Closed
LCF2764 opened this issue Mar 23, 2020 · 9 comments
Closed
Labels
Bug bug should be fixed

Comments

@LCF2764
Copy link

LCF2764 commented Mar 23, 2020

hi,
I use aishell recipe and running asr_train.py. When I use single GPU, it work well. However, when I use 2GPU, it finish training at the end of first epoch and throw the error: KeyError: 'validation/main/loss' error.

�[J/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/matplotlib/tight_layout.py:231: UserWarning: tight_layout : falling back to Agg renderer
  warnings.warn("tight_layout : falling back to Agg renderer")
Exception in main training loop: 'validation/main/loss'
Traceback (most recent call last):
  File "/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/chainer/training/trainer.py", line 345, in run
    if entry.trigger(self):
  File "/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/chainer/training/triggers/minmax_value_trigger.py", line 52, in __call__
    value = float(stats[key])  # copy to CPU
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
  File "/home/lcf/espnet/egs/Magicspeech/asr3/../../../espnet/bin/asr_train.py", line 368, in <module>
    main(sys.argv[1:])
  File "/home/lcf/espnet/egs/Magicspeech/asr3/../../../espnet/bin/asr_train.py", line 355, in main
    train(args)
  File "/home/lcf/espnet/espnet/asr/pytorch_backend/asr.py", line 631, in train
    trainer.run()
  File "/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/chainer/training/trainer.py", line 376, in run
    six.reraise(*exc_info)
  File "/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/chainer/training/trainer.py", line 345, in run
    if entry.trigger(self):
  File "/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/chainer/training/triggers/minmax_value_trigger.py", line 52, in __call__
    value = float(stats[key])  # copy to CPU
KeyError: 'validation/main/loss'
# Accounting: time=129 threads=1
# Ended (code 1) at Mon Mar 23 12:46:57 CST 2020, elapsed time 129 seconds

How to fix this?

@LCF2764
Copy link
Author

LCF2764 commented Mar 23, 2020

The problem seems to be in this code snippet:
espnet/espnet/asr/pytorch_backend/asr.py
line571-575

    # Save best models
    trainer.extend(snapshot_object(model, 'model.loss.best'),
                   trigger=training.triggers.MinValueTrigger('validation/main/loss'))
    if mtl_mode != 'ctc':
        trainer.extend(snapshot_object(model, 'model.acc.best'),
                       trigger=training.triggers.MaxValueTrigger('validation/main/acc'))

when I comment out these code, it can be trained normally. But can't save the best model and can't show the "main/loss_ctc", "main/loss_att", "main/acc", and "main/loss" in the log file.

@sw005320
Copy link
Contributor

What kind of multiple GPU environments are you using?
The other people have never reported such errors.

@LCF2764
Copy link
Author

LCF2764 commented Mar 23, 2020

thanks for your reply!
this is my Environments

- Ubuntu 16.04.5
- python version: `3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34)  [GCC 7.3.0]`
- espnet version: `espnet 0.6.2`
- chainer version: `chainer 7.2.0`
- pytorch version: `pytorch 1.4.0`
- CUDA Version: 10.2
- GPU: 2 x GTX1080

@sw005320
Copy link
Contributor

sw005320 commented Mar 23, 2020

OK. I'm concerning the following warning.

�[J/home/lcf/anaconda3/envs/python36/lib/python3.6/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '

Could you try it with an older version of pytorch?
Some interface changes may happen in the latest pytorch.

@LCF2764
Copy link
Author

LCF2764 commented Mar 24, 2020

I have try many older version of pytorch include:

  • pytorch 1.0.0
  • pytorch 1.0.1
  • pytorch 1.1.0
  • pytorch 1.2.0

but all have the same problem.

@sw005320 sw005320 added the Bug bug should be fixed label Mar 24, 2020
@sw005320
Copy link
Contributor

Many thanks.
This would be a bug.
I'll test it but I'd not use multiple GPUs easily and I'm not sure I could debug it.
Please keep post the update.

@sw005320
Copy link
Contributor

I just tested it and I did not get any errors.
My environment is as follows:
(You may try chainer 6.0.0?)

  • python version: 3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0]
  • espnet version: espnet 0.6.2
  • chainer version: chainer 6.0.0
  • pytorch version: pytorch 1.0.1.post2
  • Git hash: 41cfa571c273f4c68ce9826914ee03e28305702f
    • Commit date: Mon Mar 23 09:10:14 2020 -0400

@LCF2764
Copy link
Author

LCF2764 commented Mar 25, 2020

Thank you very mush!
After I instasll chainer=6.0.0, the problem was solved.
Thanks!

@sw005320
Copy link
Contributor

Cool!
This is a very good note.
We may need to stick to use chainer 6.0.0.
I want to leave it as it is (because our default chainer version in Makefile is 6.0.0) but if many people are stacked about it due to that, we'll ask people to fix to use chainer 6.0.0 or fix this bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug bug should be fixed
Projects
None yet
Development

No branches or pull requests

3 participants