Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Retval[0] does not have value #18

Closed
zhangshuaitao opened this issue Oct 20, 2017 · 14 comments

Comments

@zhangshuaitao
Copy link

INFO:tensorflow:global step 109662: loss = 5.3843 (0.160 sec/step)
INFO:tensorflow:global step 109663: loss = 4.5832 (0.256 sec/step)
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Retval[0] does not have value
INFO:tensorflow:global step 109664: loss = 8.8361 (0.098 sec/step)
INFO:tensorflow:Finished training! Saving model to disk.
Traceback (most recent call last):
File "./train_seglink.py", line 275, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "./train_seglink.py", line 271, in main
train(train_op)
File "./train_seglink.py", line 260, in train
session_config = sess_config
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 759, in train
sv.saver.save(sess, sv.save_path, global_step=sv.global_step)
File "/usr/lib/python2.7/contextlib.py", line 24, in exit
self.gen.next()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 964, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 792, in stop
stop_grace_period_secs=self._stop_grace_secs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 296, in stop_on_exception
yield
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 494, in run
self.run_loop()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 994, in run_loop
self._sv.global_step])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 789, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 997, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1132, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1152, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Retval[0] does not have value
zst@zst-robot1:~/zst/seglink-master$
我的tf 版本是1.2.1,我也尝试在1.1.0上运行也会出现这样的错误.

@zhangshuaitao
Copy link
Author

@dengdan

@dengdan
Copy link
Owner

dengdan commented Oct 20, 2017

Please provide the the command you are running, including the parameters

@zhangshuaitao
Copy link
Author

sudo python ./train_seglink.py --dataset_name=icdar2015 --dataset_dir=/home/zst/result --batch_size=1 --train_dir=/home/zst/zst/seglink-master/train_dir --checkpoint_path=/home/zst/zst/seglink

@zhangshuaitao
Copy link
Author

@dengdan 学习率改为了0.00001

@dengdan
Copy link
Owner

dengdan commented Oct 21, 2017

Why sudo?

@zhangshuaitao
Copy link
Author

因为我用sudo去使用python2.7,如果不用sudo它使用的是anconda python

@zhangshuaitao
Copy link
Author

我还有一个问题就是大概训练多少次的时候开始收敛啊,我已经迭代了50万次了,还没有收敛

@dengdan
Copy link
Owner

dengdan commented Oct 21, 2017

  1. Try not to use sudo to run Python2.7, something might go wrong because of bin path differences.
  2. The loss will decrease at the very beginning of training if everything works well. The final loss will be around 1.0.

@zhangshuaitao
Copy link
Author

@dengdan 我采取了您的建议在另一台机器装了tensorflow1.1.0 还是这样的问题,但是每次报错前,它都会自动保存相应的checkpoint,再次训练时它会在原有的迭代基础上继续训练,这样对结果影响大吗?

@dengdan
Copy link
Owner

dengdan commented Oct 22, 2017

Well, I am not sure about it, because the reason for the bug has not been figured out yet, and how the bug can be reproduced is also unknown.
But, do you mean that the training can be restarted and work well after the error?

@zhangshuaitao
Copy link
Author

yes

@zhangshuaitao
Copy link
Author

When i train step at 500000 ,i find the fmean fall,it should be overfit,Can i know the number of you training set ,I guess the reason for the decline in fmean is that the training set is not enough.

@dengdan
Copy link
Owner

dengdan commented Oct 23, 2017

SynthText 0.8M, IC15 1000.

@dengdan dengdan closed this as completed Oct 27, 2017
@Donaghys
Copy link

当我训练步数为500000时,我发现fmean下降,应该是过拟合,我能知道您的训练集数量吗,我想fmean下降的原因是训练集不够。

您好,请问您通过调参使损失顺利下降成功了吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants