New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TimeoutError while running many workflows #4745
Comments
I'm tentatively putting this in milestone 1.6 because I think this might be problematic as it reduces the reliability/robustness of the engine. |
There's also a strange thing - in the daemon log file, I see different errors (for different nodes). E.g. I have this:
but
so no error... Any idea why? (even the timing is different) I'm confused! EDIT: (A note on timing: there could be a 1h shift due to UTC vs local time, and a few minutes difference between the submission and the exception - still it's not clear to me why the messages in the log file and in the process report are not the same) |
I think I have observed the same error occur in at least one Jenkins run: #4718 (comment) |
I've actually also encountered this error recently when running using:
$ verdi process report 4142573
2021-02-08 18:13:09 [2855652 | REPORT]: [4142573|PwRelaxWorkChain|run_relax]: launching PwBaseWorkChain<4142586>
2021-02-08 18:13:17 [2855653 | ERROR]: Traceback (most recent call last):
File "/home/aiida/code/aiida/env/3dd/aiida-core/aiida/manage/external/rmq.py", line 201, in _continue
result = yield super()._continue(communicator, pid, nowait, tag)
File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/tornado/gen.py", line 1063, in run
yielded = self.gen.throw(*exc_info)
File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/plumpy/process_comms.py", line 547, in _continue
yield proc.step_until_terminated()
File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/tornado/gen.py", line 1063, in run
yielded = self.gen.throw(*exc_info)
File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/plumpy/processes.py", line 1117, in step_until_terminated
yield self.step()
File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/tornado/gen.py", line 1069, in run
yielded = self.gen.send(value)
File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/plumpy/processes.py", line 1108, in step
self.transition_to(next_state)
File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 318, in transition_to
self.transition_failed(initial_state_label, label, *sys.exc_info()[1:])
File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 332, in transition_failed
raise exception.with_traceback(trace)
File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 302, in transition_to
self._enter_next_state(new_state)
File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 363, in _enter_next_state
self._fire_state_event(StateEventHook.ENTERING_STATE, next_state)
File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 281, in _fire_state_event
callback(self, hook, state)
File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/plumpy/processes.py", line 310, in <lambda>
state_machine.StateEventHook.ENTERING_STATE, lambda _s, _h, state: self.on_entering(state)
File "/home/aiida/code/aiida/env/3dd/aiida-core/aiida/engine/processes/process.py", line 329, in on_entering
super().on_entering(state)
File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/plumpy/processes.py", line 612, in on_entering
call_with_super_check(self.on_wait, state.data)
File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/plumpy/base/utils.py", line 28, in call_with_super_check
wrapped(*args, **kwargs)
File "/home/aiida/code/aiida/env/3dd/aiida-core/aiida/engine/processes/workchains/workchain.py", line 260, in on_wait
self.action_awaitables()
File "/home/aiida/code/aiida/env/3dd/aiida-core/aiida/engine/processes/workchains/workchain.py", line 274, in action_awaitables
self.runner.call_on_process_finish(awaitable.pk, callback)
File "/home/aiida/code/aiida/env/3dd/aiida-core/aiida/engine/runners.py", line 311, in call_on_process_finish
self._communicator.add_broadcast_subscriber(broadcast_filter, subscriber_identifier)
File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/plumpy/communications.py", line 125, in add_broadcast_subscriber
return self._communicator.add_broadcast_subscriber(converted, identifier)
File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/kiwipy/rmq/communicator.py", line 592, in add_broadcast_subscriber
return self._run_task(coro)
File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/kiwipy/rmq/communicator.py", line 677, in _run_task
return self.tornado_to_kiwi_future(self._create_task(coro)).result(timeout=self.TASK_TIMEOUT)
File "/usr/lib/python3.7/concurrent/futures/_base.py", line 437, in result
raise TimeoutError()
concurrent.futures._base.TimeoutError At the time I was running 2000 work chains concurrently. In a couple of weeks, running thousands of calculations, I only had the error 5 times. |
yeh so this timeout is currently hardcoded to 5 seconds, which appears is not long enough under heavy load. This could be set to a different value here: aiida-core/aiida/manage/manager.py Line 253 in e794287
(by passing So I think I can tie this into my #4712 PR, to set it as a configurable value (with a higher default) |
see 1a3a958 |
I think making it configurable can alleviate the problem. However, my question is: should this exception except the process? Can it just be ignored, or e.g. put the process in pause instead (if ignoring is not safe), e.g. after a few failed attempts? In this way workflows of users are not disrupted if there is a timeout, they just need to replay them when the load is lower. |
I'll look into it, but I don't think it will be trivial |
See aiidateam/plumpy#209, which addresses the main timeout here, of the broadcast timing out. |
This exception is logged when the process is being created: https://github.com/aiidateam/plumpy/blob/b1bde82403be36a76525b0c6359a175a422c0c1c/plumpy/processes.py#L298-L299. It is also of note, if the RPC/Broadcast subscribes do fail to register when the process is being created on the daemon, it will mean that these processes will not be able to receive kill/pause/play/status messages. Anyway I think this should be opened as a separate specific issue |
When using an RMQ communicator, the broadcast can timeout on heavy loads to RMQ (for example see aiidateam/aiida-core#4745). This broadcast is not critical to the running of the process, and so a timeout should not except it. Also ensure the process PID is included in all log messages.
@giovannipizzi @mbercx fell free to test and re-open if any issue persists |
While running ~1000 workchains (see description [in this comment]), I got some TimeoutErrors.
These are the workflows that a non-zero error status. Some were for convergence issues of QE, but a few were exceptions:
verdi process status 110696 111576 110019 128494 107304 107365 107405 107439 108306 108636 | grep Except
:Here is their report:
for i in 112063 112118 110522 109485 109846 ; do verdi process report $i ; done
Note that they all happened within ~1 minute. It might have been some slowdown of the machine (I am not sure), but it means that when these timeouts occur, the engine "loses" the job rather than retrying later.
Your environment
The text was updated successfully, but these errors were encountered: