Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TimeoutError while running many workflows #4745

Closed
giovannipizzi opened this issue Feb 11, 2021 · 12 comments · Fixed by #4776
Closed

TimeoutError while running many workflows #4745

giovannipizzi opened this issue Feb 11, 2021 · 12 comments · Fixed by #4776
Assignees
Labels
Milestone

Comments

@giovannipizzi
Copy link
Member

While running ~1000 workchains (see description [in this comment]), I got some TimeoutErrors.

These are the workflows that a non-zero error status. Some were for convergence issues of QE, but a few were exceptions:
verdi process status 110696 111576 110019 128494 107304 107365 107405 107439 108306 108636 | grep Except:

    └── PwBaseWorkChain<112063> Excepted [4:validate_resources]
    └── PwBaseWorkChain<112118> Excepted [6:while_(should_run_process)(1:run_process)]
    └── PwBaseWorkChain<110522> Excepted [5:if_(should_run_init)]
        └── PwCalculation<109485> Excepted
        └── PwCalculation<109846> Excepted

Here is their report:
for i in 112063 112118 110522 109485 109846 ; do verdi process report $i ; done

2021-02-10 22:10:17 [35132 | ERROR]: Traceback (most recent call last):
  File "/home/pizzi/git/aiida-core/aiida/manage/external/rmq.py", line 206, in _continue
    result = await super()._continue(communicator, pid, nowait, tag)
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/process_comms.py", line 613, in _continue
    await proc.step_until_terminated()
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/processes.py", line 1213, in step_until_terminated
    await self.step()
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/processes.py", line 1199, in step
    self.transition_to(next_state)
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 337, in transition_to
    self.transition_failed(initial_state_label, label, *sys.exc_info()[1:])
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 353, in transition_failed
    raise exception.with_traceback(trace)
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 322, in transition_to
    self._enter_next_state(new_state)
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 388, in _enter_next_state
    self._fire_state_event(StateEventHook.ENTERED_STATE, last_state)
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 301, in _fire_state_event
    callback(self, hook, state)
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/processes.py", line 324, in <lambda>
    lambda _s, _h, from_state: self.on_entered(cast(Optional[process_states.State], from_state)),
  File "/home/pizzi/git/aiida-core/aiida/engine/processes/process.py", line 389, in on_entered
    super().on_entered(from_state)
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/processes.py", line 695, in on_entered
    self._communicator.broadcast_send(body=None, sender=self.pid, subject=subject)
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/communications.py", line 154, in broadcast_send
    return self._communicator.broadcast_send(body, sender, subject, correlation_id)
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/kiwipy/rmq/threadcomms.py", line 227, in broadcast_send
    self._communicator.broadcast_send(body=body, sender=sender, subject=subject, correlation_id=correlation_id)
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/pytray/aiothreads.py", line 155, in await_
    return self.await_submit(awaitable).result(timeout=self.task_timeout)
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 437, in result
    raise TimeoutError()
concurrent.futures._base.TimeoutError

2021-02-10 22:10:47 [35168 | ERROR]: Traceback (most recent call last):
  File "/home/pizzi/git/aiida-core/aiida/manage/external/rmq.py", line 206, in _continue
    result = await super()._continue(communicator, pid, nowait, tag)
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/process_comms.py", line 613, in _continue
    await proc.step_until_terminated()
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/processes.py", line 1213, in step_until_terminated
    await self.step()
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/processes.py", line 1199, in step
    self.transition_to(next_state)
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 337, in transition_to
    self.transition_failed(initial_state_label, label, *sys.exc_info()[1:])
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 353, in transition_failed
    raise exception.with_traceback(trace)
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 322, in transition_to
    self._enter_next_state(new_state)
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 388, in _enter_next_state
    self._fire_state_event(StateEventHook.ENTERED_STATE, last_state)
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 301, in _fire_state_event
    callback(self, hook, state)
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/processes.py", line 324, in <lambda>
    lambda _s, _h, from_state: self.on_entered(cast(Optional[process_states.State], from_state)),
  File "/home/pizzi/git/aiida-core/aiida/engine/processes/process.py", line 389, in on_entered
    super().on_entered(from_state)
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/processes.py", line 695, in on_entered
    self._communicator.broadcast_send(body=None, sender=self.pid, subject=subject)
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/communications.py", line 154, in broadcast_send
    return self._communicator.broadcast_send(body, sender, subject, correlation_id)
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/kiwipy/rmq/threadcomms.py", line 227, in broadcast_send
    self._communicator.broadcast_send(body=body, sender=sender, subject=subject, correlation_id=correlation_id)
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/pytray/aiothreads.py", line 155, in await_
    return self.await_submit(awaitable).result(timeout=self.task_timeout)
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 437, in result
    raise TimeoutError()
concurrent.futures._base.TimeoutError

2021-02-10 22:09:05 [35027 | ERROR]: Traceback (most recent call last):
  File "/home/pizzi/git/aiida-core/aiida/manage/external/rmq.py", line 206, in _continue
    result = await super()._continue(communicator, pid, nowait, tag)
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/process_comms.py", line 613, in _continue
    await proc.step_until_terminated()
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/processes.py", line 1213, in step_until_terminated
    await self.step()
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/processes.py", line 1199, in step
    self.transition_to(next_state)
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 337, in transition_to
    self.transition_failed(initial_state_label, label, *sys.exc_info()[1:])
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 353, in transition_failed
    raise exception.with_traceback(trace)
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 322, in transition_to
    self._enter_next_state(new_state)
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 388, in _enter_next_state
    self._fire_state_event(StateEventHook.ENTERED_STATE, last_state)
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 301, in _fire_state_event
    callback(self, hook, state)
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/processes.py", line 324, in <lambda>
    lambda _s, _h, from_state: self.on_entered(cast(Optional[process_states.State], from_state)),
  File "/home/pizzi/git/aiida-core/aiida/engine/processes/process.py", line 389, in on_entered
    super().on_entered(from_state)
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/processes.py", line 695, in on_entered
    self._communicator.broadcast_send(body=None, sender=self.pid, subject=subject)
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/communications.py", line 154, in broadcast_send
    return self._communicator.broadcast_send(body, sender, subject, correlation_id)
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/kiwipy/rmq/threadcomms.py", line 227, in broadcast_send
    self._communicator.broadcast_send(body=body, sender=sender, subject=subject, correlation_id=correlation_id)
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/pytray/aiothreads.py", line 155, in await_
    return self.await_submit(awaitable).result(timeout=self.task_timeout)
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 437, in result
    raise TimeoutError()
concurrent.futures._base.TimeoutError

*** 109485: CalcJobState.SUBMITTING
*** Scheduler output: N/A
*** Scheduler errors: N/A
*** 1 LOG MESSAGES:
+-> ERROR at 2021-02-10 22:10:36.148243+00:00
 | Traceback (most recent call last):
 |   File "/home/pizzi/git/aiida-core/aiida/manage/external/rmq.py", line 206, in _continue
 |     result = await super()._continue(communicator, pid, nowait, tag)
 |   File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/process_comms.py", line 613, in _continue
 |     await proc.step_until_terminated()
 |   File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/processes.py", line 1213, in step_until_terminated
 |     await self.step()
 |   File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/processes.py", line 1199, in step
 |     self.transition_to(next_state)
 |   File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 337, in transition_to
 |     self.transition_failed(initial_state_label, label, *sys.exc_info()[1:])
 |   File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 353, in transition_failed
 |     raise exception.with_traceback(trace)
 |   File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 322, in transition_to
 |     self._enter_next_state(new_state)
 |   File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 388, in _enter_next_state
 |     self._fire_state_event(StateEventHook.ENTERED_STATE, last_state)
 |   File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 301, in _fire_state_event
 |     callback(self, hook, state)
 |   File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/processes.py", line 324, in <lambda>
 |     lambda _s, _h, from_state: self.on_entered(cast(Optional[process_states.State], from_state)),
 |   File "/home/pizzi/git/aiida-core/aiida/engine/processes/process.py", line 389, in on_entered
 |     super().on_entered(from_state)
 |   File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/processes.py", line 695, in on_entered
 |     self._communicator.broadcast_send(body=None, sender=self.pid, subject=subject)
 |   File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/communications.py", line 154, in broadcast_send
 |     return self._communicator.broadcast_send(body, sender, subject, correlation_id)
 |   File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/kiwipy/rmq/threadcomms.py", line 227, in broadcast_send
 |     self._communicator.broadcast_send(body=body, sender=sender, subject=subject, correlation_id=correlation_id)
 |   File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/pytray/aiothreads.py", line 155, in await_
 |     return self.await_submit(awaitable).result(timeout=self.task_timeout)
 |   File "/usr/lib/python3.7/concurrent/futures/_base.py", line 437, in result
 |     raise TimeoutError()
 | concurrent.futures._base.TimeoutError
*** 109846: CalcJobState.SUBMITTING
*** Scheduler output: N/A
*** Scheduler errors: N/A
*** 1 LOG MESSAGES:
+-> ERROR at 2021-02-10 22:10:45.169995+00:00
 | Traceback (most recent call last):
 |   File "/home/pizzi/git/aiida-core/aiida/manage/external/rmq.py", line 206, in _continue
 |     result = await super()._continue(communicator, pid, nowait, tag)
 |   File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/process_comms.py", line 613, in _continue
 |     await proc.step_until_terminated()
 |   File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/processes.py", line 1213, in step_until_terminated
 |     await self.step()
 |   File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/processes.py", line 1199, in step
 |     self.transition_to(next_state)
 |   File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 337, in transition_to
 |     self.transition_failed(initial_state_label, label, *sys.exc_info()[1:])
 |   File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 353, in transition_failed
 |     raise exception.with_traceback(trace)
 |   File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 322, in transition_to
 |     self._enter_next_state(new_state)
 |   File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 388, in _enter_next_state
 |     self._fire_state_event(StateEventHook.ENTERED_STATE, last_state)
 |   File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 301, in _fire_state_event
 |     callback(self, hook, state)
 |   File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/processes.py", line 324, in <lambda>
 |     lambda _s, _h, from_state: self.on_entered(cast(Optional[process_states.State], from_state)),
 |   File "/home/pizzi/git/aiida-core/aiida/engine/processes/process.py", line 389, in on_entered
 |     super().on_entered(from_state)
 |   File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/processes.py", line 695, in on_entered
 |     self._communicator.broadcast_send(body=None, sender=self.pid, subject=subject)
 |   File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/communications.py", line 154, in broadcast_send
 |     return self._communicator.broadcast_send(body, sender, subject, correlation_id)
 |   File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/kiwipy/rmq/threadcomms.py", line 227, in broadcast_send
 |     self._communicator.broadcast_send(body=body, sender=sender, subject=subject, correlation_id=correlation_id)
 |   File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/pytray/aiothreads.py", line 155, in await_
 |     return self.await_submit(awaitable).result(timeout=self.task_timeout)
 |   File "/usr/lib/python3.7/concurrent/futures/_base.py", line 437, in result
 |     raise TimeoutError()
 | concurrent.futures._base.TimeoutError

Note that they all happened within ~1 minute. It might have been some slowdown of the machine (I am not sure), but it means that when these timeouts occur, the engine "loses" the job rather than retrying later.

Your environment

  • Operating system: Mac OS X
  • Python version: 3.7.7
  • aiida-core version: develop (will be 1.6), commit 2e18f5b; plumpy==0.18.5a0
@giovannipizzi
Copy link
Member Author

I'm tentatively putting this in milestone 1.6 because I think this might be problematic as it reduces the reliability/robustness of the engine.

@giovannipizzi
Copy link
Member Author

giovannipizzi commented Feb 11, 2021

There's also a strange thing - in the daemon log file, I see different errors (for different nodes). E.g. I have this:

02/10/2021 11:09:57 PM <97679> plumpy.processes: [ERROR] Process<112337> failed to register as an RPC subscriber
Traceback (most recent call last):
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/processes.py", line 296, in init
    identifier = self._communicator.add_rpc_subscriber(self.message_receive, identifier=str(self.pid))
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/plumpy/communications.py", line 120, in add_rpc_subscriber
    return self._communicator.add_rpc_subscriber(converted, identifier)
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/kiwipy/rmq/threadcomms.py", line 184, in add_rpc_subscriber
    self._communicator.add_rpc_subscriber(self._wrap_subscriber(subscriber), identifier)
  File "/home/pizzi/.virtualenvs/aiida-dev/lib/python3.7/site-packages/pytray/aiothreads.py", line 155, in await_
    return self.await_submit(awaitable).result(timeout=self.task_timeout)
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 437, in result
    raise TimeoutError()
concurrent.futures._base.TimeoutError

but verdi process report 112337 has no log messages... In a similar weird way, if I e.g. look for 109485 in the log file (that is one with an error above), I only find this line in the logs:

02/10/2021 11:06:31 PM <97678> aiida.orm.nodes.process.workflow.workchain.WorkChainNode: [REPORT] [108861|PwBaseWorkChain|run_process]: launching PwCalculation<109485> iteration #1

so no error... Any idea why? (even the timing is different) I'm confused!

EDIT: (A note on timing: there could be a 1h shift due to UTC vs local time, and a few minutes difference between the submission and the exception - still it's not clear to me why the messages in the log file and in the process report are not the same)

@chrisjsewell
Copy link
Member

I think I have observed the same error occur in at least one Jenkins run: #4718 (comment)

@mbercx
Copy link
Member

mbercx commented Feb 17, 2021

I've actually also encountered this error recently when running using:

  • Ubuntu 16.04.7 LTS
  • Python 3.7.5
  • aiida-core==1.5.0:
$ verdi process report 4142573
2021-02-08 18:13:09 [2855652 | REPORT]: [4142573|PwRelaxWorkChain|run_relax]: launching PwBaseWorkChain<4142586>
2021-02-08 18:13:17 [2855653 |  ERROR]: Traceback (most recent call last):
  File "/home/aiida/code/aiida/env/3dd/aiida-core/aiida/manage/external/rmq.py", line 201, in _continue
    result = yield super()._continue(communicator, pid, nowait, tag)
  File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/plumpy/process_comms.py", line 547, in _continue
    yield proc.step_until_terminated()
  File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/plumpy/processes.py", line 1117, in step_until_terminated
    yield self.step()
  File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/tornado/gen.py", line 1069, in run
    yielded = self.gen.send(value)
  File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/plumpy/processes.py", line 1108, in step
    self.transition_to(next_state)
  File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 318, in transition_to
    self.transition_failed(initial_state_label, label, *sys.exc_info()[1:])
  File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 332, in transition_failed
    raise exception.with_traceback(trace)
  File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 302, in transition_to
    self._enter_next_state(new_state)
  File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 363, in _enter_next_state
    self._fire_state_event(StateEventHook.ENTERING_STATE, next_state)
  File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/plumpy/base/state_machine.py", line 281, in _fire_state_event
    callback(self, hook, state)
  File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/plumpy/processes.py", line 310, in <lambda>
    state_machine.StateEventHook.ENTERING_STATE, lambda _s, _h, state: self.on_entering(state)
  File "/home/aiida/code/aiida/env/3dd/aiida-core/aiida/engine/processes/process.py", line 329, in on_entering
    super().on_entering(state)
  File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/plumpy/processes.py", line 612, in on_entering
    call_with_super_check(self.on_wait, state.data)
  File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/plumpy/base/utils.py", line 28, in call_with_super_check
    wrapped(*args, **kwargs)
  File "/home/aiida/code/aiida/env/3dd/aiida-core/aiida/engine/processes/workchains/workchain.py", line 260, in on_wait
    self.action_awaitables()
  File "/home/aiida/code/aiida/env/3dd/aiida-core/aiida/engine/processes/workchains/workchain.py", line 274, in action_awaitables
    self.runner.call_on_process_finish(awaitable.pk, callback)
  File "/home/aiida/code/aiida/env/3dd/aiida-core/aiida/engine/runners.py", line 311, in call_on_process_finish
    self._communicator.add_broadcast_subscriber(broadcast_filter, subscriber_identifier)
  File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/plumpy/communications.py", line 125, in add_broadcast_subscriber
    return self._communicator.add_broadcast_subscriber(converted, identifier)
  File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/kiwipy/rmq/communicator.py", line 592, in add_broadcast_subscriber
    return self._run_task(coro)
  File "/home/aiida/.virtualenvs/aiida_3dd/lib/python3.7/site-packages/kiwipy/rmq/communicator.py", line 677, in _run_task
    return self.tornado_to_kiwi_future(self._create_task(coro)).result(timeout=self.TASK_TIMEOUT)
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 437, in result
    raise TimeoutError()
concurrent.futures._base.TimeoutError

At the time I was running 2000 work chains concurrently. In a couple of weeks, running thousands of calculations, I only had the error 5 times.

@chrisjsewell
Copy link
Member

chrisjsewell commented Feb 17, 2021

yeh so this timeout is currently hardcoded to 5 seconds, which appears is not long enough under heavy load.

This could be set to a different value here:

return kiwipy.rmq.RmqThreadCommunicator.connect(

(by passing async_task_timeout)

So I think I can tie this into my #4712 PR, to set it as a configurable value (with a higher default)

@chrisjsewell
Copy link
Member

see 1a3a958

@giovannipizzi
Copy link
Member Author

I think making it configurable can alleviate the problem. However, my question is: should this exception except the process? Can it just be ignored, or e.g. put the process in pause instead (if ignoring is not safe), e.g. after a few failed attempts? In this way workflows of users are not disrupted if there is a timeout, they just need to replay them when the load is lower.

@chrisjsewell
Copy link
Member

I'll look into it, but I don't think it will be trivial

@chrisjsewell
Copy link
Member

See aiidateam/plumpy#209, which addresses the main timeout here, of the broadcast timing out.

@chrisjsewell
Copy link
Member

There's also a strange thing - in the daemon log file, I see different errors (for different nodes). E.g. I have this
so no error... Any idea why? (even the timing is different) I'm confused!

This exception is logged when the process is being created: https://github.com/aiidateam/plumpy/blob/b1bde82403be36a76525b0c6359a175a422c0c1c/plumpy/processes.py#L298-L299.
Perhaps logging to the node has not been initialised at this point 🤷

It is also of note, if the RPC/Broadcast subscribes do fail to register when the process is being created on the daemon, it will mean that these processes will not be able to receive kill/pause/play/status messages.
Catching of the TimeoutError was added in aiidateam/plumpy#81
@sphuber/@muhrin do you think this is ideal behaviour or, if being recreated on the daemon, could/should this except so that it can be re-queued by RMQ?

Anyway I think this should be opened as a separate specific issue

chrisjsewell added a commit to aiidateam/plumpy that referenced this issue Feb 24, 2021
When using an RMQ communicator, the broadcast can timeout on heavy loads to RMQ
(for example see aiidateam/aiida-core#4745).
This broadcast is not critical to the running of the process,
and so a timeout should not except it.

Also ensure the process PID is included in all log messages.
@chrisjsewell chrisjsewell linked a pull request Feb 24, 2021 that will close this issue
@chrisjsewell
Copy link
Member

I've moved the logging issue to #4777 and #4776 will close the original issue

@chrisjsewell
Copy link
Member

@giovannipizzi @mbercx fell free to test and re-open if any issue persists

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants