Increase the default for `runner.poll.interval` config option to 60 #4150

sphuber · 2020-06-04T21:08:22Z

This option is used to define the poll_interval argument of all
Runner instances that are created to run all processes. It determines
the interval with which the state of a subprocess, that a WorkChain is
waiting on, is checked whether it is terminated. If that is the case, a
callback is called which will signal to the WorkChain that it can
continue.

This polling is a backup mechanism in case the broadcast by the process
when it terminates is missed by the caller, which would cause it to wait
indefinitely. The original default of 1 was causing unnecessary load on
the CPUs as well as the database that each time had to query for the
process state. When running many calculations this would spin the CPUs
noticeably. Since this is supposed to be a fail-safe mechanism and it
should only be required rarely, it is fine to increase the time
significantly to reduce the load.

sphuber · 2020-06-04T21:09:58Z

I have the feeling we should maybe test this in the wild before merging. Maybe we have silently been relying a lot on the polling backup mechanism and by changing this now, we would maybe see an unexpected drop in throughput, albeit at a significantly lower CPU cost 😅 I will try to test myself soon, but if anyone else could give it a spin as well, that'd be great. Until then I will put the PR as on hold.

zhubonan · 2020-06-04T21:51:56Z

@sphuber Thanks. I guess the testing would be more applicable for a high-throughput case right (actively doing lots of processes)?

I have added logging in my production instance anyway and will see if I can find any. It isn't that active though. There are many CalcJob, but they finish rather slowly (I wish it is the opposite 😂!).

codecov · 2020-06-04T21:56:31Z

Codecov Report

Merging #4150 into develop will increase coverage by 0.01%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##           develop    #4150      +/-   ##
===========================================
+ Coverage    78.89%   78.89%   +0.01%     
===========================================
  Files          467      467              
  Lines        34500    34500              
===========================================
+ Hits         27214    27215       +1     
+ Misses        7286     7285       -1

Flag	Coverage Δ
#django	`70.82% <ø> (+0.01%)`	⬆️
#sqlalchemy	`71.69% <ø> (ø)`

Impacted Files	Coverage Δ
aiida/manage/configuration/options.py	`75.00% <ø> (ø)`
aiida/transports/plugins/local.py	`80.47% <0.00%> (+0.26%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f61dd19...8b401f0. Read the comment docs.

unkcpz · 2020-06-05T03:47:53Z

def call_on_process_finish(self, pk, callback):
        """
        Callback to be called when the process of the given pk is terminated

        :param pk: the pk of the process
        :param callback: the function to be called upon process termination
        """
        process = load_node(pk=pk)
        self._poll_process(process, callback)

I am not very sure but I think poll calculation is the only mechanism here rather than the backup for the rmq broadcast. Since the on_process_finished callback (workchain::action_awaitables) is only registered in _poll_calculation but not to rmq.
But for sure, get_calculation_futuire uses rmq broker and sets poll mechanism as backup, however this method is not called elsewhere in the code base.

Or did I get it wrong?

sphuber · 2020-06-05T06:49:48Z

I think you are right @unkcpz . I have also checked the code and I cannot see where the parent is subscribing to state transition messages from the child, so it continuing relies solely on the polling mechanism. I wonder now if the subscribing was broken by accident at some point or if it actually never went in the code. Anyway, I will open a new issue for this and add a fix. This might actually give us an unexpected performance boost since we were actually not relying on event-based processing at all.

unkcpz · 2020-06-05T06:59:07Z

Ha, yes. I also suggest to add a logging message to _poll_ function of CalculationFuture, even it is not used elsewhere at the moment.
Anyway, this is a good fix now, I encountered the same performance issues because of this, now the cpu burden is lighter.

This option is used to define the `poll_interval` argument of all `Runner` instances that are created to run all processes. It determines the interval with which the state of a subprocess, that a `WorkChain` is waiting on, is checked whether it is terminated. If that is the case, a callback is called which will signal to the `WorkChain` that it can continue. This polling is a backup mechanism in case the broadcast by the process when it terminates is missed by the caller, which would cause it to wait indefinitely. The original default of 1 was causing unnecessary load on the CPUs as well as the database that each time had to query for the process state. When running many calculations this would spin the CPUs noticeably. Since this is supposed to be a fail-safe mechanism and it should only be required rarely, it is fine to increase the time significantly to reduce the load.

sphuber · 2020-06-17T20:39:35Z

@muhrin care to give this daunting PR a go? ;)

zhubonan

Looks OK to me. I have been running with 60 seconds polling interval for a few days so far, no issue found.

zhubonan · 2020-06-18T08:23:04Z

By the way, is there a plan to re-enable the broadcast mechanism for informing process completion, so the polling is for fail-safe as originally intended?

sphuber · 2020-06-18T08:58:29Z

By the way, is there a plan to re-enable the broadcast mechanism for informing process completion, so the polling is for fail-safe as originally intended?

Already merged in #4154 😄

sphuber requested review from muhrin and zhubonan June 4, 2020 21:08

sphuber added the pr/on-hold PR should not be merged label Jun 4, 2020

sphuber force-pushed the fix/4149/runner-poll-interval-default branch from 567c972 to 7442aee Compare June 4, 2020 21:40

sphuber force-pushed the fix/4149/runner-poll-interval-default branch from 7442aee to 187c0c4 Compare June 5, 2020 16:50

sphuber force-pushed the fix/4149/runner-poll-interval-default branch from 187c0c4 to e605430 Compare June 17, 2020 15:03

giovannipizzi approved these changes Jun 18, 2020

View reviewed changes

Merge branch 'develop' into fix/4149/runner-poll-interval-default

8b401f0

zhubonan approved these changes Jun 18, 2020

View reviewed changes

sphuber merged commit b807740 into aiidateam:develop Jun 18, 2020

sphuber deleted the fix/4149/runner-poll-interval-default branch June 18, 2020 09:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase the default for `runner.poll.interval` config option to 60 #4150

Increase the default for `runner.poll.interval` config option to 60 #4150

sphuber commented Jun 4, 2020 •

edited

Loading

sphuber commented Jun 4, 2020

zhubonan commented Jun 4, 2020

codecov bot commented Jun 4, 2020 •

edited

Loading

unkcpz commented Jun 5, 2020 •

edited

Loading

sphuber commented Jun 5, 2020

unkcpz commented Jun 5, 2020

sphuber commented Jun 17, 2020

zhubonan left a comment

zhubonan commented Jun 18, 2020

sphuber commented Jun 18, 2020

Increase the default for runner.poll.interval config option to 60 #4150

Increase the default for runner.poll.interval config option to 60 #4150

Conversation

sphuber commented Jun 4, 2020 • edited Loading

sphuber commented Jun 4, 2020

zhubonan commented Jun 4, 2020

codecov bot commented Jun 4, 2020 • edited Loading

Codecov Report

unkcpz commented Jun 5, 2020 • edited Loading

sphuber commented Jun 5, 2020

unkcpz commented Jun 5, 2020

sphuber commented Jun 17, 2020

zhubonan left a comment

Choose a reason for hiding this comment

zhubonan commented Jun 18, 2020

sphuber commented Jun 18, 2020

Increase the default for `runner.poll.interval` config option to 60 #4150

Increase the default for `runner.poll.interval` config option to 60 #4150

sphuber commented Jun 4, 2020 •

edited

Loading

codecov bot commented Jun 4, 2020 •

edited

Loading

unkcpz commented Jun 5, 2020 •

edited

Loading