Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase the default for runner.poll.interval config option to 60 #4150

Merged

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Jun 4, 2020

Fixes #4149

This option is used to define the poll_interval argument of all
Runner instances that are created to run all processes. It determines
the interval with which the state of a subprocess, that a WorkChain is
waiting on, is checked whether it is terminated. If that is the case, a
callback is called which will signal to the WorkChain that it can
continue.

This polling is a backup mechanism in case the broadcast by the process
when it terminates is missed by the caller, which would cause it to wait
indefinitely. The original default of 1 was causing unnecessary load on
the CPUs as well as the database that each time had to query for the
process state. When running many calculations this would spin the CPUs
noticeably. Since this is supposed to be a fail-safe mechanism and it
should only be required rarely, it is fine to increase the time
significantly to reduce the load.

@sphuber sphuber requested review from muhrin and zhubonan June 4, 2020 21:08
@sphuber
Copy link
Contributor Author

sphuber commented Jun 4, 2020

I have the feeling we should maybe test this in the wild before merging. Maybe we have silently been relying a lot on the polling backup mechanism and by changing this now, we would maybe see an unexpected drop in throughput, albeit at a significantly lower CPU cost 😅 I will try to test myself soon, but if anyone else could give it a spin as well, that'd be great. Until then I will put the PR as on hold.

@sphuber sphuber added the pr/on-hold PR should not be merged label Jun 4, 2020
@sphuber sphuber force-pushed the fix/4149/runner-poll-interval-default branch from 567c972 to 7442aee Compare June 4, 2020 21:40
@zhubonan
Copy link
Contributor

zhubonan commented Jun 4, 2020

@sphuber Thanks. I guess the testing would be more applicable for a high-throughput case right (actively doing lots of processes)?

I have added logging in my production instance anyway and will see if I can find any. It isn't that active though. There are many CalcJob, but they finish rather slowly (I wish it is the opposite 😂!).

@codecov
Copy link

codecov bot commented Jun 4, 2020

Codecov Report

Merging #4150 into develop will increase coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #4150      +/-   ##
===========================================
+ Coverage    78.89%   78.89%   +0.01%     
===========================================
  Files          467      467              
  Lines        34500    34500              
===========================================
+ Hits         27214    27215       +1     
+ Misses        7286     7285       -1     
Flag Coverage Δ
#django 70.82% <ø> (+0.01%) ⬆️
#sqlalchemy 71.69% <ø> (ø)
Impacted Files Coverage Δ
aiida/manage/configuration/options.py 75.00% <ø> (ø)
aiida/transports/plugins/local.py 80.47% <0.00%> (+0.26%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f61dd19...8b401f0. Read the comment docs.

@unkcpz
Copy link
Member

unkcpz commented Jun 5, 2020

def call_on_process_finish(self, pk, callback):
        """
        Callback to be called when the process of the given pk is terminated

        :param pk: the pk of the process
        :param callback: the function to be called upon process termination
        """
        process = load_node(pk=pk)
        self._poll_process(process, callback)

I am not very sure but I think poll calculation is the only mechanism here rather than the backup for the rmq broadcast. Since the on_process_finished callback (workchain::action_awaitables) is only registered in _poll_calculation but not to rmq.
But for sure, get_calculation_futuire uses rmq broker and sets poll mechanism as backup, however this method is not called elsewhere in the code base.

Or did I get it wrong?

@sphuber
Copy link
Contributor Author

sphuber commented Jun 5, 2020

I think you are right @unkcpz . I have also checked the code and I cannot see where the parent is subscribing to state transition messages from the child, so it continuing relies solely on the polling mechanism. I wonder now if the subscribing was broken by accident at some point or if it actually never went in the code. Anyway, I will open a new issue for this and add a fix. This might actually give us an unexpected performance boost since we were actually not relying on event-based processing at all.

@unkcpz
Copy link
Member

unkcpz commented Jun 5, 2020

Ha, yes. I also suggest to add a logging message to _poll_ function of CalculationFuture, even it is not used elsewhere at the moment.
Anyway, this is a good fix now, I encountered the same performance issues because of this, now the cpu burden is lighter.

@sphuber sphuber force-pushed the fix/4149/runner-poll-interval-default branch from 7442aee to 187c0c4 Compare June 5, 2020 16:50
This option is used to define the `poll_interval` argument of all
`Runner` instances that are created to run all processes. It determines
the interval with which the state of a subprocess, that a `WorkChain` is
waiting on, is checked whether it is terminated. If that is the case, a
callback is called which will signal to the `WorkChain` that it can
continue.

This polling is a backup mechanism in case the broadcast by the process
when it terminates is missed by the caller, which would cause it to wait
indefinitely. The original default of 1 was causing unnecessary load on
the CPUs as well as the database that each time had to query for the
process state. When running many calculations this would spin the CPUs
noticeably. Since this is supposed to be a fail-safe mechanism and it
should only be required rarely, it is fine to increase the time
significantly to reduce the load.
@sphuber sphuber force-pushed the fix/4149/runner-poll-interval-default branch from 187c0c4 to e605430 Compare June 17, 2020 15:03
@sphuber
Copy link
Contributor Author

sphuber commented Jun 17, 2020

@muhrin care to give this daunting PR a go? ;)

Copy link
Contributor

@zhubonan zhubonan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK to me. I have been running with 60 seconds polling interval for a few days so far, no issue found.

@zhubonan
Copy link
Contributor

By the way, is there a plan to re-enable the broadcast mechanism for informing process completion, so the polling is for fail-safe as originally intended?

@sphuber
Copy link
Contributor Author

sphuber commented Jun 18, 2020

By the way, is there a plan to re-enable the broadcast mechanism for informing process completion, so the polling is for fail-safe as originally intended?

Already merged in #4154 😄

@sphuber sphuber merged commit b807740 into aiidateam:develop Jun 18, 2020
@sphuber sphuber deleted the fix/4149/runner-poll-interval-default branch June 18, 2020 09:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr/on-hold PR should not be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Increase the runner.poll.interval default value
4 participants