Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High idle CPU usage #378

Closed
1 of 3 tasks
MichaIng opened this issue Jul 21, 2021 · 15 comments
Closed
1 of 3 tasks

High idle CPU usage #378

MichaIng opened this issue Jul 21, 2021 · 15 comments
Labels
bug Something is broken triage

Comments

@MichaIng
Copy link
Contributor

MichaIng commented Jul 21, 2021

❓ I'm submitting a ...

  • 🐞 bug report
  • 🐣 feature request
  • ❓ question about the decisions made in the repository

🐞 Describe the bug. What is the current behavior?
The idle CPU load has been significantly increased with #199 and further multiplied with #308. The new connection manager seems to imply a loop which significantly loads the CPU, compared to before. With the second PR, the timeout has been reduced from 0.1 to 0.01, which seems to imply a 10 times increased number of loops and hence a multiplication of the CPU load. Now #311 addressed the regression of #199 in a better way, making the reduced timeout obsolete, as far as I understand. I just tried changing the 0.01 here back to 0.1, basically reverting #308, and indeed the CPU load goes back again to match moreless the state of #199 or even lower.

❓ What is the motivation / use case for changing the behavior?
The absolute CPU load may be small, but especially on smaller SBCs this is quite an issue. So to keep cheroot/CherryPy a great HTTP server for embedded devices, it would be great to reduce idle CPU usage to a minimum. For reference: HTPC-Manager/HTPC-Manager#30

πŸ’‘ To Reproduce

Steps to reproduce the behavior:

git clone https://github.com/HTPC-Manager/HTPC-Manager.git
cd HTPC-Manager
git checkout MichaIng-CherryPy
pip3 install -Ur requirements.txt
python3 Htpc.py
# Check CPU usage of that process
pip3 install -U cherrypy
python3 Htpc.py
# Check and compare CPU usage
pip3 install -U 'cheroot==8.0.0'
python3 Htpc.py
# Check and compare CPU usage
pip3 install -U 'cheroot==8.10'
python3 Htpc.py
# Check and compare CPU usage
pip3 install -U 'cheroot==8.4.2'
python3 Htpc.py
# Check and compare CPU usage
pip3 install -U 'cheroot'
# Adjust below path according to Python version
sed -i 's/timeout=0\.01/timeout=0\.1/' /usr/local/lib/python3.9/dist-packages/cheroot/connections.py
python3 Htpc.py
# Check and compare CPU usage

πŸ’‘ Expected behavior

I expect a minimal CPU usage when the HTTP server is not accessed at all.

πŸ“‹ Details

No logs, I basically did it via htop -d 10. Not a very precise tool for measuring process CPU usage, but the differences are very significant when counting the seconds in which the CPU time in centiseconds raises.

πŸ“‹ Environment

  • Cheroot version: 8.5.2 and others (see above)
  • CherryPy version: 18.6.1
  • Python version: 3.9.2 and others
  • OS: Debian Bullseye and others
  • Browser: not applicable

πŸ“‹ Additional context

There are no errors involved, just the connection manager itself checking for idle connections (as far as I understood) very often, which implies CPU usage.

@webknjaz
Copy link
Member

I suppose it's a follow-up for cherrypy/cherrypy#1908, right?

@MichaIng
Copy link
Contributor Author

Jep, at least this made me dig deeper into it. I recognised just now that it matches the symptoms exactly. Now I wonder whether we should merge those issues? πŸ€”
At least it fits better here into the cheroot repo rather than the CherryPy repo.

@webknjaz
Copy link
Member

@MichaIng I'm going to wait for @Blindfreddy to confirm this and if it's true, I could transfer his issue into Cheroot and maybe copy some details from this issue + keep that one because it has more comments.

@MichaIng
Copy link
Contributor Author

Yes makes sense. Sorry I should have appended the finds there to not split conversation πŸ˜….

@webknjaz
Copy link
Member

That's alright. Now we wait :)
I'd accept a PR but I'd like to get more feedback first because previous refactoring attempts caused regressions which I'd like to avoid (also, I'm not sure if we can come up with proper regression tests for the CI to catch problems in the future).

@MichaIng
Copy link
Contributor Author

Yes, we should check carefully what the restored timeout=0.1 may have for side effects. Probably there is a way to reduce the overhead implied by the reduced timeout (resp. increased loop rate) instead, when there are no open connections, or so.

@webknjaz
Copy link
Member

Interesting... We may consider making this:

  1. a configurable setting
  2. a dynamically adjusted value (depending on the load?) β€” not sure whether it'd be reasonable to have this backed-in or pluggable as a third-party lib

@MichaIng
Copy link
Contributor Author

MichaIng commented Jul 21, 2021

I think checking the load would only increase the load and complexity. The goal should be to reduce the steps done within the loop to a minimum and not adding further magic. But that opinion is on pretty minimal insights.

Here is the loop, which runs until a stop() is requested: https://github.com/cherrypy/cheroot/blob/e2cfa19/cheroot/connections.py#L208-L229

Within the loop, it collects and iterates through all connections, giving the select() command a timeout if no connection is found. Without this timeout, the select() command waits and blocks until a connection appears: https://docs.python.org/3/library/selectors.html#selectors.BaseSelector.select

If timeout > 0, this specifies the maximum wait time, in seconds. If timeout <= 0, the call won’t block, and will report the currently ready file objects. If timeout is None, the call will block until a monitored file object becomes ready.

This means that server stop requests would wait indefinitely as long as no connection happens, if I'm not wrong.

But the following note is interesting:

This method can return before any file object becomes ready or the timeout has elapsed if the current process receives a signal: in this case, an empty list will be returned.

So I guess it would be possible to run the _run() function asynchronously, without giving it a timeout, so the selector will always wait until an actual connection is ready to be selected. The stop() function could then, instead of setting the _stop_requested flag (or additionally), send a signal (close()?) to the selector. Then an empty tuple is returned, which can be used as condition to exit the _run() function's loop.

@liamstask
Copy link
Contributor

I previously submitted #352 for this, but there was an issue on windows I was not able to reproduce.

@MichaIng
Copy link
Contributor Author

MichaIng commented Jul 21, 2021

I'll see if I'm able to setup a test environment on Windows. It's not exactly what I head in mind, but makes sense to align it with the anyway present expiration timeout. But this means that stop()ing the connections will take up to this timeout, as stop() only sets the flag for this loop to exit but does not actively cancel the selection, does it? Probably this causes issues on some platforms when many shot-living connections opened and tried to be stopped again?

@liamstask
Copy link
Contributor

this means that stop()ing the connections will take up to this timeout

this is correct. the original default value of 0.5 was selected to match the default poll interval in socketserver but is configurable.

Probably this causes issues on some platforms when many shot-living connections

I am not sure I understand - once the selector select() wakes up (either as a result of the timeout or due to a connection becoming active), it will check if stop has been requested before entering select() again, so worst case latency of stop() still corresponds to the expiration timeout. ie, other connections waking up the selector can only cause it to detect the stop request earlier.

@MichaIng
Copy link
Contributor Author

MichaIng commented Jul 22, 2021

ie, other connections waking up the selector can only cause it to detect the stop request earlier.

Ah yes that is true. Then it totally makes sense as you suggest in #352:

  • Whenever a new connection is incoming, the selector immediately "awakes" and processes it.
  • When no new connection is incoming, the selector waits at max for expiration_interval to assure that stale connections are closed accordingly.

Now I see only one issue:

  1. Let's say there was connection A, which is not active anymore but did not yet reach expiration_interval.
  2. A new connection B is coming in, so select() awakes and processes it.
  3. As connection A did not yet reach expiration_interval, a new iteration of the while loop is done.
  4. The iteration hangs again at select(), while connection A hits expiration_interval.
  5. So in theory, stale connections can stay up to two times the expiration_interval then.

Theoretically the last remaining time until expiry interval could be stored and used as select timeout, so that, aside of the little processing time, the sum of two consecutive select timeouts at max matches the expiration_interval. But then, the other question is if there is any issue when connections are not "expired" (==unregistered + closed) for a longer time. Due to my lack of depth insights: Is there any problem when old connections are not unregistered+closed until a new connections is incoming, or after a much longer time, e.g. 300 seconds (the default on Apache2)?
EDIT: Ah I forgot the stop() which currently takes up to select() timeout as well. So the timeout currently needs to stay below what is seen as acceptable delay when requesting a server shutdown. The alternative here would be to call close() on the selector itself (not a connection) as part of stop(), to override the select() timeout, as mentioned above.

If an issue is seen when a sudden large number of new connections is coming while a large number of old connections was not yet closed, then processing new ones and closing old ones could probably be merged into the same loop, to minimise the risk that both add up to reach a certain limit. Currently it's done in two separate loops through the same connection tuples doing partly the same checks (conn !=/== self.server). Merging those could reduce the processing time of each loop iteration.

@MichaIng
Copy link
Contributor Author

MichaIng commented Jan 3, 2022

This has now been merged here: #401
Testing/verifying the idle CPU usage benefits would be welcome.

On Windows systems, sadly a capped timeout needs to stay, as the connection handler select() function somehow on Windows does not return as fast as a connection is detected. The timeout however has been increased from 0.01 to 0.05 seconds, which seems to be a good balance between idle CPU usage and max delay for processing an incoming connection. For Raspberry Pi with Linux however this doesn't play a role since select() returns immediately once incoming connections appear. That expiration_interval (defaulting to 0.5 seconds) needs to be kept as well on Linux is to assure that expired connections are disconnected and removed in time, since new connections and expiry are done within the same loop.

@MichaIng
Copy link
Contributor Author

Marking as closed, as #401 nearly restored the much lower idle load from before #199 #308 were merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken triage
Projects
None yet
Development

No branches or pull requests

3 participants