Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use eventlet green thread instead of regular thread #115

Merged
merged 2 commits into from
Jul 25, 2022

Conversation

drencrom
Copy link
Contributor

This fixes the greenlet.error: cannot switch to a different thread
exception that leaves the server in deadlock

This fixes the `greenlet.error: cannot switch to a different thread`
exception that leaves the server in deadlock
@drencrom drencrom mentioned this pull request Jul 22, 2022
@peppepetra
Copy link
Contributor

lgtm, thanks for looking into this.

@peppepetra peppepetra merged commit c0cb5eb into canonical:master Jul 25, 2022
@drencrom drencrom deleted the thread_patch branch July 25, 2022 11:39
lathiat added a commit to lathiat/prometheus-openstack-exporter that referenced this pull request Mar 27, 2024
Currently, slow running OpenStack API Requests (either stuck connecting
or still waiting for the actual response) from the periodic DataGatherer
task will block the HTTPServer connections from being processed.

Conversely, a stalled client of the HTTPServer (e.g. opening a telnet
session and not sending a request) will also block other HTTPServer
connections from being processed and also block the DataGatherer task
from running.

Observed Symptoms:
- Slow or failed prometheus requests
- Statistics not being updated as often as you would expect
- HTTP 500 responses and BrokenPipeError tracebacks being logged due to
  later trying to respond to prometheus clients which timed out and
  disconnected the socket.

Cause:
This happens because we are intending to use the eventlet library for
asynchronous non-blocking I/O, but, not all code running is correctly
using the patched or "green" versions of various standard libraries
(e.g. socket). As a result, we sometimes block the other tasks from
running.

Fix this by ensuring the entire program is correctly using eventlet and
green patched functions by importing eventlet and using
eventlet.patcher.monkey_patch() before importing any other modules.

== History Lesson ==
There have been several incorrect attempts to solve this and some related
problems. To try and avoid any further such problems, I have
comprehensively documented the historical issues and why those fixes
have not worked below, both for my understanding and yours :)

1. eventlet implements asynchronous "non-blocking" socket I/O without
any code changes to the application and without using real pthreads by
using co-operative "green threads" from the greenlet library.

For this to work correctly, it needs to replace many standard libraries
(e.g. socket, time, threading) with an alternative implementation. This
applies to both our own code, and code within imported modules (e.g.
novaclient).

This does not happen automatically, you can find the full details at
https://eventlet.readthedocs.io/en/latest/patching.html but as a brief
summary this can be done with 3 different methods:
- Explicitly importing all relevant modules from eventlet.green
- Automatically during a single import eventlet.patcher.import_patched
- Automatically during future imports with eventlet.patcher.monkey_patch

2. The original Issue canonical#112 found that the process deadlocked with the
following error: greenlet.error: cannot switch to a different thread

At the time, we used a native Python Thread for the DataGatherer class
and separately used the ForkingHTTPServer to allow both functions to
operate simultaneously.

We did not intend to use eventlet/green threads at all, however, the
python-cinderclient library incorrectly imports eventlet.sleep which
results in sometimes using green threads accidentally, hence the error.

We attempted to fix that in canonical#115 by importing the green version of
threading.Thread explicitly. This avoided the "cannot switch to a
different thread" issue by only using green threads and not mixing
Python threads and green threads in the same process.

3. After merging canonical#115 it was found that the HTTPServer loop never
co-operatively yielded to the DataGatherer's thread and the stats were
never updated.

To fix this, canonical#116 imported the green version of socket, asyncore and
time and also littered a few sleep(0) calls around to force co-operative
yielding at various points.

4. In canonical#124 we switched from ForkingHTTPServer to the normal HTTPServer
because sometimes it would fork too many servers and hit the process or
system-wide process limit.

Though not noted elsewhere, when I reproduce this issue by connecting
many clients using the tool `siege` to a server where I firewalled the
nova API connections, I can see that all of those processes are defunct
and not actually alive. This is most likely because the process is
blocked and the calls to waitpid which would reap them never happen.

Since we are not using the eventlet version of http.server.HTTPServer,
without the forked model we now block anytime we are handling a server
request.

Additionally, anytime the DataGatherer green thread calls out through
the OpenStack API libraries, it uses non-patched versions of
socket/requests/urllib3 and also blocks the HTTPServer which is now
inside the same process.

== Testing ==
To test we now have a working solution, you can

1. Block access to the Nova API (causes connect to hang for 120 seconds)
using this firewall command:
iptables -I OUTPUT -p tcp -m state --state NEW --dport 8774 -j DROP

2. Make many concurrent and repeated requests using siege:
while true; do siege http://172.16.0.30:9183/metrics -t 5s -c 5 -d 0.1; done

When testing with these changes, I never see us block a server or client
connection and all requests take a few milliseconds at most, whether or
not the client requests are slow or we open a connection to the server
that doesn't send a request.

Fixes: canonical#112, canonical#115, canonical#116, canonical#124, canonical#126
lathiat added a commit to lathiat/prometheus-openstack-exporter that referenced this pull request Jun 10, 2024
Currently, slow running OpenStack API Requests (either stuck connecting
or still waiting for the actual response) from the periodic DataGatherer
task will block HTTPServer connections from being processed. Blocked
HTTPServer connections will also block both other connections and the
DataGatherer task.

Observed Symptoms:
- Slow or failed prometheus requests
- Statistics not being updated as often as you would expect
- HTTP 500 responses and BrokenPipeError tracebacks being logged due to
  later trying to respond to prometheus clients which timed out and
  disconnected the socket
- Hitting the forked process limit

This happens because in the current code, we are intending to use the
eventlet library for asynchronous non-blocking I/O, but we are not using
it correctly. All code within the main application and all imported
dependencies must import the special eventlet "green" versions of many
python libraries (e.g. socket, time, threading, SimpleHTTPServer, etc)
which yield to other green threads when they would have blocked waiting
for I/O or to sleep. Currently this does not always happen.

Fix this by importing eventlet and using eventlet.patcher.monkey_patch()
before importing any other modules. This will automatically intercept
all future imports (including those inside dependencies) and
automatically load the green versions of relevant libraries.

Documentation on correctly import eventlet can be found here:
https://eventlet.readthedocs.io/en/latest/patching.html

A detailed and comprehensive analysis of the issue and multiple previous
attempts to fix it can be found in Issue canonical#130. If you intend to make
further related changes to the use of eventlet, threads or forked
processes please read the detailed history lesson available there.

Fixes: canonical#130, canonical#126, canonical#124, canonical#116, canonical#115, canonical#112
lathiat added a commit to lathiat/prometheus-openstack-exporter that referenced this pull request Jun 12, 2024
Currently, slow running OpenStack API Requests (either stuck connecting
or still waiting for the actual response) from the periodic DataGatherer
task will block HTTPServer connections from being processed. Blocked
HTTPServer connections will also block both other connections and the
DataGatherer task.

Observed Symptoms:
- Slow or failed prometheus requests
- Statistics not being updated as often as you would expect
- HTTP 500 responses and BrokenPipeError tracebacks being logged due to
  later trying to respond to prometheus clients which timed out and
  disconnected the socket
- Hitting the forked process limit

This happens because in the current code, we are intending to use the
eventlet library for asynchronous non-blocking I/O, but we are not using
it correctly. All code within the main application and all imported
dependencies must import the special eventlet "green" versions of many
python libraries (e.g. socket, time, threading, SimpleHTTPServer, etc)
which yield to other green threads when they would have blocked waiting
for I/O or to sleep. Currently this does not always happen.

Fix this by importing eventlet and using eventlet.patcher.monkey_patch()
before importing any other modules. This will automatically intercept
all future imports (including those inside dependencies) and
automatically load the green versions of relevant libraries.

Documentation on correctly import eventlet can be found here:
https://eventlet.readthedocs.io/en/latest/patching.html

A detailed and comprehensive analysis of the issue and multiple previous
attempts to fix it can be found in Issue canonical#130. If you intend to make
further related changes to the use of eventlet, threads or forked
processes please read the detailed history lesson available there.

Fixes: canonical#130, canonical#126, canonical#124, canonical#116, canonical#115, canonical#112
lathiat added a commit to lathiat/prometheus-openstack-exporter that referenced this pull request Jun 12, 2024
Currently, slow running OpenStack API Requests (either stuck connecting
or still waiting for the actual response) from the periodic DataGatherer
task will block HTTPServer connections from being processed. Blocked
HTTPServer connections will also block both other connections and the
DataGatherer task.

Observed Symptoms:
- Slow or failed prometheus requests
- Statistics not being updated as often as you would expect
- HTTP 500 responses and BrokenPipeError tracebacks being logged due to
  later trying to respond to prometheus clients which timed out and
  disconnected the socket
- Hitting the forked process limit

This happens because in the current code, we are intending to use the
eventlet library for asynchronous non-blocking I/O, but we are not using
it correctly. All code within the main application and all imported
dependencies must import the special eventlet "green" versions of many
python libraries (e.g. socket, time, threading, SimpleHTTPServer, etc)
which yield to other green threads when they would have blocked waiting
for I/O or to sleep. Currently this does not always happen.

Fix this by importing eventlet and using eventlet.patcher.monkey_patch()
before importing any other modules. This will automatically intercept
all future imports (including those inside dependencies) and
automatically load the green versions of relevant libraries.

Documentation on correctly import eventlet can be found here:
https://eventlet.readthedocs.io/en/latest/patching.html

A detailed and comprehensive analysis of the issue and multiple previous
attempts to fix it can be found in Issue canonical#130. If you intend to make
further related changes to the use of eventlet, threads or forked
processes please read the detailed history lesson available there.

Fixes: canonical#130, canonical#126, canonical#124, canonical#116, canonical#115, canonical#112
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants