Stability issues with gunicorn --max-requests #8212

NotSqrt · 2024-03-08T13:31:56Z

Describe the bug

When running an aiohttp app deployed with gunicorn and the --max-requests param (to avoid memory issues), I face ChunkedEncodingError and ConnectionError issues (with a python-requests client)

The logs also show Task was destroyed but it is pending! events.

To Reproduce

server: same as in https://docs.aiohttp.org/en/stable/deployment.html#application

from aiohttp import web


async def index(request):
    return web.Response(text="Welcome home!")


my_web_app = web.Application()
my_web_app.router.add_get('/', index)

Command to run: gunicorn gunicorn_example:my_web_app --bind localhost:8080 --worker-class aiohttp.GunicornWebWorker --max-requests 2 --workers 2 (low value of max-requests to accentuate the problem)

Client: while curl http://127.0.0.1:8080 -o /dev/null --silent -w "%{http_code}"; do true; done or similar.

Expected behavior

The shutting down / rebooting events should not have any impact on requests.

Logs/tracebacks

[2024-03-08 14:29:33 +0100] [425307] [INFO] Max requests, shutting down: <Worker 425307>
[2024-03-08 14:29:33 +0100] [425307] [INFO] Worker exiting (pid: 425307)
Exception ignored in: <coroutine object RequestHandler.start at 0x7f9e0eb37290>
Traceback (most recent call last):
  File "/home/user/.pyenv/versions/3.10/envs/sx-api/lib/python3.10/site-packages/aiohttp/web_protocol.py", line 600, in start
    self.transport.close()
  File "/usr/lib/python3.10/asyncio/selector_events.py", line 706, in close
    self._loop.call_soon(self._call_connection_lost, None)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 753, in call_soon
    self._check_closed()
  File "/usr/lib/python3.10/asyncio/base_events.py", line 515, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
Task was destroyed but it is pending!
task: <Task pending name='Task-225' coro=<RequestHandler.start() done, defined at /home/user/.pyenv/versions/3.10/envs/sx-api/lib/python3.10/site-packages/aiohttp/web_protocol.py:481> wait_for=<Task pending name='Task-227' coro=<RequestHandler._handle_request() running at /home/user/.pyenv/versions/3.10/envs/sx-api/lib/python3.10/site-packages/aiohttp/web_protocol.py:442> cb=[Task.task_wakeup()]>>
sys:1: RuntimeWarning: coroutine 'RequestHandler._handle_request' was never awaited
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
Task was destroyed but it is pending!
task: <Task pending name='Task-227' coro=<RequestHandler._handle_request() running at /home/user/.pyenv/versions/3.10/envs/sx-api/lib/python3.10/site-packages/aiohttp/web_protocol.py:442> cb=[Task.task_wakeup()]>

Python Version

$ python --version
3.10.12

aiohttp Version

$ python -m pip show aiohttp
3.9.3

multidict Version

$ python -m pip show multidict
6.0.4

yarl Version

$ python -m pip show yarl
1.9.2

OS

Ubuntu 22.04

Related component

Server

Additional context

No response

Code of Conduct

I agree to follow the aio-libs Code of Conduct

The text was updated successfully, but these errors were encountered:

Dreamsorcerer · 2024-03-08T17:38:35Z

I don't think it's specific to gunicorn. I suspect because you are hammering it with requests, you manage to catch an edge case where a request is coming in during shutdown.

I'm short on time right now, but if you'd like to try and debug, start by adding print() into aiohttp.web_server.Server.connection_made()/pre_shutdown() methods.

My suspicion is that you might find a call to connection_made() happening after pre_shutdown() has happened.

If that turns out to be correct, then try adding an await asyncio.sleep(0) before the pre_shutdown() call in aiohttp.web_runner.BaseRunner.cleanup() and see if that causes the connection_made() to happen before.

NotSqrt · 2024-03-11T13:55:08Z

I confirm that connection_made can be called after pre_shutdown of the same process.

With your suggestion, I no longer have exceptions on the server logs (but I still have some dropped requests : potentially an edge case for gunicorn when all processes are simultaneously booting up, despite their backog option).

         if self._server:  # If setup succeeded
+            await asyncio.sleep(0)
             self._server.pre_shutdown()

Dreamsorcerer · 2024-03-11T17:48:04Z

Alright, if you can figure out a test to add in a PR to reproduce, that'd be great (existing tests for shutdown are in https://github.com/aio-libs/aiohttp/blob/master/tests/test_run_app.py#L924). Otherwise, I'll try and give it a go when I have some time.

For the dropped connections, it could be that the connection has been established, but it hasn't yet started the handler, causing the server to close it again without actually handling it? Try changing the 0 to a higher number (0.1 maybe) and see if that makes a difference. If so, you could start debugging further by looking around this code:

aiohttp/aiohttp/web_protocol.py

Lines 510 to 519 in 54be8dc

    
           while not self._force_close: 
        
               if not self._messages: 
        
                   try: 
        
                       # wait for next request 
        
                       self._waiter = loop.create_future() 
        
                       await self._waiter 
        
                   except asyncio.CancelledError: 
        
                       break 
        
                   finally: 
        
                       self._waiter = None

or

aiohttp/aiohttp/web_protocol.py

Lines 359 to 360 in 54be8dc

    
           if self._force_close or self._close: 
        
               return

But, it's not so clear to me how we'd fix this one cleanly.

NotSqrt · 2024-03-12T08:34:15Z

Well, this is clearly out of my league..

I've tried looking how to cleanly stop an asyncio server that was started with loop.create_server.
Ideally, when the gunicorn worker reaches max_requests, it should stop accepting new connections and do a proper shutdown

What I've seen is that the RequestHandler.start method increments manager.requests_count at each request, while the GunicornWebWorker checks that value only in a loop with _wait_next_notify and call_later once per second.
So clearly gunicorn options --max-requests and --max-requests-jitter won't be strictly respected.

It seems quite hard and implementation-specific to stop a server from accepting new connections..
I've seen an ongoing discussion here : python/cpython#113538 and https://stackoverflow.com/questions/43860469/long-running-asyncio-server-reload-certificate

Dreamsorcerer · 2024-03-12T16:42:19Z

What I've seen is that the RequestHandler.start method increments manager.requests_count at each request, while the GunicornWebWorker checks that value only in a loop with _wait_next_notify and call_later once per second. So clearly gunicorn options --max-requests and --max-requests-jitter won't be strictly respected.

Yes, I already noticed that, it's only an estimate, but that's likely all it needs to be (to be honest, I'm not sure I even see the point of the option, it seems like a PHP thing where memory leaks and bad code are rather common).

It seems quite hard and implementation-specific to stop a server from accepting new connections.. I've seen an ongoing discussion here : python/cpython#113538 and https://stackoverflow.com/questions/43860469/long-running-asyncio-server-reload-certificate

I think, as Guido says at the beginning of that thread, it's up to us to track the connections. Which is basically what we do. We don't want to actually force close the server, but gracefully close each connection.

My assumption is that once we close the server from accepting incoming requests, if we yield to the event loop once, then any connections that were incoming will have their connection_made() callbacks run. That seems to be true, given that the exceptions you encountered disappeared after an await asyncio.sleep(0).

Then when we try to close the connections, the reason active connections don't close automatically is in the code linked above. Specifically, the expected behaviour is that an idle connection will be waiting at:

aiohttp/aiohttp/web_protocol.py

Line 515 in 54be8dc

await self._waiter

That waiter gets cancelled, which then breaks out of the loop and the connection is closed.

An active connection will be later in that loop and awaiting on the handler task:

aiohttp/aiohttp/web_protocol.py

Line 541 in 54be8dc

resp, reset = await task

In this case, it just gets an _close attribute set which tells it to break out of the loop at the end (so it doesn't become an idle connection again).

My guess is that the problem occurs with the connection_made() getting called, but the code then yielding to the event loop before it reaches that loop and starts the handler task (or after entering the loop, but before data is received and it's woken up from the waiter?). At which point the connection is closed, and then the code resumes, sees that the connection should be closed and never starts the handler task. Again at a guess, it may be the data_received() callback that should initiate this, but because the connection is already labelled as closed, it just returns and skips processing anything:

aiohttp/aiohttp/web_protocol.py

Lines 359 to 360 in 54be8dc

    
           if self._force_close or self._close: 
        
               return

This is why I asked if increasing the sleep() call makes a difference, as it would allow some time for data_received() to be called properly, and would suggest that I'm along the right lines.

NotSqrt · 2024-03-13T08:29:52Z

Increasing the sleep to 0.1 or even 0.5 seems to improve a little the stability, but dropped requests can still happen.

Dreamsorcerer · 2024-03-13T14:37:58Z

That could just be the delay to start every handler, depending on how many requests are queued up in that time.
What I'm thinking is that maybe we need to track if the connection is a new connection or not, and when it is a new connection allow it 1 second or similar to receive the request data from the client. If no data is received in that time, then we close the connection.

Dreamsorcerer · 2024-03-30T16:53:34Z

I don't think I can provide a full fix, it seems to be down to network race conditions with keep-alive connections.

At best, if everything else works perfectly, we can still have a situation where the server closes an idle keep-alive connection while there is a request in transit from the client. I don't think there's a way to cancel a close, so there's no way to avoid this situation. This is expected behaviour for HTTP, and for idempotent methods, Gunicorn should probably automatically retry the connection (maybe on a different process) without giving any errors to the end user:
https://www.rfc-editor.org/rfc/rfc9112.html#name-retrying-requests

Dreamsorcerer · 2024-03-30T17:31:09Z

I also can't reproduce the server-side warnings in a test...
I think we'll just add in the sleep(0) for now.

Dreamsorcerer · 2024-03-30T17:36:18Z

If anyone wants to play with it, this was my attempt at a test, but I get the same result before and after the change (add it to the bottom of test_run_app.py):

    def test_shutdown_while_incoming_connections(
        self, aiohttp_unused_port: Callable[[], int]
    ) -> None:
        port = aiohttp_unused_port()
        results = ()

        async def test() -> None:
            await asyncio.sleep(1)
            async with ClientSession(connector=TCPConnector(limit=10)) as sess:
                async def req(path: str):
                    try:
                        async with sess.get(f"http://localhost:{port}{path}") as resp:
                            return await resp.text()
                    except Exception as exc:
                        return exc.__class__

                tasks = (req("/stop"), *(req("/") for _ in range(100)))
                nonlocal results
                stop, *results = await asyncio.gather(*tasks)

        async def handler(request: web.Request) -> web.Response:
            return web.Response(text="FOO")

        async def run_test(app: web.Application) -> None:
            nonlocal t
            t = asyncio.create_task(test())
            yield
            t.cancel()
            with contextlib.suppress(asyncio.CancelledError):
                await t

        t = None
        app = web.Application()
        app.cleanup_ctx.append(run_test)
        app.router.add_get("/", handler)
        app.router.add_get("/stop", self.stop)

        web.run_app(app, port=port, shutdown_timeout=10)
        assert len(results) == 100
        assert set(results) == {"FOO", ClientConnectorError}

Dreamsorcerer · 2024-04-05T15:30:30Z

I think, also, if integration with Gunicorn can allow it be alerted when the server is initiating a shutdown, then Gunicorn could also just stop forwarding requests to that process immediately and avoid any dropped connections that way. I'm not familiar with what's available in Gunicorn, so something worth investigating if you're keen to improve on this.

NotSqrt added the bug label Mar 8, 2024

Dreamsorcerer mentioned this issue Mar 30, 2024

Fix race condition with request handlers on shutdown #8271

Merged

Dreamsorcerer closed this as completed in #8271 Mar 30, 2024

This was referenced Mar 30, 2024

[PR #8271/e0d9d3ab backport][3.9] Fix race condition with request handlers on shutdown #8272

Merged

[PR #8271/e0d9d3ab backport][3.10] Fix race condition with request handlers on shutdown #8273

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stability issues with gunicorn --max-requests #8212

Stability issues with gunicorn --max-requests #8212

NotSqrt commented Mar 8, 2024 •

edited

Loading

Dreamsorcerer commented Mar 8, 2024

NotSqrt commented Mar 11, 2024

Dreamsorcerer commented Mar 11, 2024

NotSqrt commented Mar 12, 2024

Dreamsorcerer commented Mar 12, 2024 •

edited

Loading

NotSqrt commented Mar 13, 2024

Dreamsorcerer commented Mar 13, 2024

Dreamsorcerer commented Mar 30, 2024

Dreamsorcerer commented Mar 30, 2024

Dreamsorcerer commented Mar 30, 2024 •

edited

Loading

Dreamsorcerer commented Apr 5, 2024

Stability issues with gunicorn --max-requests #8212

Stability issues with gunicorn --max-requests #8212

Comments

NotSqrt commented Mar 8, 2024 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Logs/tracebacks

Python Version

aiohttp Version

multidict Version

yarl Version

OS

Related component

Additional context

Code of Conduct

Dreamsorcerer commented Mar 8, 2024

NotSqrt commented Mar 11, 2024

Dreamsorcerer commented Mar 11, 2024

NotSqrt commented Mar 12, 2024

Dreamsorcerer commented Mar 12, 2024 • edited Loading

NotSqrt commented Mar 13, 2024

Dreamsorcerer commented Mar 13, 2024

Dreamsorcerer commented Mar 30, 2024

Dreamsorcerer commented Mar 30, 2024

Dreamsorcerer commented Mar 30, 2024 • edited Loading

Dreamsorcerer commented Apr 5, 2024

NotSqrt commented Mar 8, 2024 •

edited

Loading

Dreamsorcerer commented Mar 12, 2024 •

edited

Loading

Dreamsorcerer commented Mar 30, 2024 •

edited

Loading