arbiter: don't log if handling SIGCHLD #3064

hydrargyrum · 2023-09-05T10:31:28Z

Logging when handling a signal is a bad practice
See https://docs.python.org/3/library/logging.html#thread-safety

Logging when handling a signal is a bad practice See https://docs.python.org/3/library/logging.html#thread-safety Fixes benoitc#2816

benoitc · 2023-09-05T11:00:52Z

Thanks but this is done on purpose like the comment says. It's not common. I prefer to keep it like this. This place is threadsafe also.

hydrargyrum · 2023-09-05T12:04:02Z

@benoitc what is done on purpose? raising exceptions and breaking logging system? Please see #2816 that many people encounter

benoitc · 2023-09-05T12:15:58Z

This code should be threadsafe. I suspect the error is happening before. We need first to understand why this code is re-entrant here.

Let keep it open since it is actually solving the cause. But I woud insist that we need to understand why this comes now.

benoitc · 2023-09-05T12:20:48Z

linking to #3050 I think we should revisit this change that introducde more logging.

@hydrargyrum if you happen to be around i'm on irc.

ja0x · 2023-10-10T14:06:58Z

Wanted to shine a little bit of info I found since I have encountered this bug as well. The issue isn't with thread-safety but rather with re-entrancy within the logging module and the underlying BufferedWriter (from IO module).

The issue occurs (usually on a slow disk) when the SIGCHLD handler interrupts the main process while it was in the process of writing its logging buffer. The handler (which is the same thread) will try to write to that buffer but will be unable because it doesn't allow reentrant code, it will then raise an exception.

If logger is set to write stderr to that same stream or file, it then sets off a cascading reentrant code error, eventually resulting in a recursion error.

It is a hard bug to reproduce, because the main thread has to interrupt itself exactly when it was writing the buffer.

Ref:
https://docs.python.org/3/library/io.html#reentrancy,
https://github.com/python/cpython/blob/e24f9ae7035cfb4c11c4a5c2820121fc22463f8b/Modules/_io/bufferedio.c#L295C11-L295C11

TomiBelan · 2023-10-31T22:42:30Z

@ja0x is correct. Python programs MUST NOT print or log anything inside a signal handler. https://stackoverflow.com/q/45680378 is another link that explains the root cause of this problem.

This gunicorn bug has a long and sad history.

Dec 2017: Issue #1663 was reported: In some cases (for example OOM) workers die without any error logged, or even crashloop.
Dec 2019: Issue #2215 was reported: Essentially the same as #1663, but with a reproducible example, and focused on missing logging.
Apr 2020: PR #2315 (ee1e5c1) was created by @tilgovi. It logs an error in the SIGCHLD handler if a worker exits with a nonzero code. And adds more logging to arbiter halt (that part is okay).
Sep 2020: PR #2419 (28df992, ca36d41) was created by @mildebrandt trying to fix issue #1663. It logs an error in the SIGCHLD handler if a worker exits with a nonzero code or is killed by a signal.
Dec 2020: PR #2475 (b695b49) was created by @aberres trying to fix issue #2215, and immediately merged by @benoitc. It logs a warning in the SIGCHLD handler if a worker is killed by a signal.
Apr 2021: gunicorn 20.1.0 was released containing PR #2475. Soon, issue #2564 was reported: Users started experiencing crashes with RuntimeError: reentrant call.
Oct 2021: PR #2667 was created by @BoasKluiving to revert PR #2475 in order to fix issue #2564.
Dec 2021: PR #2475 was reverted by @benoitc in commit 76f8da2 in order to fix issue #2564.
Jan 2022: PR #2667 was closed by @javabrett because he noticed that PR #2475 was reverted in the meantime.
Jun 2022: Issue #2816 was reported by @InfernalAzazel. AFAICT, it's actually just an exact duplicate of issue #2564, because the fix in commit 76f8da2 wasn't released.
Dec 2022: PR #2908 was created by myself. It logs a warning if a worker is killed by a signal, but the log is written later in the main loop, not directly in the SIGCHLD handler. I did not receive any response in the PR.
May 2023: PR #2315 and PR #2419 were merged by @benoitc, reintroducing the RuntimeError: reentrant call bug. Issue #2564 was closed (incorrectly).
Jul 2023: gunicorn 21.0.0 was released containing PRs #2315 and #2419.
Aug 2023: Issue #3050 was reported by @feliixx, who wants the log to be a warning, not an error.
Sep 2023: PR #3064 (this PR) was created by @hydrargyrum, trying to fix issue #2816 by skipping logging in the SIGCHLD handler. (FWIW: you missed one self.log.error call, and nobody except handle_chld calls reap_workers anyway.)

If I may be frank, I feel very disappointed and frustrated by the gunicorn project. It is a cruel irony that my PR #2908 which attempts to handle this correctly was ignored, but PRs #2315 and #2419 were merged, causing a regression.

But I'll try to swallow my bitter feelings and offer constructive advice.

My advice to @benoitc is to do this:

git revert -m 1 1fd1c82195 (revert both commits of Add additional logs when worker exits abnormally #2419)
git revert ee1e5c1928 (revert the bad half of More verbose exit logging #2315, keep the good half)
Review and (if I may be so bold) merge my PR Re-add and improve the warning for workers terminated due to a signal #2908
Reject PR arbiter: don't log if handling SIGCHLD #3064 (this PR)

Doing that fixes #2564 properly (at least I think so), fixes #3050, doesn't regress #2215, and removes the need for PR #3064 (this PR).

Hope that helps.

tilgovi · 2023-10-31T23:30:03Z

This is allextremely helpful and politely offered advice. I'll try to find time to push this along. I appreciate all your effort here and the exception patience with helping to sort this out.

Most of the maintainers are volunteers and we don't always find the sustained time to reconstruct all of this context, so gathering it in one place is immensely helpful.

aberres · 2023-11-01T06:14:18Z

@TomiBelan Thank you for the write-up. Since my failed approach three years ago I am silently hoping to get the issue fixed for good :)

tilgovi · 2023-12-28T00:33:30Z

I've rebased #2314, which moves reaping to the main thread. That should move the logging out of the signal handler. Any reviews or testing of that would be appreciated. I've assigned myself the related issues to follow up on any other changes afterwards.

piskvorky · 2024-02-02T14:38:13Z

Also bitten by this. I debugged and googled my way to this PR – I have nothing to add to @ja0x and @TomiBelan excellent analysis, so just a +1. I experience the same issue with reentrant logging inside Arbiter.reap_workers() during SIGCHLD https://github.com/benoitc/gunicorn/blob/master/gunicorn/arbiter.py#L242

To make matters worse, in my case gunicorn is auto-launched and monitored from supervisord, running a Flask app. After the unhandled exception due to RuntimeError: reentrant call inside <_io.BufferedWriter name='<stderr>'>, the server port was not released correctly, and the newly restarted gunicorn arbiter process (launched by supervisor) keeps printing this forever:

[2024-01-25 06:50:00 +0000] [384109] [INFO] Starting gunicorn 21.2.0
[2024-01-25 06:50:00 +0000] [384109] [ERROR] Connection in use: ('127.0.0.1', 8000)
[2024-01-25 06:50:00 +0000] [384109] [ERROR] Retrying in 1 second.
[2024-01-25 06:50:01 +0000] [384109] [ERROR] Connection in use: ('127.0.0.1', 8000)
[2024-01-25 06:50:01 +0000] [384109] [ERROR] Retrying in 1 second.
…

I.e. the server is bricked after the [ERROR] Unhandled exception in main loop, somehow.

On a positive note, I've been using gunicorn for years and this is the first major issue. Kudos and huge thanks to the maintainers!

sylt · 2024-02-02T15:48:39Z

Without really checking what PRs were opened already, I posted a PR yesterday (#3148) with an attempt at solving various signal handling issues in gunicorn I had the misfortune to encounter. I couldn't see that any of the open proposals had used the same approach to solve the problem, so I thought people could have a look before I eventually close it.

randomir · 2024-08-17T18:31:56Z

@TomiBelan, thanks for the excellent summary. I just started digging myself when I discovered it.

In my case it's gunicorn (sometimes) dying on SIGHUP during a log rotate.

To maintainers, please note the seriousness of this bug, even if it's intermittent. And the fact it was re-introduced in the same version in which it was fixed (21.0.0) makes it open for 3+ years. 😞

arbiter: don't log if handling SIGCHLD

5286a10

Logging when handling a signal is a bad practice See https://docs.python.org/3/library/logging.html#thread-safety Fixes benoitc#2816

benoitc closed this Sep 5, 2023

benoitc reopened this Sep 5, 2023

benoitc self-assigned this Sep 5, 2023

tilgovi self-assigned this Dec 28, 2023

tilgovi added this to the 22.0 milestone Dec 29, 2023

hendrikmuhs mentioned this pull request Jul 9, 2024

Logging in signal handlers may be unsafe #3198

Open

This was referenced Sep 1, 2024

print is not thread safe, and should not be used in signal handlers celery/celery#9221

Closed

print is not thread-safe, so should not be used in signal handler celery/celery#9222

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arbiter: don't log if handling SIGCHLD #3064

arbiter: don't log if handling SIGCHLD #3064

hydrargyrum commented Sep 5, 2023

benoitc commented Sep 5, 2023 •

edited

Loading

hydrargyrum commented Sep 5, 2023 •

edited

Loading

benoitc commented Sep 5, 2023 •

edited

Loading

benoitc commented Sep 5, 2023

ja0x commented Oct 10, 2023 •

edited

Loading

TomiBelan commented Oct 31, 2023

tilgovi commented Oct 31, 2023

aberres commented Nov 1, 2023

tilgovi commented Dec 28, 2023

piskvorky commented Feb 2, 2024 •

edited

Loading

sylt commented Feb 2, 2024

randomir commented Aug 17, 2024

arbiter: don't log if handling SIGCHLD #3064

Are you sure you want to change the base?

arbiter: don't log if handling SIGCHLD #3064

Conversation

hydrargyrum commented Sep 5, 2023

benoitc commented Sep 5, 2023 • edited Loading

hydrargyrum commented Sep 5, 2023 • edited Loading

benoitc commented Sep 5, 2023 • edited Loading

benoitc commented Sep 5, 2023

ja0x commented Oct 10, 2023 • edited Loading

TomiBelan commented Oct 31, 2023

tilgovi commented Oct 31, 2023

aberres commented Nov 1, 2023

tilgovi commented Dec 28, 2023

piskvorky commented Feb 2, 2024 • edited Loading

sylt commented Feb 2, 2024

randomir commented Aug 17, 2024

benoitc commented Sep 5, 2023 •

edited

Loading

hydrargyrum commented Sep 5, 2023 •

edited

Loading

benoitc commented Sep 5, 2023 •

edited

Loading

ja0x commented Oct 10, 2023 •

edited

Loading

piskvorky commented Feb 2, 2024 •

edited

Loading