Reap workers in the main loop #2314

tilgovi · 2020-04-20T23:06:38Z

Handle SIGCHLD like every other signal, waking up the arbiter and handling the signal in the main loop rather than in the signal handler.

Take special care to reinstall the signal handler in case Python avoided doing so to prevent infinite recursion.

Clean up workers and call the worker_exit hook in only one place. When killing a worker, do not clean it up. The arbiter will now clean up the worker and invoke the hook when it reaps the worker.

With reaping handled in the main loop and kill_worker delegating responsibility for cleanup to the reaping loop, iterate over the workers dictionary everywhere else without concern for concurrent modification.

tilgovi · 2020-04-20T23:27:04Z

This is not safe as is due to this code that might cause Gunicorn to miss signals:

    def signal(self, sig, frame):
        if len(self.SIG_QUEUE) < 5:
            self.SIG_QUEUE.append(sig)
            self.wakeup()

tilgovi · 2020-04-21T01:09:56Z

Ahh, I think I may have identified why we had SIGCHLD separately handled?

Is it because some systems exhibit a behavior where not calling os.waitpid in the signal execution context causes infinite recursion?

http://poincare.matf.bg.ac.rs/~ivana/courses/ps/sistemi_knjige/pomocno/apue/APUE/0201433079/ch10lev1sec7.html

tilgovi · 2020-04-21T01:33:25Z

Python actually checks for the situation I linked to, taking care not to re-install the signal handler for SIGCHLD automatically if it might cause infinite recursion.

https://github.com/python/cpython/blob/62183b8d6d49e59c6a98bbdaa65b7ea1415abb7f/Modules/signalmodule.c#L334

I've modified the first commit to address that and commented the code to indicate this case.

benoitc

made some comment, but i'm still unsure to understand the need of this patch. Can you clarify what is it trying to solve? Do we have a way to reproduce the issue?

gunicorn/arbiter.py

tilgovi · 2020-08-27T01:54:19Z

Can you clarify what is it trying to solve? Do we have a way to reproduce the issue?

Just code cleanup. It avoids the need to call list() on the workers list when we loop over it because it removes the concurrent modification, by removing duplicate code.

tilgovi · 2020-09-13T20:25:32Z

I reworked this with a separate indicator flag that the arbiter uses to mark that it needs to reap workers in the next main loop iteration. This change ensures that the arbiter will always reap workers even if the signal queue is full. I also added better explanation for why it is good to re-install the handler.

Please take another look.

benoitc · 2020-09-17T15:23:53Z

The use of use of list(self.WORKERS.items() was mainly there because .items() was not considered safe. When a worker die the list is updated concurrently.

Anyway the patch looks fine. I need to test it on BSDs systems, in particularly OpenBSD where I remember the special handling of the CHLD signal was added for.

Can it wait next release (let's make it next month?) ?

tilgovi · 2020-09-19T03:16:59Z

The use of use of list(self.WORKERS.items() was mainly there because .items() was not considered safe. When a worker die the list is updated concurrently.

Not any more because that's what this patch changes!

Anyway the patch looks fine. I need to test it on BSDs systems, in particularly OpenBSD where I remember the special handling of the CHLD signal was added for.

I don't think any BSD has this problem. From what I read it's mos likely some versions of Solaris.1 In any case, setting the handler when it is already set is not a problem, so I'm not worried about regression here. The only possible change here would to fix behavior on a system where Gunicorn was previously broken.

Can it wait next release (let's make it next month?) ?

Absolutely.

gunicorn/arbiter.py

tilgovi · 2023-12-27T22:38:33Z

@benoitc would you take another look here, please? I think it's a good idea to reap on the main thread. I think this approach is easier to understand, safer, reduces duplicate code, and eliminates concurrent modification of the workers list.

Handle SIGCHLD like every other signal, waking up the arbiter and handling the signal in the main loop rather than in the signal handler. Take special care to reinstall the signal handler since Python may not. Clean up workers and call the worker_exit hook in only one place. When killing a worker, do not clean it up. The arbiter will now clean up the worker and invoke the hook when it reaps the worker. Ensure that all workers have their temporary watchdog files closed and that the arbiter does not exit or log about other child processes dying. With reaping handled in the main loop and kill_worker delegating responsibility for cleanup to the reaping loop, iterate over the workers dictionary everywhere else without concern for concurrent modification.

tilgovi · 2023-12-29T04:24:35Z

gunicorn/arbiter.py

+                    if not worker:
+                        continue
+
+                    worker.tmp.close()


I thought it was best to ensure that we actually only do anything below here if this is actually our worker. But I can make this a separate PR if that's desired.

piskvorky · 2024-02-02T15:45:13Z

gunicorn/arbiter.py

    SIG_NAMES = dict(
        (getattr(signal, name), name[3:].lower()) for name in dir(signal)
-        if name[:3] == "SIG" and name[3] != "_"
+        if name[:3] == "SIG" and name[3] != "_" and name[3:] != "CLD"


This condition gave me a pause, not obvious.

Suggested change

if name[:3] == "SIG" and name[3] != "_" and name[3:] != "CLD"

if name[:3] == "SIG" and name[3] != "_" and name[3:] != "CLD" # SIGCLD is an obsolete name for SIGCHLD

At least according to https://man7.org/linux/man-pages/man7/signal.7.html

What was the motivation for singling it out here?

I thought I encountered an error without this (on Linux) but I'll double check.

@sylt's solution in #3148 seems a bit cleaner.

benoitc · 2024-02-02T18:50:57Z

@tilgovi I will check this patch. But one of the reason we handled CHLD differently is that montoring is separate from handling signals targetting the process itself. Having all in in the same signal loop means that this signal wil be in the same queue.

sylt · 2024-02-02T20:23:16Z

I wasn't aware of this pr until now, but FYI, I've posted #3148 which also aims to handle SIGCHLD on the main thread, although in a slightly different way.

piskvorky · 2024-02-02T20:32:38Z

I've been running & testing #3148 (@sylt's PR), so far without problems. It seems more complete – perhaps the two PRs could be combined, gaining the benefit of both?

And then there's #2908 too of course.

tilgovi · 2024-02-03T16:16:38Z

@tilgovi I will check this patch. But one of the reason we handled CHLD differently is that montoring is separate from handling signals targetting the process itself. Having all in in the same signal loop means that this signal wil be in the same queue.

We could put the SIGCHLD signal on the front of the queue, if what you want is just to handle it with higher priority, but I think it's a good idea to make the handler itself short and defer reaping to the main thread.

I also had a version of this with just a boolean flag for whether to reap on wakeup, if for some reason you really don't want SIGCHLD in the queue at all. But I don't think I understand in what way this signal is for "monitoring" or different from "signals targeting the process itself."

Can you explain anymore what concerns you?

tilgovi · 2024-02-03T16:20:30Z

I see on the other PR that it seems like one hesitation you have is that you don't want to wait for children when exiting Gunicorn, but that already happens. While we reap children in the signal context, handling other signals is blocked.

tilgovi force-pushed the safe-reaping branch 2 times, most recently from 0a2653c to 0b5d41b Compare April 20, 2020 23:14

tilgovi marked this pull request as draft April 20, 2020 23:26

tilgovi force-pushed the safe-reaping branch from 0b5d41b to 014e912 Compare April 21, 2020 01:32

tilgovi marked this pull request as ready for review April 21, 2020 01:33

tilgovi force-pushed the safe-reaping branch from 014e912 to f59ea64 Compare April 21, 2020 01:34

tilgovi marked this pull request as draft May 3, 2020 22:42

benoitc reviewed Aug 26, 2020

View reviewed changes

gunicorn/arbiter.py Show resolved Hide resolved

gunicorn/arbiter.py Show resolved Hide resolved

tilgovi force-pushed the safe-reaping branch from f59ea64 to c5ca40e Compare September 13, 2020 20:23

tilgovi marked this pull request as ready for review September 13, 2020 20:26

tilgovi force-pushed the safe-reaping branch from c5ca40e to 8ef2e40 Compare September 13, 2020 22:42

tilgovi requested a review from benoitc March 25, 2023 01:18

tilgovi commented Mar 25, 2023

View reviewed changes

gunicorn/arbiter.py Show resolved Hide resolved

tilgovi commented Mar 25, 2023

View reviewed changes

gunicorn/arbiter.py Outdated Show resolved Hide resolved

tilgovi mentioned this pull request Mar 25, 2023

worker_abort not running on timeout #2284

Closed

benoitc added the Feature/Core label May 10, 2023

tilgovi force-pushed the safe-reaping branch from 8ef2e40 to 9347afe Compare December 27, 2023 22:32

tilgovi changed the title ~~Safe Worker Reaping~~ Reap workers in the main loop Dec 27, 2023

tilgovi force-pushed the safe-reaping branch 2 times, most recently from 4387671 to 7e279d2 Compare December 27, 2023 22:36

tilgovi force-pushed the safe-reaping branch from 7e279d2 to de33726 Compare December 27, 2023 22:59

tilgovi force-pushed the safe-reaping branch from de33726 to a526c00 Compare December 27, 2023 22:59

tilgovi mentioned this pull request Dec 28, 2023

arbiter: don't log if handling SIGCHLD #3064

Open

tilgovi commented Dec 29, 2023

View reviewed changes

tilgovi requested a review from javabrett December 29, 2023 04:25

tilgovi added this to the 22.0 milestone Dec 29, 2023

piskvorky reviewed Feb 2, 2024

View reviewed changes

benoitc mentioned this pull request Feb 3, 2024

arbiter: Handle SIGCHLD like all other signals + misc signal handling improvements #3148

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reap workers in the main loop #2314

Reap workers in the main loop #2314

tilgovi commented Apr 20, 2020 •

edited

Loading

tilgovi commented Apr 20, 2020

tilgovi commented Apr 21, 2020

tilgovi commented Apr 21, 2020

benoitc left a comment

tilgovi commented Aug 27, 2020

tilgovi commented Sep 13, 2020

benoitc commented Sep 17, 2020

tilgovi commented Sep 19, 2020

tilgovi commented Dec 27, 2023

tilgovi Dec 29, 2023

piskvorky Feb 2, 2024 •

edited

Loading

tilgovi Feb 2, 2024

piskvorky Feb 2, 2024

benoitc commented Feb 2, 2024 •

edited

Loading

sylt commented Feb 2, 2024

piskvorky commented Feb 2, 2024 •

edited

Loading

tilgovi commented Feb 3, 2024

tilgovi commented Feb 3, 2024

	if name[:3] == "SIG" and name[3] != "_" and name[3:] != "CLD"
	if name[:3] == "SIG" and name[3] != "_" and name[3:] != "CLD" # SIGCLD is an obsolete name for SIGCHLD

Reap workers in the main loop #2314

Are you sure you want to change the base?

Reap workers in the main loop #2314

Conversation

tilgovi commented Apr 20, 2020 • edited Loading

tilgovi commented Apr 20, 2020

tilgovi commented Apr 21, 2020

tilgovi commented Apr 21, 2020

benoitc left a comment

Choose a reason for hiding this comment

tilgovi commented Aug 27, 2020

tilgovi commented Sep 13, 2020

benoitc commented Sep 17, 2020

tilgovi commented Sep 19, 2020

tilgovi commented Dec 27, 2023

tilgovi Dec 29, 2023

Choose a reason for hiding this comment

piskvorky Feb 2, 2024 • edited Loading

Choose a reason for hiding this comment

tilgovi Feb 2, 2024

Choose a reason for hiding this comment

piskvorky Feb 2, 2024

Choose a reason for hiding this comment

benoitc commented Feb 2, 2024 • edited Loading

sylt commented Feb 2, 2024

piskvorky commented Feb 2, 2024 • edited Loading

tilgovi commented Feb 3, 2024

tilgovi commented Feb 3, 2024

tilgovi commented Apr 20, 2020 •

edited

Loading

piskvorky Feb 2, 2024 •

edited

Loading

benoitc commented Feb 2, 2024 •

edited

Loading

piskvorky commented Feb 2, 2024 •

edited

Loading