Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGCHLD doesn't reap properly #44

Closed
vishvananda opened this issue Jan 7, 2016 · 8 comments
Closed

SIGCHLD doesn't reap properly #44

vishvananda opened this issue Jan 7, 2016 · 8 comments

Comments

@vishvananda
Copy link

If this is intended to be pid 1, it needs to properly reap orphan processes by handling SIGCHLD. There should be a handler for SIGCHLD that essentially does:

while (waitpid((pid_t)(-1), 0, WNOHANG) > 0) {};

more info here:

http://stackoverflow.com/questions/20688982/zombie-process-vs-orphan-process

@chriskuehl
Copy link
Contributor

Hi,

Does this look like the line you're looking for?
https://github.com/Yelp/dumb-init/blob/master/dumb-init.c#L219

We have a test which creates a zombie which is reowned by PID1 and reaped:
https://github.com/Yelp/dumb-init/blob/master/tests/test-zombies

If it's not working for you, can you provide a test case?

Thanks!

@vishvananda
Copy link
Author

Generally the practice is to wait on the pid of the child for the sigchild and then waitpid(-1) with WNOHANG for children to exit. It appears that your method of just doing waitpid(-1) every time will catch the exit code on the next iteration of the while loop, but I believe if a parent and child die at roughly the same time but the kernel sends you the parent pid first, your code will not clean up the child. This is generally the reason for doing a loop with WNOHANG. This is probably an edge case but worth considering.

@bukzor
Copy link
Contributor

bukzor commented Jan 7, 2016

@vishvananda I don't think there's an actual problem here, in any case that I can imagine.

  1. You're using dumb-init as pid1 in a container, and parent SIGCHILD arrives before its child's. We'll kill its process group and exit. The container will exit. There may be zombies for a split second, but all cleanups should run.
  2. You're using dumb-init as pid1 on a Real Machine, and parent SIGCHILD arrives before its child's. We'll kill its process group and exit. The machine will halt. Hopefully this is 100% analogous to case 1, but I don't know.
  3. You're using dumb-init as a subordinate (non pid1) process on a Real Machine, and parent SIGCHILD arrives before its child's. We'll kill its process group and exit. All cleanups should run, and any would-be zombies will be reaped by the real init process.

Can you present an actually problematic case?

@vishvananda
Copy link
Author

Fair points. I can't come up with a pressing issue that would be caused by this unless the main init process is failing for some reason. I just tend to go for the theoretically correct approach in case there is some case I can't think of. Happy to close this if you don't see it as an issue.

@bukzor
Copy link
Contributor

bukzor commented Jan 8, 2016

I can't see any case where things go wrong. Long running systems have no zombies, and any SIGTERM handlers (et al) get run in all cases. Any zombies that do occur will be reaped in a timely manner, somehow.

Closing this until we come up with a better counterexample.

@bukzor bukzor closed this as completed Jan 8, 2016
@bukzor
Copy link
Contributor

bukzor commented Jan 8, 2016

Doing a quick survey, systemd handles zombie reaping like so:

/* Let the kernel reap children for us */
(void) sigaction(SIGCHLD, &sa, NULL);

And upstart's strategy is pretty analogous:

/* Don't ignore SIGCHLD or SIGALRM, but don't respond to them
 * directly; it's enough that they interrupt the main loop and
 * get dealt with during it.
 */
nih_signal_set_handler (SIGCHLD, nih_signal_handler);

Neither uses WNOHANG, which would cause a busy loop.
We could do the same, except we want to take an action when a particular PID dies;
we signal a process group and die when our direct descendant dies.

We could have done all the work in a SIGCHLD handler, but then our main() function needs to run infinitely without busy looping, and also be able to be told to die by the SIGCHLD handler.
Doing it directly in the main loop is exactly the same I think, but a bit cleaner in our situation.

@vishvananda
Copy link
Author

no problem with doing it in the main loop if that is what you prefer. just wanted to comment that WNOHANG does not have to cause a busy loop since it returns -1 if there are no children that have exited. The loop will exit as soon as all of the waiting children have exited. This of course means reaping after your initial wait. This is how tini does it for example: https://github.com/krallin/tini/blob/master/src/tini.c#L348

@beekhof
Copy link

beekhof commented Apr 6, 2017

You're using dumb-init as pid1 in a container, and parent SIGCHILD arrives before its child's. We'll kill its process group and exit. The container will exit. There may be zombies for a split second, but all cleanups should run.

This assumes a very flat process tree which wont always be the case. If the "parent" in question is not a first-level child of dumb-init then the container has no need to exit and the zombies will persist for an extended period of time.

This matches the behaviour we have seen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants