Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(servstate): use WaitDelay to avoid Command.Wait blocking on stdin/out/err #275

Merged
merged 10 commits into from
Aug 18, 2023

Conversation

benhoyt
Copy link
Contributor

@benhoyt benhoyt commented Aug 15, 2023

Use os.exec's Cmd.WaitDelay to ensure cmd.Wait() returns in a reasonable timeframe if the goroutines that cmd.Start() uses to copy stdin/out/err are blocked when copying due to a sub-subprocess holding onto them. Read more details about the issue in golang/go#23019 and the proposed solution (that was added in Go 1.20) in golang/go#50436.

This solves issue 149, where Patroni wasn't restarting properly even after a KILL signal was sent to it. I had originally mis-diagnosed this problem as an issue with Pebble not tracking the process tree of processes that daemonise and change their process group (which is still an issue, but is not causing this problem). The Patroni process wasn't being marked as finished at all due to being blocked on the cmd.Wait(). Patroni starts sub-processes and "forwards" stdin/out/err, so the copy goroutines block. Thankfully Go 1.20 introduced WaitDelay to allow you to easily work around this exact problem.

The fix itself is this one-liner:

s.cmd.WaitDelay = s.killDelay() // defaults to 5 seconds

This will really only be a problem for services, but we make the same change for exec and exec health checks as it won't hurt there either.

Also, as a drive-by, this PR also canonicalises some log messages: our style is to start with an uppercase letter (for logs, not errors) and to use "Cannot X" rather than "Error Xing".

Fixes #149.

Add WaitDelay to ensure cmd.Wait() returns in a reasonable timeframe if
the goroutines that cmd.Start() uses to copy Stdin/Stdout/Stderr are
blocked when copying due to a sub-subprocess holding onto them. Read
more details in these issues:

- golang/go#23019
- golang/go#50436

This isn't the original intent of kill-delay, but it seems reasonable
to reuse it in this context.

Fixes canonical#149
It's unlikely to be needed here, but it won't hurt either. Use a
smaller timeout (1s) as these are intended for short-running commands.
Our style is to start with an uppercase letter (for logs, not errors)
and to use "Cannot X" rather than "Error Xing".
internals/daemon/daemon.go Outdated Show resolved Hide resolved
internals/overlord/servstate/handlers.go Show resolved Hide resolved
@hpidcock
Copy link
Member

We might need special handling of exec.ErrWaitDelay

@benhoyt
Copy link
Contributor Author

benhoyt commented Aug 16, 2023

@hpidcock No, I don't think we need to handle ErrWaitDelay specially. Normally pebble services are terminated with a signal, so will have a "terminated" exit code and that will be the error reported. ErrWaitDelay is only returned if the wait-delay kicks in and the service terminates successfully (with a zero exit code) -- but we do want an error in that case as it means the children haven't cleaned up I/O properly. Additionally, in reaper.WaitCommand we expect a non-success err from cmd.Wait() due to the reaper already waiting for PIDs, so we always get a syscall error from "wait" or "waitid".

Also factor out setupEmptyServiceManager as the same code is used in 7
places.
Copy link
Contributor

@flotter flotter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. I have one concern reusing killDelay - please check if my understand makes sense or not.

There is one subtlety which this new feature may create in the future - since a service restart were never able to restart child process that promoted themselves outside the process group (or session), in the past that may have forced people to someone find ways to kill everything or restart. What we are now enabling is leaving a child behind running, and killing the file descriptors previously used to capture the output (is my understanding correct)? If this is true, a service restart will cause the new primary process to log output without having the stdout of the previous children included (This is assuming the service knows how to deal with reusing the running child, else it may even start another one)?

internals/overlord/servstate/handlers.go Outdated Show resolved Hide resolved
internals/overlord/servstate/handlers.go Outdated Show resolved Hide resolved
@benhoyt
Copy link
Contributor Author

benhoyt commented Aug 17, 2023

There is one subtlety which this new feature may create in the future - since a service restart were never able to restart child process that promoted themselves outside the process group (or session), in the past that may have forced people to someone find ways to kill everything or restart. What we are now enabling is leaving a child behind running, and killing the file descriptors previously used to capture the output (is my understanding correct)? If this is true, a service restart will cause the new primary process to log output without having the stdout of the previous children included (This is assuming the service knows how to deal with reusing the running child, else it may even start another one)?

@flotter Yeah, that understanding is correct. However, I think the behaviour in this PR is strictly better than what we have now, which is causing a deadlock, and causing the charm (or whatever's running Pebble) to not see that the service stopped at all. People that were working around may have to change (or remove!) their workaround, but I think that's okay. Also, I think it was only the data team using Patroni that this affected (at least, that we knew about) -- and they know about this fix.

Copy link

@marco6 marco6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding my 2 cents

internals/overlord/servstate/handlers.go Outdated Show resolved Hide resolved
internals/daemon/api_logs.go Show resolved Hide resolved
@benhoyt benhoyt changed the title Use Command.WaitDelay to avoid Command.Wait blocking on stdin/out/err fix(servstate): Use WaitDelay to avoid Command.Wait blocking on stdin/out/err Aug 18, 2023
@benhoyt benhoyt changed the title fix(servstate): Use WaitDelay to avoid Command.Wait blocking on stdin/out/err fix(servstate): use WaitDelay to avoid Command.Wait blocking on stdin/out/err Aug 18, 2023
@benhoyt benhoyt merged commit 1a66abb into canonical:master Aug 18, 2023
13 checks passed
@benhoyt benhoyt deleted the add-wait-delay branch August 18, 2023 04:49
marceloneppel added a commit to canonical/postgresql-k8s-operator that referenced this pull request Aug 18, 2023
@benhoyt benhoyt mentioned this pull request Sep 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pebble service is not restarted on on-check-failure if the OS process was killed
5 participants