Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running pebble exec makes both client and server hang on server shutdown #163

Closed
woky opened this issue Nov 14, 2022 · 5 comments · Fixed by #380
Closed

Running pebble exec makes both client and server hang on server shutdown #163

woky opened this issue Nov 14, 2022 · 5 comments · Fixed by #380

Comments

@woky
Copy link
Contributor

woky commented Nov 14, 2022

In latest master (a1f5223), when a pebble exec command is running, shutting down the server will make both the client and server hang.

Steps to reproduce

The following steps are run in pebble source tree.

In terminal 1:

$ PEBBLE=$PWD PEBBLE_DEBUG=1 go run ./cmd/pebble run --verbose

In terminal 2:

$ PEBBLE=$PWD PEBBLE_DEBUG=1 go run ./cmd/pebble exec bash    

Now shutdown the server (in terminal 1) with Ctrl-C or sending SIGINT/SIGTERM directly to the pebble process. The server will send SIGKILL to the bash child, but the child will remain in zombie state. The client (terminal 2) will stop processing input, even escape sequences like ^C and ^Z, with last characters printed by bash remaining on the output.

To stop both client and server you can send SIGKILL to the server. The client will notice that websocket connection is gone and exit with failure.

It's not related to bash being started as interactive shell. The same behavior is observed when running e.g. "sleep 999" instead of bash.

@benhoyt
Copy link
Contributor

benhoyt commented Nov 14, 2022

Good bug report -- thanks. I've noticed this when testing as well.

Implementation note: we use exec.CommandContext for execing the process, so there's already a context we can cancel. So we'll probably need to wire that up to a Stop method on cmdstate.CommandManager that kills the tombs / cancels the contexts on all the current executions. Currently we don't store the tomb/context on the execution object so we may need to add that as a field. If the manager has a Stop() method it looks like StateEngine.Stop will already call it on all managers that implement that.

@benhoyt
Copy link
Contributor

benhoyt commented Oct 4, 2023

I spent an hour or so on this today, but this is going to require further investigation. Exec commands are tasks, and TaskRunner has a Stop which is being called correctly when the server exits. The task's tomb appears to be wired in correctly using tomb.Context and exec.CommandContext, so a SIGTERM is being sent to the process. But for some reason it's not exiting. Or is it exiting but the websockets aren't terminating correctly or something?

In any case, I'd like to look into this further, but have other things I need to work on now. In the meantime, if you can post specifics of your use case, or a workaround you've used, that might help others.

@weiiwang01
Copy link

I think this happens because the reaper.Stop function stops the reaper loop without terminating and waiting for all child processes to finish and be reaped. This causes tasks like exec relying on reaper.WaitCommand to wait forever, as they depend on the reaper loop to send the process's exit code to proceed.

I created a patch to kill all child process managed by the reaper and wait for them to exit, which seems to fix this issue.

diff --git a/internals/reaper/reaper.go b/internals/reaper/reaper.go
index fb1d7b6..f211ffd 100644
--- a/internals/reaper/reaper.go
+++ b/internals/reaper/reaper.go
@@ -21,6 +21,7 @@ import (
 	"os/exec"
 	"os/signal"
 	"sync"
+	"syscall"
 
 	"golang.org/x/sys/unix"
 	"gopkg.in/tomb.v2"
@@ -67,6 +68,23 @@ func Stop() error {
 	}
 	mutex.Unlock()
 
+	// kill all child process and wait them to die
+	mutex.Lock()
+	for len(pids) > 0 {
+		for p, ch := range pids {
+			syscall.Kill(p, syscall.SIGKILL)
+			mutex.Unlock()
+			// waiting for reaper finishing its job on this process
+			code, ok := <-ch
+			if ok {
+				ch <- code
+			}
+			mutex.Lock()
+			break
+		}
+	}
+	mutex.Unlock()
+
 	reaperTomb.Kill(nil)
 	reaperTomb.Wait()
 	reaperTomb = tomb.Tomb{}
@@ -201,6 +219,7 @@ func WaitCommand(cmd *exec.Cmd) (int, error) {
 
 	// Wait for reaper to reap this PID and send us the exit code.
 	exitCode := <-ch
+	close(ch)
 
 	// Remove PID from waits map once we've received exit code from reaper.
 	mutex.Lock()

@weiiwang01
Copy link

weiiwang01 commented Feb 29, 2024

And for use cases, we've identified an issue within our production environment where some workload containers is terminated during the execution of a command by charm. This problem prevents the container from restarting until the terminationGracePeriodSeconds elapses, which is 300 seconds.

@benhoyt
Copy link
Contributor

benhoyt commented Mar 6, 2024

Okay, I understand better why this is happening now (and thanks for your pointer to the reaper, @weiiwang01). It's because the ServiceManager is stopping the reaper early in the process, before TaskRunner has had a chance to abort the exec tasks (aborting an exec task sends SIGKILL to its pid via the tomb and command context). Logs showing the order here:

^C2024-03-06T03:50:11.043Z [pebble] Exiting on interrupt signal.
2024-03-06T03:50:11.043Z [pebble] DEBUG No services to stop.
2024-03-06T03:50:12.044Z [pebble] DEBUG state engine: stopping *servstate.ServiceManager
2024-03-06T03:50:12.045Z [pebble] DEBUG Reaper stopped.
2024-03-06T03:50:12.045Z [pebble] DEBUG state engine: stopping *logstate.LogManager
2024-03-06T03:50:12.045Z [pebble] DEBUG state engine: stopping *cmdstate.CommandManager
2024-03-06T03:50:12.045Z [pebble] DEBUG state engine: stopping *state.TaskRunner

Then when TaskRunner.Stop is called, it calls tomb.Kill and then tomb.Wait on each task (exec) tomb, and because the reaper's not running, the tomb.Wait hangs.

So I think the best fix is to move the reaper.Start and reaper.Stop to the top level, instead of putting them in servstate.NewManager and ServiceManager.Stop. But we'll have to be careful because of 1) tests, and 2) pebble enter (pebble enter needs to do the same thing).

This is closely related to #284, which suggests a similar fix.

benhoyt added a commit to benhoyt/pebble that referenced this issue Mar 11, 2024
For example, if an exec command is running when you Ctrl-C the Pebble
daemon, it hangs.

It's hanging because the ServiceManager is stopping the reaper early in
the process, before TaskRunner has had a chance to abort the exec tasks
(aborting an exec task sends SIGKILL to its pid via the tomb and
command context).

Then when TaskRunner.Stop is called, it calls tomb.Kill and then
tomb.Wait on each task (exec) tomb, and because the reaper's not
running, the tomb.Wait hangs.

So the fix is to move the reaper.Start and reaper.Stop to the top
level (inside the "run" command, which is also used for "enter"),
instead of putting them in servstate.NewManager and
ServiceManager.Stop.

After doing this, some of the tests also had to be modified to start
and stop the reaper.

Fixes canonical#163
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants