Running pebble exec makes both client and server hang on server shutdown #163

woky · 2022-11-14T14:26:09Z

In latest master (a1f5223), when a pebble exec command is running, shutting down the server will make both the client and server hang.

Steps to reproduce

The following steps are run in pebble source tree.

In terminal 1:

$ PEBBLE=$PWD PEBBLE_DEBUG=1 go run ./cmd/pebble run --verbose

In terminal 2:

$ PEBBLE=$PWD PEBBLE_DEBUG=1 go run ./cmd/pebble exec bash

Now shutdown the server (in terminal 1) with Ctrl-C or sending SIGINT/SIGTERM directly to the pebble process. The server will send SIGKILL to the bash child, but the child will remain in zombie state. The client (terminal 2) will stop processing input, even escape sequences like ^C and ^Z, with last characters printed by bash remaining on the output.

To stop both client and server you can send SIGKILL to the server. The client will notice that websocket connection is gone and exit with failure.

It's not related to bash being started as interactive shell. The same behavior is observed when running e.g. "sleep 999" instead of bash.

The text was updated successfully, but these errors were encountered:

benhoyt · 2022-11-14T22:51:53Z

Good bug report -- thanks. I've noticed this when testing as well.

Implementation note: we use exec.CommandContext for execing the process, so there's already a context we can cancel. So we'll probably need to wire that up to a Stop method on cmdstate.CommandManager that kills the tombs / cancels the contexts on all the current executions. Currently we don't store the tomb/context on the execution object so we may need to add that as a field. If the manager has a Stop() method it looks like StateEngine.Stop will already call it on all managers that implement that.

benhoyt · 2023-10-04T02:18:28Z

I spent an hour or so on this today, but this is going to require further investigation. Exec commands are tasks, and TaskRunner has a Stop which is being called correctly when the server exits. The task's tomb appears to be wired in correctly using tomb.Context and exec.CommandContext, so a SIGTERM is being sent to the process. But for some reason it's not exiting. Or is it exiting but the websockets aren't terminating correctly or something?

In any case, I'd like to look into this further, but have other things I need to work on now. In the meantime, if you can post specifics of your use case, or a workaround you've used, that might help others.

weiiwang01 · 2024-02-29T02:55:36Z

I think this happens because the reaper.Stop function stops the reaper loop without terminating and waiting for all child processes to finish and be reaped. This causes tasks like exec relying on reaper.WaitCommand to wait forever, as they depend on the reaper loop to send the process's exit code to proceed.

I created a patch to kill all child process managed by the reaper and wait for them to exit, which seems to fix this issue.

diff --git a/internals/reaper/reaper.go b/internals/reaper/reaper.go
index fb1d7b6..f211ffd 100644
--- a/internals/reaper/reaper.go
+++ b/internals/reaper/reaper.go
@@ -21,6 +21,7 @@ import (
 	"os/exec"
 	"os/signal"
 	"sync"
+	"syscall"
 
 	"golang.org/x/sys/unix"
 	"gopkg.in/tomb.v2"
@@ -67,6 +68,23 @@ func Stop() error {
 	}
 	mutex.Unlock()
 
+	// kill all child process and wait them to die
+	mutex.Lock()
+	for len(pids) > 0 {
+		for p, ch := range pids {
+			syscall.Kill(p, syscall.SIGKILL)
+			mutex.Unlock()
+			// waiting for reaper finishing its job on this process
+			code, ok := <-ch
+			if ok {
+				ch <- code
+			}
+			mutex.Lock()
+			break
+		}
+	}
+	mutex.Unlock()
+
 	reaperTomb.Kill(nil)
 	reaperTomb.Wait()
 	reaperTomb = tomb.Tomb{}
@@ -201,6 +219,7 @@ func WaitCommand(cmd *exec.Cmd) (int, error) {
 
 	// Wait for reaper to reap this PID and send us the exit code.
 	exitCode := <-ch
+	close(ch)
 
 	// Remove PID from waits map once we've received exit code from reaper.
 	mutex.Lock()

weiiwang01 · 2024-02-29T03:03:15Z

And for use cases, we've identified an issue within our production environment where some workload containers is terminated during the execution of a command by charm. This problem prevents the container from restarting until the terminationGracePeriodSeconds elapses, which is 300 seconds.

benhoyt · 2024-03-06T04:21:57Z

Okay, I understand better why this is happening now (and thanks for your pointer to the reaper, @weiiwang01). It's because the ServiceManager is stopping the reaper early in the process, before TaskRunner has had a chance to abort the exec tasks (aborting an exec task sends SIGKILL to its pid via the tomb and command context). Logs showing the order here:

^C2024-03-06T03:50:11.043Z [pebble] Exiting on interrupt signal.
2024-03-06T03:50:11.043Z [pebble] DEBUG No services to stop.
2024-03-06T03:50:12.044Z [pebble] DEBUG state engine: stopping *servstate.ServiceManager
2024-03-06T03:50:12.045Z [pebble] DEBUG Reaper stopped.
2024-03-06T03:50:12.045Z [pebble] DEBUG state engine: stopping *logstate.LogManager
2024-03-06T03:50:12.045Z [pebble] DEBUG state engine: stopping *cmdstate.CommandManager
2024-03-06T03:50:12.045Z [pebble] DEBUG state engine: stopping *state.TaskRunner

Then when TaskRunner.Stop is called, it calls tomb.Kill and then tomb.Wait on each task (exec) tomb, and because the reaper's not running, the tomb.Wait hangs.

So I think the best fix is to move the reaper.Start and reaper.Stop to the top level, instead of putting them in servstate.NewManager and ServiceManager.Stop. But we'll have to be careful because of 1) tests, and 2) pebble enter (pebble enter needs to do the same thing).

This is closely related to #284, which suggests a similar fix.

For example, if an exec command is running when you Ctrl-C the Pebble daemon, it hangs. It's hanging because the ServiceManager is stopping the reaper early in the process, before TaskRunner has had a chance to abort the exec tasks (aborting an exec task sends SIGKILL to its pid via the tomb and command context). Then when TaskRunner.Stop is called, it calls tomb.Kill and then tomb.Wait on each task (exec) tomb, and because the reaper's not running, the tomb.Wait hangs. So the fix is to move the reaper.Start and reaper.Stop to the top level (inside the "run" command, which is also used for "enter"), instead of putting them in servstate.NewManager and ServiceManager.Stop. After doing this, some of the tests also had to be modified to start and stop the reaper. Fixes canonical#163

troyanov mentioned this issue Oct 5, 2023

Pebble is unresponsive #314

Closed

benhoyt mentioned this issue Feb 28, 2024

pebble hangs when it is killed while executing a command #371

Closed

benhoyt mentioned this issue Mar 11, 2024

fix: start process subreaper at top level to avoid shutdown hangs #380

Merged

benhoyt closed this as completed in ca137c8 Mar 12, 2024

benhoyt closed this as completed in #380 Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running pebble exec makes both client and server hang on server shutdown #163

Running pebble exec makes both client and server hang on server shutdown #163

woky commented Nov 14, 2022

benhoyt commented Nov 14, 2022

benhoyt commented Oct 4, 2023 •

edited

Loading

weiiwang01 commented Feb 29, 2024

weiiwang01 commented Feb 29, 2024 •

edited

Loading

benhoyt commented Mar 6, 2024

Running pebble exec makes both client and server hang on server shutdown #163

Running pebble exec makes both client and server hang on server shutdown #163

Comments

woky commented Nov 14, 2022

Steps to reproduce

benhoyt commented Nov 14, 2022

benhoyt commented Oct 4, 2023 • edited Loading

weiiwang01 commented Feb 29, 2024

weiiwang01 commented Feb 29, 2024 • edited Loading

benhoyt commented Mar 6, 2024

benhoyt commented Oct 4, 2023 •

edited

Loading

weiiwang01 commented Feb 29, 2024 •

edited

Loading