Skip to content

fix: propagate child process exit code#202

Merged
mtojek merged 2 commits into
mainfrom
mtojek/fix-exit-code-propagation
May 21, 2026
Merged

fix: propagate child process exit code#202
mtojek merged 2 commits into
mainfrom
mtojek/fix-exit-code-propagation

Conversation

@mtojek
Copy link
Copy Markdown
Member

@mtojek mtojek commented May 19, 2026

Fixes #190

Both LandJail.Run() and NSJailManager.Run() always returned nil, discarding the child process exit code. The boundary process exited 0 regardless of what the target command returned.

Changes

  • landjail/manager.go, nsjail_manager/manager.go: Capture the child process error via a buffered channel instead of discarding it in the goroutine. Run() now blocks on the channel after the select and returns the error (which includes *exec.ExitError with the correct code).
  • landjail/child.go, nsjail_manager/child.go: Return the raw *exec.ExitError from cmd.Run() instead of calling os.Exit() or wrapping it in fmt.Errorf(). Serpent's RunCommandError has Unwrap(), so errors.As can find *exec.ExitError through the entire chain.
  • cli/cli.go: The handler calls os.Exit(exitCode) when the child process exits with a non-zero code. All cleanup (proxy stop, etc.) has already happened inside Run() via defers. This ensures the correct exit code is propagated regardless of how the caller handles errors, both as a standalone binary and when embedded as a coder boundary subcommand (no changes needed in coder/coder).

How to test

Build and run:

go build -o ./boundary ./cmd/boundary/

# success → 0
./boundary --jail-type landjail -- true; echo $?

# false → 1
./boundary --jail-type landjail -- false; echo $?

# arbitrary exit code → 42
./boundary --jail-type landjail -- bash -c 'exit 42'; echo $?

# arbitrary exit code → 127
./boundary --jail-type landjail -- bash -c 'exit 127'; echo $?

# command not found → 1
./boundary --jail-type landjail -- no-such-cmd; echo $?

landjail requires kernel 6.7+ (Landlock V4). Use --jail-type nsjail with appropriate privileges on older kernels.


Generated by Coder Agents

@mtojek mtojek force-pushed the mtojek/fix-exit-code-propagation branch from 02622ca to 61d23c4 Compare May 19, 2026 11:34
Both LandJail.Run() and NSJailManager.Run() always returned nil,
discarding the child process exit code. The landjail child also
wrapped exit codes in fmt.Errorf() instead of calling os.Exit().

Changes:
- Add exitcode.Error type to carry exit codes through the error chain
- Fix landjail child to call os.Exit(exitCode), matching nsjail behavior
- Fix both managers to capture child errors via a channel and return
  exitcode.Error from Run()
- Fix main.go to extract exitcode.Error before defaulting to os.Exit(1)
- Change NSJailManager.RunChildProcess to return error (was void)

Fixes #190
@mtojek mtojek force-pushed the mtojek/fix-exit-code-propagation branch from 61d23c4 to cb56ca1 Compare May 19, 2026 11:38
@mtojek mtojek requested a review from SasSwart May 19, 2026 11:42
@mtojek mtojek marked this pull request as ready for review May 19, 2026 11:42
Copy link
Copy Markdown
Contributor

@SasSwart SasSwart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing blocking given this is easy to patch and a minor fix. But a few comments to consider nonetheless.

Comment thread cli/cli.go
Comment on lines +247 to +258

// If the child process exited with a non-zero code, exit
// with the same code directly. All cleanup (proxy, etc.)
// has already happened inside Run(). Exiting here ensures
// the correct code is propagated regardless of how the
// calling framework handles errors (standalone binary or
// embedded as a coder subcommand).
var exitErr *exec.ExitError
if errors.As(err, &exitErr) {
os.Exit(exitErr.ExitCode())
}
return err
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a convenient solution, but it makes me nervous to have an os.Exit that is more than a single level of indirection from the entrypoint like this. Looking at this bit of code here, we don't how what cleanup would have happened on the return path between here and the entry point.

I think the proper solution is to do the error checking near the entry point both here and in the coder subcommand.

In practice, the risk is low and its easy to patch later, so I don't think this blocks the PR. Its worth a mention though.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed it's not ideal. The problem is the embedded mode (coder boundary ...), we don't control coder's entrypoint, and serpent wraps our returned error in RunCommandError, so coder's main() just does os.Exit(1) losing the actual code.

To do it "properly" we'd need changes in coder/coder or serpent. This is a conscious tradeoff: all cleanup (proxy, iptables) already ran inside Run() via defers before the error returns, so the os.Exit is safe.

Let me know your thoughts!

Comment thread landjail/child.go Outdated
// This is an unexpected error
logger.Error("Command execution failed", "error", err)
return fmt.Errorf("command execution failed: %v", err)
return err
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I feel like the error wrapping here was useful. It provides a better single description of the failure path than disjoint debug logs that might be filtered out.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, brought it back 👍

Comment thread landjail/manager.go
// error is already buffered. In the signal path the child may still
// be running; return nil so deferred cleanup (iptables, proxy) can
// proceed before the process exits.
select {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asking for clarity:

why do we need a second select here instead of a three case select above?

select {
case sig := <-sigChan:
// ...
case err := <-childErr:
// ...
case <-ctx.Done():
// ...
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the child finishes, the goroutine sends on childErr AND calls defer cancel(), which closes ctx.Done(). So both channels are ready at roughly the same time. Go picks randomly between ready cases - if we land in ctx.Done() instead of childErr, we lose the exit code error and return nil.

Two selects avoid that: first one waits for signal or context cancellation, second one (non-blocking) drains the child result. In the ctx.Done path the error is already buffered so we always get it. In the signal path the child may still be running, so default: return nil lets deferred cleanup proceed.

Addresses review feedback: re-add error wrapping for ExitError
with fmt.Errorf and %w verb so the error message is descriptive
while preserving the *exec.ExitError type for errors.As().
@mtojek mtojek force-pushed the mtojek/fix-exit-code-propagation branch from 7b2e933 to 8b03bb7 Compare May 20, 2026 13:48
@mtojek mtojek merged commit 3e1e57b into main May 21, 2026
5 checks passed
@mtojek mtojek deleted the mtojek/fix-exit-code-propagation branch May 21, 2026 07:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Boundary CLI doesn't propagate processes exit code

2 participants