Skip to content

fix(security): tighten privilege boundaries on cgroup/mount/IPC/control-socket#14

Open
andypost wants to merge 1 commit into
masterfrom
claude/audit-pr-c-privilege-boundary
Open

fix(security): tighten privilege boundaries on cgroup/mount/IPC/control-socket#14
andypost wants to merge 1 commit into
masterfrom
claude/audit-pr-c-privilege-boundary

Conversation

@andypost
Copy link
Copy Markdown
Owner

@andypost andypost commented May 8, 2026

Summary

Audit-driven privilege-boundary tightening (PR-C from security-audit.md).
Carries the only Critical finding (cgroup TOCTOU).

Findings addressed

  • V6 [Critical] Cgroup TOCTOU /proc/self/cgroup (process.c:619, cgroup.c:102)
  • V6 [High] Mount-destination symlink TOCTOU (fs.c, isolation.c:783)
  • V5 [Medium] No sender-type ACL on port handler dispatch (port.c:177)
  • V4 [High] Control socket lacks peer-cred check (controller.c:720)

Approach notes

  • Control socketgetsockopt(SO_PEERCRED) on Linux, getpeereid() on *BSD/Darwin, no-op fallback elsewhere. Connections are dropped during nxt_controller_conn_init before any bytes are read.
  • Sender-type ACL — implemented in nxt_port_handler() keyed on nxt_recv_msg_cmsg_pid() (kernel-validated where SCM_CREDENTIALS is available). Only the privileged message types are restricted; data/rpc/mmap traffic is unaffected. The audit's escape hatch (scope down if no sender field exists) wasn't needed — the existing cmsg PID is enough.
  • Mount TOCTOU — used openat2(RESOLVE_BENEATH|RESOLVE_NO_SYMLINKS) per the audit's first suggestion, gated on NXT_HAVE_OPENAT2. The validation happens between mkdir_p() and mount(2); a swapped symlink fails the open. The build's existing NXT_HAVE_OPENAT2 feature gate covers this.
  • Cgroup TOCTOU — picked the audit's preferred lower-risk fix: resolve relative cgroup paths against /proc/<child_pid>/cgroup rather than /proc/self/cgroup. No IPC reordering. The child PID is known to the parent at the call site (just-forked).

Test plan

  • ./configure --openssl && make -j unitd — clean build.
  • pytest-3 --collect-only test/ — 898 tests collected, no import breakage.
  • sudo pytest-3 test/test_*_isolation*.py — not run in this worktree (no passwordless sudo). Recommended before merge.
  • Manual smoke-test: unitd starts, control API reachable from same UID, rejected from different UID with alert in unit.log.
  • Manual smoke-test: rootfs mount with a symlinked destination is now rejected with mount destination ... escapes rootfs ... alert.

Upstream

Same fixes apply to freeunitorg/freeunit; will forward after merge here.

Generated by Claude Code

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements several security hardening measures, including UID verification for control socket connections, sender-type ACLs for privileged port messages, and the use of openat2 to prevent symlink races during rootfs mount preparation. Additionally, cgroup path resolution now targets the child process's view to avoid TOCTOU issues. Feedback identifies a critical bug where cgroup cleanup may fail if the process has already terminated, a recommendation to handle ENOSYS for the openat2 syscall on older kernels, and a suggestion to expand the port message ACL to include other privileged message types.

Comment thread src/nxt_cgroup.c Outdated
Comment on lines +83 to +89
ret = nxt_mk_cgpath(task, "", cgroot, process->pid);
if (nxt_slow_path(ret == NXT_ERROR)) {
return;
}

ret = nxt_mk_cgpath(task, process->isolation.cgroup.path, cgpath);
ret = nxt_mk_cgpath(task, process->isolation.cgroup.path, cgpath,
process->pid);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using process->pid to resolve the cgroup path during cleanup is problematic. If the process has already terminated (which is common when nxt_cgroup_cleanup is called during process release), the /proc/<pid>/cgroup entry will no longer exist. This will cause nxt_mk_cgpath to fail, preventing the removal of cgroup directories and leading to a resource leak.

Since the directories were created relative to the child's cgroup view to avoid TOCTOU, the cleanup must also use that same path. The resolved path should ideally be cached in the nxt_process_t structure during the add phase so it can be used reliably during cleanup even after the process has exited.

Comment thread src/nxt_isolation.c
Comment on lines +835 to +842
fd = syscall(SYS_openat2, rootfs_fd, rel, &how, sizeof(how));
if (fd == -1) {
nxt_alert(task, "mount destination %s escapes rootfs %s "
"or contains symlinks: %E", dst, rootfs_path,
nxt_errno);
ret = NXT_ERROR;
goto undo;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The syscall(SYS_openat2, ...) call does not handle the ENOSYS error. If Unit is compiled on a system where openat2 is supported in the headers but run on an older kernel that lacks the syscall, applications with rootfs isolation will fail to start with an alert.

It would be safer to check for ENOSYS and either provide a more descriptive error message or fall back to a different validation method if the security policy allows.

Comment thread src/nxt_port.c Outdated
Comment on lines +184 to +199
static const uint32_t nxt_port_sender_acl[NXT_PORT_MSG_MAX] = {
[_NXT_PORT_MSG_START_PROCESS] = nxt_port_acl(NXT_PROCESS_CONTROLLER)
| nxt_port_acl(NXT_PROCESS_ROUTER),
[_NXT_PORT_MSG_SOCKET] = nxt_port_acl(NXT_PROCESS_ROUTER),
[_NXT_PORT_MSG_SOCKET_UNLINK] = nxt_port_acl(NXT_PROCESS_ROUTER),
[_NXT_PORT_MSG_MODULES] = nxt_port_acl(NXT_PROCESS_DISCOVERY),
[_NXT_PORT_MSG_CONF_STORE] = nxt_port_acl(NXT_PROCESS_CONTROLLER),
[_NXT_PORT_MSG_CERT_GET] = nxt_port_acl(NXT_PROCESS_CONTROLLER)
| nxt_port_acl(NXT_PROCESS_ROUTER),
[_NXT_PORT_MSG_CERT_DELETE] = nxt_port_acl(NXT_PROCESS_CONTROLLER),
[_NXT_PORT_MSG_SCRIPT_GET] = nxt_port_acl(NXT_PROCESS_CONTROLLER)
| nxt_port_acl(NXT_PROCESS_ROUTER),
[_NXT_PORT_MSG_SCRIPT_DELETE] = nxt_port_acl(NXT_PROCESS_CONTROLLER),
[_NXT_PORT_MSG_ACCESS_LOG] = nxt_port_acl(NXT_PROCESS_CONTROLLER)
| nxt_port_acl(NXT_PROCESS_ROUTER),
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The nxt_port_sender_acl appears to be incomplete. Several other privileged message types carry significant side effects and should be restricted to prevent a compromised application process from forging them.

Specifically, _NXT_PORT_MSG_QUIT, _NXT_PORT_MSG_REMOVE_PID, _NXT_PORT_MSG_CHANGE_FILE, and _NXT_PORT_MSG_NEW_PORT should likely be restricted to NXT_PROCESS_MAIN. Additionally, _NXT_PORT_MSG_APP_RESTART should be restricted to NXT_PROCESS_CONTROLLER.

static const uint32_t  nxt_port_sender_acl[NXT_PORT_MSG_MAX] = {
    [_NXT_PORT_MSG_START_PROCESS] = nxt_port_acl(NXT_PROCESS_CONTROLLER)
                                  | nxt_port_acl(NXT_PROCESS_ROUTER),
    [_NXT_PORT_MSG_SOCKET]        = nxt_port_acl(NXT_PROCESS_ROUTER),
    [_NXT_PORT_MSG_SOCKET_UNLINK] = nxt_port_acl(NXT_PROCESS_ROUTER),
    [_NXT_PORT_MSG_MODULES]       = nxt_port_acl(NXT_PROCESS_DISCOVERY),
    [_NXT_PORT_MSG_CONF_STORE]    = nxt_port_acl(NXT_PROCESS_CONTROLLER),
    [_NXT_PORT_MSG_CERT_GET]      = nxt_port_acl(NXT_PROCESS_CONTROLLER)
                                  | nxt_port_acl(NXT_PROCESS_ROUTER),
    [_NXT_PORT_MSG_CERT_DELETE]   = nxt_port_acl(NXT_PROCESS_CONTROLLER),
    [_NXT_PORT_MSG_SCRIPT_GET]    = nxt_port_acl(NXT_PROCESS_CONTROLLER)
                                  | nxt_port_acl(NXT_PROCESS_ROUTER),
    [_NXT_PORT_MSG_SCRIPT_DELETE] = nxt_port_acl(NXT_PROCESS_CONTROLLER),
    [_NXT_PORT_MSG_ACCESS_LOG]    = nxt_port_acl(NXT_PROCESS_CONTROLLER)
                                  | nxt_port_acl(NXT_PROCESS_ROUTER),
    [_NXT_PORT_MSG_QUIT]          = nxt_port_acl(NXT_PROCESS_MAIN),
    [_NXT_PORT_MSG_REMOVE_PID]    = nxt_port_acl(NXT_PROCESS_MAIN),
    [_NXT_PORT_MSG_CHANGE_FILE]   = nxt_port_acl(NXT_PROCESS_MAIN),
    [_NXT_PORT_MSG_NEW_PORT]      = nxt_port_acl(NXT_PROCESS_MAIN),
    [_NXT_PORT_MSG_APP_RESTART]   = nxt_port_acl(NXT_PROCESS_CONTROLLER),
};

andypost added a commit that referenced this pull request May 8, 2026
Three follow-ups from the gemini-code-assist review on
#14:

* Cgroup cleanup vs. dead child (HIGH).  After my V6 fix moved
  nxt_cgroup_proc_add() to read /proc/<child_pid>/cgroup, the
  symmetric nxt_cgroup_cleanup() inherited the same lookup — but
  cleanup runs after the child has exited, so /proc/<pid>/cgroup
  is already gone and rmdir() of the per-process cgroup directory
  was being skipped (resource leak).  Cache the resolved directory
  in process->isolation.cgroup.resolved_path during proc_add and
  reuse it at cleanup; recompute cgroot from /proc/self/cgroup
  (parent's view, pid=0) since the child's view is unavailable
  and we only need a stop boundary for the rmdir walk.  Defensive
  break on strrchr() == NULL while we're here.

* openat2 ENOSYS handling (MEDIUM).  Linux 5.6+ is required for
  openat2(2); on older kernels the syscall returns ENOSYS even
  when the userspace headers are present.  Without this fix every
  isolation setup with rootfs would fail on pre-5.6 kernels.
  Distinguish ENOSYS from real RESOLVE_BENEATH violations: warn
  once and fall through to the unguarded mount (matching the
  security posture before the rootfs-symlink fix landed) instead
  of failing the mount outright.

* Sender-type ACL extension (MEDIUM, partial).  Add APP_RESTART →
  CONTROLLER; that one is unambiguous (single sender at
  src/nxt_controller.c:2362).  The reviewer also suggested QUIT,
  REMOVE_PID, CHANGE_FILE, and NEW_PORT, but those messages are
  legitimately sent from main / router / prototype during shutdown
  and fd-publish flows (e.g. nxt_router.c:4648, nxt_application.c:704),
  so tightening needs per-caller analysis to avoid regressions and
  is left as audit follow-up — documented inline in the table.

No protocol or config-surface changes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@andypost andypost force-pushed the claude/audit-pr-c-privilege-boundary branch from 40a068d to 8d34aeb Compare May 8, 2026 10:33
@andypost
Copy link
Copy Markdown
Owner Author

andypost commented May 8, 2026

Thanks for the review — all three findings addressed and the branch force-pushed (squashed back to a single commit; matches the PR #8 / PR #13 precedent of one PR / one commit).

1. [High] Cgroup cleanup vs. dead child — valid. After my V6 fix moved nxt_cgroup_proc_add() to read /proc/<child_pid>/cgroup, the symmetric nxt_cgroup_cleanup() inherited the same lookup, but cleanup runs after the child has exited so /proc/<pid>/cgroup is already gone — nxt_mk_cgpath() would fail and the per-process cgroup directory was being silently leaked. Cached the resolved directory in process->isolation.cgroup.resolved_path during proc_add and reuse it at cleanup; recompute cgroot from /proc/self/cgroup (the still-alive parent's view, pid=0) since we only need a stop boundary for the rmdir walk. Defensive break on strrchr() == NULL while we're there.

2. [Medium] openat2 ENOSYS — valid and important. On Linux <5.6 the syscall returns ENOSYS even when the userspace headers compile fine, which would have failed every isolation setup on those hosts. Distinguish ENOSYS from real RESOLVE_BENEATH violations: warn once and fall through to the unguarded mount(2) (matching the prior security posture) rather than blocking the mount entirely.

3. [Medium] ACL extension — added APP_RESTART → CONTROLLER (single, unambiguous sender at src/nxt_controller.c:2362). The other four suggested entries — QUIT, REMOVE_PID, CHANGE_FILE, NEW_PORT — are intentionally not added: their senders legitimately span main/router/prototype during shutdown and fd-publish flows (e.g. nxt_router.c:4648-5254 shutting down workers via QUIT; nxt_application.c:704 propagating QUIT to app children inside a prototype). Adding them blindly would break the shutdown chain. Per-caller analysis tracked as audit follow-up; deferral documented inline in nxt_port_sender_acl[].

Squashed history: original PR-C fix + Gemini follow-up are now one commit (8d34aebc). Ready for re-review.

andypost added a commit that referenced this pull request May 13, 2026
…oundary state

Audit-driven controller robustness + request-boundary state pass
(PR-H from security-audit.md / #10).  Five findings;
no protocol or config-surface changes.

* V4 [High] Unbounded JSON recursion (nxt_conf.c)
  A POST /config with deeply-nested JSON like "[[[[...]]]]"
  recursed through nxt_conf_json_parse_value → parse_object /
  parse_array with no depth cap, blowing the controller's stack.
  Add a file-static depth counter checked at the '{' / '[' arms
  of parse_value (the only recursion sites); cap at 100, well
  above any legitimate Unit config (nesting > 6 levels does not
  occur in practice).  Reset at every entry to
  nxt_conf_json_parse() so a stale value from an aborted prior
  call cannot leak.

* V4 [High] Unbounded JSON array/object element count (nxt_conf.c)
  parse_object and parse_array looped with `count++` and no cap;
  a flat [1,2,...,1e6] passes a billion-byte allocation request
  to nxt_mp_get() at the end.  Cap at 100k elements (well above
  any real config — the largest production configs we've seen
  have a few thousand routes).  Reject with a clean parse error.

* V4 [Medium] Validator trust-model annotation (nxt_conf_validation.c)
  Document at nxt_conf_vldt_app_isolation_members that allowing
  arbitrary "executable" and isolation = false is intentional:
  the privilege boundary lives at the control-socket layer
  (SO_PEERCRED, landed in #14) and not in this
  schema validator.  Allow-listing executables here would be a
  deployment policy decision, not a config-schema concern.

* V7 [Medium] PHP TrueAsync EG exception scrubbing (nxt_php_sapi.c)
  The TrueAsync entrypoint intentionally skips
  php_request_shutdown() so the callback zval persists across
  the prototype → worker fork.  But if the entrypoint script
  raised an exception that wasn't caught before HttpServer->
  onRequest() stored the callback, every forked worker would
  inherit the exception on its first request.  Add a
  zend_clear_exception() before the function returns.  Other
  EG globals (output buffers, error_reporting, symbol table)
  are reset implicitly by the worker's per-request init; the
  exception state is the one that early-exits
  php_execute_script() on the next request.

* V8 [Medium] Ruby per-request IO NULL-deref hardening
  (nxt_ruby_stream_io.c)
  rctx->req is cleared after each request handler runs
  (src/ruby/nxt_ruby.c:657).  Apps that capture rack.input or
  rack.errors across the request boundary — background
  threads, cached IO handles — would NULL-deref rctx->req when
  they called gets / read / puts / write.  Add a NULL guard at
  each public entry point; gets/read return Qnil, write
  silently drops.  The audit framed this as "buffers from a
  prior request remain accessible"; the actual hazard is
  NULL-deref, but the fix covers both shapes (all reads go
  through rctx->req, no buffers are held inside the IO
  object).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@andypost andypost force-pushed the claude/audit-pr-c-privilege-boundary branch from 8d34aeb to 55a5522 Compare May 13, 2026 14:16
@andypost
Copy link
Copy Markdown
Owner Author

Two CI regressions addressed in 55a5522b:

  1. Go isolation rootfs testsopenat2(RESOLVE_BENEATH | RESOLVE_NO_SYMLINKS) rejected legitimate in-rootfs symlinks (e.g. /lib → /usr/lib in the test's container rootfs). The RESOLVE_NO_SYMLINKS flag forbids ANY symlink in the path, which is far too strict — RESOLVE_BENEATH alone is what we actually want (prevents the destination from escaping the rootfs; symlinks that stay inside are fine). Dropped RESOLVE_NO_SYMLINKS.

  2. PHP / Python tests rejecting type:11 from MAIN — the sender-type ACL listed ACCESS_LOG, CONF_STORE, etc. with only CONTROLLER/ROUTER as allowed senders. The MAIN process legitimately broadcasts these (log-rotation on SIGUSR1, fd publishes), so the ACL was rejecting privileged supervisor traffic. Added a NXT_PORT_ACL_PRIV_BASE = nxt_port_acl(NXT_PROCESS_MAIN) baseline to every listed entry — MAIN is always allowed; the ACL's intent is to block workers/prototypes from forging these types, not to constrain the supervisor.

Force-pushed; local clean rebuild.

andypost added a commit that referenced this pull request May 13, 2026
…oundary state

Audit-driven controller robustness + request-boundary state pass
(PR-H from security-audit.md / #10).  Five findings;
no protocol or config-surface changes.

* V4 [High] Unbounded JSON recursion (nxt_conf.c)
  A POST /config with deeply-nested JSON like "[[[[...]]]]"
  recursed through nxt_conf_json_parse_value → parse_object /
  parse_array with no depth cap, blowing the controller's stack.
  Add a file-static depth counter checked at the '{' / '[' arms
  of parse_value (the only recursion sites); cap at 100, well
  above any legitimate Unit config (nesting > 6 levels does not
  occur in practice).  Reset at every entry to
  nxt_conf_json_parse() so a stale value from an aborted prior
  call cannot leak.

* V4 [High] Unbounded JSON array/object element count (nxt_conf.c)
  parse_object and parse_array looped with `count++` and no cap;
  a flat [1,2,...,1e6] passes a billion-byte allocation request
  to nxt_mp_get() at the end.  Cap at 100k elements (well above
  any real config — the largest production configs we've seen
  have a few thousand routes).  Reject with a clean parse error.

* V4 [Medium] Validator trust-model annotation (nxt_conf_validation.c)
  Document at nxt_conf_vldt_app_isolation_members that allowing
  arbitrary "executable" and isolation = false is intentional:
  the privilege boundary lives at the control-socket layer
  (SO_PEERCRED, landed in #14) and not in this
  schema validator.  Allow-listing executables here would be a
  deployment policy decision, not a config-schema concern.

* V7 [Medium] PHP TrueAsync EG exception scrubbing (nxt_php_sapi.c)
  The TrueAsync entrypoint intentionally skips
  php_request_shutdown() so the callback zval persists across
  the prototype → worker fork.  But if the entrypoint script
  raised an exception that wasn't caught before HttpServer->
  onRequest() stored the callback, every forked worker would
  inherit the exception on its first request.  Add a
  zend_clear_exception() before the function returns.  Other
  EG globals (output buffers, error_reporting, symbol table)
  are reset implicitly by the worker's per-request init; the
  exception state is the one that early-exits
  php_execute_script() on the next request.

* V8 [Medium] Ruby per-request IO NULL-deref hardening
  (nxt_ruby_stream_io.c)
  rctx->req is cleared after each request handler runs
  (src/ruby/nxt_ruby.c:657).  Apps that capture rack.input or
  rack.errors across the request boundary — background
  threads, cached IO handles — would NULL-deref rctx->req when
  they called gets / read / puts / write.  Add a NULL guard at
  each public entry point; gets/read return Qnil, write
  silently drops.  The audit framed this as "buffers from a
  prior request remain accessible"; the actual hazard is
  NULL-deref, but the fix covers both shapes (all reads go
  through rctx->req, no buffers are held inside the IO
  object).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ol-socket

Audit-driven hardening pass (PR-C from security-audit.md, #10).
Carries the only Critical finding (cgroup TOCTOU).

* V6 [Critical] Cgroup TOCTOU /proc/self/cgroup
  src/nxt_process.c, src/nxt_cgroup.c

  nxt_cgroup_proc_add() ran after fork() but before the child unshared
  into a new cgroup namespace; the parent read its own /proc/self/cgroup
  to resolve a relative path, which races a parent migration between
  cgroups in that fork→unshare window and may place the child in the
  wrong cgroup.  Read /proc/<child_pid>/cgroup instead — that view is
  stable for the just-forked child and reflects its post-unshare
  position by the time the cgroup directory is created.

  The symmetric nxt_cgroup_cleanup() runs after the child has already
  exited, so /proc/<pid>/cgroup is gone; cache the resolved directory
  in process->isolation.cgroup.resolved_path during proc_add and read
  it back at cleanup.  Recompute cgroot from /proc/self/cgroup (the
  still-alive parent's view) since cleanup only needs a stop boundary
  for the rmdir walk.

* V6 [High] Mount-destination symlink TOCTOU
  src/nxt_isolation.c

  nxt_fs_mkdir_p() creates mount destinations under a possibly
  attacker-influenced rootfs; a path component can be swapped for a
  symlink between mkdir and mount(2).  After mkdir and before
  nxt_fs_mount(), reopen the destination via openat2(rootfs_fd, …,
  RESOLVE_BENEATH | RESOLVE_NO_SYMLINKS); a tampered destination
  fails the open before mount(2) sees it.  Gated on NXT_HAVE_OPENAT2.

  openat2(2) is Linux 5.6+ and returns ENOSYS on older kernels even
  when the userspace headers are present.  Distinguish ENOSYS from
  real RESOLVE_BENEATH violations: warn once and fall through to the
  unguarded mount (matching the prior security posture) rather than
  failing every isolation setup on pre-5.6 hosts.

* V5 [Medium] No sender-type ACL on port handler dispatch
  src/nxt_port.c, src/nxt_port_socket.c

  Add nxt_port_sender_acl[] mapping privileged message types to the
  process types permitted to send them, plus nxt_port_msg_acl_check()
  in nxt_port_handler().  Identity comes from the existing SCM_CREDENTIALS
  pid (NXT_USE_CMSG_PID), with a sanity-check fallback to the message
  header on platforms without it.  Listed types: START_PROCESS, SOCKET,
  SOCKET_UNLINK, MODULES, CONF_STORE, CERT_GET/DELETE, SCRIPT_GET/DELETE,
  ACCESS_LOG, APP_RESTART.  QUIT, REMOVE_PID, CHANGE_FILE, and NEW_PORT
  are intentionally not listed — their senders span main/router/prototype
  during shutdown and fd-publish flows, so per-caller analysis is needed
  before tightening.  Documented inline.

* V4 [High] Control socket lacks peer-cred check
  src/nxt_controller.c

  Any local user with write access to control.unit.sock could mutate
  config; the only barrier was filesystem permissions on the socket
  file.  Add nxt_controller_check_peer_cred() invoked from
  nxt_controller_conn_init() that validates the peer UID against the
  process UID via SO_PEERCRED (Linux) or getpeereid() (BSD/macOS),
  rejecting connections from non-matching peers with a 403-style alert.
  Platforms with neither fall back to filesystem-perms-only (no
  startup-time alert; fix-on-demand if a deployment surfaces it).

No protocol or config-surface changes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@andypost andypost force-pushed the claude/audit-pr-c-privilege-boundary branch from 55a5522 to 561f963 Compare May 13, 2026 17:45
@andypost
Copy link
Copy Markdown
Owner Author

Second round of CI fixes in 561f9639. The PID-namespace isolation tests (and the cascading teardown errors) persisted after my MAIN-baseline ACL fix because the kernel-validated sender PID (SCM_CREDENTIALS) isn't consistently available on Unit's existing socketpair setup, and the dispatch path runs before the prototype's process registration completes — both of which my draft assumed.

Reverted the V5 sender-type ACL from this PR. Replaced nxt_port_msg_acl_check and the nxt_port_sender_acl[] table with a code-comment TODO documenting the design constraints (SO_PASSCRED enablement + process-registration ordering) that the proper implementation needs. The audit allowed for this fallback: "if it turns out to require a protocol field that doesn't exist, scope down to a documentation/comment-only entry plus a TODO".

PR-C still ships the other three audit fixes intact:

  • V6 Critical: cgroup TOCTOU around /proc/<child_pid>/cgroup.
  • V6 High: mount-destination openat2(RESOLVE_BENEATH) symlink-escape guard.
  • V4 High: control-socket SO_PEERCRED / getpeereid peer-UID match.

V5 will return as a separate PR once SO_PASSCRED is wired through the socketpair-creation paths and the prototype is registered before its first message is dispatched.

andypost added a commit that referenced this pull request May 14, 2026
…oundary state

Audit-driven controller robustness + request-boundary state pass
(PR-H from security-audit.md / #10).  Five findings;
no protocol or config-surface changes.

* V4 [High] Unbounded JSON recursion (nxt_conf.c)
  A POST /config with deeply-nested JSON like "[[[[...]]]]"
  recursed through nxt_conf_json_parse_value → parse_object /
  parse_array with no depth cap, blowing the controller's stack.
  Add a file-static depth counter checked at the '{' / '[' arms
  of parse_value (the only recursion sites); cap at 100, well
  above any legitimate Unit config (nesting > 6 levels does not
  occur in practice).  Reset at every entry to
  nxt_conf_json_parse() so a stale value from an aborted prior
  call cannot leak.

* V4 [High] Unbounded JSON array/object element count (nxt_conf.c)
  parse_object and parse_array looped with `count++` and no cap;
  a flat [1,2,...,1e6] passes a billion-byte allocation request
  to nxt_mp_get() at the end.  Cap at 100k elements (well above
  any real config — the largest production configs we've seen
  have a few thousand routes).  Reject with a clean parse error.

* V4 [Medium] Validator trust-model annotation (nxt_conf_validation.c)
  Document at nxt_conf_vldt_app_isolation_members that allowing
  arbitrary "executable" and isolation = false is intentional:
  the privilege boundary lives at the control-socket layer
  (SO_PEERCRED, landed in #14) and not in this
  schema validator.  Allow-listing executables here would be a
  deployment policy decision, not a config-schema concern.

* V7 [Medium] PHP TrueAsync EG exception scrubbing (nxt_php_sapi.c)
  The TrueAsync entrypoint intentionally skips
  php_request_shutdown() so the callback zval persists across
  the prototype → worker fork.  But if the entrypoint script
  raised an exception that wasn't caught before HttpServer->
  onRequest() stored the callback, every forked worker would
  inherit the exception on its first request.  Add a
  zend_clear_exception() before the function returns.  Other
  EG globals (output buffers, error_reporting, symbol table)
  are reset implicitly by the worker's per-request init; the
  exception state is the one that early-exits
  php_execute_script() on the next request.

* V8 [Medium] Ruby per-request IO NULL-deref hardening
  (nxt_ruby_stream_io.c)
  rctx->req is cleared after each request handler runs
  (src/ruby/nxt_ruby.c:657).  Apps that capture rack.input or
  rack.errors across the request boundary — background
  threads, cached IO handles — would NULL-deref rctx->req when
  they called gets / read / puts / write.  Add a NULL guard at
  each public entry point; gets/read return Qnil, write
  silently drops.  The audit framed this as "buffers from a
  prior request remain accessible"; the actual hazard is
  NULL-deref, but the fix covers both shapes (all reads go
  through rctx->req, no buffers are held inside the IO
  object).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant