fix(security): tighten privilege boundaries on cgroup/mount/IPC/control-socket#14
fix(security): tighten privilege boundaries on cgroup/mount/IPC/control-socket#14andypost wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements several security hardening measures, including UID verification for control socket connections, sender-type ACLs for privileged port messages, and the use of openat2 to prevent symlink races during rootfs mount preparation. Additionally, cgroup path resolution now targets the child process's view to avoid TOCTOU issues. Feedback identifies a critical bug where cgroup cleanup may fail if the process has already terminated, a recommendation to handle ENOSYS for the openat2 syscall on older kernels, and a suggestion to expand the port message ACL to include other privileged message types.
| ret = nxt_mk_cgpath(task, "", cgroot, process->pid); | ||
| if (nxt_slow_path(ret == NXT_ERROR)) { | ||
| return; | ||
| } | ||
|
|
||
| ret = nxt_mk_cgpath(task, process->isolation.cgroup.path, cgpath); | ||
| ret = nxt_mk_cgpath(task, process->isolation.cgroup.path, cgpath, | ||
| process->pid); |
There was a problem hiding this comment.
Using process->pid to resolve the cgroup path during cleanup is problematic. If the process has already terminated (which is common when nxt_cgroup_cleanup is called during process release), the /proc/<pid>/cgroup entry will no longer exist. This will cause nxt_mk_cgpath to fail, preventing the removal of cgroup directories and leading to a resource leak.
Since the directories were created relative to the child's cgroup view to avoid TOCTOU, the cleanup must also use that same path. The resolved path should ideally be cached in the nxt_process_t structure during the add phase so it can be used reliably during cleanup even after the process has exited.
| fd = syscall(SYS_openat2, rootfs_fd, rel, &how, sizeof(how)); | ||
| if (fd == -1) { | ||
| nxt_alert(task, "mount destination %s escapes rootfs %s " | ||
| "or contains symlinks: %E", dst, rootfs_path, | ||
| nxt_errno); | ||
| ret = NXT_ERROR; | ||
| goto undo; | ||
| } |
There was a problem hiding this comment.
The syscall(SYS_openat2, ...) call does not handle the ENOSYS error. If Unit is compiled on a system where openat2 is supported in the headers but run on an older kernel that lacks the syscall, applications with rootfs isolation will fail to start with an alert.
It would be safer to check for ENOSYS and either provide a more descriptive error message or fall back to a different validation method if the security policy allows.
| static const uint32_t nxt_port_sender_acl[NXT_PORT_MSG_MAX] = { | ||
| [_NXT_PORT_MSG_START_PROCESS] = nxt_port_acl(NXT_PROCESS_CONTROLLER) | ||
| | nxt_port_acl(NXT_PROCESS_ROUTER), | ||
| [_NXT_PORT_MSG_SOCKET] = nxt_port_acl(NXT_PROCESS_ROUTER), | ||
| [_NXT_PORT_MSG_SOCKET_UNLINK] = nxt_port_acl(NXT_PROCESS_ROUTER), | ||
| [_NXT_PORT_MSG_MODULES] = nxt_port_acl(NXT_PROCESS_DISCOVERY), | ||
| [_NXT_PORT_MSG_CONF_STORE] = nxt_port_acl(NXT_PROCESS_CONTROLLER), | ||
| [_NXT_PORT_MSG_CERT_GET] = nxt_port_acl(NXT_PROCESS_CONTROLLER) | ||
| | nxt_port_acl(NXT_PROCESS_ROUTER), | ||
| [_NXT_PORT_MSG_CERT_DELETE] = nxt_port_acl(NXT_PROCESS_CONTROLLER), | ||
| [_NXT_PORT_MSG_SCRIPT_GET] = nxt_port_acl(NXT_PROCESS_CONTROLLER) | ||
| | nxt_port_acl(NXT_PROCESS_ROUTER), | ||
| [_NXT_PORT_MSG_SCRIPT_DELETE] = nxt_port_acl(NXT_PROCESS_CONTROLLER), | ||
| [_NXT_PORT_MSG_ACCESS_LOG] = nxt_port_acl(NXT_PROCESS_CONTROLLER) | ||
| | nxt_port_acl(NXT_PROCESS_ROUTER), | ||
| }; |
There was a problem hiding this comment.
The nxt_port_sender_acl appears to be incomplete. Several other privileged message types carry significant side effects and should be restricted to prevent a compromised application process from forging them.
Specifically, _NXT_PORT_MSG_QUIT, _NXT_PORT_MSG_REMOVE_PID, _NXT_PORT_MSG_CHANGE_FILE, and _NXT_PORT_MSG_NEW_PORT should likely be restricted to NXT_PROCESS_MAIN. Additionally, _NXT_PORT_MSG_APP_RESTART should be restricted to NXT_PROCESS_CONTROLLER.
static const uint32_t nxt_port_sender_acl[NXT_PORT_MSG_MAX] = {
[_NXT_PORT_MSG_START_PROCESS] = nxt_port_acl(NXT_PROCESS_CONTROLLER)
| nxt_port_acl(NXT_PROCESS_ROUTER),
[_NXT_PORT_MSG_SOCKET] = nxt_port_acl(NXT_PROCESS_ROUTER),
[_NXT_PORT_MSG_SOCKET_UNLINK] = nxt_port_acl(NXT_PROCESS_ROUTER),
[_NXT_PORT_MSG_MODULES] = nxt_port_acl(NXT_PROCESS_DISCOVERY),
[_NXT_PORT_MSG_CONF_STORE] = nxt_port_acl(NXT_PROCESS_CONTROLLER),
[_NXT_PORT_MSG_CERT_GET] = nxt_port_acl(NXT_PROCESS_CONTROLLER)
| nxt_port_acl(NXT_PROCESS_ROUTER),
[_NXT_PORT_MSG_CERT_DELETE] = nxt_port_acl(NXT_PROCESS_CONTROLLER),
[_NXT_PORT_MSG_SCRIPT_GET] = nxt_port_acl(NXT_PROCESS_CONTROLLER)
| nxt_port_acl(NXT_PROCESS_ROUTER),
[_NXT_PORT_MSG_SCRIPT_DELETE] = nxt_port_acl(NXT_PROCESS_CONTROLLER),
[_NXT_PORT_MSG_ACCESS_LOG] = nxt_port_acl(NXT_PROCESS_CONTROLLER)
| nxt_port_acl(NXT_PROCESS_ROUTER),
[_NXT_PORT_MSG_QUIT] = nxt_port_acl(NXT_PROCESS_MAIN),
[_NXT_PORT_MSG_REMOVE_PID] = nxt_port_acl(NXT_PROCESS_MAIN),
[_NXT_PORT_MSG_CHANGE_FILE] = nxt_port_acl(NXT_PROCESS_MAIN),
[_NXT_PORT_MSG_NEW_PORT] = nxt_port_acl(NXT_PROCESS_MAIN),
[_NXT_PORT_MSG_APP_RESTART] = nxt_port_acl(NXT_PROCESS_CONTROLLER),
};Three follow-ups from the gemini-code-assist review on #14: * Cgroup cleanup vs. dead child (HIGH). After my V6 fix moved nxt_cgroup_proc_add() to read /proc/<child_pid>/cgroup, the symmetric nxt_cgroup_cleanup() inherited the same lookup — but cleanup runs after the child has exited, so /proc/<pid>/cgroup is already gone and rmdir() of the per-process cgroup directory was being skipped (resource leak). Cache the resolved directory in process->isolation.cgroup.resolved_path during proc_add and reuse it at cleanup; recompute cgroot from /proc/self/cgroup (parent's view, pid=0) since the child's view is unavailable and we only need a stop boundary for the rmdir walk. Defensive break on strrchr() == NULL while we're here. * openat2 ENOSYS handling (MEDIUM). Linux 5.6+ is required for openat2(2); on older kernels the syscall returns ENOSYS even when the userspace headers are present. Without this fix every isolation setup with rootfs would fail on pre-5.6 kernels. Distinguish ENOSYS from real RESOLVE_BENEATH violations: warn once and fall through to the unguarded mount (matching the security posture before the rootfs-symlink fix landed) instead of failing the mount outright. * Sender-type ACL extension (MEDIUM, partial). Add APP_RESTART → CONTROLLER; that one is unambiguous (single sender at src/nxt_controller.c:2362). The reviewer also suggested QUIT, REMOVE_PID, CHANGE_FILE, and NEW_PORT, but those messages are legitimately sent from main / router / prototype during shutdown and fd-publish flows (e.g. nxt_router.c:4648, nxt_application.c:704), so tightening needs per-caller analysis to avoid regressions and is left as audit follow-up — documented inline in the table. No protocol or config-surface changes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
40a068d to
8d34aeb
Compare
|
Thanks for the review — all three findings addressed and the branch force-pushed (squashed back to a single commit; matches the PR #8 / PR #13 precedent of one PR / one commit). 1. [High] Cgroup cleanup vs. dead child — valid. After my V6 fix moved 2. [Medium] openat2 ENOSYS — valid and important. On Linux <5.6 the syscall returns ENOSYS even when the userspace headers compile fine, which would have failed every isolation setup on those hosts. Distinguish ENOSYS from real 3. [Medium] ACL extension — added Squashed history: original PR-C fix + Gemini follow-up are now one commit ( |
…oundary state Audit-driven controller robustness + request-boundary state pass (PR-H from security-audit.md / #10). Five findings; no protocol or config-surface changes. * V4 [High] Unbounded JSON recursion (nxt_conf.c) A POST /config with deeply-nested JSON like "[[[[...]]]]" recursed through nxt_conf_json_parse_value → parse_object / parse_array with no depth cap, blowing the controller's stack. Add a file-static depth counter checked at the '{' / '[' arms of parse_value (the only recursion sites); cap at 100, well above any legitimate Unit config (nesting > 6 levels does not occur in practice). Reset at every entry to nxt_conf_json_parse() so a stale value from an aborted prior call cannot leak. * V4 [High] Unbounded JSON array/object element count (nxt_conf.c) parse_object and parse_array looped with `count++` and no cap; a flat [1,2,...,1e6] passes a billion-byte allocation request to nxt_mp_get() at the end. Cap at 100k elements (well above any real config — the largest production configs we've seen have a few thousand routes). Reject with a clean parse error. * V4 [Medium] Validator trust-model annotation (nxt_conf_validation.c) Document at nxt_conf_vldt_app_isolation_members that allowing arbitrary "executable" and isolation = false is intentional: the privilege boundary lives at the control-socket layer (SO_PEERCRED, landed in #14) and not in this schema validator. Allow-listing executables here would be a deployment policy decision, not a config-schema concern. * V7 [Medium] PHP TrueAsync EG exception scrubbing (nxt_php_sapi.c) The TrueAsync entrypoint intentionally skips php_request_shutdown() so the callback zval persists across the prototype → worker fork. But if the entrypoint script raised an exception that wasn't caught before HttpServer-> onRequest() stored the callback, every forked worker would inherit the exception on its first request. Add a zend_clear_exception() before the function returns. Other EG globals (output buffers, error_reporting, symbol table) are reset implicitly by the worker's per-request init; the exception state is the one that early-exits php_execute_script() on the next request. * V8 [Medium] Ruby per-request IO NULL-deref hardening (nxt_ruby_stream_io.c) rctx->req is cleared after each request handler runs (src/ruby/nxt_ruby.c:657). Apps that capture rack.input or rack.errors across the request boundary — background threads, cached IO handles — would NULL-deref rctx->req when they called gets / read / puts / write. Add a NULL guard at each public entry point; gets/read return Qnil, write silently drops. The audit framed this as "buffers from a prior request remain accessible"; the actual hazard is NULL-deref, but the fix covers both shapes (all reads go through rctx->req, no buffers are held inside the IO object). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
8d34aeb to
55a5522
Compare
|
Two CI regressions addressed in
Force-pushed; local clean rebuild. |
…oundary state Audit-driven controller robustness + request-boundary state pass (PR-H from security-audit.md / #10). Five findings; no protocol or config-surface changes. * V4 [High] Unbounded JSON recursion (nxt_conf.c) A POST /config with deeply-nested JSON like "[[[[...]]]]" recursed through nxt_conf_json_parse_value → parse_object / parse_array with no depth cap, blowing the controller's stack. Add a file-static depth counter checked at the '{' / '[' arms of parse_value (the only recursion sites); cap at 100, well above any legitimate Unit config (nesting > 6 levels does not occur in practice). Reset at every entry to nxt_conf_json_parse() so a stale value from an aborted prior call cannot leak. * V4 [High] Unbounded JSON array/object element count (nxt_conf.c) parse_object and parse_array looped with `count++` and no cap; a flat [1,2,...,1e6] passes a billion-byte allocation request to nxt_mp_get() at the end. Cap at 100k elements (well above any real config — the largest production configs we've seen have a few thousand routes). Reject with a clean parse error. * V4 [Medium] Validator trust-model annotation (nxt_conf_validation.c) Document at nxt_conf_vldt_app_isolation_members that allowing arbitrary "executable" and isolation = false is intentional: the privilege boundary lives at the control-socket layer (SO_PEERCRED, landed in #14) and not in this schema validator. Allow-listing executables here would be a deployment policy decision, not a config-schema concern. * V7 [Medium] PHP TrueAsync EG exception scrubbing (nxt_php_sapi.c) The TrueAsync entrypoint intentionally skips php_request_shutdown() so the callback zval persists across the prototype → worker fork. But if the entrypoint script raised an exception that wasn't caught before HttpServer-> onRequest() stored the callback, every forked worker would inherit the exception on its first request. Add a zend_clear_exception() before the function returns. Other EG globals (output buffers, error_reporting, symbol table) are reset implicitly by the worker's per-request init; the exception state is the one that early-exits php_execute_script() on the next request. * V8 [Medium] Ruby per-request IO NULL-deref hardening (nxt_ruby_stream_io.c) rctx->req is cleared after each request handler runs (src/ruby/nxt_ruby.c:657). Apps that capture rack.input or rack.errors across the request boundary — background threads, cached IO handles — would NULL-deref rctx->req when they called gets / read / puts / write. Add a NULL guard at each public entry point; gets/read return Qnil, write silently drops. The audit framed this as "buffers from a prior request remain accessible"; the actual hazard is NULL-deref, but the fix covers both shapes (all reads go through rctx->req, no buffers are held inside the IO object). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ol-socket Audit-driven hardening pass (PR-C from security-audit.md, #10). Carries the only Critical finding (cgroup TOCTOU). * V6 [Critical] Cgroup TOCTOU /proc/self/cgroup src/nxt_process.c, src/nxt_cgroup.c nxt_cgroup_proc_add() ran after fork() but before the child unshared into a new cgroup namespace; the parent read its own /proc/self/cgroup to resolve a relative path, which races a parent migration between cgroups in that fork→unshare window and may place the child in the wrong cgroup. Read /proc/<child_pid>/cgroup instead — that view is stable for the just-forked child and reflects its post-unshare position by the time the cgroup directory is created. The symmetric nxt_cgroup_cleanup() runs after the child has already exited, so /proc/<pid>/cgroup is gone; cache the resolved directory in process->isolation.cgroup.resolved_path during proc_add and read it back at cleanup. Recompute cgroot from /proc/self/cgroup (the still-alive parent's view) since cleanup only needs a stop boundary for the rmdir walk. * V6 [High] Mount-destination symlink TOCTOU src/nxt_isolation.c nxt_fs_mkdir_p() creates mount destinations under a possibly attacker-influenced rootfs; a path component can be swapped for a symlink between mkdir and mount(2). After mkdir and before nxt_fs_mount(), reopen the destination via openat2(rootfs_fd, …, RESOLVE_BENEATH | RESOLVE_NO_SYMLINKS); a tampered destination fails the open before mount(2) sees it. Gated on NXT_HAVE_OPENAT2. openat2(2) is Linux 5.6+ and returns ENOSYS on older kernels even when the userspace headers are present. Distinguish ENOSYS from real RESOLVE_BENEATH violations: warn once and fall through to the unguarded mount (matching the prior security posture) rather than failing every isolation setup on pre-5.6 hosts. * V5 [Medium] No sender-type ACL on port handler dispatch src/nxt_port.c, src/nxt_port_socket.c Add nxt_port_sender_acl[] mapping privileged message types to the process types permitted to send them, plus nxt_port_msg_acl_check() in nxt_port_handler(). Identity comes from the existing SCM_CREDENTIALS pid (NXT_USE_CMSG_PID), with a sanity-check fallback to the message header on platforms without it. Listed types: START_PROCESS, SOCKET, SOCKET_UNLINK, MODULES, CONF_STORE, CERT_GET/DELETE, SCRIPT_GET/DELETE, ACCESS_LOG, APP_RESTART. QUIT, REMOVE_PID, CHANGE_FILE, and NEW_PORT are intentionally not listed — their senders span main/router/prototype during shutdown and fd-publish flows, so per-caller analysis is needed before tightening. Documented inline. * V4 [High] Control socket lacks peer-cred check src/nxt_controller.c Any local user with write access to control.unit.sock could mutate config; the only barrier was filesystem permissions on the socket file. Add nxt_controller_check_peer_cred() invoked from nxt_controller_conn_init() that validates the peer UID against the process UID via SO_PEERCRED (Linux) or getpeereid() (BSD/macOS), rejecting connections from non-matching peers with a 403-style alert. Platforms with neither fall back to filesystem-perms-only (no startup-time alert; fix-on-demand if a deployment surfaces it). No protocol or config-surface changes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
55a5522 to
561f963
Compare
|
Second round of CI fixes in Reverted the V5 sender-type ACL from this PR. Replaced PR-C still ships the other three audit fixes intact:
V5 will return as a separate PR once SO_PASSCRED is wired through the socketpair-creation paths and the prototype is registered before its first message is dispatched. |
…oundary state Audit-driven controller robustness + request-boundary state pass (PR-H from security-audit.md / #10). Five findings; no protocol or config-surface changes. * V4 [High] Unbounded JSON recursion (nxt_conf.c) A POST /config with deeply-nested JSON like "[[[[...]]]]" recursed through nxt_conf_json_parse_value → parse_object / parse_array with no depth cap, blowing the controller's stack. Add a file-static depth counter checked at the '{' / '[' arms of parse_value (the only recursion sites); cap at 100, well above any legitimate Unit config (nesting > 6 levels does not occur in practice). Reset at every entry to nxt_conf_json_parse() so a stale value from an aborted prior call cannot leak. * V4 [High] Unbounded JSON array/object element count (nxt_conf.c) parse_object and parse_array looped with `count++` and no cap; a flat [1,2,...,1e6] passes a billion-byte allocation request to nxt_mp_get() at the end. Cap at 100k elements (well above any real config — the largest production configs we've seen have a few thousand routes). Reject with a clean parse error. * V4 [Medium] Validator trust-model annotation (nxt_conf_validation.c) Document at nxt_conf_vldt_app_isolation_members that allowing arbitrary "executable" and isolation = false is intentional: the privilege boundary lives at the control-socket layer (SO_PEERCRED, landed in #14) and not in this schema validator. Allow-listing executables here would be a deployment policy decision, not a config-schema concern. * V7 [Medium] PHP TrueAsync EG exception scrubbing (nxt_php_sapi.c) The TrueAsync entrypoint intentionally skips php_request_shutdown() so the callback zval persists across the prototype → worker fork. But if the entrypoint script raised an exception that wasn't caught before HttpServer-> onRequest() stored the callback, every forked worker would inherit the exception on its first request. Add a zend_clear_exception() before the function returns. Other EG globals (output buffers, error_reporting, symbol table) are reset implicitly by the worker's per-request init; the exception state is the one that early-exits php_execute_script() on the next request. * V8 [Medium] Ruby per-request IO NULL-deref hardening (nxt_ruby_stream_io.c) rctx->req is cleared after each request handler runs (src/ruby/nxt_ruby.c:657). Apps that capture rack.input or rack.errors across the request boundary — background threads, cached IO handles — would NULL-deref rctx->req when they called gets / read / puts / write. Add a NULL guard at each public entry point; gets/read return Qnil, write silently drops. The audit framed this as "buffers from a prior request remain accessible"; the actual hazard is NULL-deref, but the fix covers both shapes (all reads go through rctx->req, no buffers are held inside the IO object). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Summary
Audit-driven privilege-boundary tightening (PR-C from
security-audit.md).Carries the only Critical finding (cgroup TOCTOU).
Findings addressed
/proc/self/cgroup(process.c:619,cgroup.c:102)fs.c,isolation.c:783)port.c:177)controller.c:720)Approach notes
getsockopt(SO_PEERCRED)on Linux,getpeereid()on *BSD/Darwin, no-op fallback elsewhere. Connections are dropped duringnxt_controller_conn_initbefore any bytes are read.nxt_port_handler()keyed onnxt_recv_msg_cmsg_pid()(kernel-validated whereSCM_CREDENTIALSis available). Only the privileged message types are restricted; data/rpc/mmap traffic is unaffected. The audit's escape hatch (scope down if no sender field exists) wasn't needed — the existing cmsg PID is enough.openat2(RESOLVE_BENEATH|RESOLVE_NO_SYMLINKS)per the audit's first suggestion, gated onNXT_HAVE_OPENAT2. The validation happens betweenmkdir_p()andmount(2); a swapped symlink fails the open. The build's existingNXT_HAVE_OPENAT2feature gate covers this./proc/<child_pid>/cgrouprather than/proc/self/cgroup. No IPC reordering. The child PID is known to the parent at the call site (just-forked).Test plan
./configure --openssl && make -j unitd— clean build.pytest-3 --collect-only test/— 898 tests collected, no import breakage.sudo pytest-3 test/test_*_isolation*.py— not run in this worktree (no passwordless sudo). Recommended before merge.unitdstarts, control API reachable from same UID, rejected from different UID with alert in unit.log.mount destination ... escapes rootfs ...alert.Upstream
Same fixes apply to
freeunitorg/freeunit; will forward after merge here.Generated by Claude Code