Skip to content

Security and robustness fixes from NIF audit#67

Merged
benoitc merged 10 commits into
mainfrom
security-fixes
May 30, 2026
Merged

Security and robustness fixes from NIF audit#67
benoitc merged 10 commits into
mainfrom
security-fixes

Conversation

@benoitc
Copy link
Copy Markdown
Owner

@benoitc benoitc commented May 30, 2026

Fixes the NIF/security audit findings, one commit per area. Each adds a regression test.

  1. binary_to_term decodes with ERL_NIF_BIN2TERM_SAFE (atom-exhaustion).
  2. Bounded conversion recursion + PyTuple_New/PyList_New NULL checks + dict-mutation bound.
  3. Zero-copy py_buffer won't relocate while a Python memoryview pins it (UAF).
  4. Callback suspend/resume lifetime: keep the worker, free prior result on resume, clear pending-callback TLS, dirty + bounded callback-pipe writes.
  5. OWN_GIL worker: a per-request failure no longer kills the worker loop; owngil_* dispatch is dirty with bounded I/O; SuspensionRequired resolved per-interpreter.
  6. Event-loop fd handles are opaque ids validated against a registry; fd_read/fd_write moved to dirty schedulers.
  7. py_state optional max_state_entries cap (default infinity) with atomic admission; py:stream/logging builders validate identifiers and escape literals.
  8. venv/installer commands run via spawn_executable with argv lists, not os:cmd.
  9. Low-severity batch: make_py_error NULL guards, embedded-NUL rejection, a leaked split method, a stray debug fprintf.

Notes: the py_state cap is opt-in (default infinity); Phase 4 frees the prior result on resume rather than guarding the toggling has_result; the fd registry is global rather than per-loop.

benoitc added 10 commits May 30, 2026 02:03
All enif_binary_to_term calls now pass ERL_NIF_BIN2TERM_SAFE so
attacker-influenced data (notably a Python "__etf__:<base64>" callback
result) cannot mint new non-GC'd atoms and exhaust the atom table.
Local pids/refs and existing atoms still round-trip; only brand-new
atoms, remote-node terms, and external funs are rejected.

Adds py_reentrant_SUITE:test_etf_decode_safe covering both the
non-breaking round-trip and the rejected-novel-atom case.
Cap nesting depth in py_to_term/term_to_py so a deeply nested term or
Python structure returns an error instead of overflowing the C stack and
crashing the node. Bound the dict->map loop against mid-iteration mutation
(use the actual filled count). NULL-check the argument PyTuple_New across
the call/eval/resume paths (py_nif, py_event_loop, py_callback, py_exec);
most sites were already guarded.

Adds py_SUITE:test_conversion_depth_guard (1M-deep term -> clean error,
node stays up).
py_buffer_write relocated and freed buf->data on growth even when a
Python memoryview (PyBuffer_getbuffer) held a raw pointer into it,
dangling the view (use-after-free, whole-node crash). Refuse to grow
while view_count > 0; the write returns an error instead.

Adds py_buffer_SUITE:buffer_grow_pinned_test (pinned grow -> error,
release -> write succeeds).
Keep the worker resource alive while a callback is suspended (it could
be GC'd mid-suspension -> use-after-free on resume); release it in the
suspended-state destructor and NULL-check before the replay deref. Free
any prior result_data before a resume stores a new one (result_data, not
the toggling has_result flag, is the real pending-result indicator, so a
duplicate/raced resume no longer leaks). Clear the pending-callback TLS at
the worker request boundary so a reused worker thread doesn't trip the
stale-TLS invariant. Run the callback-response pipe writes on dirty IO
schedulers with non-blocking, deadline-bounded writes so a stalled reader
or large payload can't wedge a scheduler or desync the framed protocol.
Also Py_XDECREF callback_args in the GIL-held destructor branch.

Adds py_reentrant_SUITE:test_reentrant_resume_stress.
A per-request dict allocation failure in a subinterpreter worker no
longer breaks (and permanently kills) the worker command loop while
holding the GIL; it now sends an error response and keeps serving.

The owngil_* dispatch NIFs run on dirty IO schedulers and use the
non-blocking, deadline-bounded read_with_timeout/write_all_with_deadline
helpers (cmd_pipe write end set O_NONBLOCK), so a stalled or dead worker
can't wedge a scheduler forever or desync the framed protocol.

SuspensionRequired is now resolved from the current interpreter's erlang
module (like ProcessError), avoiding cross-interpreter PyObject use under
OWN_GIL.

Adds py_owngil_features_SUITE:owngil_reentrant_multi_stress_test.
The asyncio reader/writer integration handed Python a raw fd_resource
pointer cast to an integer, then cast it back and dereferenced it in
remove/update/clear/release. A stale, duplicate, or fabricated id was a
double-free or arbitrary-pointer deref that crashed the node.

Hand out opaque ids tracked in a validating registry instead: producers
register and return an id; removing consumers take (and release) the
entry so a duplicate id is a no-op; non-removing consumers look up with a
temporary keep. An unknown id is a safe no-op or a clean ValueError.

fd_read/fd_write also moved to ERL_NIF_DIRTY_JOB_IO_BOUND.

Adds py_fd_ops_SUITE:fd_registry_invalid_id_test.
py_state gains an optional max_state_entries cap (default infinity, so
existing behavior is unchanged) enforced with atomic admission
(insert_new + update_counter, ets:take on remove) so Python-driven
state_set can't exhaust node memory. A reserved sentinel row holds the
count and is protected from caller corruption and hidden from keys/fetch.

The py:stream and setup_logging helpers that build Python source now
validate module/function/kwarg names as identifiers (rejecting injection
at positions where quoting is meaningless) and escape string-literal
values, including newlines and other control bytes.

Adds py_state_SUITE (cap accounting + sentinel) and
py_stream_SUITE:test_stream_rejects_injection.
ensure_venv and dependency installation built shell command strings and
ran them through os:cmd, with quote/1 single-quote wrapping that did not
escape embedded quotes -- a venv path/requirements/extras value could
break out and inject. Run the executables via
open_port({spawn_executable, Exe}, [{args, Args}, ...]) with literal
argument lists instead. For uv, VIRTUAL_ENV is passed via the port
{env, ...} option rather than a shell prefix.

Adds py_venv_SUITE:test_venv_path_metacharacters.
Guard make_py_error against a NULL message/type when a Python exception's
text isn't UTF-8-encodable (and against a NULL exception type). Reject
embedded NULs in binary_to_string so a module/func/attr/code name can't
be silently truncated at the NUL. Release the leaked split() method in
ReactorBuffer_split. Drop a stray debug fprintf on the normal worker
send path. Document that the worker-pool init/shutdown is gen_server
serialized (no lock needed).

Adds py_SUITE:test_embedded_nul_name_rejected.
owngil_reentrant_multi_stress_test (4 contexts x 20 iters x 6 deep) timed
out on slow CI runners. Reduce to 2 contexts x 5 iters x depth 4 with a
120s budget; it still drives concurrent reentrant suspend/resume across
distinct subinterpreters (the existing concurrent/nested reentrant tests
cover the same paths).
@benoitc benoitc merged commit 80c2106 into main May 30, 2026
17 checks passed
@benoitc benoitc deleted the security-fixes branch May 30, 2026 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant