Summary
leanpoint crashes (or becomes unresponsive) when one or more upstreams stop responding at the TCP layer but do not actively reject connections. Each poll spawns a worker thread that eventually gets abandoned via thread.detach() when it misses the wall-clock deadline, but the socket remains open for the kernel's TCP retransmit window (~15 minutes on Linux). Over time the main process exhausts RLIMIT_NOFILE and can no longer open new sockets — inbound or outbound.
Observed incident (devnet-4, 2026-04-23)
Public endpoints returned UpstreamConnectFailure for every upstream while the container appeared running. Snapshot inside the container:
/proc/$PID/fd/ held 1021 socket fds out of RLIMIT_NOFILE=1024 (Docker default).
ss -tn showed ~960 in ESTABLISHED toward the devnet clients and 63 in CLOSE-WAIT.
- HTTP
GET / on the public listener hung (accept loop could not get a new fd).
- Container logs showed sustained
WARN | Upstream (...) timed out after 5000ms — detaching thread over the hours preceding the outage.
The CLOSE-WAIT count is the smoking gun: the remote side had sent FIN, but our worker never called close() because it was still blocked inside recv() on a socket the peer had half-closed.
Root cause
src/upstreams.zig::pollUpstreamThread creates a fresh std.http.Client per poll and calls lean_api.fetchSlots synchronously. On Zig 0.14.1, std.http.Client does not expose connect_timeout or read_timeout — the existing @hasField guards in poller.zig / server.zig are no-ops on this version. To enforce a deadline, pollUpstreams waits up to request_timeout_ms and then just logs + moves on; any still-running worker is detached.
Detaching does not cancel the thread. It keeps blocking in the underlying syscall with the socket open until:
- the peer sends
RST/FIN (can be near-instant), or
- the kernel gives up on TCP retransmits (default ~15 min, but can be longer when SYN-ACKs arrive and then the remote stalls mid-stream).
For the 16-upstream devnet we poll every ~4s; even a small fraction of stuck workers leaks fds faster than they drain. At default Docker nofile=1024, one day of mild upstream flakiness is enough to exhaust the limit.
Fix (PR #1)
Apply SO_RCVTIMEO and SO_SNDTIMEO to the socket returned by client.open() via std.posix.setsockopt on req.connection.?.stream.handle. This bounds every blocking recv/send to request_timeout_ms regardless of peer behavior, so detached workers reliably self-terminate (and their defer client.deinit() closes the socket) within the configured deadline.
This addresses the observed pathology — accepted connections that go silent or are half-closed. Connect-phase black-holes (SYN with no SYN-ACK) are out of scope here because we would need to switch to non-blocking connect + poll; they didn't show up in the incident.
Mitigation applied on the running container
Restarted with --ulimit nofile=65536:65536. This only buys time; without the socket-level timeout the leak still grows, just slower.
Follow-ups (not in this fix)
- Non-blocking connect with explicit connect-timeout, so connect-phase hangs also clean up promptly.
- Health counter +
/healthz that fails when open-fd usage crosses a threshold, so orchestrators restart before exhaustion.
- Track the currently-in-flight detached worker count and log when it grows unboundedly.
- Revisit whether we need per-poll threads at all once socket-level timeouts exist — a sequential poll with hard timeouts would remove the detach/ref-count machinery entirely.
Reproduction
- Run
leanpoint against an upstream that accepts TCP but never returns a response body (e.g. nc -l 5055 in accept-and-block mode).
- Watch
ls /proc/$PID/fd | wc -l climb by one per poll_interval_ms until nofile is hit.
- Public listener stops accepting.
With the fix, the fd count stays flat: each worker returns within request_timeout_ms and releases its socket.
Summary
leanpointcrashes (or becomes unresponsive) when one or more upstreams stop responding at the TCP layer but do not actively reject connections. Each poll spawns a worker thread that eventually gets abandoned viathread.detach()when it misses the wall-clock deadline, but the socket remains open for the kernel's TCP retransmit window (~15 minutes on Linux). Over time the main process exhaustsRLIMIT_NOFILEand can no longer open new sockets — inbound or outbound.Observed incident (devnet-4, 2026-04-23)
Public endpoints returned
UpstreamConnectFailurefor every upstream while the container appeared running. Snapshot inside the container:/proc/$PID/fd/held 1021 socket fds out ofRLIMIT_NOFILE=1024(Docker default).ss -tnshowed ~960 inESTABLISHEDtoward the devnet clients and 63 inCLOSE-WAIT.GET /on the public listener hung (accept loop could not get a new fd).WARN | Upstream (...) timed out after 5000ms — detaching threadover the hours preceding the outage.The
CLOSE-WAITcount is the smoking gun: the remote side had sentFIN, but our worker never calledclose()because it was still blocked insiderecv()on a socket the peer had half-closed.Root cause
src/upstreams.zig::pollUpstreamThreadcreates a freshstd.http.Clientper poll and callslean_api.fetchSlotssynchronously. On Zig 0.14.1,std.http.Clientdoes not exposeconnect_timeoutorread_timeout— the existing@hasFieldguards inpoller.zig/server.zigare no-ops on this version. To enforce a deadline,pollUpstreamswaits up torequest_timeout_msand then just logs + moves on; any still-running worker is detached.Detaching does not cancel the thread. It keeps blocking in the underlying syscall with the socket open until:
RST/FIN(can be near-instant), orFor the 16-upstream devnet we poll every ~4s; even a small fraction of stuck workers leaks fds faster than they drain. At default Docker
nofile=1024, one day of mild upstream flakiness is enough to exhaust the limit.Fix (PR #1)
Apply
SO_RCVTIMEOandSO_SNDTIMEOto the socket returned byclient.open()viastd.posix.setsockoptonreq.connection.?.stream.handle. This bounds every blockingrecv/sendtorequest_timeout_msregardless of peer behavior, so detached workers reliably self-terminate (and theirdefer client.deinit()closes the socket) within the configured deadline.This addresses the observed pathology — accepted connections that go silent or are half-closed. Connect-phase black-holes (SYN with no SYN-ACK) are out of scope here because we would need to switch to non-blocking connect + poll; they didn't show up in the incident.
Mitigation applied on the running container
Restarted with
--ulimit nofile=65536:65536. This only buys time; without the socket-level timeout the leak still grows, just slower.Follow-ups (not in this fix)
/healthzthat fails when open-fd usage crosses a threshold, so orchestrators restart before exhaustion.Reproduction
leanpointagainst an upstream that accepts TCP but never returns a response body (e.g.nc -l 5055in accept-and-block mode).ls /proc/$PID/fd | wc -lclimb by one perpoll_interval_msuntilnofileis hit.With the fix, the fd count stays flat: each worker returns within
request_timeout_msand releases its socket.