File descriptor leak crashes leanpoint under unresponsive upstreams

## Summary

`leanpoint` crashes (or becomes unresponsive) when one or more upstreams stop responding at the TCP layer but do not actively reject connections. Each poll spawns a worker thread that eventually gets abandoned via `thread.detach()` when it misses the wall-clock deadline, but the socket remains open for the kernel's TCP retransmit window (~15 minutes on Linux). Over time the main process exhausts `RLIMIT_NOFILE` and can no longer open new sockets — inbound or outbound.

## Observed incident (devnet-4, 2026-04-23)

Public endpoints returned `UpstreamConnectFailure` for every upstream while the container appeared running. Snapshot inside the container:

- `/proc/$PID/fd/` held **1021 socket fds** out of `RLIMIT_NOFILE=1024` (Docker default).
- `ss -tn` showed **~960 in `ESTABLISHED`** toward the devnet clients and **63 in `CLOSE-WAIT`**.
- HTTP `GET /` on the public listener hung (accept loop could not get a new fd).
- Container logs showed sustained `WARN | Upstream (...) timed out after 5000ms — detaching thread` over the hours preceding the outage.

The `CLOSE-WAIT` count is the smoking gun: the remote side had sent `FIN`, but our worker never called `close()` because it was still blocked inside `recv()` on a socket the peer had half-closed.

## Root cause

`src/upstreams.zig::pollUpstreamThread` creates a fresh `std.http.Client` per poll and calls `lean_api.fetchSlots` synchronously. On Zig 0.14.1, `std.http.Client` does **not** expose `connect_timeout` or `read_timeout` — the existing `@hasField` guards in `poller.zig` / `server.zig` are no-ops on this version. To enforce a deadline, `pollUpstreams` waits up to `request_timeout_ms` and then just logs + moves on; any still-running worker is detached.

Detaching does not cancel the thread. It keeps blocking in the underlying syscall with the socket open until:

1. the peer sends `RST`/`FIN` (can be near-instant), or
2. the kernel gives up on TCP retransmits (default ~15 min, but can be longer when SYN-ACKs arrive and then the remote stalls mid-stream).

For the 16-upstream devnet we poll every ~4s; even a small fraction of stuck workers leaks fds faster than they drain. At default Docker `nofile=1024`, one day of mild upstream flakiness is enough to exhaust the limit.

## Fix (PR #1)

Apply `SO_RCVTIMEO` and `SO_SNDTIMEO` to the socket returned by `client.open()` via `std.posix.setsockopt` on `req.connection.?.stream.handle`. This bounds every blocking `recv`/`send` to `request_timeout_ms` regardless of peer behavior, so detached workers reliably self-terminate (and their `defer client.deinit()` closes the socket) within the configured deadline.

This addresses the observed pathology — accepted connections that go silent or are half-closed. Connect-phase black-holes (SYN with no SYN-ACK) are out of scope here because we would need to switch to non-blocking connect + poll; they didn't show up in the incident.

## Mitigation applied on the running container

Restarted with `--ulimit nofile=65536:65536`. This only buys time; without the socket-level timeout the leak still grows, just slower.

## Follow-ups (not in this fix)

- Non-blocking connect with explicit connect-timeout, so connect-phase hangs also clean up promptly.
- Health counter + `/healthz` that fails when open-fd usage crosses a threshold, so orchestrators restart before exhaustion.
- Track the currently-in-flight detached worker count and log when it grows unboundedly.
- Revisit whether we need per-poll threads at all once socket-level timeouts exist — a sequential poll with hard timeouts would remove the detach/ref-count machinery entirely.

## Reproduction

1. Run `leanpoint` against an upstream that accepts TCP but never returns a response body (e.g. `nc -l 5055` in accept-and-block mode).
2. Watch `ls /proc/$PID/fd | wc -l` climb by one per `poll_interval_ms` until `nofile` is hit.
3. Public listener stops accepting.

With the fix, the fd count stays flat: each worker returns within `request_timeout_ms` and releases its socket.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File descriptor leak crashes leanpoint under unresponsive upstreams #13

Summary

Observed incident (devnet-4, 2026-04-23)

Root cause

Fix (PR #1)

Mitigation applied on the running container

Follow-ups (not in this fix)

Reproduction

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

File descriptor leak crashes leanpoint under unresponsive upstreams #13

Description

Summary

Observed incident (devnet-4, 2026-04-23)

Root cause

Fix (PR #1)

Mitigation applied on the running container

Follow-ups (not in this fix)

Reproduction

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions