fix: tighten WS heap defenses (cap=4, broadcast floor 32K)#61
Open
skialpine wants to merge 1 commit into
Open
Conversation
Under sustained multi-protocol load (4+ WS clients + BLE + USB) the
default WS client cap of 8 lets per-client queue + AsyncWebSocketMessage
allocations overcommit heap to the point where lwIP TX buffers starve
silently — WiFi reports status=CONNECTED, disc/rec never increment, but
packets drop. Two small knobs together close the gap:
* cleanupClients(4) — halves worst-case connection overcommit.
Real workloads (on-device UI + PRs/Decenza + occasional second
client) cap out at 3–4; 8 was always headroom for edge cases that
don't happen on this hardware.
* WS_BROADCAST_HEAP_FLOOR raised 25000 → 32000. The original 25K
floor sat too close to the lwIP starvation knee: a 4-client status
burst dipped free heap to ~22K, below where pbufs can be allocated.
32K keeps the post-burst trough in safe territory.
Validated on hardware: 2h+ soak with 4 ws_drop_repro generators + BLE
(Decenza) + USB serial, 0% ping loss, 0 reconnects, 0 panics, ~300
broadcasts skipped by the existing wsBroadcastHeapOk gate. minheap
stable across the run (no leak signature).
Holding for an 8h soak before merge.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two small knobs that together close the silent-WiFi-drop failure mode observed during multi-protocol soaks on top of #56's WS-OOM defense.
websocket.cleanupClients(4)— halve the WS client cap from the lib default (8) to 4WS_BROADCAST_HEAP_FLOORraised 25 KB → 32 KBWhy
Under sustained multi-protocol load (4+ WS clients + BLE + USB), per-client queue +
AsyncWebSocketMessageallocations overcommit heap until lwIP can't allocate pbufs. Symptom is nasty:wifi_statusstaysCONNECTED,disc/recnever increment, but packets drop silently. The on-device watchdog never trips because nothing has crashed.The existing
wsBroadcastHeapOk()gate (PR #58) prevents thebad_alloc → abortreboot, but at the original 25 KB floor a 4-client status burst still dipped free heap to ~22 KB — below the lwIP starvation knee. 32 KB keeps the post-burst trough in safe territory.cap=8 was always overhead for edge cases. Real workloads (on-device UI + PRs/Decenza app + an occasional second viewer) cap out at 3–4 concurrent clients on this hardware.
Validation
2h+ soak on hardware (skialpine/test/adc-lib+telemetry worktree with these edits applied):
disc=0 rec=0for the full run)bad_alloc, 0 AsyncTCP stallsHolding for an 8h soak before merge.
What this doesn't fix
This is a brake, not a root-cause fix. The underlying pressure — large
statusAllbroadcasts × 4 clients allocating ~2.3 KB per burst — is still there; we're just guaranteeing the gate catches it. Follow-up will split immutable fields (firmware_version,protocol_version,reset_reason) and quasi-immutable fields (stall_count,last_stall_*,adc_recovery_count,soc_temp_max_c) out of the hotstatusAllpath into a one-shotsession_infoframe + on-change deltas. Targets ~37% payload reduction on the status broadcast.Test plan
pio run -e esp32s3)[ws] low heap … skip broadcaststill fires correctly under load🤖 Generated with Claude Code