Scale telemetry: SoC temp, weight-stall watchdog, reset reason by skialpine · Pull Request #57 · decentespresso/openscale

skialpine · 2026-05-25T18:28:49Z

Summary

Diagnostics for field reports of "weight stops being collected under sustained multi-protocol load" — the only recovery seen was a long battery-out cooldown (a quick power-cycle didn't help), pointing at a thermal/analog failure rather than firmware state. Adds telemetry to confirm/rule it out, with no behavior change to the weight/WiFi/BLE paths.

New fields in the /snapshot WS status frame (and serial logs):

soc_temp_c / soc_temp_max_c — live + peak ESP32-S3 die temp (temperatureRead()), sampled every 2 s.
weight_stalled / stall_count / last_stall_ms / last_stall_temp_c — a watchdog in pureScale() that flags when the ADS1232 raw value is frozen/railed >8 s (a live cell dithers every sample), recording the die temp at failure to correlate stalls with heat. Skipped during the deliberate ADC power-cycle recovery; throttled to 250 ms.
reset_reason — esp_reset_reason() at boot, so a brownout/panic/WDT reset is attributable.

Plus tools/thermal_load_test.sh: a 1-hour USB+WiFi+churn+mDNS soak that polls the telemetry (BT driven externally).

Threading: new cross-task scalars are volatile per CLAUDE.md; the status frame only reads them.

Test plan

Builds for esp32s3; flashed; status frame reports the new fields.
1-hour multi-protocol soak to capture peak temp + any stall and the die temp at which it occurs.

🤖 Generated with Claude Code

Diagnostics for the field "weight stops being collected under sustained load" reports (suspected thermal). Adds to the WS status frame and serial logs: - soc_temp_c / soc_temp_max_c: live + peak ESP32-S3 die temperature (temperatureRead()), sampled every 2s on the main loop. - weight_stalled + stall_count + last_stall_ms + last_stall_temp_c: a watchdog in pureScale() that flags when the ADS1232 raw value is frozen/railed for >8s (a live cell dithers every sample), recording the die temp at the moment of the stall to correlate failures with heat. (This failure is not firmware- recoverable, so it's surfaced, not silently retried.) - reset_reason: esp_reset_reason() captured at boot, so a brownout/panic/WDT reset is attributable instead of looking like a clean power-on. Telemetry-only; no behavior change to the weight/WiFi/BLE paths. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…nitor) Drives USB 10Hz + WS 10Hz + HTTP/WS churn + mDNS (BT driven externally) and polls the new temp/stall telemetry every ~60s, watching for the weight-stall failure and the die temp at which it occurs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Review follow-ups on the telemetry watchdog: - Skip the stall check while b_adc_recovery_active (the ADS1232 power-cycle freezes the raw value by design); re-seed the window on resume so a genuine signal-timeout recovery isn't miscounted as a railed/frozen stall. - Check every 250 ms instead of every loop iteration -- the ADC only produces ~10 samples/s, so polling getDebugInfo() at full loop rate (with its sqrt + dataset passes) just burns CPU/heat, which is counterproductive on the chip we're trying to characterize. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

skialpine · 2026-05-25T18:29:25Z

Code review

Reviewed the telemetry diff (deep bug scan + threading/CLAUDE.md check). Found one real issue, now fixed in this PR (commit 8e345d9):

Stall watchdog false-trip during ADC recovery — the watchdog read getDebugInfo().rawValue, which is frozen by definition during the firmware's own powerDown()/powerUp() recovery, so a genuine signal-timeout recovery would be miscounted as a railed/frozen stall (corrupting the very metric this adds). Fixed: skip the check while b_adc_recovery_active and re-seed the window on resume.
Also (cost): it ran getDebugInfo() (which does a sqrt + dataset passes) every loop iteration though only rawValue is used — wasteful on a chip we're characterizing for heat. Now throttled to 250 ms (the ADC only samples ~10/s).

Verified clean: printf format/arg pairing in both status frames; cross-task reads are volatile (benign torn read only); temperatureRead()/resetReasonStr() safe; at-rest false-positive risk is low (24-bit raw at SAMPLES=1 dithers every sample, so 8 s of byte-identical raw is a genuine freeze).

🤖 Generated with Claude Code

From the toolkit review of PR decentespresso#57: - temperatureRead() NaN guard: don't poison g_socTempC/Max (NaN -> invalid JSON and a frozen peak since NaN compares false); keep last valid + log once. - g_resetReason is now volatile (CLAUDE.md: cross-task globals read on the AsyncTCP status path); status frame casts it for printf. - Expose adc_recovery_count in the status frame: a *perpetual* ADC recovery loop keeps re-seeding the stall window so weight_stalled may never trip -- the climbing recovery count makes that failure mode visible. i_adc_recovery_count is now volatile (newly read cross-task). - reset_reason: numeric "unknown_<code>" fallback so unmapped IDF reset reasons (CPU_LOCKUP/USB/JTAG) stay attributable. - Comment fixes: volatile cross-task rationale; stall-window re-seed wording + recovery-loop blind-spot note; last_stall_temp_c valid-only-if last_stall_ms. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- DURATION/IP/HOST now positional args; warn (don't silently skip) if no USB port. - Telemetry monitor logs reset_reason + adc_recovery_count per line, waits a full status interval after (re)connect, tracks peak temp / stalls / recoveries / reboots across the whole run (so a firmware reset doesn't lose the peak), and prints a SUMMARY line with a PASS/FAIL verdict. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Addresses the iteration-2 review findings on PR decentespresso#57: - Status frame no longer reads the multi-field stopWatch object directly off the AsyncTCP task (CLAUDE.md-forbidden cross-task tear, pre-existing). The loop task now snapshots it into aligned volatiles (g_timerRunning/ g_timerElapsed) that both status frames read. - Widen i_adc_recovery_count uint8_t -> uint32_t and drop the <255 cap so a perpetual-recovery loop (the blind spot the stall watchdog can't see) keeps counting truthfully over a long soak instead of saturating; update the WS format specifier %u -> %lu accordingly. - SoC temp guard: isfinite() instead of !isnan() so +/-inf can't reach the JSON. - Stall watchdog: never store 0 as the t_rawChange timestamp (it is the reseed sentinel) at boot/rollover. - README: document the new status-frame telemetry fields. - thermal_load_test.sh: FAIL (not silent PASS) on sustained loss of status frames or a crashed load generator, and exit non-zero on FAIL so it works as a CI gate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- thermal_load_test.sh: close the silent-PASS holes a review found. A flapping wedge (scale answers one frame per reconnect, resetting the consecutive-miss streak) now fails via a cumulative total_no_status counter, not just max_no_status_streak. Each load generator's PID is captured and waited on individually so a never-started/crashed driver (non-zero exit) fails the run instead of being missed by a Traceback-only grep. A run that never saw soc_temp_max_c (peak stuck at the -999 sentinel) also fails, since the thermal data the test exists to capture is absent. - CLAUDE.md: add "Fixing bugs you find along the way" — pre-existing bugs get fixed in the same change, not deferred. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

skialpine and others added 3 commits May 25, 2026 12:24

skialpine and others added 4 commits May 25, 2026 12:39

This was referenced May 25, 2026

fix: prevent WS-broadcast OOM crash under connection churn skialpine/openscale#1

Open

fix: prevent WS-broadcast OOM crash under connection churn #58

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale telemetry: SoC temp, weight-stall watchdog, reset reason#57

Scale telemetry: SoC temp, weight-stall watchdog, reset reason#57
skialpine wants to merge 7 commits into
decentespresso:mainfrom
skialpine:feat/scale-telemetry

skialpine commented May 25, 2026

Uh oh!

skialpine commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

skialpine commented May 25, 2026

Summary

Test plan

Uh oh!

skialpine commented May 25, 2026

Code review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant