Scale telemetry: SoC temp, weight-stall watchdog, reset reason#57
Open
skialpine wants to merge 7 commits into
Open
Scale telemetry: SoC temp, weight-stall watchdog, reset reason#57skialpine wants to merge 7 commits into
skialpine wants to merge 7 commits into
Conversation
Diagnostics for the field "weight stops being collected under sustained load" reports (suspected thermal). Adds to the WS status frame and serial logs: - soc_temp_c / soc_temp_max_c: live + peak ESP32-S3 die temperature (temperatureRead()), sampled every 2s on the main loop. - weight_stalled + stall_count + last_stall_ms + last_stall_temp_c: a watchdog in pureScale() that flags when the ADS1232 raw value is frozen/railed for >8s (a live cell dithers every sample), recording the die temp at the moment of the stall to correlate failures with heat. (This failure is not firmware- recoverable, so it's surfaced, not silently retried.) - reset_reason: esp_reset_reason() captured at boot, so a brownout/panic/WDT reset is attributable instead of looking like a clean power-on. Telemetry-only; no behavior change to the weight/WiFi/BLE paths. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…nitor) Drives USB 10Hz + WS 10Hz + HTTP/WS churn + mDNS (BT driven externally) and polls the new temp/stall telemetry every ~60s, watching for the weight-stall failure and the die temp at which it occurs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Review follow-ups on the telemetry watchdog: - Skip the stall check while b_adc_recovery_active (the ADS1232 power-cycle freezes the raw value by design); re-seed the window on resume so a genuine signal-timeout recovery isn't miscounted as a railed/frozen stall. - Check every 250 ms instead of every loop iteration -- the ADC only produces ~10 samples/s, so polling getDebugInfo() at full loop rate (with its sqrt + dataset passes) just burns CPU/heat, which is counterproductive on the chip we're trying to characterize. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
Author
Code reviewReviewed the telemetry diff (deep bug scan + threading/CLAUDE.md check). Found one real issue, now fixed in this PR (commit
Verified clean: printf format/arg pairing in both status frames; cross-task reads are 🤖 Generated with Claude Code |
From the toolkit review of PR decentespresso#57: - temperatureRead() NaN guard: don't poison g_socTempC/Max (NaN -> invalid JSON and a frozen peak since NaN compares false); keep last valid + log once. - g_resetReason is now volatile (CLAUDE.md: cross-task globals read on the AsyncTCP status path); status frame casts it for printf. - Expose adc_recovery_count in the status frame: a *perpetual* ADC recovery loop keeps re-seeding the stall window so weight_stalled may never trip -- the climbing recovery count makes that failure mode visible. i_adc_recovery_count is now volatile (newly read cross-task). - reset_reason: numeric "unknown_<code>" fallback so unmapped IDF reset reasons (CPU_LOCKUP/USB/JTAG) stay attributable. - Comment fixes: volatile cross-task rationale; stall-window re-seed wording + recovery-loop blind-spot note; last_stall_temp_c valid-only-if last_stall_ms. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- DURATION/IP/HOST now positional args; warn (don't silently skip) if no USB port. - Telemetry monitor logs reset_reason + adc_recovery_count per line, waits a full status interval after (re)connect, tracks peak temp / stalls / recoveries / reboots across the whole run (so a firmware reset doesn't lose the peak), and prints a SUMMARY line with a PASS/FAIL verdict. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Addresses the iteration-2 review findings on PR decentespresso#57: - Status frame no longer reads the multi-field stopWatch object directly off the AsyncTCP task (CLAUDE.md-forbidden cross-task tear, pre-existing). The loop task now snapshots it into aligned volatiles (g_timerRunning/ g_timerElapsed) that both status frames read. - Widen i_adc_recovery_count uint8_t -> uint32_t and drop the <255 cap so a perpetual-recovery loop (the blind spot the stall watchdog can't see) keeps counting truthfully over a long soak instead of saturating; update the WS format specifier %u -> %lu accordingly. - SoC temp guard: isfinite() instead of !isnan() so +/-inf can't reach the JSON. - Stall watchdog: never store 0 as the t_rawChange timestamp (it is the reseed sentinel) at boot/rollover. - README: document the new status-frame telemetry fields. - thermal_load_test.sh: FAIL (not silent PASS) on sustained loss of status frames or a crashed load generator, and exit non-zero on FAIL so it works as a CI gate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- thermal_load_test.sh: close the silent-PASS holes a review found. A flapping wedge (scale answers one frame per reconnect, resetting the consecutive-miss streak) now fails via a cumulative total_no_status counter, not just max_no_status_streak. Each load generator's PID is captured and waited on individually so a never-started/crashed driver (non-zero exit) fails the run instead of being missed by a Traceback-only grep. A run that never saw soc_temp_max_c (peak stuck at the -999 sentinel) also fails, since the thermal data the test exists to capture is absent. - CLAUDE.md: add "Fixing bugs you find along the way" — pre-existing bugs get fixed in the same change, not deferred. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 25, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Diagnostics for field reports of "weight stops being collected under sustained multi-protocol load" — the only recovery seen was a long battery-out cooldown (a quick power-cycle didn't help), pointing at a thermal/analog failure rather than firmware state. Adds telemetry to confirm/rule it out, with no behavior change to the weight/WiFi/BLE paths.
New fields in the
/snapshotWS status frame (and serial logs):soc_temp_c/soc_temp_max_c— live + peak ESP32-S3 die temp (temperatureRead()), sampled every 2 s.weight_stalled/stall_count/last_stall_ms/last_stall_temp_c— a watchdog inpureScale()that flags when the ADS1232 raw value is frozen/railed >8 s (a live cell dithers every sample), recording the die temp at failure to correlate stalls with heat. Skipped during the deliberate ADC power-cycle recovery; throttled to 250 ms.reset_reason—esp_reset_reason()at boot, so a brownout/panic/WDT reset is attributable.Plus
tools/thermal_load_test.sh: a 1-hour USB+WiFi+churn+mDNS soak that polls the telemetry (BT driven externally).Threading: new cross-task scalars are
volatileper CLAUDE.md; the status frame only reads them.Test plan
🤖 Generated with Claude Code