Skip to content
4 changes: 4 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,10 @@ This document is meant to evolve with the codebase. During a session, if you (Cl

If you fix a bug whose symptom is documented in the "When something is broken" table, leave the entry in place — it's still the right "first place to look" for the next person.

## Fixing bugs you find along the way

Pre-existing bugs get fixed too — "it was already there" is not a reason to defer. When you turn up a bug while working on something else (a review flags it, you read past it, a test surfaces it), fix it as part of the same change; a pre-existing bug is no less bad than a newly introduced one, and the person touching the code is the right person to fix it. The only exception is when the fix is genuinely a large, independent effort — then call it out explicitly and agree on a separate change, rather than silently leaving it in place.

## Don't

- Don't call I²C / SPI / blocking IO from the AsyncTCP task.
Expand Down
29 changes: 28 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -182,10 +182,37 @@ Status frame shape:
"soft_sleep": false,
"events_enabled": true,
"rate_hz": 10,
"interval_ms": 100
"interval_ms": 100,
"soc_temp_c": 33.3,
"soc_temp_max_c": 41.2,
"weight_stalled": false,
"stall_count": 0,
"last_stall_ms": 0,
"last_stall_temp_c": 0.0,
"adc_recovery_count": 0,
"reset_reason": "poweron"
}
```

The trailing fields are diagnostic telemetry (added to investigate a thermal
"weight stops being collected" failure under sustained load):

- `soc_temp_c` / `soc_temp_max_c` — current and peak ESP32-S3 die temperature
(°C) since boot. `soc_temp_max_c` is `-100` until the first valid sample.
- `weight_stalled` — `true` while the load-cell raw value has been frozen/railed
for >8 s (readings have stopped), cleared when they resume.
- `stall_count` — number of stall events since boot; `last_stall_ms` is the
`millis()` of the most recent stall onset (`0` = none yet) and
`last_stall_temp_c` is the die temp at that moment (valid only when
`last_stall_ms != 0`).
- `adc_recovery_count` — number of ADC power-cycle recoveries since boot. A
climbing value is the signal for a perpetual-recovery loop (the case
`weight_stalled` is blind to).
- `reset_reason` — why the SoC last reset (`poweron`, `panic`, `brownout`,
`task_wdt`, …), so a reboot mid-soak is explained.

These reset on reboot (not persisted to NVS).

For backwards compatibility, WiFi only sends weight snapshots by default. A
client must send `events on` before periodic status, local scale button presses,
or power-off notifications are emitted. The event stream resets to off when the
Expand Down
39 changes: 38 additions & 1 deletion include/parameter.h
Original file line number Diff line number Diff line change
Expand Up @@ -193,9 +193,46 @@ static const unsigned long ZERO_DISPLAY_MISMATCH_TIMEOUT = 1500;
static const float ZERO_DISPLAY_MISMATCH_THRESHOLD = 0.5;
static const uint8_t ADC_ERROR_RECOVERY_COUNT = 2;
static bool b_adc_recovery_active = false;
static uint8_t i_adc_recovery_count = 0;
// volatile: written on the main loop -- incremented on each ADC power-cycle
// recovery, reset to 0 by resetAdcRecoveryState() -- and read in the WS status
// frame (which can be built on the AsyncTCP task). uint32_t (not uint8_t) so a
// *perpetual* recovery loop -- the one failure mode the stall watchdog is blind
// to -- keeps counting truthfully over a long soak instead of saturating at 255.
static volatile uint32_t i_adc_recovery_count = 0;
//bool b_tempDisablePowerOff = true;

// Instrumentation for diagnosing the "weight stops being collected" failure
// under sustained load (suspected thermal). These are all written on the main
// loop and read by the WS status frame, which is built BOTH on the main loop
// (periodic) AND on the AsyncTCP task (command responses) -- so the read crosses
// a task boundary. volatile prevents the AsyncTCP reader caching a stale value
// (single aligned scalars => the load/store is atomic on Xtensa, no mutex
// needed). b_weightStalled is set by the pureScale() stall watchdog when the ADC
// raw value is frozen/railed.
volatile bool b_weightStalled = false;
// volatile for the same cross-task reason; written once at boot in setup().
volatile const char *g_resetReason = "unknown";
// Peak/last-event stats since boot (no NVS; reset on reboot, which g_resetReason
// then explains). g_socTempMaxC = highest SoC die temp seen. The *_stall_*
// fields capture the most recent stall so the failure is visible after the fact
// -- consumers must treat last_stall_temp_c as valid only when g_lastStallMs != 0
// (0.0 otherwise means "no stall yet", not a real 0 C reading).
volatile float g_socTempC = 0.0f; // latest SoC temperature (C)
volatile float g_socTempMaxC = -100.0f; // peak SoC temperature since boot (C); -100 = no valid sample yet
volatile uint32_t g_stallCount = 0; // number of weight-stall events since boot
volatile unsigned long g_lastStallMs = 0; // millis() of the last stall onset (0 = none)
volatile float g_lastStallTempC = 0.0f; // SoC temp when the last stall began (valid only if g_lastStallMs != 0)

// Snapshot of the stopWatch state, refreshed once per main-loop iteration. The
// WS status frame is built BOTH on the main loop AND on the AsyncTCP task
// (command responses); stopWatch is a multi-field object (running flag + start
// ts + accumulator) also mutated from BLE/USB, so reading it directly off the
// AsyncTCP task can tear (CLAUDE.md). The status frame reads these single
// aligned volatiles instead. g_timerElapsed carries stopWatch.elapsed() in its
// configured resolution (SECONDS) -- it is the WS "timer_seconds" field.
volatile bool g_timerRunning = false;
volatile unsigned long g_timerElapsed = 0;

bool b_negativeWeight = false;

bool b_weight_quick_zero = false; //Tare后快速显示为0优化
Expand Down
32 changes: 24 additions & 8 deletions include/websocket.h
Original file line number Diff line number Diff line change
Expand Up @@ -183,22 +183,30 @@ void sendWebsocketRateInfo(AsyncWebSocketClient *client, const char *status) {
}

void sendWebsocketStatus(AsyncWebSocketClient *client, const char *status) {
client->printf("{\"type\":\"status\",\"status\":\"%s\",\"protocol_version\":1,\"firmware_version\":\"%s\",\"grams\":%.2f,\"ms\":%lu,\"battery_percent\":%d,\"battery_voltage\":%.2f,\"charging\":%s,\"timer_running\":%s,\"timer_seconds\":%lu,\"display_on\":%s,\"low_power\":%s,\"soft_sleep\":%s,\"events_enabled\":%s,\"rate_hz\":%lu,\"interval_ms\":%lu}",
client->printf("{\"type\":\"status\",\"status\":\"%s\",\"protocol_version\":1,\"firmware_version\":\"%s\",\"grams\":%.2f,\"ms\":%lu,\"battery_percent\":%d,\"battery_voltage\":%.2f,\"charging\":%s,\"timer_running\":%s,\"timer_seconds\":%lu,\"display_on\":%s,\"low_power\":%s,\"soft_sleep\":%s,\"events_enabled\":%s,\"rate_hz\":%lu,\"interval_ms\":%lu,\"soc_temp_c\":%.1f,\"soc_temp_max_c\":%.1f,\"weight_stalled\":%s,\"stall_count\":%lu,\"last_stall_ms\":%lu,\"last_stall_temp_c\":%.1f,\"adc_recovery_count\":%lu,\"reset_reason\":\"%s\"}",
status,
FIRMWARE_VER,
f_displayedValue,
millis(),
websocketBatteryPercent(),
f_batteryVoltage,
websocketIsCharging() ? "true" : "false",
stopWatch.isRunning() ? "true" : "false",
(unsigned long)stopWatch.elapsed(),
g_timerRunning ? "true" : "false",
g_timerElapsed,
b_u8g2Sleep ? "false" : "true",
b_websocketLowPowerEnabled ? "true" : "false",
b_softSleep ? "true" : "false",
b_websocketEventsEnabled ? "true" : "false",
websocketRateForInterval(weightWebsocketNotifyInterval),
weightWebsocketNotifyInterval);
weightWebsocketNotifyInterval,
g_socTempC,
g_socTempMaxC,
b_weightStalled ? "true" : "false",
(unsigned long)g_stallCount,
g_lastStallMs,
g_lastStallTempC,
(unsigned long)i_adc_recovery_count,
(const char *)g_resetReason);
}

// Broadcast via printfAll(): it holds the library's client-list mutex and
Expand All @@ -211,22 +219,30 @@ void sendWebsocketStatus(AsyncWebSocketClient *client, const char *status) {
// without blocking the others.
void sendWebsocketStatusAll(const char *status) {
if (!b_wifiEnabled || !b_websocketEventsEnabled || websocket.count() == 0) return;
websocket.printfAll("{\"type\":\"status\",\"status\":\"%s\",\"protocol_version\":1,\"firmware_version\":\"%s\",\"grams\":%.2f,\"ms\":%lu,\"battery_percent\":%d,\"battery_voltage\":%.2f,\"charging\":%s,\"timer_running\":%s,\"timer_seconds\":%lu,\"display_on\":%s,\"low_power\":%s,\"soft_sleep\":%s,\"events_enabled\":%s,\"rate_hz\":%lu,\"interval_ms\":%lu}",
websocket.printfAll("{\"type\":\"status\",\"status\":\"%s\",\"protocol_version\":1,\"firmware_version\":\"%s\",\"grams\":%.2f,\"ms\":%lu,\"battery_percent\":%d,\"battery_voltage\":%.2f,\"charging\":%s,\"timer_running\":%s,\"timer_seconds\":%lu,\"display_on\":%s,\"low_power\":%s,\"soft_sleep\":%s,\"events_enabled\":%s,\"rate_hz\":%lu,\"interval_ms\":%lu,\"soc_temp_c\":%.1f,\"soc_temp_max_c\":%.1f,\"weight_stalled\":%s,\"stall_count\":%lu,\"last_stall_ms\":%lu,\"last_stall_temp_c\":%.1f,\"adc_recovery_count\":%lu,\"reset_reason\":\"%s\"}",
status,
FIRMWARE_VER,
f_displayedValue,
millis(),
websocketBatteryPercent(),
f_batteryVoltage,
websocketIsCharging() ? "true" : "false",
stopWatch.isRunning() ? "true" : "false",
(unsigned long)stopWatch.elapsed(),
g_timerRunning ? "true" : "false",
g_timerElapsed,
b_u8g2Sleep ? "false" : "true",
b_websocketLowPowerEnabled ? "true" : "false",
b_softSleep ? "true" : "false",
b_websocketEventsEnabled ? "true" : "false",
websocketRateForInterval(weightWebsocketNotifyInterval),
weightWebsocketNotifyInterval);
weightWebsocketNotifyInterval,
g_socTempC,
g_socTempMaxC,
b_weightStalled ? "true" : "false",
(unsigned long)g_stallCount,
g_lastStallMs,
g_lastStallTempC,
(unsigned long)i_adc_recovery_count,
(const char *)g_resetReason);
}

void sendWebsocketWeightAll(float grams, unsigned long ms) {
Expand Down
125 changes: 122 additions & 3 deletions src/hds.ino
Original file line number Diff line number Diff line change
Expand Up @@ -392,10 +392,39 @@ void wifi_init() {

MyUsbCallbacks usbCallbacks;

// Map esp_reset_reason() to a short string for boot logging + WS telemetry, so a
// spontaneous reset (brownout / panic / watchdog) is attributable instead of
// looking like a clean power-on.
const char *resetReasonStr(esp_reset_reason_t r) {
switch (r) {
case ESP_RST_POWERON: return "poweron";
case ESP_RST_EXT: return "ext";
case ESP_RST_SW: return "sw";
case ESP_RST_PANIC: return "panic";
case ESP_RST_INT_WDT: return "int_wdt";
case ESP_RST_TASK_WDT: return "task_wdt";
case ESP_RST_WDT: return "wdt";
case ESP_RST_DEEPSLEEP: return "deepsleep";
case ESP_RST_BROWNOUT: return "brownout";
case ESP_RST_SDIO: return "sdio";
default: {
// Don't collapse unmapped IDF reset codes (e.g. CPU_LOCKUP, USB, JTAG on
// newer IDF) to a bare "unknown" -- keep the numeric code so a new/rare
// reason is still attributable. Written once at boot, so a static buffer
// is safe.
static char buf[16];
snprintf(buf, sizeof(buf), "unknown_%d", (int)r);
return buf;
}
}
}

void setup() {
Serial.begin(115200);
while (!Serial) // Wait for the Serial port to initialize (typically used in Arduino to ensure the Serial monitor is ready)
;
g_resetReason = resetReasonStr(esp_reset_reason());
Serial.printf("[boot] reset_reason=%s\n", (const char *)g_resetReason);
if (!EEPROM.begin(512)) {
Serial.println("EEPROM init failed!");
while (1) {
Expand Down Expand Up @@ -932,6 +961,56 @@ void pureScale() {
t_lastScaleData = millis();
}

// Stall watchdog: a live load cell's raw 24-bit value dithers on every
// conversion (the ADS1232/HX711 runs ~10 samples/s at the configured rate). If
// the raw value is byte-identical for >8 s it's frozen or railed (a rail to 0
// freezes rawValue at the last good value via the driver's data>0 guard, so it
// still reads as "unchanged") -- the "weight stops being collected" failure
// (suspected thermal/analog) that an in-firmware ADC power-cycle can't fix.
// Surface it (flag + one-shot log) instead of silently streaming a stuck value.
// Skipped while a deliberate ADC power-cycle recovery is in progress (raw is
// frozen by definition then); the window is re-seeded on the first 250 ms poll
// after recovery clears (via the t_rawChange==0 sentinel). Blind spot: a
// *perpetual* recovery loop (recovery every ~5 s) keeps re-seeding so this flag
// may never trip -- the climbing adc_recovery_count in the status frame is the
// signal for that case. Checked every 250 ms (not every loop): the ADC only
// produces ~10 samples/s, so polling faster just burns CPU/heat.
{
static long lastRaw = 0x7FFFFFFFL;
static unsigned long t_rawChange = 0; // 0 = (re)seed window on next sample
static unsigned long t_stallCheck = 0;
if (b_adc_recovery_active) {
// Deliberate ADC power-cycle in progress: raw is frozen by design, not by
// the failure we detect. Re-seed the window so we don't false-trip when
// streaming resumes.
t_rawChange = 0;
} else if (millis() - t_stallCheck >= 250) {
unsigned long nowMs = millis();
if (nowMs == 0) nowMs = 1; // 0 is the reseed sentinel for t_rawChange; never store it as a real timestamp (boot/rollover)
t_stallCheck = nowMs;
long raw = scale.getDebugInfo().rawValue;
if (t_rawChange == 0) {
lastRaw = raw;
t_rawChange = nowMs;
} else if (raw != lastRaw) {
lastRaw = raw;
t_rawChange = nowMs;
if (b_weightStalled) {
b_weightStalled = false;
Serial.println("[adc] weight readings resumed");
}
} else if (!b_weightStalled && nowMs - t_rawChange > 8000) {
b_weightStalled = true;
g_stallCount++;
g_lastStallMs = nowMs;
g_lastStallTempC = g_socTempC;
Serial.printf("[adc] WEIGHT STALLED #%lu: raw frozen at %ld for >8s soc=%.1fC heap=%lu\n",
(unsigned long)g_stallCount, raw, g_lastStallTempC,
(unsigned long)ESP.getFreeHeap());
}
}
}

if (scale.update()) {
b_newDataReady = true;
t_lastScaleData = millis();
Expand All @@ -941,9 +1020,7 @@ void pureScale() {
millis() - t_lastScaleRecovery > 5000) {
Serial.println("Scale ADC timeout. Power cycling ADC.");
b_adc_recovery_active = true;
if (i_adc_recovery_count < 255) {
i_adc_recovery_count++;
}
i_adc_recovery_count++; // uint32_t: counts truthfully, won't wrap in any realistic runtime
scale.powerDown();
delay(5);
scale.powerUp();
Expand Down Expand Up @@ -1268,11 +1345,53 @@ void loop() {
// here on the loop task rather than racing peripheral drivers.
processWsPendingCmds();

// Snapshot the multi-field stopWatch into aligned volatiles on the loop task so
// the WS status frame (built on the AsyncTCP task for command responses) never
// reads stopWatch cross-task. Done after the drain above so a just-applied
// timer start/stop/zero is reflected. elapsed() is in the configured
// resolution (SECONDS).
g_timerRunning = stopWatch.isRunning();
g_timerElapsed = (unsigned long)stopWatch.elapsed();

if (b_powerOff){
shut_down_now_nobeep();
return;
}

// SoC-temperature sampler + peak tracking (diagnosing the suspected thermal
// stall). Runs every 2 s regardless of WiFi state or power-supply mode
// (USB/battery); prints a summary every 10 s so a serial capture during a
// stress run shows the temp trend, and feeds g_socTempC/Max into the WS
// status frame.
{
static unsigned long t_tempSample = 0, t_tempLog = 0;
unsigned long nowMs = millis();
if (nowMs - t_tempSample >= 2000) {
t_tempSample = nowMs;
float t = temperatureRead();
// temperatureRead() returns NaN if the SoC sensor is unavailable. Reject
// any non-finite value (NaN or +/-inf): NaN serializes as invalid JSON and
// a non-finite compare would freeze the peak. Keep the last valid value and
// log once so the failure is visible rather than silent.
if (isfinite(t)) {
g_socTempC = t;
if (t > g_socTempMaxC) g_socTempMaxC = t;
} else {
static bool tempFailLogged = false;
if (!tempFailLogged) {
tempFailLogged = true;
Serial.println("[temp] temperatureRead() returned NaN -- SoC sensor unavailable");
}
}
if (nowMs - t_tempLog >= 10000) {
t_tempLog = nowMs;
Serial.printf("[temp] soc=%.1fC max=%.1fC stalls=%lu last_stall=%lums stall_temp=%.1fC heap=%lu\n",
g_socTempC, g_socTempMaxC, (unsigned long)g_stallCount,
g_lastStallMs, g_lastStallTempC, (unsigned long)ESP.getFreeHeap());
}
}
}

if (bleState == CONNECTED && b_requireHeartBeat && millis() - t_firstConnect > HEARTBEAT_TIMEOUT) {
if (millis() - t_heartBeat > HEARTBEAT_TIMEOUT) {
disconnectBLE();
Expand Down
Loading