Replies: 3 comments
-
|
Thx for your thorough investigation. I suggest you do provide a PR for me to chk if it's useful (which seems to be the case) and to cherry pick solutions while weighing the code impact over usefulness. |
Beta Was this translation helpful? Give feedback.
-
|
I've broken up my results into several branches and prepared a series of PRs, to seperate the topics from each other for easier review, if you don't mind. Proposed PR series for backlog improvementThe investigation required reliable heap observability first. The PRs below are listed
PRs 1-5 are independent. PRs 6-10 form a dependent chain. PR 10 requires PRs 6-9. Short overview on single PRsPR 1 - ESP8266: fix heap metric functions and add OOM diagnosticsTwo ESP8266 heap metric functions returned values, that not represent the technical This PR changes both functions and adds diagnostic tooling built during the Visible change: The SetOption130 heap timestamp on ESP8266 now includes a PR 2 - Fix JsonParser NULL dereference on allocation failure
PR 3 - ESP8266 stack dump decoderA shell script for decoding ESP8266 exception stack traces using firmware ELF files. PR 4 - Firmware artifact timestampsAppends a timestamp derived from the ELF build time to firmware artifact filenames PR 5 - Rule cache early initializationRule sets stored with compression ( PRs 6-9 - Backlog fixes (dependent chain, submitted after PRs 1-5 land)These address the some causes for backlog heap drain in sequence; each builds on the previous merge:
PR 10 - Fast-lane queue (feature, opt-in build flag)An optional two-lane queue architecture that lets time-critical commands, issued with I will submit PRs sequentially, starting with PR 1. Happy to discuss scope, approach, Cheers, J'Lo |
Beta Was this translation helpful? Give feedback.
-
|
Impressive work, thanks. Keep going sending PRs! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
While investigating persistent heap exhaustion on a lightly loaded ESP8266 device
running Tasmota 15.4, I traced one possible cause to the backlog command queue.
Symptom: Free heap on a ESP8266-based device declined steadily from ~28 kB to
under 4 kB over 30-200 minutes under normal operation (MQTT, rules, periodic
telemetry). Issuing
Backlogwith no arguments (which clears the queue)immediately restored heap. The device ran two nearly-full rule sets (both compressed
to near the character limit), generating frequent command sequences via Backlog.
A third rule set carries temperature watches.
On a ESP32 based device, free heap declined from ~145 kB to under ~75 kB under
the same load. Issuing
Backlogwith no arguments (which clears the queue)immediately restored heap. Device ran the same rule sets, as the ESP8266-based device.
The ESP8266 device never ran longer than 5 hours til a crash in my test setup.
Most crashes happened often within the first hour of uptime.
The ESP32 device could run 12 hours to build up the lasting heap usage, due to more free
heap on startup.
Tested with different frequencies of MQTT messages, repeating from 2 s up to 20 s gap.
Investigation of ~200 hours of console log, ~30 crash stack dumps, and compile-and-try of
small code modifications led me to some test results, that could be destilled to a small number
of causes for heap drain. Eventually the combined occurrence of those ended up in
heap fragmentation and heap drain on my test system.
On of the important questions from beginning on was, to find out, if it is plain heap
drain (classic memory leak) or heavy heap fragmentation or a combination of them.
After the series of patches, my ESP8266 test setup could run for hours on MQTT events
every 1 to 2 s, but still suffers in those hours from heap fragmentation till crash.
So I consider the test setup rules as to heavy on this platform for production purposes.
Nevertheless they still form a good test framework for load tests on the subsystems
under test.
After the series of patches, the ESP32 test setup could run under MQTT message events
with 300 ms distance without suffering from heap drain or heap fragmentation for several
days uptime.
By-products of those investigations where some small changes and fixes as well as some
new tools for me, that I would like to share with the community too:
from the
platformio_override.inifile; those timestamps are identical to thebuild timestamp output from the console, because the timestamp is read from the ELF
file symbols in a robust way. Those keep the coupling between console logs and binary
artefacts of the build process.
a roughly pre-cut console log and a pointer to the ELF file, to decode the presented
adresses from the dump
my test setup
showing now the fragmentation as well as the largest available heap chunk; additionally
some heap statistics in the status output.
early during init
Causes found for heap drain by backlogs:
Timer reset during drain - every call to
CommandHandlerresetsTasmotaGlobal.backlog_timer, including calls that originate from within abacklog drain loop. The queue only drains when there is a pause of at least
SetOption34milliseconds between commands. A sufficiently busy event rule sequencecan prevent the queue from ever draining, causing unbounded growth, especially with
MQTT events.
Global drain flags -
backlog_nodelayandbacklog_no_mqtt_responseare setonce per
Backloginvocation and remain set globally until the queue empties.Interleaved Backlog calls with different flags (e.g.
Backlog0followed byBacklog2) corrupt the programmer-intended per-sequence behavior and can preventbacklog queue draining too.
No queue size limit - there is no upper bound on queue memory consumption.
A runaway rule sequence can exhaust heap before any other measures trigger.
The three root causes above produce observable failures in combined rule/Backlog scenarios.
In contrast, the documented firmware behavior matches only in isolated single-sequence tests.
These symptoms have appeared in earlier issues; the documented mitigations (periodic restart,
reduced rule complexity) did not apply to my setup: crashes occurred well within any practical
restart interval, often within 20 to 240 minutes. My rule sets represent normal production use
for low to medium complex relay control.
I found and fixed the underlying causes. Before submitting the PR series, I'd like to know whether
there is interest in those fixes - the changes are measured and build-tested. I'd rather discuss
scope upfront than submit work that falls outside the project's current scope.
Reproduction rule sets are available, if useful for review. Log captures and stack traces
can be reproduced, but where not kept, due to their sheer size (some hundreds to thousands of mega bytes).
Investigation was done with the help of docker-tasmota for reproducible builds - thanks to Jason2866 and the whole contributors team for maintaining it.
Beta Was this translation helpful? Give feedback.
All reactions