Backlog queue drains heap under load - investigation and proposed fixes #24776

joluxer · 2026-05-21T08:06:09Z

joluxer
May 21, 2026

While investigating persistent heap exhaustion on a lightly loaded ESP8266 device
running Tasmota 15.4, I traced one possible cause to the backlog command queue.

Symptom: Free heap on a ESP8266-based device declined steadily from ~28 kB to
under 4 kB over 30-200 minutes under normal operation (MQTT, rules, periodic
telemetry). Issuing Backlog with no arguments (which clears the queue)
immediately restored heap. The device ran two nearly-full rule sets (both compressed
to near the character limit), generating frequent command sequences via Backlog.
A third rule set carries temperature watches.

On a ESP32 based device, free heap declined from ~145 kB to under ~75 kB under
the same load. Issuing Backlog with no arguments (which clears the queue)
immediately restored heap. Device ran the same rule sets, as the ESP8266-based device.

The ESP8266 device never ran longer than 5 hours til a crash in my test setup.
Most crashes happened often within the first hour of uptime.

The ESP32 device could run 12 hours to build up the lasting heap usage, due to more free
heap on startup.

Tested with different frequencies of MQTT messages, repeating from 2 s up to 20 s gap.

Investigation of ~200 hours of console log, ~30 crash stack dumps, and compile-and-try of
small code modifications led me to some test results, that could be destilled to a small number
of causes for heap drain. Eventually the combined occurrence of those ended up in
heap fragmentation and heap drain on my test system.

On of the important questions from beginning on was, to find out, if it is plain heap
drain (classic memory leak) or heavy heap fragmentation or a combination of them.

After the series of patches, my ESP8266 test setup could run for hours on MQTT events
every 1 to 2 s, but still suffers in those hours from heap fragmentation till crash.
So I consider the test setup rules as to heavy on this platform for production purposes.
Nevertheless they still form a good test framework for load tests on the subsystems
under test.

After the series of patches, the ESP32 test setup could run under MQTT message events
with 300 ms distance without suffering from heap drain or heap fragmentation for several
days uptime.

By-products of those investigations where some small changes and fixes as well as some
new tools for me, that I would like to share with the community too:

optional time stamps for the ELF files, maps and binaries, it's a developer opt-in
from the platformio_override.ini file; those timestamps are identical to the
build timestamp output from the console, because the timestamp is read from the ELF
file symbols in a robust way. Those keep the coupling between console logs and binary
artefacts of the build process.
a crash stack dump decoder for the ESP8266 console output, which can be fed with
a roughly pre-cut console log and a pointer to the ELF file, to decode the presented
adresses from the dump
a nullptr dereference fix for the jsmn-JSON parser, which happend to strike quiet often in
my test setup
a modification for the SetOption130 output on ESP8266 to be simelar to this on ESP32,
showing now the fragmentation as well as the largest available heap chunk; additionally
some heap statistics in the status output.
some less heap fragmentation from the rules processor by allocating decompression memory
early during init

Causes found for heap drain by backlogs:

Timer reset during drain - every call to CommandHandler resets
TasmotaGlobal.backlog_timer, including calls that originate from within a
backlog drain loop. The queue only drains when there is a pause of at least
SetOption34 milliseconds between commands. A sufficiently busy event rule sequence
can prevent the queue from ever draining, causing unbounded growth, especially with
MQTT events.
Global drain flags - backlog_nodelay and backlog_no_mqtt_response are set
once per Backlog invocation and remain set globally until the queue empties.
Interleaved Backlog calls with different flags (e.g. Backlog0 followed by
Backlog2) corrupt the programmer-intended per-sequence behavior and can prevent
backlog queue draining too.
No queue size limit - there is no upper bound on queue memory consumption.
A runaway rule sequence can exhaust heap before any other measures trigger.

The three root causes above produce observable failures in combined rule/Backlog scenarios.

In contrast, the documented firmware behavior matches only in isolated single-sequence tests.

These symptoms have appeared in earlier issues; the documented mitigations (periodic restart,
reduced rule complexity) did not apply to my setup: crashes occurred well within any practical
restart interval, often within 20 to 240 minutes. My rule sets represent normal production use
for low to medium complex relay control.

I found and fixed the underlying causes. Before submitting the PR series, I'd like to know whether
there is interest in those fixes - the changes are measured and build-tested. I'd rather discuss
scope upfront than submit work that falls outside the project's current scope.

Reproduction rule sets are available, if useful for review. Log captures and stack traces
can be reproduced, but where not kept, due to their sheer size (some hundreds to thousands of mega bytes).

Investigation was done with the help of docker-tasmota for reproducible builds - thanks to Jason2866 and the whole contributors team for maintaining it.

arendst · 2026-05-21T10:51:40Z

arendst
May 21, 2026
Maintainer

Thx for your thorough investigation.

I suggest you do provide a PR for me to chk if it's useful (which seems to be the case) and to cherry pick solutions while weighing the code impact over usefulness.

0 replies

joluxer · 2026-05-21T13:12:33Z

joluxer
May 21, 2026
Author

I've broken up my results into several branches and prepared a series of PRs, to seperate the topics from each other for easier review, if you don't mind.

Proposed PR series for backlog improvement

The investigation required reliable heap observability first. The PRs below are listed
in submission order; technical details will come in the description texts of the individual PR, when sent.

#	Short Description
1	ESP8266: fix heap metric functions; add opt-in OOM diagnostics, Status 4 extensions, Status 44 heap dump
2	Fix `JsonParser` NULL dereference on allocation failure
3	Tool: ESP8266 stack dump decoder
4	Build: firmware artifact timestamps from ELF
5	Rules: pre-populate rule cache before network stack init
6	Backlog: fix timer reset during drain (root cause nr. 1)
7	Backlog: Contain flag mutation side-effects
8	Backlog: fix flag corruption across sequences
9	Backlog: queue introspection commands
10	Backlog: opt-in two-queue fast-lane (feature, build flag)

PRs 1-5 are independent. PRs 6-10 form a dependent chain. PR 10 requires PRs 6-9.

Short overview on single PRs

PR 1 - ESP8266: fix heap metric functions and add OOM diagnostics

Two ESP8266 heap metric functions returned values, that not represent the technical
situation in the same way, EPS32 does. ESP_getMaxAllocHeap()
returned total free heap instead of the largest contiguous allocatable block.
ESP_getHeapFragmentation() read a struct field that is only valid immediately after
a heap walk, producing stale or zero values otherwise. Without accurate values, heap
fragmentation during the investigation was not reliably observable.

This PR changes both functions and adds diagnostic tooling built during the
investigation: OOM event counting and Status 44, which triggers a full heap block map
dump to serial. The heap map output was the primary tool for confirming the fragmentation
patterns described above.

Visible change: The SetOption130 heap timestamp on ESP8266 now includes a
fragmentation percentage field, matching the existing ESP32 format. Details and
before/after captures in the PR.

PR 2 - Fix JsonParser NULL dereference on allocation failure

JsonParser allocates with new[], which returns nullptr under -fno-exceptions
(standard ESP8266 build) instead of throwing. The allocation result is not checked,
producing a NULL dereference crash under memory pressure. One-line fix in
lib/default/jsmn-shadinger-1.0.

PR 3 - ESP8266 stack dump decoder

A shell script for decoding ESP8266 exception stack traces using firmware ELF files.
Used during the investigation to make crash output actionable.

PR 4 - Firmware artifact timestamps

Appends a timestamp derived from the ELF build time to firmware artifact filenames
(.bin, .elf, .map.gz). Correlates captured logs (crash dumps, test results)
with the exact binary that produced them, and makes multiple builds distinguishable
without relying on filesystem timestamps.

PR 5 - Rule cache early initialization

Rule sets stored with compression (USE_UNISHOX_COMPRESSION, SetOption93 = 0)
decompress into heap-allocated cache objects on first rule evaluation - after WiFi,
MQTT, and WebServer have already allocated heap. These permanent objects land in the
middle of already-fragmented heap. Pre-populating the cache during early init (before
network stack allocation) drops heap fragmentation at boot from ~25-50% to ~1% on
devices with two or more active rule sets.

PRs 6-9 - Backlog fixes (dependent chain, submitted after PRs 1-5 land)

These address the some causes for backlog heap drain in sequence; each builds on the previous merge:

PR 6 - Fix timer reset during drain (root cause nr. 1): BacklogLoop retains timer
ownership after each drain step, preventing external resets from stalling the queue.
PR 7 - Centralize backlog state access: subsystem-internal state is currently
reachable from multiple code paths with implicit cross-module side effects. This PR
limits mutation to explicit access boundaries, making state transitions traceable and
enabling the behavioral fixes that follow.
PR 8 - Behavioral improvements: new SetOption166 (configurable drain window),
per-entry command flags (root cause nr. 2), configurable queue byte limit (root cause nr. 3).
Scheduling fixes: CmndDelay regression and NoDelay scoped as a one-shot flag
per Backlog sequence.
PR 9 - Queue introspection commands: live queue state is now observable without serial.

PR 10 - Fast-lane queue (feature, opt-in build flag)

An optional two-lane queue architecture that lets time-critical commands, issued with
Backlog0/Backlog2, bypass the normal drain window. Off by default, opt-in at build time;
requires PRs 6-9. Proposed after PRs 6-9 land.

I will submit PRs sequentially, starting with PR 1. Happy to discuss scope, approach,
or memory impact before individual PRs land. The reproduction rule set is available
on request, if helpful.

Cheers, J'Lo

0 replies

s-hadinger · 2026-05-28T17:58:50Z

s-hadinger
May 28, 2026
Collaborator

Impressive work, thanks. Keep going sending PRs!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backlog queue drains heap under load - investigation and proposed fixes #24776

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Backlog queue drains heap under load - investigation and proposed fixes #24776

Uh oh!

joluxer May 21, 2026

Replies: 3 comments

Uh oh!

arendst May 21, 2026 Maintainer

Uh oh!

Uh oh!

joluxer May 21, 2026 Author

Proposed PR series for backlog improvement

Short overview on single PRs

PR 1 - ESP8266: fix heap metric functions and add OOM diagnostics

PR 2 - Fix JsonParser NULL dereference on allocation failure

PR 3 - ESP8266 stack dump decoder

PR 4 - Firmware artifact timestamps

PR 5 - Rule cache early initialization

PRs 6-9 - Backlog fixes (dependent chain, submitted after PRs 1-5 land)

PR 10 - Fast-lane queue (feature, opt-in build flag)

Uh oh!

s-hadinger May 28, 2026 Collaborator

joluxer
May 21, 2026

arendst
May 21, 2026
Maintainer

joluxer
May 21, 2026
Author

s-hadinger
May 28, 2026
Collaborator