fix(tui): bound previously-unbounded caches to prevent OOM on long sessions#2831
Conversation
creates pkg/lrucache with a small generic LRU cache to prevent memory growth in long-running sessions. replaces the private lruCache in markdown with the shared package, and bounds the previously-unbounded render cache (64 entries) and renderedItems cache (500 entries) in editfile and messages components. also fixes a pre-existing data race in the syntax highlight and editfile caches: LRU.Get mutates the recency list, so call sites must hold a write lock, not a read lock. switches both call sites from sync.RWMutex to sync.Mutex and documents the gotcha on lrucache.Get.
| @@ -71,37 +81,29 @@ type linePair struct { | |||
| } | |||
|
|
|||
| func getOrCreateCache(toolCallID string) *toolRenderCache { | |||
There was a problem hiding this comment.
[LOW] Rendered result written to orphaned struct after LRU eviction
getOrCreateCache releases cacheMu before returning its pointer. If the LRU later evicts that entry (because 64 other tool-call IDs are Put between the two lock acquisitions in renderEditFile/countDiffLines), a subsequent getOrCreateCache for the same ID allocates a fresh *toolRenderCache. The original goroutine still holds the old pointer and writes c.rendered = result to the now-orphaned struct — the result never lands in the LRU and is lost on the next lookup.
Impact: Low — requires 64 concurrent evictions between two lock acquisitions in what is typically a single rendering goroutine. The consequence is an extra recomputation (cache miss), not data corruption or a crash. In practice the cap of 64 is well above the number of simultaneous active edit-file calls in a typical session.
Possible fix: After computing the result, call cache.Put(toolCall.ID, c) instead of (or in addition to) writing directly to c, so the rendered data is always associated with the live LRU entry, even if the old pointer was evicted.
Summary
Bounds three previously-unbounded in-memory caches in the TUI to prevent
memory growth (and eventual OOM) on long sessions:
cache(edit_file render results, keyed by tool call ID)pkg/tui/components/tool/editfile/render.gorenderedItems(per-message rendered output)pkg/tui/components/messages/messages.gosyntaxHighlightCache(already capped at 128 — moved to shared package)pkg/tui/components/markdown/fast_renderer.goTo do this cleanly, the small private
lruCachethat already lived in themarkdown package is promoted to a new shared package,
pkg/lrucache,with proper tests and docs.
Why
On long coding sessions the agent accumulates many
edit_filetool callsand many rendered messages. The two caches above were
map[K]Vs with noeviction, so memory grew linearly with session length × output size. This
matches reports of
cagent/docker-agentgetting OOM-killed on longsessions. The new caps trade a small amount of re-render CPU on misses for
hard memory bounds.
What's in
pkg/lrucacheA small generic
LRU[K, V](~100 lines) backed bycontainer/list:New(maxSize)— clamped to ≥ 1Get/Put/Delete/Clear/Len/RangeBonus: pre-existing data race fix
While running the test suite under
-race, I uncovered an existing race inthe markdown syntax-highlight cache (and the same pattern in the new
editfilecall site I'd just written): both usedsync.RWMutex.RLock()tocall
LRU.Get(), butGetmutates the recency list. Multiple goroutinesholding read locks would race on
MoveToFront.Fixed by:
sync.RWMutextosync.Mutexlrucache.Get's doc commenttask test(no-race) doesn't catch this, so it would have slippedthrough CI. Now
go test -race ./...is clean for the touched packages.What it does NOT do
session.Session.Messages) itself is stillappend-only and grows with session length. That's by design — the model
needs full context. A future PR could add summarization/compaction.
styleSeqCache) arenaturally bounded by their key domain (fixed set of styles, languages,
chroma token types) and don't need this treatment.
Validation
task lint— 0 issues across 1045 filestask test— passes for all touched packagesgo test -race ./pkg/lrucache/... ./pkg/tui/...— cleanpkg/lrucache(Pre-existing failures in
pkg/teamloader/TestLoadExamplesfordmr.yamland
unload_on_switch.yamlrequire Docker Model Runner running locally —unrelated to this change, fail identically on
origin/main.)