Skip to content

fix(tui): bound previously-unbounded caches to prevent OOM on long sessions#2831

Merged
dgageot merged 1 commit into
docker:mainfrom
dgageot:board/b821136e5f51759f
May 20, 2026
Merged

fix(tui): bound previously-unbounded caches to prevent OOM on long sessions#2831
dgageot merged 1 commit into
docker:mainfrom
dgageot:board/b821136e5f51759f

Conversation

@dgageot
Copy link
Copy Markdown
Member

@dgageot dgageot commented May 20, 2026

Summary

Bounds three previously-unbounded in-memory caches in the TUI to prevent
memory growth (and eventual OOM) on long sessions:

# Cache Where Cap
1 cache (edit_file render results, keyed by tool call ID) pkg/tui/components/tool/editfile/render.go 64 entries
2 renderedItems (per-message rendered output) pkg/tui/components/messages/messages.go 500 entries
3 syntaxHighlightCache (already capped at 128 — moved to shared package) pkg/tui/components/markdown/fast_renderer.go 128 entries

To do this cleanly, the small private lruCache that already lived in the
markdown package is promoted to a new shared package, pkg/lrucache,
with proper tests and docs.

Why

On long coding sessions the agent accumulates many edit_file tool calls
and many rendered messages. The two caches above were map[K]Vs with no
eviction, so memory grew linearly with session length × output size. This
matches reports of cagent/docker-agent getting OOM-killed on long
sessions. The new caps trade a small amount of re-render CPU on misses for
hard memory bounds.

What's in pkg/lrucache

A small generic LRU[K, V] (~100 lines) backed by container/list:

  • New(maxSize) — clamped to ≥ 1
  • Get / Put / Delete / Clear / Len / Range
  • 100% test coverage (testify)
  • Documented as not thread-safe — callers provide synchronization

Bonus: pre-existing data race fix

While running the test suite under -race, I uncovered an existing race in
the markdown syntax-highlight cache (and the same pattern in the new
editfile call site I'd just written): both used sync.RWMutex.RLock() to
call LRU.Get(), but Get mutates the recency list. Multiple goroutines
holding read locks would race on MoveToFront.

Fixed by:

  • Switching both call sites from sync.RWMutex to sync.Mutex
  • Documenting the gotcha on lrucache.Get's doc comment

task test (no -race) doesn't catch this, so it would have slipped
through CI. Now go test -race ./... is clean for the touched packages.

What it does NOT do

  • The conversation history (session.Session.Messages) itself is still
    append-only and grows with session length. That's by design — the model
    needs full context. A future PR could add summarization/compaction.
  • A few other map-backed caches in the TUI (e.g. styleSeqCache) are
    naturally bounded by their key domain (fixed set of styles, languages,
    chroma token types) and don't need this treatment.

Validation

  • task lint — 0 issues across 1045 files
  • task test — passes for all touched packages
  • go test -race ./pkg/lrucache/... ./pkg/tui/... — clean
  • ✅ 100% test coverage on pkg/lrucache

(Pre-existing failures in pkg/teamloader/TestLoadExamples for dmr.yaml
and unload_on_switch.yaml require Docker Model Runner running locally —
unrelated to this change, fail identically on origin/main.)

creates pkg/lrucache with a small generic LRU cache to prevent memory growth
in long-running sessions. replaces the private lruCache in markdown with
the shared package, and bounds the previously-unbounded render cache
(64 entries) and renderedItems cache (500 entries) in editfile and messages
components.

also fixes a pre-existing data race in the syntax highlight and editfile
caches: LRU.Get mutates the recency list, so call sites must hold a write
lock, not a read lock. switches both call sites from sync.RWMutex to
sync.Mutex and documents the gotcha on lrucache.Get.
@dgageot dgageot requested a review from a team as a code owner May 20, 2026 09:17
Copy link
Copy Markdown

@docker-agent docker-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assessment: 🟢 APPROVE

@@ -71,37 +81,29 @@ type linePair struct {
}

func getOrCreateCache(toolCallID string) *toolRenderCache {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LOW] Rendered result written to orphaned struct after LRU eviction

getOrCreateCache releases cacheMu before returning its pointer. If the LRU later evicts that entry (because 64 other tool-call IDs are Put between the two lock acquisitions in renderEditFile/countDiffLines), a subsequent getOrCreateCache for the same ID allocates a fresh *toolRenderCache. The original goroutine still holds the old pointer and writes c.rendered = result to the now-orphaned struct — the result never lands in the LRU and is lost on the next lookup.

Impact: Low — requires 64 concurrent evictions between two lock acquisitions in what is typically a single rendering goroutine. The consequence is an extra recomputation (cache miss), not data corruption or a crash. In practice the cap of 64 is well above the number of simultaneous active edit-file calls in a typical session.

Possible fix: After computing the result, call cache.Put(toolCall.ID, c) instead of (or in addition to) writing directly to c, so the rendered data is always associated with the live LRU entry, even if the old pointer was evicted.

@dgageot dgageot merged commit 16d940f into docker:main May 20, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants