Skip to content

Inconsistent cache reuse and long delays despite stable request snapshots #20565

@rothnic

Description

@rothnic

Description

I investigated a long-running session tree with severe intermittent delays and inconsistent prompt-cache reuse. The core issue is that delayed turns repeatedly occurred in both the primary session and delegated secondary sessions, but the strongest evidence does not support a simple explanation of "the system prompt changed on every slow turn."

The most concerning observation is that multiple delayed turns shared the same stored request-state snapshot hash while cache.read flipped between cache hit and cache miss. That suggests one of:

  • cache reuse is unstable for effectively identical request state,
  • token/cache accounting is inconsistent,
  • or request construction differs in a way not reflected by the stored snapshot hash.

I also observed:

  • repeated delayed primary turns ranging from ~936s to ~2294s,
  • some delayed turns ending with MessageAbortedError,
  • negative input-token values on some cached turns,
  • sparse local logs that did not explain the slowdown window.

This feels related to the broader class of cache/accounting issues already discussed in issues like cache misses / low cache hit rate, dynamic system prompt invalidation, and negative token accounting, but this report adds one specific piece of evidence: same snapshot, mixed cache behavior during the slowdown period.

High-level aggregates

Class Assistant turns Avg duration Max duration
Primary 1221 32.9s 2293.7s
Secondary 209 49.4s 1515.9s

Error signals

Signal Count observed Notes
MessageAbortedError in analyzed root session 4 At least one very long delayed primary turn ended this way
Negative input token values observed Makes some token accounting unreliable
Sparse local logs yes Available local logs did not explain the slowdown window

Stable snapshot / mixed cache evidence

Example cluster 1 - same snapshot across consecutive delayed primary turns:

Time Snapshot stable? Input tokens Cache read Result
17:41:09 yes 81,851 0 tool-calls
17:41:17 yes 82,132 0 tool-calls
17:41:27 yes 82,859 0 tool-calls
17:41:40 yes 196 83,392 tool-calls
17:41:53 yes -423 84,160 tool-calls
17:42:09 yes 84,358 0 tool-calls

Example cluster 2 - same snapshot spanning secondary and primary turns:

Time Class Snapshot stable? Input tokens Cache read Result
17:28:36 Secondary yes -132 57,664 tool-calls
17:29:56 Secondary yes 68,461 0 tool-calls
17:30:46 Secondary yes 69,245 0 stop
17:30:46 Primary yes 70,188 0 tool-calls
17:31:31 Primary yes 72,065 0 tool-calls
17:32:26 Secondary yes 22,968 0 tool-calls
17:33:36 Secondary yes 11,868 23,360 tool-calls
17:33:50 Secondary yes 38,163 0 tool-calls

Delay pattern summary

Observation Seen? Notes
Repeated delayed primary turns over 15 minutes yes Multiple turns ranged from ~936s to ~2294s
Repeated delayed secondary turns yes Some appear to be genuinely doing work, not necessarily hung
Delays only after compaction no Slow turns appeared both before and after compaction
Same request-state snapshot across multiple delayed turns yes Important: slow clusters shared stable snapshots
Cache hits and misses within same snapshot cluster yes Strong evidence against "prompt changed every turn"

What changed vs what stayed stable

Stayed stable during some misses:

  • stored request-state snapshot hash
  • overall result type (tool-calls repeatedly)
  • broad turn structure within the same cluster

Changed during some misses:

  • cache.read accounting (0 vs positive)
  • input-token accounting, including negative values
  • session branch (Primary vs Secondary) in some shared-snapshot clusters
  • occasional terminal outcome (stop vs tool-calls)

Why I think this is an OpenCode/core issue worth investigating

  • I could not confirm a dynamic system-prompt mutation on every slow turn.
  • The stored snapshot staying the same while cache behavior flips is the clearest anomaly.
  • The negative token values suggest accounting/reporting bugs may be obscuring the real failure mode.
  • Some primary turns ended in MessageAbortedError, which suggests provider/client abort handling may also be involved.

It would be very helpful to know:

  1. what exactly contributes to the stored snapshot hash,
  2. whether same-snapshot requests can still differ in provider-facing cache-relevant bytes,
  3. and whether cache.read = 0 in these cases means true cache miss, unreported cache state, or another accounting artifact.

Plugins

Weave orchestration plugin, a local bash-wrapper plugin, and a command-safety plugin.

OpenCode version

0.0.0-master-202603312117

Steps to reproduce

  1. Run a long multi-turn session with a primary conversation that delegates into multiple secondary sessions.
  2. Let the session accumulate repeated tool-calls turns over time.
  3. Inspect stored session/message/part data for delayed turns.
  4. Group turns by stored request-state snapshot hash.
  5. Observe that some repeated same-snapshot clusters alternate between cache.read > 0 and cache.read = 0, even while delays continue.
  6. In some cases, also observe negative input-token values or MessageAbortedError on delayed turns.

Screenshot and/or share link

No share link included. I can provide more anonymized structured evidence if needed.

Operating System

macOS 26.2

Terminal

OpenCode CLI

Metadata

Metadata

Assignees

Labels

coreAnything pertaining to core functionality of the application (opencode server stuff)

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions