Skip to content

Triage rollout/session friction audit findings #90

@cbusillo

Description

@cbusillo

Objective

Turn the 2026-05-22 read-only rollout/session friction audit into a durable recovery point. The audit looked for repeated agent mistakes, late Auto Review confusion, wasted tokens, duplicated instructions, skill misfires, memory/compaction issues, and places where the LLM lacked state it should have had.

This issue is an index and triage queue. It should keep the user from needing to hold the whole audit in working memory. Work each candidate one at a time by following the linked issue or creating a focused implementation issue when the destination is not already covered.

Finish Line

Each evidence-backed friction candidate from the audit is either linked to an existing implementation issue, split into a focused new issue, or explicitly parked as noise. No raw private rollout/session contents are copied into public issues.

Current Status

State: Active
Next action: Choose the next one-at-a-time follow-up: #46 prompt/skill injection duplication, #76 Auto Review stale/superseded result handling, or repo-management work under #85/#87/#51.
Blocked by: None
Waiting for: User choice of next fix/workstream.
Last verified: 2026-05-22 after implementing and validating the duplicate tool-output fix on local/cbusillo-overlay.

Scope

  • In: Redacted measurements from local rollout/session files, issue routing, one-at-a-time follow-up decisions, and durable links to existing work.
  • Out: Raw transcript dumps, secrets, private paths beyond broad source classes, immediate code changes, or changing skills/config without explicit approval.

Acceptance Criteria

Relationships

Parent context: #43

Existing destinations:

Validation

Read-only audit commands used local session/rollout data under:

  • Recent sessions: ~/.code/sessions/2026/05/13, ~/.code/sessions/2026/05/14, ~/.code/sessions/2026/05/22
  • Archived sessions: selected ~/.code/archived_sessions files with unusually large size or image payloads
  • Catalog: ~/.code/sessions/index/catalog.jsonl

The audit used the rollout-friction scanner where helpful, then manually discounted scanner hits that were only instruction/prose matches instead of real runtime friction.

Evidence Summary

1. Duplicate persisted function-call outputs

Candidate destination: reopen/extend #45, or create a new focused rollout persistence bug.

Evidence:

  • In three sampled sessions, identical function_call_output entries with the same call_id were persisted twice.
  • Sample 1: 14,850 outputs, 7,670 unique, 7,180 duplicate entries; duplicate output bytes were about 47.2% of recorded output bytes.
  • Sample 2: 922 outputs, 461 unique; duplicate output bytes were about 50.0%.
  • Sample 3: 118 outputs, 59 unique; duplicate output bytes were about 50.0%.
  • One inspected span showed the same tool result written twice about 200 ms apart with the same call_id.

Likely cause to investigate: rollout/event recording may persist both live event output and a later state snapshot item.

2. Duplicated project-doc and skill blocks

Candidate destination: #46.

Evidence:

  • Recent session_meta.base_instructions repeatedly contained two --- project-doc --- sections and two ### Available skills blocks.
  • Base instruction payloads in the sampled sessions commonly ranged from about 20 KB to 52 KB before task-specific context.

Likely fix shape: de-duplicate prompt assembly sources and add a regression check that each project-doc and skill index block is injected once.

3. Auto Review loop volume and stale-current confusion

Candidate destinations: #76 and #50.

Evidence:

  • Local session catalog sample had 77 auto-review-like sessions totaling 7,707 messages.
  • 12 sessions exceeded 100 messages, 5 exceeded 500 messages, and 1 reached 1,618 messages.
  • Several large sessions had only one user message, which points to automated follow-up/resolve loops rather than interactive conversation.

Likely fix shape: snapshot-aware completion gating (#76), plus token/turn/progress guardrails (#50).

4. Stale IDE inspection results repeated into context

Candidate destination: #50 if treated as automation guardrail friction, or a new focused validation-evidence routing issue.

Evidence:

  • A large sampled session contained stale inspection results with completion_reason: stale_results after waits around 9s to 36s.
  • The stale state included a project-changed-since-inspection reason and then appeared multiple times in later recorded context.

Likely fix shape: treat stale inspection output as compact invalid evidence and require rerun, rather than preserving bulky stale findings as reusable context.

5. Raw image payload ballast

Candidate destination: #45 / #43 validation and regression evidence.

Evidence:

  • One archived session was about 234.7 MB and contained about 187.2 MB of base64 image payload across 208 image references.
  • Another archived session was about 157.5 MB and contained about 116.8 MB of base64 image payload across 137 image references.
  • A later huge May 13 session still had image references, but only about 1.6 MB of base64, suggesting the original image compaction may have improved but should remain a regression fixture.

Likely fix shape: keep raw images out of reusable model history and use these sessions as validation cases.

6. Session recall/search friction from oversized histories

Candidate destinations: #47, #45, and possibly #27.

Evidence:

  • A follow-up session investigating a destructive previous command had to search prior rollouts to reconstruct what happened.
  • Broad raw search over ~/.code was killed or abandoned because large encoded payloads made direct history search impractical.
  • The agent switched to extracting structured JSON message/tool fields to recover the relevant state.

Likely fix shape: better session indexing/summaries/citations so the LLM can retrieve prior operational facts without scanning massive raw rollout files.

7. Branch-guard preflight friction

Candidate destination: new small skill/workflow issue, or park as acceptable guard friction.

Evidence:

  • Blocked git switch appeared in 21 recent sessions.
  • The guard is useful, but agents repeatedly encounter it after issuing a blocked command instead of planning branch setup around the guard.

Likely fix shape: teach branch setup/preflight in GitHub workflow guidance or improve the guard message/action path.

8. Rollout-friction scanner false positives

Candidate destination: reopen #44 or create a scanner-quality follow-up.

Evidence:

  • The scanner flagged GraphQL/REST/rate-limit and Auto Review signals from skill docs, issue bodies, and instructions that merely mentioned those terms.
  • Manual review was needed to separate real runtime friction from self-referential prose.

Likely fix shape: classify structured runtime/tool outputs separately from instructions, issue bodies, and scanner-generated reports.

Decisions

  • Keep raw rollout/session files private and local.
  • Use GitHub issues as the durable index; do not expect chat history to carry this audit.
  • Review and implement one candidate at a time.
  • Prefer existing issue destinations when the intent already overlaps; create new issues only for truly orphaned fixes.

Open Questions

Metadata

Metadata

Assignees

No one assigned

    Labels

    planDurable planning issueplan:activeCurrent active plan

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions