Skip to content

fix(harness+desktop): coalesce streaming text deltas, add detached turns, cross-surface invariant#6

Merged
OmGuptaIND merged 9 commits intomainfrom
OmGuptaIND/harness-tool-debug
Apr 22, 2026
Merged

fix(harness+desktop): coalesce streaming text deltas, add detached turns, cross-surface invariant#6
OmGuptaIND merged 9 commits intomainfrom
OmGuptaIND/harness-tool-debug

Conversation

@OmGuptaIND
Copy link
Copy Markdown
Contributor

Summary

  • Fixes 200-bubble rendering bug — codex streams ~200 per-token text events per reply; mirror was expanding each into its own SessionHistoryEntry, so users saw one word per message bubble. Coalesces on write (synthesizeHarnessTurn) and on read (readHarnessHistory) to heal legacy messages.jsonl files already on disk.
  • Adds detached/attached disconnect mode — new sessions.disconnectMode config (default attached) + Settings toggle lets a turn keep running in the background when you close the tab. Wall-clock timer (detachedTurnMaxMs, default 10m) prevents runaway turns; reconnect clears all pending timers. Spec: specs/features/DETACHED_TURNS.md.
  • Cross-surface render invariant test — desktop live stream, webhook runner (Telegram/Slack), and mirror/history each have their own event-stream → text logic. They agreed only by coincidence. New simulator-based test pins that all three surfaces produce identical text for the same event sequence (200-delta stress, 1000-delta stress, unicode surrogate-split, empty-deltas, text-after-tool_call). 23 checks.
  • Pre-existing work on the branch (title MCP tool, harness/codex-harness-session hardening, prompt-layer tweaks, deploy configs) rides along.

Test plan

  • pnpm --filter @anton/agent-core check:harness — all 77 checks green (9 mirror + 3 round-trip + 1 legacy-mirror + 23 cross-surface + pre-existing suites)
  • Typechecks clean across protocol, agent-config, agent-core, agent-server, desktop, cli
  • Known: mobile has a pre-existing RoutineStatusEvent cast issue (reproduces on branch HEAD with my changes reverted; not caused by this diff)
  • Manual verification pending: open the existing 200-bubble session in desktop and confirm the history renders as one bubble (read-side coalesce heals the legacy mirror)
  • Manual verification pending: toggle Settings → Behavior → "Keep running when I close the tab" on, start a turn, close the tab, reopen before 10 min — turn should still be running. Leave closed past 10 min — turn should auto-cancel.
  • Manual verification pending: confirm Telegram/Slack replies still work (webhook runner untouched; covered by cross-surface test)

Notes

  • No new protocol types; 'sessions' added to existing ConfigQueryMessage / ConfigUpdateMessage key union.
  • Desktop is optimistic-local + server-authoritative for the mode — setDisconnectMode(mode, {fromServer:true}) flag prevents echo loops.
  • Safeguards deferred to follow-up PRs: tool-call budget, destructive-tool ask_user gate when detached, hard Stop button on reconnect, per-session mode override.

🤖 Generated with Claude Code

OmGuptaIND and others added 9 commits April 22, 2026 02:09
…rns, cross-surface render invariant

Three related bodies of work — the streaming-deltas fix was the user-
visible bug that triggered the rest, and the detached-turns feature
made us audit the full disconnect flow.

Streaming text delta coalesce (mirror.ts)
- CodexHarnessSession emits one `{type:'text'}` SessionEvent per
  `item/agentMessage/delta` notification (~200 per reply). The mirror
  synthesizer pushed each as a separate TextBlock, then
  readHarnessHistory expanded every block into its own
  SessionHistoryEntry. Users saw 200 single-word message bubbles
  stacked in the transcript.
- synthesizeHarnessTurn now merges consecutive text/thinking events
  into one block on the write side, and readHarnessHistory coalesces
  adjacent same-type blocks on the read side as a safety net that
  heals legacy messages.jsonl files already on disk. Tool boundaries
  still split the run correctly; `!last.isThinking && !last.toolName`
  gates the coalesce on read.
- Adds 3 mirror checks (streamed text/thinking deltas coalesce, tool-
  boundary does not coalesce across) + 1 legacy-mirror read check.

Detached/attached disconnect mode (spec + config + server + UI)
- New `sessions.disconnectMode: 'attached' | 'detached'` config field
  with `detachedTurnMaxMs` (default 10 min) budget. Default is
  attached — current behavior, no surprise cost. Detached lets the
  turn finish in the background when the tab closes; a wall-clock
  timer on the server cancels runaway turns if the client never comes
  back. Reconnect clears all pending detached budgets.
- `'sessions'` added to ConfigQueryMessage/ConfigUpdateMessage key
  union so the desktop can query + toggle the mode via the existing
  config protocol.
- server.ts ws.on('close') branches on mode: detached skips cancel,
  leaves activeTurns populated, schedules a per-session timer via
  scheduleDetachedTurnBudget. clearDetachedTurnBudget fires in the
  processMessage finally block so natural turn completion cleans up.
  Timer is .unref()'d so shutdown isn't blocked.
- Desktop Settings → Behavior → Autonomy gains "Keep running when I
  close the tab" toggle. UIStore hydrates the value from server on
  every auth_ok; setter echoes to server unless the update came from
  server hydration (fromServer flag prevents ping-pong).
- Spec: specs/features/DETACHED_TURNS.md documents the mode contract,
  safeguards still TODO (tool-call budget, destructive-tool ask_user
  gate, per-session override), and the deferred structural event-type
  split.

Cross-surface render invariant (check.ts)
- Desktop chatHandler.appendText, webhook agent-runner chunks.join,
  and mirror readHarnessHistory each implement their own "event
  stream → assistant text" logic. They agreed today only by
  coincidence. The 200-bubble bug was a case where mirror diverged
  from the other two; we fixed mirror but nothing stopped the next
  adapter change from causing a fresh divergence.
- New test block simulates all three surfaces against shared fixtures
  (single text, 200-delta stress, 1000-delta stress, unicode
  surrogate-pair split across deltas, empty-deltas interleaved,
  text-after-tool_call) and asserts: every surface's final assistant
  text run matches `expectedFinalText`; for no-tool fixtures, all
  three surfaces produce identical full bubble arrays. 23 checks.
- Pointer comment at each of the three surface sites so the next dev
  touching them sees the invariant.

Also included on the branch
- set-session-title MCP tool + server wiring for harness titling
- codex-harness-session + harness-session hardening (pre-existing)
- tool-registry + prompt-layers + factories tweaks (pre-existing)
- deploy ansible + huddle cloud-init updates (pre-existing)

Tests: 77 green (7 fixture + 5 prompt-layer + 4 registry + 5 snapshot
+ 15 identity + 5 memory-guidelines + 9 mirror + 3 round-trip + 1
legacy-mirror + 1 replay-seed + 23 cross-surface). Typechecks clean
across protocol/agent-config/agent-core/agent-server/desktop/cli;
mobile has a pre-existing RoutineStatusEvent cast issue that
reproduces on branch HEAD, not caused by this diff.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every pending ask_user swapped the composer out for the AskUserInline
card. That meant the user couldn't type anything else — the composer
UI was just gone, which is disorienting when the questions are
optional nudges rather than hard blockers.

- ChatInput.tsx now only takes over the composer for specialized
  cards (routine_create / routine_delete), where the explicit Confirm/
  Cancel buttons are the right UX. Generic ask_user leaves the
  composer visible.
- handleSend routes the user's typed text as the answer to every
  pending question when a generic ask_user is outstanding, so the
  server-side handler still unblocks cleanly. Single free-form string
  answer per question — matches what the AskUserInline "Or write your
  own answer" path already does.
- Added pendingAskUser + onAskUserSubmit to handleSend's useCallback
  deps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AskUserInline rendered every unanswered question at once — a 3- or
4-question ask_user occupied the whole viewport with options and
free-text inputs stacked vertically. Now only the first unanswered
question is visible; answering it advances to the next, and the
progress dots in the header already showed the right state.

- Replaced `questions.map(... if (isAnswered) return null)` with a
  single `findIndex`-based render. Same submit/autosubmit flow; no
  state changes needed.
- Nothing else touched — the progress counter, done-state, and
  specialized routine cards above are unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ConfirmDialog already uses ix--accent (the accent-card pattern from
the shared .ix-* interaction shell). AskUserInline was on ix--bordered,
which reads as a plain card rather than a prompt that needs attention.
Matches the pattern the design system already establishes for agent
interaction prompts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ports the SessionFilesBar component from the design handoff
(anton-computer/project/session-files.jsx, topbar variant) into the
real app. Gives the task view a compact "Files" pill in the topbar
next to Usage / More options that opens a popover listing every
artifact produced in the current conversation.

- New SessionFilesBar.tsx under components/chat/. Subscribes to
  artifactStore, filters by the active conversation's sessionId so
  artifacts from other sessions don't leak in. Clicking a row calls
  setActiveArtifact + setArtifactPanelOpen — same path the existing
  ActionsGroup uses.
- Component renders nothing when there are no artifacts, so the pill
  only shows up once there's something to list.
- Thumbnails port the design's SessionFileThumb: SVG via innerHTML,
  HTML via sandboxed iframe scaled to 0.25, code via first-6-lines
  <pre>, docs via first heading + placeholder lines.
- Extension labels map renderType + language to TS/TSX/MD/SVG/HTML
  etc. fmtAgoShort + artifactTitle helpers keep the meta row tidy.
- CSS appended to index.css. Tokens used (--bg-elev-1, --border,
  --border-strong, --text, --text-2/3/4, --accent, --accent-dim)
  already exist in Anton's theme, so no new variables needed.
- Injected into App.tsx's workspace-topbar__actions between the
  existing chat-only gate (activeView === 'chat' && hasMessages)
  and Usage/MoreHorizontal buttons. Placement matches the screenshot
  in the handoff.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Anton's model can already call the `publish` tool to push content to a
public URL. Previously there was no user-in-the-loop — the model's call
published immediately, which is the wrong trust posture for a tool
that produces a public artifact URL. This wires the tool through a
new specialized PublishConfirmCard so the user always confirms and
picks the final slug before anything goes live.

Server / tool
- buildPublishTool(deps) now takes { domain, askUser } instead of a
  bare domain string. When askUser is wired (all desktop contexts),
  execute() first fires an ask_user prompt with metadata.type =
  'publish_confirm' carrying { title, contentType, language,
  suggestedSlug, domain }. The card's answer is the final slug
  string; empty string means cancel.
- Description updated: model should call publish with a suggested
  slug; do NOT pre-ask via ask_user — this tool has its own gate.
- Fallback path preserved for non-desktop callers (evals/runner.ts
  has no human, no ask_user handler): publishes directly, same as
  before. Keeps eval scripts working.
- Slug validation: VALID_SLUG.test on the user's answer, fall back
  to the suggestedSlug if they type garbage. executePublish itself
  also throws on bad slugs, so double-protected.
- factories.ts passes ctx.onAskUser through — same pattern as the
  existing routine-factory approval gate.

Desktop
- AskUserInline.tsx: new PublishConfirmMeta + isPublishConfirm
  detector + PublishConfirmCard. Reuses .routine-confirm styles so
  no new CSS needed; shows Title / Type / Domain / editable Slug /
  live public URL preview. Confirm button is disabled when the slug
  isn't a valid [a-zA-Z0-9_-]+ string.
- Card submits the slug on confirm, empty string on cancel, matching
  what the publish tool's execute() expects.
- ChatInput.tsx: publish_confirm added to the specialized-card
  allowlist so the composer yields to it (same behavior as
  routine_create / routine_delete). Generic ask_user still leaves
  the composer visible.

Existing "Publish" button in the artifact panel is untouched — it
goes through `publish_artifact` on a different server channel and
was already user-initiated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SessionFilesBar: filter fallback `return true` leaked orphan artifacts
(anything with no conversationId) into every session's Files popover,
not just the active one. Tightened to strict match: artifacts with no
conversationId, or with a conversationId that doesn't match the
active session/conversation, are now excluded.

ChatInput auto-answer: when a multi-question generic ask_user was
pending, typing into the composer and sending would populate every
question's answer with the same string — the model would see
`{ "Q1": "foo", "Q2": "foo", "Q3": "foo" }`. Restricted the auto-
answer shortcut to single-question ask_user only. Multi-question
prompts fall through to the normal send path so the AskUserInline
card (rendered elsewhere in the transcript) can collect each
answer separately.

Both caught on review pass over PR #6; no runtime reports. Typecheck
+ harness tests still green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
My earlier "stop ask_user from hiding the composer" commit removed
the composer-replacement for generic ask_user but never added an
inline render anywhere else — so a generic ask_user would set
pendingAskUser on the client, fire "Ask User" tool-label in the
transcript, and then render absolutely nothing. User couldn't see or
answer the question.

RoutineChat now renders AskUserInline in its own
.chat-shell__ask-user block between MessageList and the composer,
gated on !isSpecializedCard (routine_create / routine_delete /
publish_confirm still take over the composer via ChatInput —
rendering there would double-show the card).

The inline block reuses the transcript's 760px max-width so the card
aligns with the message column. No changes to the specialized-card
flow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reverts the "keep composer visible while ask_user is pending"
attempt from earlier in this PR. The new behavior users asked for
is the original behavior: while an ask_user is pending, the card
replaces the composer entirely so the user can't type until they
answer. Matches Claude Code's interactive prompts.

Undoes:
- RoutineChat inline chat-shell__ask-user render + import
- ChatInput's isSpecializedCard gate on the composer replacement —
  every ask_user now takes over the composer again
- ChatInput.handleSend's auto-answer-on-send branch — unreachable
  now that composer is always replaced while pending
- .chat-shell__ask-user CSS rule

Kept from the earlier passes:
- AskUserInline one-question-at-a-time rendering (still the right
  UX improvement)
- ix--accent variant (style upgrade)
- Specialized routine/publish detection inside AskUserInline
  (those still show their dedicated cards — no change)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@OmGuptaIND OmGuptaIND merged commit 33634c0 into main Apr 22, 2026
OmGuptaIND added a commit that referenced this pull request Apr 22, 2026
### Other
- fix(harness+desktop): coalesce streaming text deltas, add detached turns, cross-surface invariant (#6)
- feat(settings): mark Claude CLI as coming soon, simplify provider form (#5)
- fix(harness+server): MCP shim hardening + SessionRegistry (#4)
- fix(caddy): preserve /health and /status paths upstream to sidecar (#3)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant