Skip to content

LLM-1285: Web fetch tool#1

Merged
pmateusz merged 12 commits into
masterfrom
mateusz.polnik-llm-1284-web-fetch-tool
Apr 9, 2026
Merged

LLM-1285: Web fetch tool#1
pmateusz merged 12 commits into
masterfrom
mateusz.polnik-llm-1284-web-fetch-tool

Conversation

@pmateusz
Copy link
Copy Markdown
Contributor

@pmateusz pmateusz commented Apr 8, 2026

Summary

Adds a web_fetch tool to kimchi-code as an in-repo extension (extensions/web-fetch/). The tool fetches web pages — including JavaScript-rendered SPAs — and converts HTML to clean markdown for the coding agent to consume.

What it does

  • Fetches any URL using Playwright (headless Chromium) as the primary fetcher, with automatic fallback to native fetch() when Playwright/Chromium is not installed
  • Converts HTML to markdown (default), plain text, or raw HTML via Turndown
  • Browser pool — single Chromium instance with lazy init, tab reuse (~100ms vs 1-2s cold start), idle timeout (60s), and crash recovery
  • Session-scoped cache with 15-min TTL and LRU eviction (100 entries) — enables drill-down workflows without re-fetching
  • SSRF protection — blocks private IPs (10.x, 172.16-31.x, 192.168.x, 127.x, 169.254.x, ::1, fc00::/7), localhost, and cloud metadata endpoints
  • Response size limit (5MB), output truncation, configurable timeout (default 30s, max 120s)
  • Strips script/style/meta tags before conversion
  • Follows redirects automatically, reports final URL in metadata

Tool parameters

Parameter Type Default Description
url string (required) URL to fetch (http:// or https://)
format markdown | text | html markdown Output format
timeout number 30 Timeout in seconds (max 120)

@pmateusz pmateusz marked this pull request as draft April 8, 2026 20:23
pmateusz added 8 commits April 9, 2026 09:50
…b-fetch tool

Implement Phase 4 of the web-fetch extension: Playwright as primary fetcher
with native fetch fallback when unavailable. Browser pool manages a single
Chromium instance with lazy init, reuse, idle timeout, and crash recovery.
Class-based BrowserPool design with constructor injection for testability.
…ed cache

Cache output was built using string.replace("Cache: miss", "Cache: hit") on the
full output including page body — if the page contained the literal string
"Cache: miss", the cached version was corrupted. Now builds cache output by
swapping the metadata array element before joining.

Also fixes unsafe `err as FetchError` cast with proper instanceof narrowing,
extracts EMPTY_DETAILS constant, and adds session-scoped cache with TTL,
cache integration tests, and session_shutdown cleanup.
Only set playwrightAvailable=false when the error indicates Chromium is
genuinely not installed (e.g. "Executable doesn't exist"). Transient
errors like ENOMEM or file locks now allow retry on the next getBrowser()
call instead of permanently disabling Playwright for the session.
Wrap domino.createDocument in try/catch so malformed HTML returns a
fallback string instead of crashing. Classify Playwright TimeoutError
by error name rather than relying solely on message substring matching.
…tests into verify

Converts the cache module from module-level state with free functions to a
Cache class with constructor-injected options (ttlMs, maxEntries), matching the
BrowserPool pattern. Adds LRU eviction at 100 entries, vitest to the extension,
and a test:extensions script in the root verify pipeline.
- Clamp timeout parameter to [0, 120] to prevent negative values
  reaching Playwright/native fetch
- Call shutdownBrowserPool() on session_shutdown alongside cache clear
- Add comment clarifying truncated responses are cached intentionally
@wiz-3b6633f35c
Copy link
Copy Markdown

wiz-3b6633f35c Bot commented Apr 9, 2026

Wiz Scan Summary

Scanner Findings
Vulnerability Finding Vulnerabilities -
Data Finding Sensitive Data -
Secret Finding Secrets -
IaC Misconfiguration IaC Misconfigurations -
SAST Finding SAST Findings 1 Medium 2 Low
Software Management Finding Software Management Findings -
Total 1 Medium 2 Low

View scan details in Wiz

To detect these findings earlier in the dev lifecycle, try using Wiz Code VS Code Extension.

Comment thread extensions/web-fetch/integration.test.ts
Comment thread extensions/web-fetch/integration.test.ts
@pmateusz pmateusz marked this pull request as ready for review April 9, 2026 10:36

let response: Response;
try {
response = await fetch(url, {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium SAST Finding

Server-Side Request Forgery (SSRF) via User-Controlled URLs in Node.js HTTP Client Libraries (CWE-918)
Analyzed by AI as Inconclusive

More Details

This rule detects instances where user-controlled URLs are passed directly to HTTP client libraries, leading to a Server-Side Request Forgery (SSRF) vulnerability. SSRF allows an attacker to abuse server functionality to make requests to internal systems or services that are not directly exposed to the internet. This can result in unauthorized access to sensitive data, internal network reconnaissance, and potential pivoting into the internal network.

Attribute Value
Impact Medium
Likelihood Medium

AI Analysis

The code snippet shows a fetch() call with a url parameter, but there's insufficient context to determine if this URL comes from user input or is properly validated. The snippet doesn't show the source of the url variable or any validation logic. Without seeing how url is populated or if there are security controls elsewhere in the function, it's impossible to definitively classify this as exploitable or safe.

Remediation

Server-Side Request Forgery (SSRF) is a vulnerability that allows an attacker to force the server to make unintended requests to other systems or internal resources. This can lead to unauthorized access, data theft, and potentially further compromise of the internal network.

To mitigate SSRF, user input should never be directly passed to HTTP client libraries. Instead, implement strict input validation and whitelisting to ensure that only trusted URLs are allowed. This can be done by maintaining a list of approved domains or URLs, and rejecting any user input that does not match this whitelist.

Code examples:

// VULNERABLE CODE - User input is directly passed to the HTTP client library
const axios = require('axios');
const userUrl = req.query.url;

axios.get(userUrl)
.then(response => {
// Handle response
})
.catch(error => {
// Handle error
});

// SECURE CODE - User input is validated against a whitelist before making the request
const axios = require('axios');
const userUrl = req.query.url;
const allowedUrls = ['https://example.com', 'https://example.org'];

if (allowedUrls.includes(userUrl)) {
axios.get(userUrl)
.then(response => {
// Handle response
})
.catch(error => {
// Handle error
});
} else {
// Reject the request with an appropriate error message
}

Additional recommendations:

  • Follow the principle of least privilege and only grant the minimum required permissions to the server.
  • Implement strict input validation and sanitization for all user input, not just URLs.
  • Keep software up-to-date and apply security patches promptly.
  • Adhere to the OWASP Server-Side Request Forgery Prevention Cheat Sheet and other relevant security standards.
  • Consider using a secure web gateway or web application firewall (WAF) to provide an additional layer of protection.

Rule ID: WS-I007-JAVASCRIPT-00019


To ignore this finding as an exception, reply to this conversation with #wiz_ignore reason

If you'd like to ignore this finding in all future scans, add an exception in the .wiz file (learn more) or create an Ignore Rule (learn more).


To get more details on how to remediate this issue using AI, reply to this conversation with #wiz remediate

@pmateusz pmateusz merged commit 440431c into master Apr 9, 2026
5 checks passed
@pmateusz pmateusz deleted the mateusz.polnik-llm-1284-web-fetch-tool branch April 22, 2026 15:09
zilvinasu added a commit that referenced this pull request May 5, 2026
…rompt

When a user pasted an image and submitted the prompt, the user-message
component rendered only the typed text — the image silently rode along to
the model with no visible reference, and there was no way for the user to
say "look at the second image" in their prompt.

Match Claude's UX: prepend a `[Image #N]` marker per attached image to the
text we forward to the model, with N being a session-wide running counter.
A turn that pastes one image and types "is this a dog?" now becomes
`[Image #1] is this a dog?` in both the rendered transcript and the model's
input. Two-image turns become `[Image #N] [Image #M] <prompt>`. Empty
prompts with images become just the markers, so the user message renders
at all (UserMessageComponent is skipped when text is empty).

The counter:
- starts at 1 per session;
- resets on `session_start`;
- runs continuously across all turns and across all attached image sources
  (paste queue + any `event.images` an upstream extension may have
  populated), so reference numbers stay stable for the entire conversation.

Verified against 8 input scenarios (single paste + text, multi-paste,
empty text, whitespace-only text, pre-attached images, mixed sources,
no-images bail-out, counter persistence across turns) — all pass. Existing
749-test suite still green.

Co-Authored-By: Kimchi <noreply@kimchi.dev>
zilvinasu added a commit that referenced this pull request May 6, 2026
…bytes, not system icon (#148)

## Summary

Two related bugs in the paste-image flow, fixed together:

1. **Wrong bytes attached to the model.** When a user copied an image
file in Finder (`Cmd+C` on the file) and pasted into the Kimchi prompt,
the harness either bailed with *"No image found on clipboard"* or
attached the **macOS generic-PNG-document icon** (a 1024×1024 flat
document glyph) instead of the file's actual contents. Vision models
correctly described what they got — *"a generic PNG file icon"* — but
the user's intent was lost.
2. **Pasted images had no textual reference in the prompt.** The
user-message component renders only the typed text, so a turn with one
paste and a caption rendered as just *"is this a dog?"* with the image
riding along invisibly. There was no way for the user to write *"compare
image 2 to image 3"*.

This PR fixes both: the harness now reads the file's real bytes from
disk when a Finder file URL is on the pasteboard, and it prepends
Claude-style `[Image #N]` markers to the submitted text so each attached
image has a stable reference.

## Bug 1 — root cause and fix

A plain Finder `Cmd+C` of a file puts only `public.file-url` on the
macOS pasteboard. The native clipboard addon
(`@mariozechner/clipboard-darwin-*`, a wrapper around `clipboard-rs`)
only consults `public.png` / `Apple PNG pasteboard type` for
`hasImage()`, so:

- Clean Finder file copy → `hasImage()` is false → handler bailed with
*"No image found"*.
- When some intermediary (Get Info → click icon → `Cmd+C`, or a
clipboard manager) had re-rendered the file URL into a 1024×1024 NSImage
system-icon rep → `hasImage()` returned true and `getImageBinary()`
cheerfully handed back that icon as if it were user content.

In `readClipboardImage()`:

1. Read `clipboard.availableFormats()` first (synchronous, cheap).
2. If `public.file-url` is present **and** we're on macOS, resolve the
path via `osascript -e 'POSIX path of (the clipboard as «class furl»)'`.
3. Validate the path: image-typed extension
(`.png`/`.jpg`/`.jpeg`/`.gif`/`.webp`), regular file, ≤ 50 MB, readable.
4. Return the file's bytes with the correct mime type — skipping the
addon's image rep entirely.
5. Fall through to the addon for screenshots, image-data-only paste, and
non-macOS platforms (unchanged behavior).

The `availableFormats()` gate is essential: AppleScript will coerce an
arbitrary text clipboard (`"hello"`) into a fake `/hello` path, so we
only invoke it when a real file URL is advertised.

## Bug 2 — `[Image #N]` markers

In the `input` event handler, count incoming + pending images and
prepend `[Image #N]` markers to `event.text` (one marker per image,
separated by spaces). The counter:

- starts at 1 per session;
- resets on `session_start`;
- runs continuously across all turns and across all attached image
sources (paste queue + any `event.images` an upstream extension may have
populated).

A turn that pastes one image and types *"is this a dog?"* now becomes
`[Image #1] is this a dog?` in both the rendered transcript and the
model's input. Two-image turns become `[Image #N] [Image #M] <prompt>`.
Empty prompts with images become just the markers (so the user message
renders at all — `UserMessageComponent` is otherwise skipped on empty
text).

## Verification

**Bug 1 — pasteboard scenarios** (5/5):

| scenario | result |
|---|---|
| File-URL only (the bug) | reads `dog copy.png` from disk: **752 670
bytes**, `image/png` ✅ |
| Real PNG bytes (screenshot-like) | addon path: 665 009 bytes ✅ |
| Plain text `"hello"` | `null` (does **not** fabricate `/hello` as a
path) ✅ |
| File URL → non-image (`.txt`) | `null` ✅ |
| File URL → missing file | `null` ✅ |

**Bug 2 — marker scenarios** (8/8):

| input | output text | counter after |
|---|---|---|
| 1 paste + `"is this a dog?"` | `[Image #1] is this a dog?` | 1 |
| follow-up paste + `"what about now?"` | `[Image #2] what about now?` |
2 |
| 2 pastes + `"compare these"` | `[Image #3] [Image #4] compare these` |
4 |
| empty text + 1 paste | `[Image #5]` | 5 |
| whitespace-only text + 1 paste | `[Image #6]` | 6 |
| pre-attached `event.images` + caption | `[Image #7] see attached` | 7
|
| 2 pre-attached + 1 pending + caption | `[Image #8] [Image #9] [Image
#10] look` | 10 |
| no images + plain text | bail out, counter unchanged | 10 |

Plus:
- `pnpm run check` (biome lint + tsc) — clean (one pre-existing
`lsp/client.ts` warning is unrelated).
- `pnpm test` — 749 tests across 40 files, all pass.
- `pnpm run build:binary` + `pnpm run install:local` — clean.

## What else changed

The PR also lands the supporting infrastructure that the new extension
needs:

- **`src/extensions/clipboard-image.ts`** — new file. Hosts both the
file-URL-aware paste handler and the `[Image #N]` marker injection.
- **`src/cli.ts`** — wires `clipboardImageExtension` into the default
extension list.
- **`src/extensions/ui.ts`** — exports `setPasteImageHandler` and
`setPendingImageIndicator` so the new extension can hook the editor.
- **`src/components/editor.ts`** —
`PromptEditor.setPendingImageIndicator()` and a render path that pins a
right-aligned status segment to the first prompt row, with content-width
recomputed so typed text never collides with the indicator.
- **`package.json`** — adds the prebuilt
`@mariozechner/clipboard-{darwin,linux,win32}-*` addons as
`optionalDependencies` so install picks the right platform binary.
- **`.gitignore`** — ignores `kimchi-session-*.html` viewers produced by
`/export`.

## Status

Draft. Opening for early visibility while I'm:

- still validating end-to-end with the running binary in everyday paste
flows;
- considering whether the same file-URL-prefer treatment should also
apply on Linux (Wayland exposes file lists via `wl-paste --list-types` →
`text/uri-list`, X11 via `xclip -selection clipboard -t TARGETS` —
straightforward but currently out of scope).

Reviews welcome on the macOS code path, the marker UX (`[Image #N]`
placement and counter scope), and the editor indicator changes; the rest
is mechanical.

Closes LLM-1499

Co-Authored-By: Kimchi <noreply@kimchi.dev>

---
### Kimchi Summary
### What changed
Adds support for pasting images directly from the clipboard into chat
turns. When images are pasted, a right-aligned indicator appears in the
prompt editor showing the number and total size of pending attachments,
and `[Image #N]` markers are prepended to the submitted text for
multimodal model compatibility.

### Why
Enables users to attach screenshots and image files to conversations
without saving them to disk first, matching the UX of multimodal chat
interfaces like Claude.

### Key changes
- **`src/extensions/clipboard-image.ts`**: New extension implementing
cross-platform clipboard image reading via native NAPI bindings
(`@mariozechner/clipboard-*` packages). Handles platform detection
(Darwin/Linux/Windows), macOS file URL fallback via AppleScript, 50 MB
size limits, and `[Image #N]` marker generation.
- **`src/components/editor.ts`**: Extended `PromptEditor` with
`setPendingImageIndicator()` and `computeContentWidth()` to reserve
right-aligned space for status indicators without colliding with typed
text; updates rendering to wrap content earlier when indicators are
active.
- **`src/extensions/ui.ts`**: Added `setPasteImageHandler()` and
`setPendingImageIndicator()` bridge functions to wire clipboard events
into the editor component.
- **`package.json`**: Added nine platform-specific
`optionalDependencies` for clipboard native addons covering Darwin
(arm64/x64/universal), Linux (arm64/x64 with glibc/musl), and Windows
(arm64/x64).
- **`.gitignore`**: Added ignore rule for `kimchi-session-*.html` files
generated by the `/export` command.

### Impact
- **Dependencies**: Requires the matching
`@mariozechner/clipboard-<platform>-<arch>` optional dependency;
gracefully degrades with a warning if unavailable, unsupported
(Termux/headless), or if the active model lacks image input
capabilities.
- **Size limits**: Pasted images are capped at 50 MB raw bytes to
prevent excessive memory usage and API payload sizes.
- ** UX**: Images are held in a pending queue until message submission,
allowing multiple pastes before sending; the indicator updates in
real-time to show accumulated size.

---------

Co-authored-by: Kimchi <noreply@kimchi.dev>
igor-susic1 added a commit that referenced this pull request May 9, 2026
  Tier 0 of docs/ferment-review.md tracked six correctness items the
  post-review audit said had to land before any further feature work.
  This commit closes all six, plus a Critical bug from the original
  docs/ferment-audit.md that survived the entire rewrite and was
  surfaced by the new property-based test on its first real run.

  T0#1  handleRefinePhase running-step destruction (audit doc #4)
  ─────────────────────────────────────────────────────────────────
  refine_phase rebuilds the entire steps array, so allowing it through
  while a step is `running` would silently delete the in-flight step
  and the subagent's later complete_step would hit STEP_NOT_FOUND.
  The bug pre-dated this branch and was already documented in the
  project's own audit; it survived the rewrite anyway.

  Added a guard in handleRefinePhase + a new STEP_RUNNING TransitionError
  variant carrying { phaseId, runningStepId, runningStepIndex,
  runningDescription, message }. Three new state-machine tests cover
  reject-when-running, allow-after-running-completes, and the mixed
  terminal/pending case.

  T0#5  Distinguish unavailable judge from grade "B"
  ─────────────────────────────────────────────────────────────────
  When the judge LLM is unreachable or returns unparseable JSON, every
  grader used to return `{ grade: "B", rationale: "Judge unavailable —
  assumed good." }`. The self-improvement loop's strategy-change
  triggers fire only on grades C+, so a fully-down judge looked like a
  string of Bs and the loop never adapted.

  Added an `unavailable: true` flag to JudgeGrade (placeholder grade
  preserved for legacy consumers). Three downstream consumers updated:

    - stats.computeStats: skip unavailable grades when computing
      averageGrade and averageStepGrade so judge outages don't inflate
      the metric
    - self-improve.evaluatePhaseFeedback: short-circuit with a neutral
      adjustment, all triggers off, even if synthetic deltas were present
    - tools/phases.complete_phase: skip the corrective-step request for
      unavailable D/F grades — we don't trust the rationale enough to
      spend an extra LLM call on a fix
    - format.formatFermentStatus: append "?" to the grade tag (e.g.
      "📊 B?") so the UI distinguishes "judged B" from "unjudged"

  Two new self-improve tests cover the short-circuit and that a synthetic
  deltas array is also ignored when unavailable.

  T0#6  Lock the snapshot file with proper-lockfile
  ─────────────────────────────────────────────────────────────────
  Two subagents in a parallel group calling complete_step concurrently
  used to race on the snapshot file. The dependency was already in
  package.json; only used by skills-manager, never wired for ferments.

  Added FermentEventStore.withLock(id, fn) wrapping every mutation
  entry: writeWithEvents, create, delete, addDecision, addMemory,
  updateWorktree, migrateLegacy. proper-lockfile's lockSync API
  doesn't accept retries (the async API does), so a small hand-rolled
  exponential-backoff retry loop covers contention. Stale-recovery
  release errors are swallowed to keep the mutation result intact.

  Two new tests cover lock release between successive writes and that
  sibling mutations don't deadlock from the same process.

  T0#4  Property-based fold-equivalence test
  ─────────────────────────────────────────────────────────────────
  Generates random valid command sequences (respecting state-machine
  guards and the new STEP_RUNNING guard) using a seeded xorshift32
  RNG, applies each via the production mutation path
  (applyCommand + commandToEvents + writeWithEvents), and after every
  command asserts:

    stateHash(eventStore.get(id)) === stateHash(parentStorage.get(id))

  This is the contract the four state-bearing layers — state machine,
  command-mapper, event fold, snapshot store — must collectively
  satisfy. It is the Tier-0 #4 item from docs/ferment-review.md and
  was originally promised as the test that would have caught §4.2 and
  §4.3 the day they were written.

  Added 5 deterministic seeds (1, 7, 42, 137, 31337), each running a
  40-step sequence. The legalCommands generator weights pause at 10%
  and abandon-from-paused at 10% so sequences don't short-circuit to
  terminal status. Non-PAUSED sequences exercise scope, activate_phase,
  activate_phase_group, refine_phase, start_step, complete_step,
  skip_step, fail_step, complete_phase, skip_phase, add_decision,
  add_memory, pause/resume, and grade-bearing variants.

  Bonus: audit doc finding #1 (Critical, unfixed)
  ─────────────────────────────────────────────────────────────────
  The property test caught a bug on its first real run that the audit
  flagged but no commit had addressed: handleActivatePhaseGroup did
  NOT deactivate a previously-active non-group phase. Trace:

    1. activate_phase(P3)              → P3 active, status=running
    2. activate_phase_group(group=1)   → group activates, but P3 STAYS
                                          active (no demotion sweep)
    3. snapshot has 3 active phases (G1A, G1B, P3); a fresh fold has
       2 (G1A, G1B). The "active phases are exactly the target group"
       invariant is broken in the snapshot.

  Added a deactivation sweep mirroring handleActivatePhase's pattern:
  demote any active phase whose groupIndex doesn't match the target
  group. New state-machine test "deactivates a previously-active
  non-group phase (audit #1)" covers it.

  Top-level scope-field drift, also caught by the property test
  ─────────────────────────────────────────────────────────────────
  The state machine writes both Ferment.scoping.{goal,criteria,
  constraints} (typed answers + confirmedAt) AND the denormalized
  top-level Ferment.{goal, successCriteria, constraints} (plain
  strings/arrays). The fold only reproduced the typed sub-record, so
  every scope command left the fold's top-level fields undefined while
  the snapshot had the real values.

  Added an optional `plain` field to ScopingGoalSetPayload,
  ScopingCriteriaSetPayload, ScopingConstraintsSetPayload. Mapper
  populates it from `post.{goal,successCriteria,constraints}`. Fold
  applies it to the corresponding top-level slot, falling back to
  `p.{goal,criteria,constraints}.answer` for legacy events.

  Refactor: applyEvent extracted as exported applyFermentEvent
  ─────────────────────────────────────────────────────────────────
  Already shipped earlier in §4.3 work — calling it out here because
  the property test depends on it. The fold and the mapper now share
  one source of truth for "what does this event do", which is the
  preconditon that makes hash-chain equivalence even tractable.

  Verification
  ─────────────────────────────────────────────────────────────────
  - pnpm run typecheck — 0 errors across 297 files
  - pnpm run lint     — 0 errors across 304 files
  - pnpm run test     — 89 files, 1608 passed, 3 pre-existing skips,
                        0 failures (up from 1595)

  Net new tests this commit: 13
  - 3 in state-machine.test.ts (refine_phase running-step guard)
  - 1 in state-machine.test.ts (audit #1 deactivation sweep)
  - 2 in self-improve.test.ts (unavailable short-circuit)
  - 2 in event-store.test.ts (lock release / no nested deadlock)
  - 5 in event-store.property.test.ts (seeded random sequences)
igor-susic1 added a commit that referenced this pull request May 11, 2026
  Tier 0 of docs/ferment-review.md tracked six correctness items the
  post-review audit said had to land before any further feature work.
  This commit closes all six, plus a Critical bug from the original
  docs/ferment-audit.md that survived the entire rewrite and was
  surfaced by the new property-based test on its first real run.

  T0#1  handleRefinePhase running-step destruction (audit doc #4)
  ─────────────────────────────────────────────────────────────────
  refine_phase rebuilds the entire steps array, so allowing it through
  while a step is `running` would silently delete the in-flight step
  and the subagent's later complete_step would hit STEP_NOT_FOUND.
  The bug pre-dated this branch and was already documented in the
  project's own audit; it survived the rewrite anyway.

  Added a guard in handleRefinePhase + a new STEP_RUNNING TransitionError
  variant carrying { phaseId, runningStepId, runningStepIndex,
  runningDescription, message }. Three new state-machine tests cover
  reject-when-running, allow-after-running-completes, and the mixed
  terminal/pending case.

  T0#5  Distinguish unavailable judge from grade "B"
  ─────────────────────────────────────────────────────────────────
  When the judge LLM is unreachable or returns unparseable JSON, every
  grader used to return `{ grade: "B", rationale: "Judge unavailable —
  assumed good." }`. The self-improvement loop's strategy-change
  triggers fire only on grades C+, so a fully-down judge looked like a
  string of Bs and the loop never adapted.

  Added an `unavailable: true` flag to JudgeGrade (placeholder grade
  preserved for legacy consumers). Three downstream consumers updated:

    - stats.computeStats: skip unavailable grades when computing
      averageGrade and averageStepGrade so judge outages don't inflate
      the metric
    - self-improve.evaluatePhaseFeedback: short-circuit with a neutral
      adjustment, all triggers off, even if synthetic deltas were present
    - tools/phases.complete_phase: skip the corrective-step request for
      unavailable D/F grades — we don't trust the rationale enough to
      spend an extra LLM call on a fix
    - format.formatFermentStatus: append "?" to the grade tag (e.g.
      "📊 B?") so the UI distinguishes "judged B" from "unjudged"

  Two new self-improve tests cover the short-circuit and that a synthetic
  deltas array is also ignored when unavailable.

  T0#6  Lock the snapshot file with proper-lockfile
  ─────────────────────────────────────────────────────────────────
  Two subagents in a parallel group calling complete_step concurrently
  used to race on the snapshot file. The dependency was already in
  package.json; only used by skills-manager, never wired for ferments.

  Added FermentEventStore.withLock(id, fn) wrapping every mutation
  entry: writeWithEvents, create, delete, addDecision, addMemory,
  updateWorktree, migrateLegacy. proper-lockfile's lockSync API
  doesn't accept retries (the async API does), so a small hand-rolled
  exponential-backoff retry loop covers contention. Stale-recovery
  release errors are swallowed to keep the mutation result intact.

  Two new tests cover lock release between successive writes and that
  sibling mutations don't deadlock from the same process.

  T0#4  Property-based fold-equivalence test
  ─────────────────────────────────────────────────────────────────
  Generates random valid command sequences (respecting state-machine
  guards and the new STEP_RUNNING guard) using a seeded xorshift32
  RNG, applies each via the production mutation path
  (applyCommand + commandToEvents + writeWithEvents), and after every
  command asserts:

    stateHash(eventStore.get(id)) === stateHash(parentStorage.get(id))

  This is the contract the four state-bearing layers — state machine,
  command-mapper, event fold, snapshot store — must collectively
  satisfy. It is the Tier-0 #4 item from docs/ferment-review.md and
  was originally promised as the test that would have caught §4.2 and
  §4.3 the day they were written.

  Added 5 deterministic seeds (1, 7, 42, 137, 31337), each running a
  40-step sequence. The legalCommands generator weights pause at 10%
  and abandon-from-paused at 10% so sequences don't short-circuit to
  terminal status. Non-PAUSED sequences exercise scope, activate_phase,
  activate_phase_group, refine_phase, start_step, complete_step,
  skip_step, fail_step, complete_phase, skip_phase, add_decision,
  add_memory, pause/resume, and grade-bearing variants.

  Bonus: audit doc finding #1 (Critical, unfixed)
  ─────────────────────────────────────────────────────────────────
  The property test caught a bug on its first real run that the audit
  flagged but no commit had addressed: handleActivatePhaseGroup did
  NOT deactivate a previously-active non-group phase. Trace:

    1. activate_phase(P3)              → P3 active, status=running
    2. activate_phase_group(group=1)   → group activates, but P3 STAYS
                                          active (no demotion sweep)
    3. snapshot has 3 active phases (G1A, G1B, P3); a fresh fold has
       2 (G1A, G1B). The "active phases are exactly the target group"
       invariant is broken in the snapshot.

  Added a deactivation sweep mirroring handleActivatePhase's pattern:
  demote any active phase whose groupIndex doesn't match the target
  group. New state-machine test "deactivates a previously-active
  non-group phase (audit #1)" covers it.

  Top-level scope-field drift, also caught by the property test
  ─────────────────────────────────────────────────────────────────
  The state machine writes both Ferment.scoping.{goal,criteria,
  constraints} (typed answers + confirmedAt) AND the denormalized
  top-level Ferment.{goal, successCriteria, constraints} (plain
  strings/arrays). The fold only reproduced the typed sub-record, so
  every scope command left the fold's top-level fields undefined while
  the snapshot had the real values.

  Added an optional `plain` field to ScopingGoalSetPayload,
  ScopingCriteriaSetPayload, ScopingConstraintsSetPayload. Mapper
  populates it from `post.{goal,successCriteria,constraints}`. Fold
  applies it to the corresponding top-level slot, falling back to
  `p.{goal,criteria,constraints}.answer` for legacy events.

  Refactor: applyEvent extracted as exported applyFermentEvent
  ─────────────────────────────────────────────────────────────────
  Already shipped earlier in §4.3 work — calling it out here because
  the property test depends on it. The fold and the mapper now share
  one source of truth for "what does this event do", which is the
  preconditon that makes hash-chain equivalence even tractable.

  Verification
  ─────────────────────────────────────────────────────────────────
  - pnpm run typecheck — 0 errors across 297 files
  - pnpm run lint     — 0 errors across 304 files
  - pnpm run test     — 89 files, 1608 passed, 3 pre-existing skips,
                        0 failures (up from 1595)

  Net new tests this commit: 13
  - 3 in state-machine.test.ts (refine_phase running-step guard)
  - 1 in state-machine.test.ts (audit #1 deactivation sweep)
  - 2 in self-improve.test.ts (unavailable short-circuit)
  - 2 in event-store.test.ts (lock release / no nested deadlock)
  - 5 in event-store.property.test.ts (seeded random sequences)
mindtraveller added a commit that referenced this pull request May 12, 2026
…ssification (#225)

## Summary

Two related issues blocked `jetbrains_get_all_open_file_paths` (and
other MCP direct tools) from being callable in a kimchi-dev session,
even when the JetBrains MCP server was running and reachable.

1. **First-run registration gap** — when the on-disk `mcp-cache.json`
had a stale `configHash` for a server with `directTools: true`, the
cached metadata was rejected, no direct tools registered at module load,
and the bootstrap-on-connect path only emitted *"available after
restart"*. A naive first-time user (or anyone who edited
`.kimchi/mcp.json`) saw a permanent **Tool not found** failure.
2. **Plan-mode classifier blind spot** — `permissions/taxonomy.ts` only
matched the read-verb (`get`/`list`/etc.) at position 0 of the tool
name. MCP direct tools arrive prefix-flattened
(`jetbrains_get_all_open_file_paths`), so the verb sits at position 1
and every read-only MCP tool got classified `unknown` → stripped from
the plan-mode active set → reported as *"not available"* or *"not
found"* depending on timing.

This PR fixes both. After landing, the user-reported flow works on the
first run, in plan mode, without a restart.

## Root cause walkthrough

### Why the cache rejection happens

- `src/extensions/mcp-adapter/index.ts` at module load runs
`resolveDirectTools(earlyConfig, earlyCache, …)` (line 75 pre-patch).
For each server with `directTools: true`, it iterates the cache entry
and registers one `pi.registerTool({ … })` per cached tool.
- `direct-tools.ts:83` skips the server when
`!isServerCacheValid(serverCache, definition)`.
- `isServerCacheValid` (`metadata-cache.ts:106`) compares
`entry.configHash` to `computeServerHash(definition)`.
`computeServerHash` only hashes identity-affecting fields (`command`,
`args`, `env`, `cwd`, `url`, `headers`, `auth`, `bearerToken`,
`bearerTokenEnv`, `exposeResources`, `excludeTools`).
- The cache file `~/.config/kimchi/harness/mcp-cache.json` is **shared
across every project** (path derives from `getAgentDir()`), so an entry
written by a different project's `mcp.json` lingers and silently fails
the hash check.
- The bootstrap path at `init.ts:171-203` could re-cache after a
successful connect, but it only fired the *"available after restart"*
notification; direct tools register synchronously at module load, so
they only became visible on the **next** restart. Combined with the
unconditional `if (!state) return []` guard in `registerAndActivate`,
even adding a runtime registration callback wouldn't have worked.

### Why plan mode hid the registered tools

- Reproduced from the session log: call #0 returned *"Plan mode: tool
jetbrains_get_all_open_file_paths is not available. Use /permissions
mode default to enable writes."* and call #1 returned *"Tool
jetbrains_get_all_open_file_paths not found"*.
- `permissions/index.ts:192` (`applyPlanModeTools`) runs at session
start and rebuilds the active set to **only** tools where
`isReadOnlyTool(name)` is true or `name` is in `PLAN_MODE_TOOLS`.
- `taxonomy.ts:20` had `READ_ONLY_NAME_HINT =
/^(read|get|list|search|query|describe|find|grep|ls|loki_|view|show)/i`.
For `jetbrains_get_all_open_file_paths`, the regex anchors at `^` and
only sees `jetbrains` → no match → classified `unknown` → not read-only
→ dropped from the plan-mode active set.

## Changes

### `src/extensions/mcp-adapter/metadata-cache.ts`

- `overwriteMetadataCache(cache)` — writes the on-disk cache verbatim
(no merge with existing). `saveMetadataCache` always merges, so it
cannot delete entries; the new function fills that gap.
- `purgeStaleEntries(cache, mcpServers)` — drops entries whose server
**is** in the current `mcp.json` but whose `configHash` no longer
matches the live definition. **Orphan entries (servers not in the
current project's config) are preserved** because the cache is shared
across projects; dropping them would silently nuke teammates' caches.

### `src/extensions/mcp-adapter/index.ts`

- Run `purgeStaleEntries` at module load right after
`loadMetadataCache()`. If any entry was removed, persist via
`overwriteMetadataCache` and continue with the cleaned cache.
- `registerAndActivate(specs, opts?)` now accepts a `markDynamic` flag
(default `true`, preserves existing proxy describe/search callers). When
`false`, the new tool names are **not** added to
`state.dynamicToolNames`, so the `pi.on("input", …)` hook leaves them in
the active set across turns.
- Removed the early `if (!state) return []` guard so the new bootstrap
callback can run from inside `initializeMcp` before its promise
resolves. The lazy `() => state` closure inside
`createDirectToolExecutor` resolves the state correctly at call time.
`state.dynamicToolNames` is only touched when both `markDynamic` and
`state` are truthy.
- Added `registerBootstrappedDirectTools(specs)` — thin wrapper that
calls `registerAndActivate(specs, { markDynamic: false })`. Passed as
the third argument to `initializeMcp`.

### `src/extensions/mcp-adapter/init.ts`

- `initializeMcp` learned an optional third argument:
`registerBootstrappedDirectTools?: (specs: DirectToolSpec[]) =>
string[]`.
- After `loadMetadataCache()` inside `initializeMcp`, run
`purgeStaleEntries` again and persist via `overwriteMetadataCache`.
Covers callers that don't go through `index.ts` module-load (tests,
embedded use).
- After the bootstrap loop at `init.ts:171-203`, for each successfully
bootstrapped server name, resolve fresh specs via
`resolveDirectTools(config, freshCache, prefix, envOverride)` filtered
to `bootstrapped`, then invoke the callback. On non-empty injection,
emit the new *"MCP: N direct tool(s) from … are now available"*
notification; fall back to the previous *"available after restart"*
message when the callback is absent or returns empty.

### `src/extensions/permissions/taxonomy.ts`

- Added `READ_ONLY_VERB_SEGMENTS` (`get`, `list`, `search`, `query`,
`describe`, `find`, `grep`, `ls`, `view`, `show`, `preview`, `inspect`,
plus `read`).
- `hasReadOnlyVerbSegment(toolName)` splits on `_` and checks if **any
segment after the first** is an exact match for a read-verb. The first
segment is intentionally skipped so a server namespace like `list` or
`get` doesn't blanket-mark every tool under it as read-only.
- `classifyTool` consults the new helper as a fallback after
`STATIC_CATEGORIES`, the `mcp__` branch, and the original
`READ_ONLY_NAME_HINT` start-anchored regex.

Worked out by inspection of the `_refreshToolRegistry` flow in the
bundled pi-coding-agent — tools must live in `_toolRegistry` (active
set) **and** match `currentContext.tools` at agent-loop dispatch
(`agent-loop.js:332`) or the agent returns `Tool X not found` before any
extension `tool_call` hook runs.

## Verified classifications

| Tool | Before | After | Reason |
| --------------------------------------------- | --------- | ---------
| ------------------------------------------- |
| `jetbrains_get_all_open_file_paths` | `unknown` | `readOnly`| verb
`get` in segment 1 |
| `jetbrains_get_run_configurations` | `unknown` | `readOnly`| verb
`get` in segment 1 |
| `jetbrains_xdebug_get_stack` | `unknown` | `readOnly`| verb `get` in
segment 2 |
| `supabase_list_tables`, `supabase_search_docs`| `unknown` |
`readOnly`| verb `list`/`search` in segment 1 |
| `jetbrains_execute_run_configuration` | `unknown` | `unknown` | no
read-verb segment — stays blocked |
| `jetbrains_build_project` | `unknown` | `unknown` | no read-verb
segment — stays blocked |
| `jetbrains_create_new_file` | `unknown` | `unknown` | `create` is not
in the read-verb set |
| `jetbrains_rename_refactoring` | `unknown` | `unknown` | `rename` is
not in the read-verb set |
| `playwright_browser_click` | `unknown` | `unknown` | no read-verb
segment |
| `mcp`, `bash`, `read` | unchanged | unchanged | hit
`STATIC_CATEGORIES` |

## Test plan

- [x] `pnpm run typecheck` — clean.
- [x] `pnpm run lint` — clean.
- [x] `pnpm test` — 1891/1891 pass.
- [x] Dry-run `purgeStaleEntries` against
`~/.config/kimchi/harness/mcp-cache.json` with the project's current
`.kimchi/mcp.json` — removes only the stale `jetbrains` entry, preserves
`supabase`, `sentry`, `playwright` orphans.
- [x] Sanity-tested classifier against the table above (all expectations
matched).
- [x] Manual smoke: with JetBrains IDE listening on `127.0.0.1:64342`,
stop `pnpm dev`, `jq 'del(.servers.jetbrains)'
~/.config/kimchi/harness/mcp-cache.json` to force the bootstrap path,
start `pnpm dev`, ask for `jetbrains_get_all_open_file_paths` in plan
mode → expect the call to succeed (cache rewritten with matching hash,
tool registered in-session, plan-mode allow-list now includes it).
- [x] Manual smoke: kill JetBrains, repeat — expect the *"available
after restart"* fallback notify and a clean `mcp({ tool:
"get_all_open_file_paths", server: "jetbrains" })` *"server not
available"* error.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---
### Kimchi Summary
### What changed
Fixes stale MCP metadata cache entries blocking direct-tool registration
on startup, enables immediate registration of bootstrapped direct tools
without a restart, and corrects the permission taxonomy for flattened
MCP tool names.

### Why
When a server definition changed, the on-disk cache retained an old
`configHash` that caused `resolveDirectTools` to silently reject every
tool on load, forcing users to restart. Additionally, the permissions
layer only looked for read-only verbs at the start of a tool name, so
flattened direct-tool names like `jetbrains_get_all_open_file_paths`
were incorrectly left as `unknown`.

### Key changes
- `metadata-cache.ts`
- Adds `purgeStaleEntries` to drop cache entries whose `configHash` no
longer matches the current server definition, while preserving orphans
from other projects.
- Adds `overwriteMetadataCache` for atomic replace-without-merge writes,
with best-effort temp-file cleanup and swallowed I/O errors so a bad
cache never crashes the extension host.
- Refactors cache I/O into `atomicWriteCache`, `loadExistingCache`,
`cleanupTempFile`, and `isValidCache`.
- `index.ts`
  - Purges stale cache and overwrites it during early extension load.
- Extends `registerAndActivate` with an optional `{ markDynamic?:
boolean }` flag; when `markDynamic` is `true` and `state` is not yet
initialized, registration is skipped to prevent leaked dynamic tools.
- Introduces `registerBootstrappedDirectTools` (always `markDynamic:
false`) and passes it into `initializeMcp` so bootstrapped tools persist
across turns.
- `init.ts`
- Mirrors the stale-cache purge inside `initializeMcp` for non-Pi entry
points.
- After successful bootstrap, immediately calls
`registerBootstrappedDirectTools` with resolved direct-tool specs
instead of showing a "restart required" notification.
- `taxonomy.ts` and `taxonomy.test.ts`
- Adds `hasReadOnlyVerbSegment`, which scans underscore-separated
segments after the first (server prefix) to classify namespaced direct
tools (e.g., `supabase_list_tables`) as `readOnly`.
- `cache-resolver.integration.test.ts` (new)
- Adds integration tests exercising stale-entry purge,
overwrite-vs-merge semantics, end-to-end stale→purge→resolve flow, and
`excludeTools` / `directTools=false` interactions.

### Impact
- Bootstrapped MCP direct tools are now available in the current session
without restarting.
- Dynamic-proxy tools registered before `state` is ready are no longer
leaked.
- Cache file corruption or write failures are handled gracefully.

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant