Skip to content

feat(parser): lazy-load infrastructure + Python tree-sitter parser (#933 phase 3)#957

Merged
gfargo merged 1 commit into
mainfrom
feat/933-phase3-lazyload-python
May 14, 2026
Merged

feat(parser): lazy-load infrastructure + Python tree-sitter parser (#933 phase 3)#957
gfargo merged 1 commit into
mainfrom
feat/933-phase3-lazyload-python

Conversation

@gfargo
Copy link
Copy Markdown
Owner

@gfargo gfargo commented May 14, 2026

Summary

First lazy-loaded language. Tree-sitter parser for Python is pulled from a manifest-pinned CDN URL into the user's cache dir on opt-in (`COCO_PREFETCH=py`); falls through to the regex parser cleanly when the cache is empty (no surprise network calls). End-to-end proven via smoke test:

```
$ COCO_PREFETCH=py tsx
· Python: downloading https://cdn.jsdelivr.net/npm/tree-sitter-python@0.23.6/tree-sitter-python.wasm…
✓ Python parser cached (447 KB) at ~/.cache/coco/tree-sitter/tree-sitter-python.wasm

OUTPUT: Updated Python `src/p.py`. added: parse_request(). removed: legacy_handler(). +1/-1 lines.
```

New modules

  • `cache.ts` — cache dir resolution (honors `COCO_CACHE_DIR` / `XDG_CACHE_HOME` / platform defaults; Windows-aware).
  • `manifest.ts` — version + SHA-256 pin per language. Python: `tree-sitter-python@0.23.6` from jsdelivr (~447 KB).
  • `download.ts` — fetch → SHA verify → atomic write. Returns typed `DownloadOutcome` instead of throwing.
  • `prefetch.ts` — `COCO_PREFETCH` env var orchestrator. Aliases (`py` / `python` / `all`), skip-if-cached, error-tolerant.
  • `pythonTreeSitterParser.ts` — per-line AST extraction mirroring the TS parser.

Wiring changes

  • `runtime.ts` extended to know about lazy-loaded languages; `resolveWasmLocations` includes the cache path alongside bundled paths.
  • `structuralParserRegistry`: `py` chain becomes `[treeSitterPythonParser, regexPy]`.
  • `src/index.ts` adds a `runPrefetchFromEnv()` hook at CLI startup. No-op when `COCO_PREFETCH` is unset.

Test isolation note

Each test file that touches the cache sets a unique `COCO_CACHE_DIR` under `os.tmpdir()` — keeps parallel jest workers from racing on `~/.cache/coco/tree-sitter/`.

Side fix

Bumped the `buildScenarioFixtures` "produces a fixture per commit" test to 15s. Default 5s was being exceeded under parallel load now that tree-sitter init runs on every test exercising `summarizeLargeFiles`. Real heavy work (temp git repo, several commits, walk-the-log) — 15s budget is comfortable.

Test plan

  • `npx tsc --noEmit` → 0 errors
  • `npm run test:jest` → 1670/1670 pass (8 of 9 consecutive runs clean)
  • `npx eslint` on touched files → clean
  • Manual: `COCO_PREFETCH=py` downloads + caches + Python parser produces expected output

Out of scope (later phases)

  • 5/6 — Rust + Go via the same lazy-load path
  • 7 — first-use Y/n prompt, `coco cache` subcommand, eval baseline, telemetry

Refs #933.

 phase 3)

First lazy-loaded language lands. Pulls down a manifest-pinned
.wasm from a CDN into the user's cache dir on opt-in; falls
through to the regex parser when the cache is empty (no surprise
network calls). Mirrors the design discussed on the issue:
bundle TS/JS, lazy-load Python / Rust / Go.

## Pieces

### New modules under `src/lib/parsers/default/__tree_sitter__/`

- **`cache.ts`** — cache directory resolution. Honors three env-var
  overrides in priority order:
    1. `COCO_CACHE_DIR` (direct override; useful for CI + tests)
    2. `XDG_CACHE_HOME` on Unix / `LOCALAPPDATA` on Windows
    3. Platform default (`~/.cache/coco` on Unix,
       `%USERPROFILE%/AppData/Local/coco/Cache` on Windows)
  Exports `getCachedWasmPath(language)` so the runtime can resolve
  a language id to its cached `.wasm` location.
- **`manifest.ts`** — the source of truth for lazy-loadable parsers.
  Pins explicit version + SHA-256 per language. Python ships:
    - jsdelivr CDN URL: `tree-sitter-python@0.23.6/tree-sitter-python.wasm`
    - SHA-256: `8c93692fb368...`
    - ~447 KB
  Updating a parser is a deliberate manifest edit; the supply-chain
  surface stays small + reviewable.
- **`download.ts`** — `fetch → verify SHA-256 → atomic-write`
  (temp file + rename so a crash never leaves a partial `.wasm`).
  Returns a typed `DownloadOutcome` instead of throwing; the
  orchestrator processes multiple languages without crashing on
  one failure. Discriminates between `network`, `sha-mismatch`,
  `write-failed` outcomes for actionable error messages.
- **`prefetch.ts`** — reads the `COCO_PREFETCH` env var, parses
  aliases (`py` / `python` / `all`), skips already-cached
  languages, downloads the rest serially. `runPrefetchFromEnv` is
  the entrypoint called from the CLI startup hook. Tolerates
  unknown language tokens with a warning to stderr.
- **`pythonTreeSitterParser.ts`** — per-line AST extraction
  mirroring the TS parser. Recognizes `function_definition`,
  `class_definition`, decorated definitions, PEP 695 type
  aliases, and ALL_CAPS module-level constants. Underscore-
  prefixed names → `exported: false` (Python convention).
  Surrenders when the cached `.wasm` isn't loaded — registry
  chain falls through to the regex parser cleanly.

### Updates

- **`runtime.ts`** — `TreeSitterLanguageId` gains `'python'`.
  `resolveWasmLocations` now includes the cache path for
  lazy-loaded languages alongside the bundled-language paths.
  `getTreeSitterParser` already checks `existsSync` per language,
  so it transparently handles "cached vs missing" without
  branching.
- **`structuralParserRegistry.ts`** — `py` chain becomes
  `[treeSitterPythonParser, regexPy]`. Tree-sitter is preferred;
  regex stays in the chain as the lossless fallback.
- **`src/index.ts`** — CLI startup gains a `runPrefetchFromEnv()`
  call before yargs takes over. No-op when `COCO_PREFETCH` is
  unset, so the typical path pays zero overhead. Errors are
  logged but non-fatal — the subcommand still runs (with regex
  fallback for the affected language).

## Usage

Single-shot opt-in:
```
COCO_PREFETCH=py coco commit
```

The first invocation downloads + caches Python (~447 KB,
sub-second on a typical connection). Every subsequent
invocation reuses the cached file silently. Tree-sitter runs
for `.py` / `.pyi` files in the diff.

Multi-language:
```
COCO_PREFETCH=py,rs,go coco commit
COCO_PREFETCH=all coco commit
```

(`rs` and `go` aren't in the manifest yet — phases 5 / 6 land
those.)

## Tests

- `cache.test.ts` — env var precedence, platform-specific paths
- `download.test.ts` — happy path, SHA mismatch refusal,
  network error, throw-from-fetch handling, formatter output
- `prefetch.test.ts` — env parsing (aliases, `all`, dedupe,
  unknown tokens), already-cached short-circuit, download
  success/failure recording

Test isolation: each test file sets a unique `COCO_CACHE_DIR`
under `os.tmpdir()` so parallel jest workers don't race on the
shared `~/.cache/coco/tree-sitter/` (the dir gets cleaned at
file exit).

Side fix: bumped the `buildScenarioFixtures` "produces a fixture
per commit" test to a 15s timeout. The default 5s was getting
exceeded under load now that the tree-sitter parser chain runs
on every test that exercises `summarizeLargeFiles`. The test
spawns a real temp git repo, runs several `git commit`s, and
walks the log — genuinely heavy. 15s is comfortably under any
reasonable CI runtime.

## Validation

- `npx tsc --noEmit` → 0 errors
- `npm run test:jest` → 1670/1670 pass across 8 of 9 consecutive
  runs (1 flake under parallel load — pre-existing scenario
  test); see also the scoreboard in `bin/_phase3-load-smoke`-
  style manual checks below.
- `npx eslint` on touched files → clean
- Manual: `COCO_PREFETCH=py tsx bin/<smoke>.ts` downloads,
  verifies hash, caches, and the Python parser then returns
  `Updated Python … added: parse_request(). removed:
  legacy_handler().` on a sample diff.

## Out of scope (queued for later phases)

- **Phase 5/6** — Rust + Go via the same lazy-load path
- **Phase 7** — first-use interactive Y/n prompt (TTY-gated),
  `coco cache list/prefetch/clear` subcommand, eval-harness
  baseline comparison, telemetry on download/verify failures

Refs #933.
@gfargo gfargo merged commit 9e5c571 into main May 14, 2026
6 of 8 checks passed
@gfargo gfargo deleted the feat/933-phase3-lazyload-python branch May 14, 2026 19:38

const PY_BYTES = new Uint8Array([0x00, 0x61, 0x73, 0x6d, 0x01, 0x00, 0x00, 0x00, 0xde, 0xad, 0xbe, 0xef])
// SHA-256 of the PY_BYTES fixture above — recomputed if PY_BYTES changes.
const PY_HASH = 'b600e3f7b5cc87bc0f00020de1d51f557e628e659eb02fd5e18aac9871a3e479'
const line = formatDownloadOutcome('python', {
ok: false,
reason: 'sha-mismatch',
expected: 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa',
ok: false,
reason: 'sha-mismatch',
expected: 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa',
actual: 'bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb',
displayName: 'Python',
version: '0.23.6',
wasmUrl: 'https://cdn.jsdelivr.net/npm/tree-sitter-python@0.23.6/tree-sitter-python.wasm',
sha256: '8c93692fb368e288a5824cee55773c9b3602804f513bda48c97661e52e9c2da2',
gfargo added a commit that referenced this pull request May 14, 2026
…phase 7) (#959)

Closes the tree-sitter integration feature (#933). Lazy-loaded
parsers gain a first-class CLI surface for cache management;
verbose mode surfaces a discoverability hint when the fast path
falls through to regex.

## New `coco cache` subcommands

Extends the existing `coco cache` (diff-summary info / clear)
with three tree-sitter subcommands:

- \`coco cache parsers\` — show every manifest language with its
  current cache status (cached size or "not cached" + the
  fetched-size estimate), version pin, and source URL. Footer
  summarizes total disk usage + quick-reference commands.
- \`coco cache prefetch [languages...]\` — download specific
  parsers (e.g. \`coco cache prefetch py rs go\` or
  \`coco cache prefetch all\`). When invoked with no args AND
  stdin is a TTY, opens an interactive checkbox picker. In
  non-interactive contexts (CI, pipes), no-arg invocations error
  out with usage hints instead of hanging on a prompt.
- \`coco cache clear-parsers\` — wipe \`~/.cache/coco/tree-sitter/\`.
  Idempotent; reports a per-language ✓ for each removed file.

Aliases mirror \`COCO_PREFETCH\` env grammar: \`py\` / \`python\`,
\`rs\` / \`rust\`, \`go\` / \`golang\`, \`all\`.

## Surrender telemetry

In verbose mode, when the language-aware fast path is enabled and
the parser chain falls through to LLM, emit a discoverability
hint:

  \`Tree-sitter parser surrendered for 'python'; using regex
   fallback. Hint: \`coco cache parsers\` to inspect,
   \`coco cache prefetch python\` to enable.\`

Quiet on the default path; visible only when the user is
debugging summary quality. Hint copy adapts: bundled-language
surrenders (\`ts\` / \`js\`) point at \`coco cache prefetch all\`
because TS / TSX wasms are always shipped (the surrender is from
a parser-init failure, not a missing download); lazy-loaded
languages get a per-language prefetch hint.

## Implementation

### \`cache.ts\` (lazy-load cache module)

- New \`getCachedParserStatus(language)\` returns
  \`{ language, cached, path, bytes?, mtime? }\` for the table
  renderer + interactive picker.
- New \`clearCachedParser(language)\` unlinks the cached .wasm.
  Idempotent; returns \`true\` when a file was actually removed.

### \`structuralParserRegistry.ts\`

- New \`hasTreeSitterParser(language)\` lets the LLM fallthrough
  path know whether a tree-sitter parser is registered for the
  language — used by the surrender-telemetry hint. Doesn't
  expose internals; the caller just needs the boolean.

### \`summarizeLargeFiles.ts\`

- Surrender-telemetry block fires after the registry returns
  undefined and BEFORE the cache lookup. Only emits when the
  chain includes a tree-sitter parser, so regex-only languages
  don't get a misleading hint.

### \`commands/cache/\`

- \`config.ts\` gains the \`CACHE_SUBCOMMANDS\` enum and a positional
  \`[languages..]\` for prefetch. Yargs validates the subcommand
  set; unknown tokens get caught by the language resolver.
- \`handler.ts\` adds three new branches:
  - \`parsers\` calls \`renderParsersTable\`
  - \`prefetch\` resolves tokens via \`parsePrefetchEnv\`
    (reusing the env-var grammar), prompts when interactive,
    and delegates to \`prefetchTreeSitterParsers\`. Failed
    downloads → \`process.exitCode = 1\`.
  - \`clear-parsers\` walks every manifest entry, calls
    \`clearCachedParser\`, reports per-language status.

### \`inquirerPrompts.ts\`

- New \`checkboxPrompt\` helper. Same dynamic-import shim as the
  other prompts; reuses the codebase's standard pattern for ESM
  inquirer modules under ts-jest.

## Tests

4 new test cases in \`handler.test.ts\` cover the new subcommands:
\`parsers\` lists every manifest language, \`prefetch\` warns on
unknown tokens, \`clear-parsers\` reports no-op when empty AND
removes cached files when present.

Test isolation: each test sets \`COCO_CACHE_DIR\` to the same tmp
dir the existing tests use for \`XDG_CACHE_HOME\`, so the
tree-sitter cache lives inside the per-test sandbox.

## Manual validation

\`\`\`
$ COCO_CACHE_DIR=/tmp/coco-phase7-smoke coco cache parsers
Tree-sitter parser cache

  Python   not cached          (448.0 KB when fetched)
  Rust     not cached          (1.05 MB when fetched)
  Go       not cached          (212.1 KB when fetched)

  cached: 0/3  total on disk: 0 B

$ coco cache prefetch py
· Python: downloading https://cdn.jsdelivr.net/.../tree-sitter-python.wasm…
✓ Python parser cached (447 KB)

Summary: 1 downloaded · 0 already cached · 0 failed

$ coco cache clear-parsers
✓ cleared Python

Cleared 1 parser(s) from ~/.cache/coco/tree-sitter/
\`\`\`

## Validation

- \`npx tsc --noEmit\` → 0 errors
- \`npm run test:jest\` → 1674/1674 pass (3 of 4 consecutive runs
  clean, 1 flake on the pre-existing scenarioInputs timeout
  pattern)
- \`npx eslint\` on touched files → clean
- Manual: all four subcommands round-trip cleanly

## Out of scope (genuine future work)

- **Eval-harness side-by-side regex-vs-tree-sitter comparison
  in the report output**. Today the eval reports per-fixture
  outcomes but doesn't discriminate WHICH parser produced each
  summary. Surfacing the regex vs. tree-sitter delta requires
  registry injection at eval time (the harness builds its own
  parser chain instead of using the global). Reasonable
  follow-up; not gating on #933 closure.

## #933 status: feature complete

| Phase | Status |
|---|---|
| 1.0 — Registry abstraction | ✓ #950 |
| 1.1 — TS/TSX bundled | ✓ #955 |
| 2 — Polish + ESM jest + arrow-fn fixture | ✓ #956 |
| 3 — Lazy-load infra + Python | ✓ #957 |
| 5 — Rust | ✓ #958 |
| 6 — Go | ✓ #958 |
| **7 — Cache CLI + telemetry** | **this PR** |

Closes #933.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants