Skip to content

feat(parser): lazy-loaded Rust + Go tree-sitter parsers (#933 phases 5 + 6)#958

Merged
gfargo merged 1 commit into
mainfrom
feat/933-phase5-6-rust-go
May 14, 2026
Merged

feat(parser): lazy-loaded Rust + Go tree-sitter parsers (#933 phases 5 + 6)#958
gfargo merged 1 commit into
mainfrom
feat/933-phase5-6-rust-go

Conversation

@gfargo
Copy link
Copy Markdown
Owner

@gfargo gfargo commented May 14, 2026

Summary

Two new languages on the lazy-load path established in phase 3. Mechanical extension — same shape as Python, no new infrastructure.

Language Package Bundle (jsdelivr) Trigger
Rust `tree-sitter-rust@0.24.0` ~1.1 MB `COCO_PREFETCH=rs`
Go `tree-sitter-go@0.25.0` ~212 KB `COCO_PREFETCH=go`

Both pinned with sha-256 hashes verified at download time.

End-to-end verified

```
$ COCO_PREFETCH=rs,go tsx
· Rust: downloading https://cdn.jsdelivr.net/npm/tree-sitter-rust@0.24.0/...
✓ Rust parser cached (1077 KB)
· Go: downloading https://cdn.jsdelivr.net/npm/tree-sitter-go@0.25.0/...
✓ Go parser cached (212 KB)

RUST: Updated Rust `src/widget.rs`. added: new_widget(), class Widget,
impl Base for Widget. removed: legacy_widget(). +3/-1 lines.
GO: Updated Go `widget.go`. added: class Widget, New(), Widget.Name().
removed: LegacyName(). +3/-1 lines.
```

Rust correctly distinguishes `impl Base for Widget` from a bare struct. Go correctly renders method receivers as `Receiver.method`.

What lands

  • Manifest entries for Rust + Go (URL, version, SHA-256, size).
  • Type unions extended: `LazyTreeSitterLanguageId`, `TreeSitterLanguageId`.
  • Runtime cache paths: `resolveWasmLocations` routes both through `getCachedWasmPath()`.
  • Two new extractors: `rustTreeSitterParser.ts` (function_item, struct_item, enum_item, trait_item, impl_item, type_item, const_item, static_item, mod_item — with visibility-modifier detection that handles `pub(crate)`, `pub(super)`, `pub(in path)` uniformly) and `goTreeSitterParser.ts` (function_declaration, method_declaration with receiver-to-type walking, type_declaration discriminating struct / interface / alias, var_declaration / const_declaration including block forms).
  • Registry chains: `[treeSitterRustParser, regexRs]` and `[treeSitterGoParser, regexGo]`.
  • Prefetch aliases: `rs` / `rust` / `go` / `golang`. `COCO_PREFETCH=all` now covers all three lazy-loaded languages.

Side fix

Bumped the second `scenarioInputs` test ("extracts the same byte-identical fixture set across runs") to a 15s timeout. Phase 3 caught the sibling test; this catches the partner. Same rationale — spins up real temp git repos serially, shell-out cost adds up under parallel jest load.

Test plan

  • `npx tsc --noEmit` → 0 errors
  • `npm run test:jest` → 1670/1670 pass (4 of 5 consecutive runs clean; 1 scenarioInputs flake)
  • `npx eslint` on touched files → clean
  • Manual: `COCO_PREFETCH=rs,go` downloads + caches + both parsers produce expected output

#933 status

After this merges: 1.0 ✓ · 1.1 ✓ · 2 ✓ · 3 ✓ · 5 ✓ · 6 ✓

Only phase 7 (polish) remains: first-use Y/n prompt, `coco cache` subcommand, eval-harness baseline, telemetry on download failures.

Refs #933.

…5 + 6)

Two new languages on the lazy-load path established in phase 3.
Mechanical extension — same shape as Python, no new infra. Users
opt in via \`COCO_PREFETCH=rs\` (or \`rust\`) and \`COCO_PREFETCH=go\`
(or \`golang\`); the first invocation downloads + verifies + caches
the .wasm, every subsequent run reuses it. Falls through to the
regex parser cleanly when the cache is empty.

## Manifest additions

- **Rust**: \`tree-sitter-rust@0.24.0\` from jsdelivr (~1.1 MB).
- **Go**: \`tree-sitter-go@0.25.0\` from jsdelivr (~212 KB).

Both pinned with sha-256 hashes verified at download time.

## Type-union extensions

Single line each:
- \`LazyTreeSitterLanguageId\` → adds \`'rust' | 'go'\`
- \`TreeSitterLanguageId\` → adds \`'rust' | 'go'\`
- \`resolveWasmLocations\` → routes both through \`getCachedWasmPath()\`

## New extractors

### \`rustTreeSitterParser.ts\`

Per-line AST walk recognizing:
- \`function_item\` → function (with optional pub visibility)
- \`struct_item\`, \`enum_item\`, \`trait_item\` → class/enum/trait
- \`impl_item\` → impl (\`Trait for Type\` and bare \`Type\` shapes)
- \`type_item\` → type aliases
- \`const_item\` / \`static_item\` → ALL_CAPS constants
- \`mod_item\` → module declarations

Visibility detection via \`visibility_modifier\` child node — picks
up \`pub\`, \`pub(crate)\`, \`pub(super)\`, \`pub(in path)\` uniformly
without regex-level pattern enumeration.

### \`goTreeSitterParser.ts\`

Per-line AST walk recognizing:
- \`function_declaration\` → function (export gate: uppercase
  first letter, matching Go convention)
- \`method_declaration\` → method, rendered as \`Receiver.method\`.
  Walks receiver → \`parameter_declaration\` → \`pointer_type\` |
  \`type_identifier\` to extract the bare type name. Mirrors the
  regex extractor's existing convention.
- \`type_declaration\` → \`class\` for struct, \`interface\` for
  interface, \`type\` for aliases. Discriminates by inspecting
  the inner \`type_spec\` shape.
- \`var_declaration\` / \`const_declaration\` → const (works for
  both single-line and block-form \`var ( ... )\`).

## Registry chains

- \`rs: [treeSitterRustParser, regexRs]\`
- \`go: [treeSitterGoParser, regexGo]\`

Tree-sitter preferred when wasm is cached; regex stays as the
lossless fallback.

## Prefetch alias additions

\`rs\` / \`rust\` → \`rust\`, \`go\` / \`golang\` → \`go\`.

\`COCO_PREFETCH=all\` now resolves to all three lazy-loaded
languages (python + rust + go).

## Side fix

Bumped the second \`scenarioInputs\` test ("extracts the same
byte-identical fixture set across runs") to a 15s timeout — same
reason as the first one, plus this version spins up TWO temp
repos serially so the cost is roughly double. Phase 3 only got
the first test; this catches the sibling.

## Validation

- \`npx tsc --noEmit\` → 0 errors
- \`npm run test:jest\` → 1670/1670 pass (4 of 5 consecutive runs
  clean; the 1 flake was scenarioInputs hitting the 15s budget
  under heavy parallel load — pre-existing pattern)
- \`npx eslint\` on touched files → clean
- Manual end-to-end:
  \`\`\`
  $ COCO_PREFETCH=rs,go tsx <smoke>
  · Rust: downloading https://cdn.jsdelivr.net/npm/tree-sitter-rust@0.24.0/...
  ✓ Rust parser cached (1077 KB)
  · Go: downloading https://cdn.jsdelivr.net/npm/tree-sitter-go@0.25.0/...
  ✓ Go parser cached (212 KB)

  RUST: added: new_widget(), class Widget, impl Base for Widget.
        removed: legacy_widget().
  GO:   added: class Widget, New(), Widget.Name().
        removed: LegacyName().
  \`\`\`
  Both correctly classify struct vs. impl Trait-for-Type vs.
  function vs. method-with-receiver.

## Out of scope (phase 7)

- First-use interactive Y/n prompt
- \`coco cache list / prefetch / clear\` subcommand
- Eval-harness baseline comparison (regex vs. tree-sitter
  output diffs surfaced in the report)
- Telemetry on download / verify failures

#933 phases 1.0, 1.1, 2, 3, 5, 6 all merged. Phase 7 (polish) is
the only piece left for this issue.

Refs #933.
@gfargo gfargo merged commit 58333c4 into main May 14, 2026
5 of 7 checks passed
@gfargo gfargo deleted the feat/933-phase5-6-rust-go branch May 14, 2026 20:00
displayName: 'Rust',
version: '0.24.0',
wasmUrl: 'https://cdn.jsdelivr.net/npm/tree-sitter-rust@0.24.0/tree-sitter-rust.wasm',
sha256: 'f65f354215611fd94ad34134b3427eb3d58cbb745df7b6509ba722184db73d57',
displayName: 'Go',
version: '0.25.0',
wasmUrl: 'https://cdn.jsdelivr.net/npm/tree-sitter-go@0.25.0/tree-sitter-go.wasm',
sha256: '9504573f352b20be7f2f1911754d710622aedc15afff16d5ed8fb5645681aee7',
gfargo added a commit that referenced this pull request May 14, 2026
…phase 7) (#959)

Closes the tree-sitter integration feature (#933). Lazy-loaded
parsers gain a first-class CLI surface for cache management;
verbose mode surfaces a discoverability hint when the fast path
falls through to regex.

## New `coco cache` subcommands

Extends the existing `coco cache` (diff-summary info / clear)
with three tree-sitter subcommands:

- \`coco cache parsers\` — show every manifest language with its
  current cache status (cached size or "not cached" + the
  fetched-size estimate), version pin, and source URL. Footer
  summarizes total disk usage + quick-reference commands.
- \`coco cache prefetch [languages...]\` — download specific
  parsers (e.g. \`coco cache prefetch py rs go\` or
  \`coco cache prefetch all\`). When invoked with no args AND
  stdin is a TTY, opens an interactive checkbox picker. In
  non-interactive contexts (CI, pipes), no-arg invocations error
  out with usage hints instead of hanging on a prompt.
- \`coco cache clear-parsers\` — wipe \`~/.cache/coco/tree-sitter/\`.
  Idempotent; reports a per-language ✓ for each removed file.

Aliases mirror \`COCO_PREFETCH\` env grammar: \`py\` / \`python\`,
\`rs\` / \`rust\`, \`go\` / \`golang\`, \`all\`.

## Surrender telemetry

In verbose mode, when the language-aware fast path is enabled and
the parser chain falls through to LLM, emit a discoverability
hint:

  \`Tree-sitter parser surrendered for 'python'; using regex
   fallback. Hint: \`coco cache parsers\` to inspect,
   \`coco cache prefetch python\` to enable.\`

Quiet on the default path; visible only when the user is
debugging summary quality. Hint copy adapts: bundled-language
surrenders (\`ts\` / \`js\`) point at \`coco cache prefetch all\`
because TS / TSX wasms are always shipped (the surrender is from
a parser-init failure, not a missing download); lazy-loaded
languages get a per-language prefetch hint.

## Implementation

### \`cache.ts\` (lazy-load cache module)

- New \`getCachedParserStatus(language)\` returns
  \`{ language, cached, path, bytes?, mtime? }\` for the table
  renderer + interactive picker.
- New \`clearCachedParser(language)\` unlinks the cached .wasm.
  Idempotent; returns \`true\` when a file was actually removed.

### \`structuralParserRegistry.ts\`

- New \`hasTreeSitterParser(language)\` lets the LLM fallthrough
  path know whether a tree-sitter parser is registered for the
  language — used by the surrender-telemetry hint. Doesn't
  expose internals; the caller just needs the boolean.

### \`summarizeLargeFiles.ts\`

- Surrender-telemetry block fires after the registry returns
  undefined and BEFORE the cache lookup. Only emits when the
  chain includes a tree-sitter parser, so regex-only languages
  don't get a misleading hint.

### \`commands/cache/\`

- \`config.ts\` gains the \`CACHE_SUBCOMMANDS\` enum and a positional
  \`[languages..]\` for prefetch. Yargs validates the subcommand
  set; unknown tokens get caught by the language resolver.
- \`handler.ts\` adds three new branches:
  - \`parsers\` calls \`renderParsersTable\`
  - \`prefetch\` resolves tokens via \`parsePrefetchEnv\`
    (reusing the env-var grammar), prompts when interactive,
    and delegates to \`prefetchTreeSitterParsers\`. Failed
    downloads → \`process.exitCode = 1\`.
  - \`clear-parsers\` walks every manifest entry, calls
    \`clearCachedParser\`, reports per-language status.

### \`inquirerPrompts.ts\`

- New \`checkboxPrompt\` helper. Same dynamic-import shim as the
  other prompts; reuses the codebase's standard pattern for ESM
  inquirer modules under ts-jest.

## Tests

4 new test cases in \`handler.test.ts\` cover the new subcommands:
\`parsers\` lists every manifest language, \`prefetch\` warns on
unknown tokens, \`clear-parsers\` reports no-op when empty AND
removes cached files when present.

Test isolation: each test sets \`COCO_CACHE_DIR\` to the same tmp
dir the existing tests use for \`XDG_CACHE_HOME\`, so the
tree-sitter cache lives inside the per-test sandbox.

## Manual validation

\`\`\`
$ COCO_CACHE_DIR=/tmp/coco-phase7-smoke coco cache parsers
Tree-sitter parser cache

  Python   not cached          (448.0 KB when fetched)
  Rust     not cached          (1.05 MB when fetched)
  Go       not cached          (212.1 KB when fetched)

  cached: 0/3  total on disk: 0 B

$ coco cache prefetch py
· Python: downloading https://cdn.jsdelivr.net/.../tree-sitter-python.wasm…
✓ Python parser cached (447 KB)

Summary: 1 downloaded · 0 already cached · 0 failed

$ coco cache clear-parsers
✓ cleared Python

Cleared 1 parser(s) from ~/.cache/coco/tree-sitter/
\`\`\`

## Validation

- \`npx tsc --noEmit\` → 0 errors
- \`npm run test:jest\` → 1674/1674 pass (3 of 4 consecutive runs
  clean, 1 flake on the pre-existing scenarioInputs timeout
  pattern)
- \`npx eslint\` on touched files → clean
- Manual: all four subcommands round-trip cleanly

## Out of scope (genuine future work)

- **Eval-harness side-by-side regex-vs-tree-sitter comparison
  in the report output**. Today the eval reports per-fixture
  outcomes but doesn't discriminate WHICH parser produced each
  summary. Surfacing the regex vs. tree-sitter delta requires
  registry injection at eval time (the harness builds its own
  parser chain instead of using the global). Reasonable
  follow-up; not gating on #933 closure.

## #933 status: feature complete

| Phase | Status |
|---|---|
| 1.0 — Registry abstraction | ✓ #950 |
| 1.1 — TS/TSX bundled | ✓ #955 |
| 2 — Polish + ESM jest + arrow-fn fixture | ✓ #956 |
| 3 — Lazy-load infra + Python | ✓ #957 |
| 5 — Rust | ✓ #958 |
| 6 — Go | ✓ #958 |
| **7 — Cache CLI + telemetry** | **this PR** |

Closes #933.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants