Skip to content

feat(chunker): migrate to official tree-sitter via cgo#80

Open
dvcdsys wants to merge 2 commits into
developfrom
feat/chunker-cgo-treesitter
Open

feat(chunker): migrate to official tree-sitter via cgo#80
dvcdsys wants to merge 2 commits into
developfrom
feat/chunker-cgo-treesitter

Conversation

@dvcdsys
Copy link
Copy Markdown
Owner

@dvcdsys dvcdsys commented Jun 7, 2026

What

Replaces the pure-Go gotreesitter parser with the official tree-sitter (github.com/tree-sitter/go-tree-sitter v0.25) compiled via cgo.

Why

gotreesitter produced ERROR trees and ~650× slowdowns on large valid TypeScript files and had a C enum GLR regression. Measured on vscode files:

File gotreesitter official (cgo)
editorOptions.ts (250 KB) 8.5 s → ERROR 13 ms → 0 errors
extHostTypeConverters.ts 23.8 s → ERROR 9 ms → 0 errors

This was also the root cause of the prod 100 GB OOM when indexing microsoft/vscode (see [project_gotreesitter_oom] memory / earlier #76/#78 work) — the C parser frees per-tree and stays bounded.

How

Grammars (31 languages, no regression):

  • 25 consumed as upstream Go modules — their bindings/go compiles parser.c via cgo from the module cache, so the C stays out of this repo.
  • 6 vendored in-tree under internal/chunker/tsgrammars/ because no usable Go binding exists: markdown, objc, scss, solidity (no bindings/go), sql (no committed generated parser.c), r (binding omits its external scanner → link failure). See tsgrammars/vendor.sh for the pins.

Chunker: new ts API (Kind(), ParseWithOptions progress-callback deadline replacing the deprecated cancellation flag, tree/parser Close()); registry wraps the C TSLanguage via ts.NewLanguage; languages without a grammar degrade to sliding-window instead of erroring. SQL node map updated to the DerekStride grammar (create_table/create_function). C-enum regression test un-skipped (now passes).

Build — Docker users notice nothing: both Dockerfiles already use the musl-based golang:1.25-alpine builder, so we add gcc musl-dev and link statically against musl (-linkmode external -extldflags -static, tags osusergo netgo). The binary stays loader-free:

  • CPU image stays on distroless/static-debian12verified: file reports statically linked, container boots, /health = {"status":"ok"}, all 31 languages register.
  • CUDA image stays on distroless/cc-debian13 (the static binary runs there unchanged).
  • Trade-off (agreed): binary grows ~41 MB → ~78 MB from the compiled grammar tables.

Testing

  • go test -race ./... — all packages green, 0 FAIL. All 31 languages validated by TestRegistry_NodeNamesMatchAST.
  • CLI path: full reindex of this repo (540 files) and of the vscode editor subtree (1271 files) with the local llama embedder. editorOptions.ts — which used to fall to sliding-window with zero symbols — now yields real symbols (cix def IEditorOptions[type], EditorBooleanOption/ApplyUpdateResult[class]).
  • External-projects path: full reindex of github.com/microsoft/vscode@main via POST /api/v1/projects/{hash}/reindex — the exact prod path that OOM'd. RSS stayed bounded ~2.1 GB (was 100 GB) throughout, productively chunking+embedding, cleanly force-stoppable. (Stopped before completion — full index is hours on a local embedder.)

Notes / follow-ups

  • The CUDA image must be built on the amd64 builder (make scout-cuda) — not buildable in this environment.
  • Observed pre-existing SQLITE_BUSY job-claim contention when a cix watch daemon runs concurrently with an index job — unrelated to this change, worth a separate look.

🤖 Generated with Claude Code

dvcdsys and others added 2 commits June 7, 2026 14:14
Replace the pure-Go gotreesitter parser with the official tree-sitter
(github.com/tree-sitter/go-tree-sitter v0.25) compiled via cgo. gotreesitter
produced ERROR trees and ~650x slowdowns on large valid TypeScript files (e.g.
vscode editorOptions.ts: 8.5s -> ERROR vs official 13ms -> 0 errors) and had a
C `enum` GLR regression; the official parser fixes both.

Grammars: 25 languages are consumed as upstream Go modules (their bindings/go
package compiles parser.c via cgo from the module cache, so the C stays out of
this repo). Six holdouts with no usable Go binding are vendored in-tree under
internal/chunker/tsgrammars/ (markdown, objc, scss, solidity, r — binding omits
its external scanner; sql — no committed generated parser.c). See vendor.sh.

- chunker.go: new ts API (Kind() vs Type(lang), ParseWithOptions progress
  callback for the wall-clock deadline replacing the deprecated cancellation
  flag, tree/parser Close()); registry factories wrap the C TSLanguage via
  ts.NewLanguage; languages without a vendored/module grammar degrade to
  sliding-window instead of erroring.
- sql node map updated to the DerekStride grammar (create_table/create_function
  rather than the *_statement names the old grammar exposed).
- C enum regression test un-skipped (now passes); all 31 languages validated by
  TestRegistry_NodeNamesMatchAST.

Build stays a static binary (cgo links the grammar C statically); CPU image
remains distroless/static. Binary grows ~41MB -> ~78MB from the compiled
grammar tables.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The chunker now uses cgo (tree-sitter grammars are C), so the images can no
longer build with CGO_ENABLED=0. Both Dockerfiles already use the musl-based
golang:1.25-alpine builder, so add gcc+musl-dev and link statically against
musl (-linkmode external -extldflags -static, tags osusergo netgo). The result
is still a loader-free static binary:

- CPU image stays on distroless/static-debian12 (verified: `file` reports
  "statically linked", container boots, /health returns {"status":"ok"}, all 31
  chunker languages register).
- CUDA image stays on distroless/cc-debian13; the static cix-server runs there
  unchanged (its glibc is only for the llama-server sidecar).

No runtime/behaviour change for operators — same base images, same healthcheck.
The binary grows ~41MB -> ~78MB from the compiled grammar tables.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant