feat(ingest): gate Phase-1 CGO parse on backend — ASTWalker serves file-level extracts (S4)#445
Conversation
…le-level extracts (S4) S4 (mache-33de70): until now the tree-sitter parse ran unconditionally in Phase-1 (engine_treesitter.go), BEFORE walker selection, so mounting a pre-parsed _ast db saved zero CGO — the parse + sitter ExtractContext / ExtractGoImports / ExtractFileLevelRefs always ran. This gates all of it on `e.astWalker == nil`: - parallel-worker Phase-1: skip parser.SetLanguage + ParseCtx + the three sitter extracts when an ASTWalker backend is present - processTreeSitterResult AST arm: serve context / imports / file-level-refs from SQL (ExtractContext, ExtractGoImports, ExtractFileLevelRefs — the last landed in #443) instead - single-file path (the mache-2y9w SIGSEGV source): same gate So on the ASTWalker path NO CGO runs in ingest — which also removes the mache-2y9w tree-sitter SIGSEGV surface for that path. Proven by the existing projection parity gate, now made genuine: emit an _imports table into sitterToASTDB (derived from the same parse, as leyline would) so the AST path's ExtractGoImports has the import set qualified-call resolution needs. TestASTQueryParity now ingests the fixture through the ASTWalker with the CGO parse SKIPPED and asserts byte-for-byte identical projection (nodes, children, content/context, callers via imports, Properties) vs the sitter path. Full ingest suite + -race green. SCOPE / honest limit: this is the engine MECHANISM. Production source ingestion does not call SetASTWalker yet (only the parity tests do), so CGO still runs in the default `mache <dir>` path until a follow-up wires SelectWalker / an _ast-db-detection step into the source-ingest entry. This PR makes that activation a one-line change with a parity-proven engine beneath it. Claude-Session: https://claude.ai/code/session_01FhfjttLFYPzgs3aBG45eBT
find_smells (advisory)Scoped to files changed in this PR. Rules below run on the standalone (no-LLO) cross-ref tables; fan_out_skew — 3 finding(s) in changed files
Rules: see |
There was a problem hiding this comment.
Pull request overview
This PR advances the ingest pipeline’s CGO-removal campaign by making Phase-1 tree-sitter parsing and related “file-level extracts” conditional on the selected backend. When an ASTWalker backend is present, Phase-1 skips the tree-sitter parse work and Phase-2 serves context, Go imports, and file-level refs from the mounted SQL _ast database instead; parity is reinforced by extending the AST parity test DB to include _imports.
Changes:
- Gate Phase-1 tree-sitter parse + sitter-based context/imports/file-level-ref extraction on
e.astWalker == nil(parallel + single-file paths). - On the ASTWalker path in
processTreeSitterResult, populatecontext/ Goimports/file-level refsvia SQL-backed extractors. - Extend the parity-test helper DB schema to emit/populate
_importsfrom the same sitter parse (Go), matching ASTWalker’s import-resolution expectations.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| internal/ingest/engine_treesitter.go | Gates tree-sitter parse/extracts on backend and serves equivalent extracts from AST SQL when using ASTWalker. |
| internal/ingest/ast_query_parity_test.go | Adds _imports table to the synthetic AST DB and populates it for Go using sitter-derived imports to preserve parity under the new gating. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
… activation plan The CGO-removal step-4 commitment point's ENGINE work has landed and is byte-parity-proven, but the ADR still read "gated on mache-33dc5f". Update it to reflect reality and to record where activation goes (answering "where should SetASTWalker be used?"): - Interface widening (S3, #431/#432/#433) + Phase-1 parse gating (S4, #445): the Engine skips the CGO parse and serves context/imports/file-level-refs from SQL when an ASTWalker backend is present; TestASTQueryParity proves byte-identical projection with the parse skipped. - What remains is activation + deletion, NOT the engine. Activation: insert a leyline-parse-first step in the read-only source-ingest fast path (cmd/mount.go, the SchemaUsesTreeSitter branch), then SetASTWalker / SelectWalker — because mache is its own parser at ingest time today, there is no _ast db to read until leyline produces one. Gated on leyline availability; absent -> today's tree-sitter path. ARCHITECTURE.md: same status correction in the standalone-path note + ADR table. The standalone "five rules" list is unchanged — duplicate_code (added this cycle) needs _ast, so it's a paired-path rule, not standalone. Claude-Session: https://claude.ai/code/session_01FhfjttLFYPzgs3aBG45eBT
…ree guard (Copilot #445) Two valid review catches: 1. The Phase-1 worker still called runtime.LockOSThread() + sitter.NewParser() (itself a CGO call) before the e.astWalker==nil gate, so the "no CGO runs in ingest" claim wasn't literally true on the AST path. Both are now gated on e.astWalker==nil — with an ASTWalker backend the worker allocates no parser and takes no thread pin, so the AST path is genuinely CGO-free. 2. The sequential single-file path lacked the parallel worker's nil-tree guard: ParseCtx can return (nil, nil) for empty input, which processTreeSitterResult would deref as result.tree.RootNode() and panic. Mirrored the guard so it routes to the BROKEN_ fallback instead. Build, full internal/ingest suite, and -race (parity + engine + reingest) all green. Claude-Session: https://claude.ai/code/session_01FhfjttLFYPzgs3aBG45eBT
… activation plan (#446) The CGO-removal step-4 commitment point's ENGINE work has landed and is byte-parity-proven, but the ADR still read "gated on mache-33dc5f". Update it to reflect reality and to record where activation goes (answering "where should SetASTWalker be used?"): - Interface widening (S3, #431/#432/#433) + Phase-1 parse gating (S4, #445): the Engine skips the CGO parse and serves context/imports/file-level-refs from SQL when an ASTWalker backend is present; TestASTQueryParity proves byte-identical projection with the parse skipped. - What remains is activation + deletion, NOT the engine. Activation: insert a leyline-parse-first step in the read-only source-ingest fast path (cmd/mount.go, the SchemaUsesTreeSitter branch), then SetASTWalker / SelectWalker — because mache is its own parser at ingest time today, there is no _ast db to read until leyline produces one. Gated on leyline availability; absent -> today's tree-sitter path. ARCHITECTURE.md: same status correction in the standalone-path note + ADR table. The standalone "five rules" list is unchanged — duplicate_code (added this cycle) needs _ast, so it's a paired-path rule, not standalone. Claude-Session: https://claude.ai/code/session_01FhfjttLFYPzgs3aBG45eBT
What
S4 (
mache-33de70) — the commitment point of the CGO-removal campaign. Until now the tree-sitter parse ran unconditionally in Phase-1, before walker selection, so mounting a pre-parsed_astdb saved zero CGO: the parse + sitterExtractContext/ExtractGoImports/ExtractFileLevelRefsalways ran.This gates all of it on
e.astWalker == nil:parser.SetLanguage+ParseCtx+ the three sitter extracts when an ASTWalker backend is presentprocessTreeSitterResultAST arm: servecontext/imports/file-level-refsfrom SQL instead (ExtractContext,ExtractGoImports,ExtractFileLevelRefs— the last landed in feat(ingest): ASTWalker.ExtractFileLevelRefs — pure-Go file-level ref parity (S4 prep) #443)mache-2y9wSIGSEGV source): same gateOn the ASTWalker path no CGO runs in ingest — which also removes the
mache-2y9wtree-sitter SIGSEGV surface for that path.Proof
The existing projection parity gate, now made genuine:
sitterToASTDBemits an_importstable (derived from the same parse, as leyline would) so the AST path'sExtractGoImportshas the import set qualified-call resolution needs.TestASTQueryParitynow ingests the fixture through the ASTWalker with the CGO parse skipped and asserts byte-for-byte identical projection (nodes, children, content/context, callers-via-imports, Properties) vs the sitter path. Fullinternal/ingestsuite +-racegreen.Scope / honest limit
This is the engine mechanism. Production source ingestion does not call
SetASTWalkeryet (only the parity tests do), so CGO still runs in the defaultmache <dir>path until a follow-up wiresSelectWalker/ an_ast-db-detection step into the source-ingest entry. This PR makes that activation a near one-line change with a parity-proven engine beneath it.🤖 Generated with Claude Code