You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Symbol extractor doesn't track string-literal boundaries: code inside C# """...""" raw strings (and other multi-line strings) becomes phantom symbols — 52 fake class App rows on cdidx itself, top map entrypoint is a Lua fixture string #177
The C# symbol extractor processes the file line-by-line with no awareness of string-literal boundaries, so any code-shaped line inside a raw string literal ("""...""") is treated as a real declaration. This is most common in test fixtures: cdidx's own tests/CodeIndex.Tests/ repo has 50+ phantom symbols across DbReaderTests.cs, ReferenceExtractorTests.cs, SymbolExtractorTests.cs, McpServerTests.cs, etc.
The same shape applies to other languages with multi-line strings (Python """...""", Rust r#"..."#, JavaScript template literals `...`, etc.) but C# is the worst case because:
C# 11+ raw string literals ("""...""") make embedding multi-language code fixtures convenient and ubiquitous in test files.
C#'s function regex is greedy enough to match shapes like def login(...) (Python) or function main() (Lua) that appear inside the string.
Concrete impact on cdidx:
symbols App --exact --lang csharp returns 52 rows. None are real C# classes — all 52 are public class App written inside C# raw string fixtures used to test indexing.
map's entrypoint heuristic ranks a function main from tests/CodeIndex.Tests/ReferenceExtractorTests.cs:210 as the docs: Add Windows install instructions and sqlite3 setup to search rules template #1 entrypoint of the whole repo, because the Lua test fixture string function main() io.write("world") end got captured as a C# function symbol and matched the C# Main/main name hint.
outline of any test file with raw-string fixtures becomes unreadable — the outline lists def login(...), public class App, void main(), def run(), etc. interleaved with real [Fact] test methods.
definition App returns 5 phantom locations and 0 real ones.
unused flags real symbols whose only "callers" are these phantoms.
Spot-check the first row at tests/CodeIndex.Tests/DbReaderTests.cs:258:
InsertIndexedFile("src/app.cs","csharp",""" public class App { public bool Read() => OperatingSystem.IsWindows() || ...; } """);
public class App is inside a raw string literal that the test passes to InsertIndexedFile as fixture content. The C# extractor sees public class App on a line, matches the class regex, and records class App at that line. Same shape repeats 51 more times across the test files.
2. outline shows phantom symbols inline with real ones
"$CDIDX" outline tests/CodeIndex.Tests/ReferenceExtractorTests.cs --db /tmp/codeindex.db | head -25
# # tests/CodeIndex.Tests/ReferenceExtractorTests.cs (csharp, 338 lines, 50 symbols)# 1 using CodeIndex.Indexer;# 3 namespace CodeIndex.Tests;# 9 public class ReferenceExtractorTests# 12 public void Extract_PythonCall_AssignsCallerContainer() : void# 14 const string content = """ : string# 15 def login(user, password): : def ← phantom (Python def inside C# string)# 29 public void Extract_CsharpDefinitionLine_DoesNotBecomeReference() : void# 31 const string content = """ : string# 32 public class App ← phantom (C# class inside C# string)# 34 public void Run() : void ← phantom# 52 public class Query ← phantom# 54 public void Run() : void ← phantom# ...
Of the 50 reported symbols, ~20 are real [Fact] methods + the test class + import + namespace; the rest are phantoms from string literal contents.
3. map ranks a fixture-string function main as the top entrypoint
"$CDIDX" map --db /tmp/codeindex.db | tail -10
# Entrypoints:# function main tests/CodeIndex.Tests/ReferenceExtractorTests.cs:210 [score 5] ← Lua fixture string!# function IsProjectPathArg src/CodeIndex/Program.cs:95 [score 4]# function RunMcp src/CodeIndex/Program.cs:98 [score 4]# function ShowError src/CodeIndex/Program.cs:106 [score 4]# class App tests/CodeIndex.Tests/DbReaderTests.cs:258 [score 4] ← also a phantom
[Fact]publicvoidExtract_LuaCall_DetectsReferences(){conststringcontent=""" function main() io.write("world") ... end """;varsymbols=SymbolExtractor.Extract(1,"lua",content);
...}
The phantom function main (Lua content inside a C# raw string) matches the C# extractor's name=main heuristic and the file gets the highest entrypoint score. That's the #1 entrypoint the AI consumer is told about.
4. definition App returns only phantoms
"$CDIDX" definition App --db /tmp/codeindex.db --lang csharp --exact
# → 52 rows, all from raw string literals; no real C# class App exists in cdidx.
If a real class App were ever added to the codebase, finding it via definition would require manually filtering out 52+ phantom locations.
Root cause
src/CodeIndex/Indexer/SymbolExtractor.cs:441-452 is the extraction loop:
The loop has no concept of string-literal context. Every line is matched against every language pattern. C# raw string literals ("""...."""), Python triple-quoted strings, Rust raw strings (r#"..."#), JS template literals — all leak their contents into the symbol table as if they were code.
C# is the most affected language because (a) the function regex is permissive (the patterns at SymbolExtractor.cs:81/94/100 accept lots of leading-token shapes) and (b) C# 11 raw string literals are the canonical idiom for embedding code fixtures.
The same pattern affects other languages whose patterns happen to match content inside multi-line strings. Python's class regex matches class App: inside Python """...""" strings; Rust's fn regex matches fn foo() { inside r#"..."#; etc.
Why it matters
symbols / definition / inspect / outline / map lie. AI consumers asking "where is App defined?" get phantom answers pointing into string fixtures.
Entrypoint heuristic is unreliable.map's top entrypoint can be a fixture string in a test file — meaningless for first-pass orientation.
Hot/unused signals are polluted. Phantom symbols count toward unused, may match real symbol names elsewhere, and add noise to hotspots rankings (especially for short/common names like App, Query, Run, Counter, main).
Self-test risk: cdidx's own dogfooding workflow (CLAUDE.md says "the reviewer sees exactly what the code actually does") is partially undermined when a code-search tool finds 52 fake class App declarations inside its own test data and indexes them as real.
Silent. No warning, no degraded-confidence flag.
Suggested direction
Three approaches, increasing in complexity. Each is independently useful.
Phase 1 — strip C# raw string literal contents before extraction (cheap, language-specific)
Track """+ ... """+ (where the count of quotes opening = count of quotes closing) on a stateful pass over lines. When inside a raw string, replace the line with a placeholder (or skip the pattern matching for those lines). Scope of change: SymbolExtractor.Extract for lang == "csharp".
// PseudocodevarinRawStringFor=0;// 0 = not in, N = waiting for N consecutive `"`for(inti=0;i<lines.Length;i++){varline=lines[i];if(inRawStringFor>0){if(LineClosesRawString(line,inRawStringFor))inRawStringFor=0;continue;// skip pattern matching while inside a raw string body}if(TryDetectRawStringOpening(line,outvarquoteCount,outvarclosesOnSameLine)){if(!closesOnSameLine)inRawStringFor=quoteCount;continue;}// existing per-pattern matching ...}
This eliminates the phantom-symbols class entirely for C# at the cost of one stateful counter and a couple of helper functions. Same approach generalizes to Python triple-quoted strings ("""...""" and '''...''') and Rust raw strings (r#"..."#).
Phase 2 — generic string-literal awareness across languages
Generalize Phase 1 to a pluggable per-language "string state machine" that knows each language's multi-line string syntax. The extractor invokes it before pattern matching to mask out string contents.
Phase 3 — heuristic fallback (immediate, no extractor change)
For commands that surface symbols to humans / AI (map entrypoints, outline, definition ranking), add a post-processing filter: a symbol whose definition line is inside a """....""" block (detected by scanning the surrounding line context) gets demoted in ranking or marked with a lower-confidence flag. Less robust than Phase 1 but doesn't require re-indexing.
Expected impact after fix
On cdidx itself:
symbols App --exact --lang csharp --count drops from 52 to 0 (no real class App exists).
outline tests/CodeIndex.Tests/ReferenceExtractorTests.cs shows ~20 real symbols instead of 50.
map's top entrypoint is no longer a Lua fixture string.
definition / inspect results stop pointing into string literals.
On real-world C# corpora (Roslyn, ASP.NET Core), test files like Roslyn's *Tests.cs that embed C#/VB source as fixture strings would similarly stop generating phantom symbols.
Scope
src/CodeIndex/Indexer/SymbolExtractor.cs — add raw-string-aware preprocessing pass for C# (and Python / Rust as Phase 2).
src/CodeIndex/Indexer/ReferenceExtractor.cs — verify whether reference extraction has the same problem (it likely does — Run() calls inside string fixtures probably get recorded as references). The same preprocessing helper can be shared.
tests/CodeIndex.Tests/SymbolExtractorTests.cs — fixtures that confirm:
class Foo { /* code */ } inside """...""" is NOT extracted as a class.
def foo(): inside Python """...""" is NOT extracted.
Real C# classes / methods on adjacent lines ARE still extracted (no regression).
Both """....""" (raw string with 3 quotes) and """"..."""" (with 4+ quotes — C# allows arbitrary count) are handled.
Same family — per-line regex extraction that doesn't account for string-literal context. This issue is the most impactful instance because it compounds across every test file that uses raw-string fixtures.
Environment
cdidx: v1.10.0 (installed via install.sh).
Repro corpus: Widthdom/CodeIndex@main itself.
Platform: linux-x64 container.
Filed from a cloud Claude Code session per CLOUD_BOOTSTRAP_PROMPT.md.
Summary
The C# symbol extractor processes the file line-by-line with no awareness of string-literal boundaries, so any code-shaped line inside a raw string literal (
"""...""") is treated as a real declaration. This is most common in test fixtures: cdidx's owntests/CodeIndex.Tests/repo has 50+ phantom symbols acrossDbReaderTests.cs,ReferenceExtractorTests.cs,SymbolExtractorTests.cs,McpServerTests.cs, etc.The same shape applies to other languages with multi-line strings (Python
"""...""", Rustr#"..."#, JavaScript template literals`...`, etc.) but C# is the worst case because:"""...""") make embedding multi-language code fixtures convenient and ubiquitous in test files.def login(...)(Python) orfunction main()(Lua) that appear inside the string.Concrete impact on cdidx:
symbols App --exact --lang csharpreturns 52 rows. None are real C# classes — all 52 arepublic class Appwritten inside C# raw string fixtures used to test indexing.map's entrypoint heuristic ranks afunction mainfromtests/CodeIndex.Tests/ReferenceExtractorTests.cs:210as the docs: Add Windows install instructions and sqlite3 setup to search rules template #1 entrypoint of the whole repo, because the Lua test fixture stringfunction main() io.write("world") endgot captured as a C# function symbol and matched the C#Main/mainname hint.outlineof any test file with raw-string fixtures becomes unreadable — the outline listsdef login(...),public class App,void main(),def run(), etc. interleaved with real[Fact]test methods.definition Appreturns 5 phantom locations and 0 real ones.unusedflags real symbols whose only "callers" are these phantoms.Repro
1. 52 phantom
class Approws on cdidx itselfSpot-check the first row at
tests/CodeIndex.Tests/DbReaderTests.cs:258:public class Appis inside a raw string literal that the test passes toInsertIndexedFileas fixture content. The C# extractor seespublic class Appon a line, matches the class regex, and recordsclass Appat that line. Same shape repeats 51 more times across the test files.2.
outlineshows phantom symbols inline with real onesOf the 50 reported symbols, ~20 are real
[Fact]methods + the test class + import + namespace; the rest are phantoms from string literal contents.3.
mapranks a fixture-stringfunction mainas the top entrypointtests/CodeIndex.Tests/ReferenceExtractorTests.cs:210:The phantom
function main(Lua content inside a C# raw string) matches the C# extractor's name=mainheuristic and the file gets the highest entrypoint score. That's the #1 entrypoint the AI consumer is told about.4.
definition Appreturns only phantomsIf a real
class Appwere ever added to the codebase, finding it viadefinitionwould require manually filtering out 52+ phantom locations.Root cause
src/CodeIndex/Indexer/SymbolExtractor.cs:441-452is the extraction loop:The loop has no concept of string-literal context. Every line is matched against every language pattern. C# raw string literals (
"""...."""), Python triple-quoted strings, Rust raw strings (r#"..."#), JS template literals — all leak their contents into the symbol table as if they were code.C# is the most affected language because (a) the function regex is permissive (the patterns at
SymbolExtractor.cs:81/94/100accept lots of leading-token shapes) and (b) C# 11 raw string literals are the canonical idiom for embedding code fixtures.The same pattern affects other languages whose patterns happen to match content inside multi-line strings. Python's class regex matches
class App:inside Python"""..."""strings; Rust'sfnregex matchesfn foo() {insider#"..."#; etc.Why it matters
symbols/definition/inspect/outline/maplie. AI consumers asking "where isAppdefined?" get phantom answers pointing into string fixtures.map's top entrypoint can be a fixture string in a test file — meaningless for first-pass orientation.unused, may match real symbol names elsewhere, and add noise tohotspotsrankings (especially for short/common names likeApp,Query,Run,Counter,main).class Appdeclarations inside its own test data and indexes them as real.Suggested direction
Three approaches, increasing in complexity. Each is independently useful.
Phase 1 — strip C# raw string literal contents before extraction (cheap, language-specific)
Track
"""+ ... """+(where the count of quotes opening = count of quotes closing) on a stateful pass overlines. When inside a raw string, replace the line with a placeholder (or skip the pattern matching for those lines). Scope of change:SymbolExtractor.Extractforlang == "csharp".This eliminates the phantom-symbols class entirely for C# at the cost of one stateful counter and a couple of helper functions. Same approach generalizes to Python triple-quoted strings (
"""..."""and'''...''') and Rust raw strings (r#"..."#).Phase 2 — generic string-literal awareness across languages
Generalize Phase 1 to a pluggable per-language "string state machine" that knows each language's multi-line string syntax. The extractor invokes it before pattern matching to mask out string contents.
Phase 3 — heuristic fallback (immediate, no extractor change)
For commands that surface symbols to humans / AI (
mapentrypoints,outline,definitionranking), add a post-processing filter: a symbol whose definition line is inside a"""...."""block (detected by scanning the surrounding line context) gets demoted in ranking or marked with a lower-confidence flag. Less robust than Phase 1 but doesn't require re-indexing.Expected impact after fix
On cdidx itself:
symbols App --exact --lang csharp --countdrops from 52 to 0 (no realclass Appexists).outline tests/CodeIndex.Tests/ReferenceExtractorTests.csshows ~20 real symbols instead of 50.map's top entrypoint is no longer a Lua fixture string.definition/inspectresults stop pointing into string literals.On real-world C# corpora (Roslyn, ASP.NET Core), test files like
Roslyn's*Tests.csthat embed C#/VB source as fixture strings would similarly stop generating phantom symbols.Scope
src/CodeIndex/Indexer/SymbolExtractor.cs— add raw-string-aware preprocessing pass for C# (and Python / Rust as Phase 2).src/CodeIndex/Indexer/ReferenceExtractor.cs— verify whether reference extraction has the same problem (it likely does —Run()calls inside string fixtures probably get recorded as references). The same preprocessing helper can be shared.tests/CodeIndex.Tests/SymbolExtractorTests.cs— fixtures that confirm:class Foo { /* code */ }inside"""..."""is NOT extracted as a class.def foo():inside Python"""..."""is NOT extracted."""...."""(raw string with 3 quotes) and""""...""""(with 4+ quotes — C# allows arbitrary count) are handled.Related
if/for/while/switchblocks as function definitions #154 — JS/TS keyword false positives (regex false-positive family, but at the line level not the string-literal level).else if,case const Class()etc. asfunction if/function Class— negative lookahead missingelse/case/ more #163 — Dartelse if/case const Class()false positives (also line-level regex).impl Trait for Structcaptures the trait name as aclass(hundreds of fakeclass Future/class From/class Defaultrows on tokio), andunsafe impl ... for ...is dropped entirely #169 — Rustimpl Trait for Structmis-captures (also line-level).Same family — per-line regex extraction that doesn't account for string-literal context. This issue is the most impactful instance because it compounds across every test file that uses raw-string fixtures.
Environment
install.sh).Widthdom/CodeIndex@mainitself.CLOUD_BOOTSTRAP_PROMPT.md.