Skip to content

Symbol extractor doesn't track string-literal boundaries: code inside C# """...""" raw strings (and other multi-line strings) becomes phantom symbols — 52 fake class App rows on cdidx itself, top map entrypoint is a Lua fixture string #177

@Widthdom

Description

@Widthdom

Summary

The C# symbol extractor processes the file line-by-line with no awareness of string-literal boundaries, so any code-shaped line inside a raw string literal ("""...""") is treated as a real declaration. This is most common in test fixtures: cdidx's own tests/CodeIndex.Tests/ repo has 50+ phantom symbols across DbReaderTests.cs, ReferenceExtractorTests.cs, SymbolExtractorTests.cs, McpServerTests.cs, etc.

The same shape applies to other languages with multi-line strings (Python """...""", Rust r#"..."#, JavaScript template literals `...`, etc.) but C# is the worst case because:

  1. C# 11+ raw string literals ("""...""") make embedding multi-language code fixtures convenient and ubiquitous in test files.
  2. C#'s function regex is greedy enough to match shapes like def login(...) (Python) or function main() (Lua) that appear inside the string.

Concrete impact on cdidx:

  • symbols App --exact --lang csharp returns 52 rows. None are real C# classes — all 52 are public class App written inside C# raw string fixtures used to test indexing.
  • map's entrypoint heuristic ranks a function main from tests/CodeIndex.Tests/ReferenceExtractorTests.cs:210 as the docs: Add Windows install instructions and sqlite3 setup to search rules template #1 entrypoint of the whole repo, because the Lua test fixture string function main() io.write("world") end got captured as a C# function symbol and matched the C# Main/main name hint.
  • outline of any test file with raw-string fixtures becomes unreadable — the outline lists def login(...), public class App, void main(), def run(), etc. interleaved with real [Fact] test methods.
  • definition App returns 5 phantom locations and 0 real ones.
  • unused flags real symbols whose only "callers" are these phantoms.

Repro

curl -fsSL https://raw.githubusercontent.com/Widthdom/CodeIndex/main/install.sh | bash
CDIDX=/root/.local/bin/cdidx

git clone https://github.com/Widthdom/CodeIndex /tmp/codeindex-src
"$CDIDX" /tmp/codeindex-src --db /tmp/codeindex.db

1. 52 phantom class App rows on cdidx itself

"$CDIDX" symbols App --db /tmp/codeindex.db --lang csharp --exact --count --limit 99999
# → 52

Spot-check the first row at tests/CodeIndex.Tests/DbReaderTests.cs:258:

InsertIndexedFile("src/app.cs", "csharp",
    """
    public class App
    {
        public bool Read() => OperatingSystem.IsWindows() || ...;
    }
    """);

public class App is inside a raw string literal that the test passes to InsertIndexedFile as fixture content. The C# extractor sees public class App on a line, matches the class regex, and records class App at that line. Same shape repeats 51 more times across the test files.

2. outline shows phantom symbols inline with real ones

"$CDIDX" outline tests/CodeIndex.Tests/ReferenceExtractorTests.cs --db /tmp/codeindex.db | head -25
# # tests/CodeIndex.Tests/ReferenceExtractorTests.cs  (csharp, 338 lines, 50 symbols)
#       1  using CodeIndex.Indexer;
#       3  namespace CodeIndex.Tests;
#       9  public class ReferenceExtractorTests
#      12      public void Extract_PythonCall_AssignsCallerContainer() : void
#      14      const string content = """ : string
#      15      def login(user, password): : def                    ← phantom (Python def inside C# string)
#      29      public void Extract_CsharpDefinitionLine_DoesNotBecomeReference() : void
#      31      const string content = """ : string
#      32      public class App                                    ← phantom (C# class inside C# string)
#      34      public void Run() : void                            ← phantom
#      52      public class Query                                  ← phantom
#      54      public void Run() : void                            ← phantom
#      ...

Of the 50 reported symbols, ~20 are real [Fact] methods + the test class + import + namespace; the rest are phantoms from string literal contents.

3. map ranks a fixture-string function main as the top entrypoint

"$CDIDX" map --db /tmp/codeindex.db | tail -10
# Entrypoints:
#   function   main                     tests/CodeIndex.Tests/ReferenceExtractorTests.cs:210  [score 5]   ← Lua fixture string!
#   function   IsProjectPathArg         src/CodeIndex/Program.cs:95  [score 4]
#   function   RunMcp                   src/CodeIndex/Program.cs:98  [score 4]
#   function   ShowError                src/CodeIndex/Program.cs:106  [score 4]
#   class      App                      tests/CodeIndex.Tests/DbReaderTests.cs:258  [score 4]            ← also a phantom

tests/CodeIndex.Tests/ReferenceExtractorTests.cs:210:

[Fact]
public void Extract_LuaCall_DetectsReferences()
{
    const string content = """
        function main()
            io.write("world")
            ...
        end
        """;
    var symbols = SymbolExtractor.Extract(1, "lua", content);
    ...
}

The phantom function main (Lua content inside a C# raw string) matches the C# extractor's name=main heuristic and the file gets the highest entrypoint score. That's the #1 entrypoint the AI consumer is told about.

4. definition App returns only phantoms

"$CDIDX" definition App --db /tmp/codeindex.db --lang csharp --exact
# → 52 rows, all from raw string literals; no real C# class App exists in cdidx.

If a real class App were ever added to the codebase, finding it via definition would require manually filtering out 52+ phantom locations.

Root cause

src/CodeIndex/Indexer/SymbolExtractor.cs:441-452 is the extraction loop:

var lines = content.Split('\n');
var symbols = new List<SymbolRecord>();
for (int i = 0; i < lines.Length; i++)
{
    var line = lines[i];
    var matchLine = lang == "csharp" ? StripLeadingCSharpAttributeLists(line) : line;
    foreach (var pattern in patterns)
    {
        var match = pattern.Regex.Match(matchLine);
        if (!match.Success) continue;
        ...

The loop has no concept of string-literal context. Every line is matched against every language pattern. C# raw string literals ("""...."""), Python triple-quoted strings, Rust raw strings (r#"..."#), JS template literals — all leak their contents into the symbol table as if they were code.

C# is the most affected language because (a) the function regex is permissive (the patterns at SymbolExtractor.cs:81/94/100 accept lots of leading-token shapes) and (b) C# 11 raw string literals are the canonical idiom for embedding code fixtures.

The same pattern affects other languages whose patterns happen to match content inside multi-line strings. Python's class regex matches class App: inside Python """...""" strings; Rust's fn regex matches fn foo() { inside r#"..."#; etc.

Why it matters

  • symbols / definition / inspect / outline / map lie. AI consumers asking "where is App defined?" get phantom answers pointing into string fixtures.
  • Entrypoint heuristic is unreliable. map's top entrypoint can be a fixture string in a test file — meaningless for first-pass orientation.
  • Hot/unused signals are polluted. Phantom symbols count toward unused, may match real symbol names elsewhere, and add noise to hotspots rankings (especially for short/common names like App, Query, Run, Counter, main).
  • Self-test risk: cdidx's own dogfooding workflow (CLAUDE.md says "the reviewer sees exactly what the code actually does") is partially undermined when a code-search tool finds 52 fake class App declarations inside its own test data and indexes them as real.
  • Silent. No warning, no degraded-confidence flag.

Suggested direction

Three approaches, increasing in complexity. Each is independently useful.

Phase 1 — strip C# raw string literal contents before extraction (cheap, language-specific)

Track """+ ... """+ (where the count of quotes opening = count of quotes closing) on a stateful pass over lines. When inside a raw string, replace the line with a placeholder (or skip the pattern matching for those lines). Scope of change: SymbolExtractor.Extract for lang == "csharp".

// Pseudocode
var inRawStringFor = 0;  // 0 = not in, N = waiting for N consecutive `"`
for (int i = 0; i < lines.Length; i++)
{
    var line = lines[i];
    if (inRawStringFor > 0) {
        if (LineClosesRawString(line, inRawStringFor)) inRawStringFor = 0;
        continue;  // skip pattern matching while inside a raw string body
    }
    if (TryDetectRawStringOpening(line, out var quoteCount, out var closesOnSameLine)) {
        if (!closesOnSameLine) inRawStringFor = quoteCount;
        continue;
    }
    // existing per-pattern matching ...
}

This eliminates the phantom-symbols class entirely for C# at the cost of one stateful counter and a couple of helper functions. Same approach generalizes to Python triple-quoted strings ("""...""" and '''...''') and Rust raw strings (r#"..."#).

Phase 2 — generic string-literal awareness across languages

Generalize Phase 1 to a pluggable per-language "string state machine" that knows each language's multi-line string syntax. The extractor invokes it before pattern matching to mask out string contents.

Phase 3 — heuristic fallback (immediate, no extractor change)

For commands that surface symbols to humans / AI (map entrypoints, outline, definition ranking), add a post-processing filter: a symbol whose definition line is inside a """....""" block (detected by scanning the surrounding line context) gets demoted in ranking or marked with a lower-confidence flag. Less robust than Phase 1 but doesn't require re-indexing.

Expected impact after fix

On cdidx itself:

  • symbols App --exact --lang csharp --count drops from 52 to 0 (no real class App exists).
  • outline tests/CodeIndex.Tests/ReferenceExtractorTests.cs shows ~20 real symbols instead of 50.
  • map's top entrypoint is no longer a Lua fixture string.
  • definition / inspect results stop pointing into string literals.

On real-world C# corpora (Roslyn, ASP.NET Core), test files like Roslyn's *Tests.cs that embed C#/VB source as fixture strings would similarly stop generating phantom symbols.

Scope

  • src/CodeIndex/Indexer/SymbolExtractor.cs — add raw-string-aware preprocessing pass for C# (and Python / Rust as Phase 2).
  • src/CodeIndex/Indexer/ReferenceExtractor.cs — verify whether reference extraction has the same problem (it likely does — Run() calls inside string fixtures probably get recorded as references). The same preprocessing helper can be shared.
  • tests/CodeIndex.Tests/SymbolExtractorTests.cs — fixtures that confirm:
    • class Foo { /* code */ } inside """...""" is NOT extracted as a class.
    • def foo(): inside Python """...""" is NOT extracted.
    • Real C# classes / methods on adjacent lines ARE still extracted (no regression).
    • Both """....""" (raw string with 3 quotes) and """"..."""" (with 4+ quotes — C# allows arbitrary count) are handled.

Related

Same family — per-line regex extraction that doesn't account for string-literal context. This issue is the most impactful instance because it compounds across every test file that uses raw-string fixtures.

Environment

  • cdidx: v1.10.0 (installed via install.sh).
  • Repro corpus: Widthdom/CodeIndex@main itself.
  • Platform: linux-x64 container.
  • Filed from a cloud Claude Code session per CLOUD_BOOTSTRAP_PROMPT.md.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions