Symbol extractor doesn't track string-literal boundaries: code inside C# `"""..."""` raw strings (and other multi-line strings) becomes phantom symbols — 52 fake `class App` rows on cdidx itself, top `map` entrypoint is a Lua fixture string

## Summary

The C# symbol extractor processes the file line-by-line with no awareness of string-literal boundaries, so any code-shaped line **inside a raw string literal** (`"""..."""`) is treated as a real declaration. This is most common in test fixtures: cdidx's own `tests/CodeIndex.Tests/` repo has 50+ phantom symbols across `DbReaderTests.cs`, `ReferenceExtractorTests.cs`, `SymbolExtractorTests.cs`, `McpServerTests.cs`, etc.

The same shape applies to other languages with multi-line strings (Python `"""..."""`, Rust `r#"..."#`, JavaScript template literals `` `...` ``, etc.) but C# is the worst case because:

1. C# 11+ raw string literals (`"""..."""`) make embedding multi-language code fixtures convenient and ubiquitous in test files.
2. C#'s function regex is greedy enough to match shapes like `def login(...)` (Python) or `function main()` (Lua) that appear inside the string.

Concrete impact on cdidx:

- `symbols App --exact --lang csharp` returns **52** rows. None are real C# classes — all 52 are `public class App` written inside C# raw string fixtures used to test indexing.
- `map`'s entrypoint heuristic ranks a `function main` from `tests/CodeIndex.Tests/ReferenceExtractorTests.cs:210` as the **#1 entrypoint** of the whole repo, because the Lua test fixture string `function main() io.write("world") end` got captured as a C# function symbol and matched the C# `Main`/`main` name hint.
- `outline` of any test file with raw-string fixtures becomes unreadable — the outline lists `def login(...)`, `public class App`, `void main()`, `def run()`, etc. interleaved with real `[Fact]` test methods.
- `definition App` returns 5 phantom locations and 0 real ones.
- `unused` flags real symbols whose only "callers" are these phantoms.

## Repro

```bash
curl -fsSL https://raw.githubusercontent.com/Widthdom/CodeIndex/main/install.sh | bash
CDIDX=/root/.local/bin/cdidx

git clone https://github.com/Widthdom/CodeIndex /tmp/codeindex-src
"$CDIDX" /tmp/codeindex-src --db /tmp/codeindex.db
```

### 1. 52 phantom `class App` rows on cdidx itself

```bash
"$CDIDX" symbols App --db /tmp/codeindex.db --lang csharp --exact --count --limit 99999
# → 52
```

Spot-check the first row at `tests/CodeIndex.Tests/DbReaderTests.cs:258`:

```csharp
InsertIndexedFile("src/app.cs", "csharp",
    """
    public class App
    {
        public bool Read() => OperatingSystem.IsWindows() || ...;
    }
    """);
```

`public class App` is **inside** a raw string literal that the test passes to `InsertIndexedFile` as fixture content. The C# extractor sees `public class App` on a line, matches the class regex, and records `class App` at that line. Same shape repeats 51 more times across the test files.

### 2. `outline` shows phantom symbols inline with real ones

```bash
"$CDIDX" outline tests/CodeIndex.Tests/ReferenceExtractorTests.cs --db /tmp/codeindex.db | head -25
# # tests/CodeIndex.Tests/ReferenceExtractorTests.cs  (csharp, 338 lines, 50 symbols)
#       1  using CodeIndex.Indexer;
#       3  namespace CodeIndex.Tests;
#       9  public class ReferenceExtractorTests
#      12      public void Extract_PythonCall_AssignsCallerContainer() : void
#      14      const string content = """ : string
#      15      def login(user, password): : def                    ← phantom (Python def inside C# string)
#      29      public void Extract_CsharpDefinitionLine_DoesNotBecomeReference() : void
#      31      const string content = """ : string
#      32      public class App                                    ← phantom (C# class inside C# string)
#      34      public void Run() : void                            ← phantom
#      52      public class Query                                  ← phantom
#      54      public void Run() : void                            ← phantom
#      ...
```

Of the 50 reported symbols, ~20 are real `[Fact]` methods + the test class + import + namespace; the rest are phantoms from string literal contents.

### 3. `map` ranks a fixture-string `function main` as the top entrypoint

```bash
"$CDIDX" map --db /tmp/codeindex.db | tail -10
# Entrypoints:
#   function   main                     tests/CodeIndex.Tests/ReferenceExtractorTests.cs:210  [score 5]   ← Lua fixture string!
#   function   IsProjectPathArg         src/CodeIndex/Program.cs:95  [score 4]
#   function   RunMcp                   src/CodeIndex/Program.cs:98  [score 4]
#   function   ShowError                src/CodeIndex/Program.cs:106  [score 4]
#   class      App                      tests/CodeIndex.Tests/DbReaderTests.cs:258  [score 4]            ← also a phantom
```

`tests/CodeIndex.Tests/ReferenceExtractorTests.cs:210`:

```csharp
[Fact]
public void Extract_LuaCall_DetectsReferences()
{
    const string content = """
        function main()
            io.write("world")
            ...
        end
        """;
    var symbols = SymbolExtractor.Extract(1, "lua", content);
    ...
}
```

The phantom `function main` (Lua content inside a C# raw string) matches the C# extractor's name=`main` heuristic and the file gets the highest entrypoint score. That's the **#1 entrypoint** the AI consumer is told about.

### 4. `definition App` returns only phantoms

```bash
"$CDIDX" definition App --db /tmp/codeindex.db --lang csharp --exact
# → 52 rows, all from raw string literals; no real C# class App exists in cdidx.
```

If a real `class App` were ever added to the codebase, finding it via `definition` would require manually filtering out 52+ phantom locations.

## Root cause

`src/CodeIndex/Indexer/SymbolExtractor.cs:441-452` is the extraction loop:

```csharp
var lines = content.Split('\n');
var symbols = new List<SymbolRecord>();
for (int i = 0; i < lines.Length; i++)
{
    var line = lines[i];
    var matchLine = lang == "csharp" ? StripLeadingCSharpAttributeLists(line) : line;
    foreach (var pattern in patterns)
    {
        var match = pattern.Regex.Match(matchLine);
        if (!match.Success) continue;
        ...
```

The loop has no concept of string-literal context. Every line is matched against every language pattern. C# raw string literals (`"""...."""`), Python triple-quoted strings, Rust raw strings (`r#"..."#`), JS template literals — all leak their contents into the symbol table as if they were code.

C# is the most affected language because (a) the function regex is permissive (the patterns at `SymbolExtractor.cs:81/94/100` accept lots of leading-token shapes) and (b) C# 11 raw string literals are the canonical idiom for embedding code fixtures.

The same pattern affects other languages whose patterns happen to match content inside multi-line strings. Python's class regex matches `class App:` inside Python `"""..."""` strings; Rust's `fn` regex matches `fn foo() {` inside `r#"..."#`; etc.

## Why it matters

- **`symbols` / `definition` / `inspect` / `outline` / `map` lie.** AI consumers asking "where is `App` defined?" get phantom answers pointing into string fixtures.
- **Entrypoint heuristic is unreliable.** `map`'s top entrypoint can be a fixture string in a test file — meaningless for first-pass orientation.
- **Hot/unused signals are polluted.** Phantom symbols count toward `unused`, may match real symbol names elsewhere, and add noise to `hotspots` rankings (especially for short/common names like `App`, `Query`, `Run`, `Counter`, `main`).
- **Self-test risk:** cdidx's own dogfooding workflow (CLAUDE.md says "the reviewer sees exactly what the code actually does") is partially undermined when a code-search tool finds 52 fake `class App` declarations inside its own test data and indexes them as real.
- **Silent.** No warning, no degraded-confidence flag.

## Suggested direction

Three approaches, increasing in complexity. Each is independently useful.

### Phase 1 — strip C# raw string literal contents before extraction (cheap, language-specific)

Track `"""+ ... """+` (where the count of quotes opening = count of quotes closing) on a stateful pass over `lines`. When inside a raw string, replace the line with a placeholder (or skip the pattern matching for those lines). Scope of change: `SymbolExtractor.Extract` for `lang == "csharp"`.

```csharp
// Pseudocode
var inRawStringFor = 0;  // 0 = not in, N = waiting for N consecutive `"`
for (int i = 0; i < lines.Length; i++)
{
    var line = lines[i];
    if (inRawStringFor > 0) {
        if (LineClosesRawString(line, inRawStringFor)) inRawStringFor = 0;
        continue;  // skip pattern matching while inside a raw string body
    }
    if (TryDetectRawStringOpening(line, out var quoteCount, out var closesOnSameLine)) {
        if (!closesOnSameLine) inRawStringFor = quoteCount;
        continue;
    }
    // existing per-pattern matching ...
}
```

This eliminates the phantom-symbols class entirely for C# at the cost of one stateful counter and a couple of helper functions. Same approach generalizes to Python triple-quoted strings (`"""..."""` and `'''...'''`) and Rust raw strings (`r#"..."#`).

### Phase 2 — generic string-literal awareness across languages

Generalize Phase 1 to a pluggable per-language "string state machine" that knows each language's multi-line string syntax. The extractor invokes it before pattern matching to mask out string contents.

### Phase 3 — heuristic fallback (immediate, no extractor change)

For commands that surface symbols to humans / AI (`map` entrypoints, `outline`, `definition` ranking), add a post-processing filter: a symbol whose definition line is **inside a `"""...."""` block** (detected by scanning the surrounding line context) gets demoted in ranking or marked with a lower-confidence flag. Less robust than Phase 1 but doesn't require re-indexing.

## Expected impact after fix

On cdidx itself:

- `symbols App --exact --lang csharp --count` drops from **52** to **0** (no real `class App` exists).
- `outline tests/CodeIndex.Tests/ReferenceExtractorTests.cs` shows ~20 real symbols instead of 50.
- `map`'s top entrypoint is no longer a Lua fixture string.
- `definition` / `inspect` results stop pointing into string literals.

On real-world C# corpora (Roslyn, ASP.NET Core), test files like `Roslyn`'s `*Tests.cs` that embed C#/VB source as fixture strings would similarly stop generating phantom symbols.

## Scope

- `src/CodeIndex/Indexer/SymbolExtractor.cs` — add raw-string-aware preprocessing pass for C# (and Python / Rust as Phase 2).
- `src/CodeIndex/Indexer/ReferenceExtractor.cs` — verify whether reference extraction has the same problem (it likely does — `Run()` calls inside string fixtures probably get recorded as references). The same preprocessing helper can be shared.
- `tests/CodeIndex.Tests/SymbolExtractorTests.cs` — fixtures that confirm:
  - `class Foo { /* code */ }` inside `"""..."""` is NOT extracted as a class.
  - `def foo():` inside Python `"""..."""` is NOT extracted.
  - Real C# classes / methods on adjacent lines ARE still extracted (no regression).
  - Both `"""...."""` (raw string with 3 quotes) and `""""...""""` (with 4+ quotes — C# allows arbitrary count) are handled.

## Related

- #154 — JS/TS keyword false positives (regex false-positive family, but at the line level not the string-literal level).
- #163 — Dart `else if` / `case const Class()` false positives (also line-level regex).
- #169 — Rust `impl Trait for Struct` mis-captures (also line-level).

Same family — per-line regex extraction that doesn't account for string-literal context. This issue is the most impactful instance because it compounds across every test file that uses raw-string fixtures.

## Environment

- cdidx: v1.10.0 (installed via `install.sh`).
- Repro corpus: `Widthdom/CodeIndex@main` itself.
- Platform: linux-x64 container.
- Filed from a cloud Claude Code session per `CLOUD_BOOTSTRAP_PROMPT.md`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Symbol extractor doesn't track string-literal boundaries: code inside C# `"""..."""` raw strings (and other multi-line strings) becomes phantom symbols — 52 fake `class App` rows on cdidx itself, top `map` entrypoint is a Lua fixture string #177

Summary

Repro

1. 52 phantom `class App` rows on cdidx itself

2. `outline` shows phantom symbols inline with real ones

3. `map` ranks a fixture-string `function main` as the top entrypoint

4. `definition App` returns only phantoms

Root cause

Why it matters

Suggested direction

Phase 1 — strip C# raw string literal contents before extraction (cheap, language-specific)

Phase 2 — generic string-literal awareness across languages

Phase 3 — heuristic fallback (immediate, no extractor change)

Expected impact after fix

Scope

Related

Environment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Symbol extractor doesn't track string-literal boundaries: code inside C# """...""" raw strings (and other multi-line strings) becomes phantom symbols — 52 fake class App rows on cdidx itself, top map entrypoint is a Lua fixture string #177

Description

Summary

Repro

1. 52 phantom class App rows on cdidx itself

2. outline shows phantom symbols inline with real ones

3. map ranks a fixture-string function main as the top entrypoint

4. definition App returns only phantoms

Root cause

Why it matters

Suggested direction

Phase 1 — strip C# raw string literal contents before extraction (cheap, language-specific)

Phase 2 — generic string-literal awareness across languages

Phase 3 — heuristic fallback (immediate, no extractor change)

Expected impact after fix

Scope

Related

Environment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Symbol extractor doesn't track string-literal boundaries: code inside C# `"""..."""` raw strings (and other multi-line strings) becomes phantom symbols — 52 fake `class App` rows on cdidx itself, top `map` entrypoint is a Lua fixture string #177

1. 52 phantom `class App` rows on cdidx itself

2. `outline` shows phantom symbols inline with real ones

3. `map` ranks a fixture-string `function main` as the top entrypoint

4. `definition App` returns only phantoms