Skip to content

C# wrapped constructor initializer : base(...) / : this(...) leaks phantom function base / function this symbols #331

@Widthdom

Description

@Widthdom

Summary

When a C# constructor's : base(...) or : this(...) initializer is placed on its own wrapped line (a common Allman-style formatting), SymbolExtractor tokenizes that initializer line as a method declaration and emits a phantom function symbol named base or this. The C# method regex's returnType character class includes :, so a line that begins with whitespace + : + base|this + (args) matches the method pattern with returnType=":", name="base"|"this". Only the wrapped form is affected — same-line public Foo() : base(x) { } is covered by a preceding constructor match and doesn't re-match against the initializer on the same line.

Repro

CDIDX=/root/.local/bin/cdidx
mkdir -p /tmp/dogfood/cs-ctor-chain && cat > /tmp/dogfood/cs-ctor-chain/C.cs <<'EOF'
namespace CtorChain;

public class Base
{
    public Base() { }
    public Base(int x) { }
    public Base(string s, int n) { }
}

public class Derived : Base
{
    // `: base(...)` on same line
    public Derived(int x) : base(x) { }

    // `: base(...)` on next line
    public Derived(string s)
        : base(s, 0)
    {
    }

    // `: this(...)` chain
    public Derived() : this(0) { }

    // `: this(...)` on next line
    public Derived(int a, int b)
        : this(a)
    {
    }

    // Expression-bodied constructor
    public Derived(double d) : base((int)d, "d") => System.Console.WriteLine(d);
}
EOF
"$CDIDX" index /tmp/dogfood/cs-ctor-chain
"$CDIDX" symbols --path "cs-ctor-chain/*" --kind function

Observed (phantoms mixed in with real ctors):

function   base                                     C.cs:17-19
function   this                                     C.cs:26-28

base and this are C# contextual keywords — they are never valid method names. Any function base or function this row is a false positive.

Suspected root cause

src/CodeIndex/Indexer/SymbolExtractor.cs:94 — the C# method-declaration row:

@"^\s*(?:(?:public|private|protected|internal|static|virtual|override|sealed|abstract|async|extern|new|unsafe|partial)\s+)*(?<returnType>\([^)]+\)|(?:global::)?[\w?.<>\[\],:]+)\s+(?<name>\w+)\s*(?:<[^>]+>)?\s*\((?<paren>[^)]*)\)"

The returnType character class is [\w?.<>\[\],:]+ — it explicitly includes :. On a wrapped initializer line like : base(s, 0):

  • The leading : is consumed as returnType.
  • base is consumed as name (no word-boundary exclusion for contextual keywords).
  • (s, 0) is consumed as paren.
  • The full method pattern matches.

The reason : ever made it into the char class appears to be to support edge cases like global:: prefixes — but global:: is already covered by the explicit (?:global::)? prefix, so the stray : in the char class is redundant and harmful.

Suggested direction

Two approaches, either sufficient:

(A) Drop : from the returnType char class. Change :94 from [\w?.<>\[\],:]+ to [\w?.<>\[\],]+. The global:: prefix is already handled by the explicit (?:global::)? alternative. A quick audit of the other C# rows shows no other pattern depends on : inside returnType. Simplest, lowest-risk fix.

(B) Skip lines whose first non-whitespace character is :. Add a guard at the extraction loop (:441-452) that short-circuits lines matching ^\s*:. Also catches any future regex that becomes permissive in the same direction. Slightly safer but adds a per-line check.

Preferred: (A). It removes the root cause from the regex where it originated.

An auxiliary hardening worth considering regardless of which approach is picked: exclude base and this (and new, which is valid C# operator-name territory but never a method-name-in-isolation here) from the name capture via a word-boundary negative lookahead — (?<name>(?!(?:base|this)\b)\w+) — so a future accidental-permissive change can't reintroduce this class of phantom.

Cross-language note

Only C# has a method regex with : in the returnType char class; I spot-checked the Java, TypeScript, and Rust rows in SymbolExtractor.cs and none permit : there. Languages whose method syntax genuinely uses : (TypeScript return-type annotations, Python) use dedicated separate regexes, not a returnType char class on the declarator.

Scope

  • Affected: src/CodeIndex/Indexer/SymbolExtractor.cs:94 (C# method regex), downstream consumers (symbols, definition, references, callers, callees, inspect).
  • Not affected: same-line : base(...) / : this(...) (because the preceding method regex already consumed the constructor on that line — though note this is a happy accident, not a designed interaction).
  • Expression-bodied constructors with same-line initializer (public Ctor(d) : base(x) => ...) are also not affected — same "already consumed" accident.
  • Ctors and dtors themselves are captured correctly; this is purely a phantom-emission bug.

Related

Environment

  • cdidx: v1.10.0 (/root/.local/bin/cdidx)
  • OS: Linux 4.4.0
  • Fixture: /tmp/dogfood/cs-ctor-chain/C.cs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions