Skip to content

C#/Java: nested-generic call sites new Dict<K, List<V>>() / Foo<Bar<int>>() are silently dropped from the reference index — the >> tail breaks the generic-arg regex #263

@Widthdom

Description

@Widthdom

Summary

Both ConstructorCallRegex and CallRegex in ReferenceExtractor.cs accept an optional generic argument block before the trailing (, but the inner character class is [^>\n]+, which stops at the first >. For any call site whose generic arguments contain a nested generic — i.e. whose argument list closes with >> or >>> — neither regex matches. No reference row is emitted.

Affected call forms (all idiomatic C# / Java):

  • new Dictionary<string, List<int>>() — constructor reference to Dictionary and inner type List both missed.
  • new List<Dictionary<string, int>>()List missed.
  • new Dictionary<int, Dictionary<string, List<int>>>() — triple-nested, missed.
  • Helper.DoWork<List<int>>() — generic method call reference to DoWork missed.
  • Helper.Process<Dictionary<string, int>>() — generic method call reference to Process missed.

Sibling issue #222 covers the symbol-extraction side: method definitions whose return type contains a space-in-generic-args get dropped. This issue is the analogous bug on the reference-extraction side: call sites whose generic args contain a nested generic get dropped. Different regex, different code path, different symptom (0 references vs 0 definitions), same underlying shape.

Repro

CDIDX=/root/.local/bin/cdidx
mkdir -p /tmp/dogfood/cs-nested-gen
cat > /tmp/dogfood/cs-nested-gen/N.cs <<'EOF'
using System.Collections.Generic;
namespace Demo;

public class Builder
{
    public void Build()
    {
        // Plain ctor — works
        var a = new List<int>();
        var b = new Dictionary<string, int>();
        // Nested generic ctors — all dropped
        var c = new Dictionary<string, List<int>>();
        var d = new List<Dictionary<string, int>>();
        var e = new Dictionary<int, Dictionary<string, List<int>>>();
        // Generic method calls with nested type args
        Helper.DoWork<List<int>>();
        Helper.Process<Dictionary<string, int>>();
    }
}
public static class Helper
{
    public static void DoWork<T>() { }
    public static void Process<T>() { }
}
EOF
"$CDIDX" index /tmp/dogfood/cs-nested-gen --rebuild
"$CDIDX" references Dictionary --db /tmp/dogfood/cs-nested-gen/.cdidx/codeindex.db --exact
"$CDIDX" references List       --db /tmp/dogfood/cs-nested-gen/.cdidx/codeindex.db --exact
"$CDIDX" references DoWork     --db /tmp/dogfood/cs-nested-gen/.cdidx/codeindex.db --exact
"$CDIDX" references Process    --db /tmp/dogfood/cs-nested-gen/.cdidx/codeindex.db --exact

Observed (actual):

--- references Dictionary ---
instantiate  Dictionary  N.cs:11:21  in Build
call         Dictionary  N.cs:11:21  in Build
(2 references in 1 files)                ← only the flat `Dictionary<string, int>` usage; 3 nested usages dropped

--- references List ---
instantiate  List        N.cs:10:21  in Build
call         List        N.cs:10:21  in Build
(2 references in 1 files)                ← only the flat `List<int>` usage; 3 nested usages dropped

--- references DoWork ---
No references found.                     ← `DoWork<List<int>>()` dropped

--- references Process ---
No references found.                     ← `Process<Dictionary<string,int>>()` dropped

Expected: at least one reference row per call site, including the nested-generic forms.

Suspected root cause (from reading the source)

src/CodeIndex/Indexer/ReferenceExtractor.cs:75-76:

private static readonly Regex ConstructorCallRegex = new(
    @"\bnew\s+(?<name>[A-Za-z_]\w*)(?:<[^>\n]+>)?\s*\(",
    RegexOptions.Compiled);
private static readonly Regex CallRegex = new(
    @"(?<![\w$])(?<name>[A-Za-z_]\w*)(?:<[^>\n]+>)?\s*\(",
    RegexOptions.Compiled);

The generic-arg subgroup (?:<[^>\n]+>)? has two problems:

  1. [^>\n]+ forbids > inside the angle brackets, so the outer group can never span a nested <...>.
  2. Regex backtracking cannot rescue the match: for Dictionary<string, List<int>>(, the engine tries [^>\n]+ greedily, settles on string, List<int, matches the next >, and then looks for \s*\(. The next character is >, not (, so the match fails. Shorter backtracks of [^>\n]+ still leave at least one > before the paren. The optional group can also match zero characters — but then the regex expects \s*\( immediately after Dictionary, which fails because the next char is <. End result: no match anywhere on the line.

CallRegex for Helper.DoWork<List<int>>() fails the same way, so generic method calls with nested type args are also dropped.

String-literal pre-erasure (StringLiteralRegex at :71-73) and comment stripping (PrepareLine) don't interact here — the call is plain code with no strings or comments.

Note the contrast with CallRegex's flat case: new Dictionary<string, int>() (one level) does match because [^>\n]+ consumes string, int, then > matches, then ( matches — no trailing > in the way.

Suggested direction

Replace the single-pass [^>\n]+ with a small bracket-balancing helper. One option is a recursive pattern (supported by .NET regex via balancing groups, but awkward); a simpler option is to switch these two references to a procedural lexer step that:

  1. Finds each [A-Za-z_]\w* identifier on the prepared line.
  2. If the next non-space char is <, consumes a balanced <...> region by counting < / > depth, stopping at newline.
  3. Checks whether the next non-space char after the balanced region is (. If so, emit a reference.

Sketch:

private static bool TrySkipBalancedAngles(string s, ref int i)
{
    if (i >= s.Length || s[i] != '<') return true; // no generics, fine
    int depth = 0;
    while (i < s.Length)
    {
        char c = s[i++];
        if (c == '<') depth++;
        else if (c == '>') { depth--; if (depth == 0) return true; }
        else if (c == '\n') return false;
    }
    return false;
}

Invoke from a single reference-scan loop that replaces both ConstructorCallRegex and CallRegex. This handles arbitrary nesting depth without regex engine backtracking pathology.

Alternative, regex-only: allow one or two levels of nesting by hand, e.g.

private static readonly Regex CallRegex = new(
    @"(?<![\w$])(?<name>[A-Za-z_]\w*)(?:<(?:[^<>\n]|<[^<>\n]*>)*>)?\s*\(",
    RegexOptions.Compiled);

This admits two-level nesting (the inner <[^<>\n]*> handles one nested <...>, the outer alternation stitches them together). It is an 80/20 fix that handles Dictionary<string, List<int>> and Foo<Bar<int>> without introducing catastrophic-backtracking risk. Three-deep (Dictionary<int, Dictionary<string, List<int>>>) would still fail — worth documenting as a known limitation if taking the regex-only path.

Why it matters

  • Task<Result<T, E>>, Dictionary<K, V>, Tuple<...>, Func<...>, List<KeyValuePair<K, V>> are constructed all over typical .NET code. Any impact / references / callers question about Dictionary or List misses the nested-generic call sites — which are a large fraction of the real usage.
  • Dependency-injection registration patterns (services.AddSingleton<IFoo<Bar>>()) and factory methods (Create<Repository<Customer>>()) are almost exclusively nested-generic call sites. DI-heavy codebases have hugely under-reported reference graphs.
  • unused false positives. Generic classes/interfaces whose only usage is via nested construction (Repository<Order> only constructed as new List<Repository<Order>>()) look unused.
  • Silent gap. No warning is emitted — references Dictionary returning only 2 rows in a codebase with hundreds of Dictionary<,> usages looks plausible to a user who doesn't know the pattern.

Cross-language note

Java has exactly the same pattern: new HashMap<String, List<Integer>>(), CompletableFuture<List<Response>> construction. The same regex change (or procedural replacement) fixes both languages in one shot since they share CallRegex. Kotlin (listOf<Map<String, Int>>()), Scala (Seq[Map[K, V]] — different bracket style, separate concern), and TypeScript (new Map<string, Array<number>>()) also benefit.

The Rust counterpart (Vec::<HashMap<K, V>>::new()) is orthogonal — Rust uses turbofish ::<...> which isn't matched by the current regex at all, and is a separate gap from this one.

Scope

  • src/CodeIndex/Indexer/ReferenceExtractor.cs — either depth-aware angle-balancing or 2-level-capable regex rewrite for CallRegex and ConstructorCallRegex.
  • tests/CodeIndex.Tests/ReferenceExtractorTests.cs — fixtures as in the repro plus Java equivalents.
  • DEVELOPER_GUIDE.md language-pattern reference table — update the "known limitations" row if choosing the 2-level-only fix.
  • CLAUDE.md design-decision section on reference extraction — mention nested-generic handling.

Related

Environment

  • cdidx v1.10.0 (installed via install.sh to /root/.local/bin/cdidx).
  • Platform: linux-x64 container.
  • Filed from a cloud Claude Code session per CLOUD_BOOTSTRAP_PROMPT.md.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions