You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
C#/Java: nested-generic call sites new Dict<K, List<V>>() / Foo<Bar<int>>() are silently dropped from the reference index — the >> tail breaks the generic-arg regex #263
Both ConstructorCallRegex and CallRegex in ReferenceExtractor.cs accept an optional generic argument block before the trailing (, but the inner character class is [^>\n]+, which stops at the first>. For any call site whose generic arguments contain a nested generic — i.e. whose argument list closes with >> or >>> — neither regex matches. No reference row is emitted.
Affected call forms (all idiomatic C# / Java):
new Dictionary<string, List<int>>() — constructor reference to Dictionaryand inner type List both missed.
new List<Dictionary<string, int>>() — List missed.
new Dictionary<int, Dictionary<string, List<int>>>() — triple-nested, missed.
Helper.DoWork<List<int>>() — generic method call reference to DoWork missed.
Helper.Process<Dictionary<string, int>>() — generic method call reference to Process missed.
Sibling issue #222 covers the symbol-extraction side: method definitions whose return type contains a space-in-generic-args get dropped. This issue is the analogous bug on the reference-extraction side: call sites whose generic args contain a nested generic get dropped. Different regex, different code path, different symptom (0 references vs 0 definitions), same underlying shape.
Repro
CDIDX=/root/.local/bin/cdidx
mkdir -p /tmp/dogfood/cs-nested-gen
cat > /tmp/dogfood/cs-nested-gen/N.cs <<'EOF'using System.Collections.Generic;namespace Demo;public class Builder{ public void Build() { // Plain ctor — works var a = new List<int>(); var b = new Dictionary<string, int>(); // Nested generic ctors — all dropped var c = new Dictionary<string, List<int>>(); var d = new List<Dictionary<string, int>>(); var e = new Dictionary<int, Dictionary<string, List<int>>>(); // Generic method calls with nested type args Helper.DoWork<List<int>>(); Helper.Process<Dictionary<string, int>>(); }}public static class Helper{ public static void DoWork<T>() { } public static void Process<T>() { }}EOF"$CDIDX" index /tmp/dogfood/cs-nested-gen --rebuild
"$CDIDX" references Dictionary --db /tmp/dogfood/cs-nested-gen/.cdidx/codeindex.db --exact
"$CDIDX" references List --db /tmp/dogfood/cs-nested-gen/.cdidx/codeindex.db --exact
"$CDIDX" references DoWork --db /tmp/dogfood/cs-nested-gen/.cdidx/codeindex.db --exact
"$CDIDX" references Process --db /tmp/dogfood/cs-nested-gen/.cdidx/codeindex.db --exact
Observed (actual):
--- references Dictionary ---
instantiate Dictionary N.cs:11:21 in Build
call Dictionary N.cs:11:21 in Build
(2 references in 1 files) ← only the flat `Dictionary<string, int>` usage; 3 nested usages dropped
--- references List ---
instantiate List N.cs:10:21 in Build
call List N.cs:10:21 in Build
(2 references in 1 files) ← only the flat `List<int>` usage; 3 nested usages dropped
--- references DoWork ---
No references found. ← `DoWork<List<int>>()` dropped
--- references Process ---
No references found. ← `Process<Dictionary<string,int>>()` dropped
Expected: at least one reference row per call site, including the nested-generic forms.
The generic-arg subgroup (?:<[^>\n]+>)? has two problems:
[^>\n]+ forbids > inside the angle brackets, so the outer group can never span a nested <...>.
Regex backtracking cannot rescue the match: for Dictionary<string, List<int>>(, the engine tries [^>\n]+ greedily, settles on string, List<int, matches the next >, and then looks for \s*\(. The next character is >, not (, so the match fails. Shorter backtracks of [^>\n]+ still leave at least one > before the paren. The optional group can also match zero characters — but then the regex expects \s*\( immediately after Dictionary, which fails because the next char is <. End result: no match anywhere on the line.
CallRegex for Helper.DoWork<List<int>>() fails the same way, so generic method calls with nested type args are also dropped.
String-literal pre-erasure (StringLiteralRegex at :71-73) and comment stripping (PrepareLine) don't interact here — the call is plain code with no strings or comments.
Note the contrast with CallRegex's flat case: new Dictionary<string, int>() (one level) does match because [^>\n]+ consumes string, int, then > matches, then ( matches — no trailing > in the way.
Suggested direction
Replace the single-pass [^>\n]+ with a small bracket-balancing helper. One option is a recursive pattern (supported by .NET regex via balancing groups, but awkward); a simpler option is to switch these two references to a procedural lexer step that:
Finds each [A-Za-z_]\w* identifier on the prepared line.
If the next non-space char is <, consumes a balanced <...> region by counting < / > depth, stopping at newline.
Checks whether the next non-space char after the balanced region is (. If so, emit a reference.
Sketch:
privatestaticboolTrySkipBalancedAngles(strings,refinti){if(i>=s.Length||s[i]!='<')returntrue;// no generics, fineintdepth=0;while(i<s.Length){charc=s[i++];if(c=='<')depth++;elseif(c=='>'){depth--;if(depth==0)returntrue;}elseif(c=='\n')returnfalse;}returnfalse;}
Invoke from a single reference-scan loop that replaces both ConstructorCallRegex and CallRegex. This handles arbitrary nesting depth without regex engine backtracking pathology.
Alternative, regex-only: allow one or two levels of nesting by hand, e.g.
This admits two-level nesting (the inner <[^<>\n]*> handles one nested <...>, the outer alternation stitches them together). It is an 80/20 fix that handles Dictionary<string, List<int>> and Foo<Bar<int>> without introducing catastrophic-backtracking risk. Three-deep (Dictionary<int, Dictionary<string, List<int>>>) would still fail — worth documenting as a known limitation if taking the regex-only path.
Why it matters
Task<Result<T, E>>, Dictionary<K, V>, Tuple<...>, Func<...>, List<KeyValuePair<K, V>> are constructed all over typical .NET code. Any impact / references / callers question about Dictionary or List misses the nested-generic call sites — which are a large fraction of the real usage.
Dependency-injection registration patterns (services.AddSingleton<IFoo<Bar>>()) and factory methods (Create<Repository<Customer>>()) are almost exclusively nested-generic call sites. DI-heavy codebases have hugely under-reported reference graphs.
unused false positives. Generic classes/interfaces whose only usage is via nested construction (Repository<Order> only constructed as new List<Repository<Order>>()) look unused.
Silent gap. No warning is emitted — references Dictionary returning only 2 rows in a codebase with hundreds of Dictionary<,> usages looks plausible to a user who doesn't know the pattern.
Cross-language note
Java has exactly the same pattern: new HashMap<String, List<Integer>>(), CompletableFuture<List<Response>> construction. The same regex change (or procedural replacement) fixes both languages in one shot since they share CallRegex. Kotlin (listOf<Map<String, Int>>()), Scala (Seq[Map[K, V]] — different bracket style, separate concern), and TypeScript (new Map<string, Array<number>>()) also benefit.
The Rust counterpart (Vec::<HashMap<K, V>>::new()) is orthogonal — Rust uses turbofish ::<...> which isn't matched by the current regex at all, and is a separate gap from this one.
Scope
src/CodeIndex/Indexer/ReferenceExtractor.cs — either depth-aware angle-balancing or 2-level-capable regex rewrite for CallRegex and ConstructorCallRegex.
tests/CodeIndex.Tests/ReferenceExtractorTests.cs — fixtures as in the repro plus Java equivalents.
DEVELOPER_GUIDE.md language-pattern reference table — update the "known limitations" row if choosing the 2-level-only fix.
CLAUDE.md design-decision section on reference extraction — mention nested-generic handling.
Summary
Both
ConstructorCallRegexandCallRegexinReferenceExtractor.csaccept an optional generic argument block before the trailing(, but the inner character class is[^>\n]+, which stops at the first>. For any call site whose generic arguments contain a nested generic — i.e. whose argument list closes with>>or>>>— neither regex matches. No reference row is emitted.Affected call forms (all idiomatic C# / Java):
new Dictionary<string, List<int>>()— constructor reference toDictionaryand inner typeListboth missed.new List<Dictionary<string, int>>()—Listmissed.new Dictionary<int, Dictionary<string, List<int>>>()— triple-nested, missed.Helper.DoWork<List<int>>()— generic method call reference toDoWorkmissed.Helper.Process<Dictionary<string, int>>()— generic method call reference toProcessmissed.Sibling issue #222 covers the symbol-extraction side: method definitions whose return type contains a space-in-generic-args get dropped. This issue is the analogous bug on the reference-extraction side: call sites whose generic args contain a nested generic get dropped. Different regex, different code path, different symptom (0 references vs 0 definitions), same underlying shape.
Repro
Observed (actual):
Expected: at least one reference row per call site, including the nested-generic forms.
Suspected root cause (from reading the source)
src/CodeIndex/Indexer/ReferenceExtractor.cs:75-76:The generic-arg subgroup
(?:<[^>\n]+>)?has two problems:[^>\n]+forbids>inside the angle brackets, so the outer group can never span a nested<...>.Dictionary<string, List<int>>(, the engine tries[^>\n]+greedily, settles onstring, List<int, matches the next>, and then looks for\s*\(. The next character is>, not(, so the match fails. Shorter backtracks of[^>\n]+still leave at least one>before the paren. The optional group can also match zero characters — but then the regex expects\s*\(immediately afterDictionary, which fails because the next char is<. End result: no match anywhere on the line.CallRegexforHelper.DoWork<List<int>>()fails the same way, so generic method calls with nested type args are also dropped.String-literal pre-erasure (
StringLiteralRegexat:71-73) and comment stripping (PrepareLine) don't interact here — the call is plain code with no strings or comments.Note the contrast with
CallRegex's flat case:new Dictionary<string, int>()(one level) does match because[^>\n]+consumesstring, int, then>matches, then(matches — no trailing>in the way.Suggested direction
Replace the single-pass
[^>\n]+with a small bracket-balancing helper. One option is a recursive pattern (supported by .NET regex via balancing groups, but awkward); a simpler option is to switch these two references to a procedural lexer step that:[A-Za-z_]\w*identifier on the prepared line.<, consumes a balanced<...>region by counting</>depth, stopping at newline.(. If so, emit a reference.Sketch:
Invoke from a single reference-scan loop that replaces both
ConstructorCallRegexandCallRegex. This handles arbitrary nesting depth without regex engine backtracking pathology.Alternative, regex-only: allow one or two levels of nesting by hand, e.g.
This admits two-level nesting (the inner
<[^<>\n]*>handles one nested<...>, the outer alternation stitches them together). It is an 80/20 fix that handlesDictionary<string, List<int>>andFoo<Bar<int>>without introducing catastrophic-backtracking risk. Three-deep (Dictionary<int, Dictionary<string, List<int>>>) would still fail — worth documenting as a known limitation if taking the regex-only path.Why it matters
Task<Result<T, E>>,Dictionary<K, V>,Tuple<...>,Func<...>,List<KeyValuePair<K, V>>are constructed all over typical .NET code. Anyimpact/references/callersquestion aboutDictionaryorListmisses the nested-generic call sites — which are a large fraction of the real usage.services.AddSingleton<IFoo<Bar>>()) and factory methods (Create<Repository<Customer>>()) are almost exclusively nested-generic call sites. DI-heavy codebases have hugely under-reported reference graphs.unusedfalse positives. Generic classes/interfaces whose only usage is via nested construction (Repository<Order>only constructed asnew List<Repository<Order>>()) look unused.references Dictionaryreturning only 2 rows in a codebase with hundreds ofDictionary<,>usages looks plausible to a user who doesn't know the pattern.Cross-language note
Java has exactly the same pattern:
new HashMap<String, List<Integer>>(),CompletableFuture<List<Response>>construction. The same regex change (or procedural replacement) fixes both languages in one shot since they shareCallRegex. Kotlin (listOf<Map<String, Int>>()), Scala (Seq[Map[K, V]]— different bracket style, separate concern), and TypeScript (new Map<string, Array<number>>()) also benefit.The Rust counterpart (
Vec::<HashMap<K, V>>::new()) is orthogonal — Rust uses turbofish::<...>which isn't matched by the current regex at all, and is a separate gap from this one.Scope
src/CodeIndex/Indexer/ReferenceExtractor.cs— either depth-aware angle-balancing or 2-level-capable regex rewrite forCallRegexandConstructorCallRegex.tests/CodeIndex.Tests/ReferenceExtractorTests.cs— fixtures as in the repro plus Java equivalents.DEVELOPER_GUIDE.mdlanguage-pattern reference table — update the "known limitations" row if choosing the 2-level-only fix.CLAUDE.mddesign-decision section on reference extraction — mention nested-generic handling.Related
Task<Result<A, B>>,Dictionary<K, V>) are silently dropped — idiomatic .NET formatting is effectively unindexed #222 — C# method return types with space-in-generic-args dropped from symbol extraction. Sibling bug on the definition side; same shape, different regex.IgnoredCallNamesis language-global; same extractor architecture.nameof(X.Y)/typeof(T)/default(T)arguments are silently dropped from the reference index #253 —nameof/typeof/defaultargument dropout (same extractor, different no-parens family).is/aspatterns, attributes) never registered as references #256 — type-position references (same extractor, different no-parens family).Environment
install.shto/root/.local/bin/cdidx).CLOUD_BOOTSTRAP_PROMPT.md.