Skip to content

ReferenceExtractor drops legitimate calls in all languages because language-specific keyword lists are applied globally #200

@Widthdom

Description

@Widthdom

Summary

ReferenceExtractor.IgnoredCallNames is a single HashSet<string> that bundles together keywords from many languages — Makefile (all, clean, install, build, run, help), Gradle/Groovy (apply, task, dependencies, repositories, version, description, group, ext, …), Haskell (putStrLn, putStr, print, Just, Nothing, Left, Right, True, False), R (library, cat, paste, paste0, sprintf, stop, warning, message, tryCatch, …), Terraform (resource, data, variable, module, provider, …), Shell (echo, cd, set, unset, export, source, eval, exec, test, read, …), and "other languages" (require, import, include, raise, lambda).

The set is consulted once per call match regardless of the file's language (ReferenceExtractor.cs:158), so e.g. run() / build() / install() / clean() / help() / apply() / task() / require() / print() / library() / cat() / cd() / read() / test() / source() calls in Python / JavaScript / TypeScript / C# / Go / Rust / Java / Kotlin / Ruby / C / C++ / PHP / Swift / Dart / Scala / Elixir / F# code are silently dropped from symbol_references.

This makes references, callers, callees, and impact return wrong / missing results for any function that happens to share a name with a keyword from an unrelated language.

Repro

mkdir -p /tmp/dogfood/keyword_test && cd /tmp/dogfood/keyword_test
cat > a.py <<'EOF'
def caller():
    run()
    build()
    install()
    clean()
    help()
    print()
    require()
    notexcluded()
    apply()
    task()
EOF

/root/.local/bin/cdidx index .
sqlite3 .cdidx/codeindex.db \
  "SELECT symbol_name, line, reference_kind, container_name FROM symbol_references ORDER BY line;"

Observed:

notexcluded|9|call|caller

Only 1 of 10 calls indexed. The other 9 are silently dropped because run, build, install, clean, help, print, require, apply, task all live in the global IgnoredCallNames set.

Downstream effect — none of these work as users would expect:

/root/.local/bin/cdidx callers run --exact --json
/root/.local/bin/cdidx callees caller --exact --json
/root/.local/bin/cdidx references print --exact --json
/root/.local/bin/cdidx impact apply --json

All four return zero / wrong results even though the calls obviously exist in indexed Python source.

Suspected root cause (from reading the source)

src/CodeIndex/Indexer/ReferenceExtractor.cs:

  • Lines 22–69: IgnoredCallNames is a single HashSet<string> that mixes Makefile, Gradle, Terraform, R, PowerShell, Haskell, Shell, F#, Java, C#, "Other languages" entries.
  • Line 155–164 (the CallRegex loop): every match goes through one untyped IgnoredCallNames.Contains(name) check, with no language is "..." guard.

So the per-language comments above each block in the set are aspirational only — at runtime there is no per-language scoping. A Makefile target name and a Python builtin and a Haskell prelude name are all blocked uniformly across all 31 graph-supported languages.

For comparison, EventSubscriptionRegex at line 149 is properly gated by if (language is "csharp"). The keyword exclusion needs the same treatment.

Suggested fix

Replace the single IgnoredCallNames HashSet with a small lookup keyed by language (or a base set of truly universal control-flow / declaration keywords plus per-language overlays merged at filter time). At minimum:

  • all, clean, install, build, run, help should only apply to language is "makefile".
  • apply, plugins, dependencies, repositories, allprojects, subprojects, task, buildscript, ext, group, version, description should only apply to gradle.
  • resource, data, variable, output, locals, module, provider, terraform, required_providers, backend should only apply to terraform.
  • library, cat, paste, paste0, sprintf, stop, warning, message, invisible, tryCatch, withCallingHandlers, next, break, repeat should only apply to r.
  • putStrLn, putStr, print, Just, Nothing, Left, Right, True, False, data, newtype, instance, deriving, infixl, infixr, infix, qualified, hiding, forall should only apply to haskell.
  • Shell built-ins (echo, exit, cd, set, unset, export, source, eval, exec, test, read, shift, trap, local, declare, readonly) should only apply to shell — currently shell is not even in SupportedLanguages (line 12–20), so this whole block is dead weight that only ever blocks calls in the other 31 languages it was never meant for.
  • "Other languages" require, import, include, raise, lambdarequire blocks legitimate Ruby require calls and Node require() calls and a function named require; print (Haskell list above) blocks Python's print everywhere. These need careful per-language scoping.
  • Genuinely cross-language keywords (if, else, for, while, switch, catch, lock, do, try, when, sizeof, typeof, return, throw, nameof, await, using, new, class, struct, record, interface, enum, delegate, event, namespace, def, function, func) can stay in a shared base set.

Tests should cover at least one positive case per language family — e.g. Python print() must produce a reference row, JS require('x') must produce one, Ruby require 'x' must produce one, C# task.Run() must produce a Run call ref, etc.

Environment

  • cdidx v1.10.0 (installed via official install.sh to /root/.local/bin/cdidx)
  • Linux, .NET 8

https://claude.ai/code/session_01Bi6Vn3v37ViFbJroJkpUe3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions