Skip to content

Feature: add call graph to analysis output (Jedi + CodeQL, provenance-tracked) #22

@rahlk

Description

@rahlk

Is your feature request related to a problem? Please describe.

The Python analyzer currently emits only a symbol table — there's no call graph. Downstream consumers (the python-sdk's path/caller/callee queries, dead-code analysis, attack-surface mapping, retrieval-augmented codebase tooling) all need caller→callee information. The Java analyzer already produces one; until the Python analyzer does too, SDK code can't be language-uniform.

Jedi alone covers a narrow slice of call sites (locally-typed, statically obvious). A lot of edges — dynamic dispatch, framework callbacks, calls into third-party libraries — go unresolved. We need a richer resolver, and a way to record which backend(s) resolved each edge so consumers can decide how much to trust them.

Describe the solution you'd like

Add a call_graph: List[PyCallEdge] field to PyApplication, where each PyCallEdge has source / target (both PyCallable.signature strings), weight, and provenance: List[Literal["jedi","codeql","joern"]]. Nodes are the existing PyCallable entries — no separate vertex type. External / RPC / third-party callees surface as ghost nodes downstream via to_digraph, so those edges are preserved rather than dropped.

Pipeline per run:

  1. Build symbol table (Jedi, as today).
  2. If --codeql is enabled: auto-download the CLI into <cache_dir>/codeql/bin/, install codeql/python-all via codeql pack install into a per-project qlpack, execute the call-edge query, and augment PyCallsite.callee_signature in-place for sites Jedi couldn't resolve.
  3. Heuristic constructor fallback for sites neither Jedi nor CodeQL caught (most often classes nested inside functions/methods).
  4. Derive Jedi-side edges from the fully-augmented symbol table.
  5. Merge with CodeQL edges; provenance unions for edges both backends saw.

An SDK-facing to_digraph / from_digraph adapter rehydrates the persisted edges into a networkx.DiGraph for path/caller/callee queries.

Describe alternatives you've considered

  • Symbol-table only, build the call graph in the SDK. Pushes the resolution problem onto every consumer and forces each to re-implement the Jedi/CodeQL plumbing. Rejected.
  • Promote PyClass to a separate call-graph node type so A() lands on the class. Doubles the node taxonomy. Rejected in favor of <class>.__init__ as the constructor target — keeps nodes uniform with where method PyCallables already live in the symbol table.
  • Drop edges to external/library callees. Loses RPC / framework / third-party calls, which are exactly the most interesting edges for security/architecture analysis. Rejected; ghost nodes solve this.
  • Use an nx.DiGraph in the persisted schema directly. Not Pydantic-serializable, and would force every consumer onto networkx. Rejected; persist a flat edge list and adapt at the boundaries.

Additional context

  • Mirrors the Java analyzer's CallableVertex + CallDependency shape, adapted to Python (signature-keyed, no separate vertex type).
  • Implementation is on feat/add-codeql-call-graph (committed retroactively for provenance).
  • Breaking change: --analysis-level is removed. The call graph is built unconditionally; CodeQL participation is controlled by the existing --codeql / --no-codeql flag.
  • Follow-ups (out of scope here): auto-synthesizing PyCallable entries for implicit __init__s on Pydantic/dataclass-style classes (currently surfaces as ghost constructor targets); a Joern backend (placeholder in the provenance enum only).

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions