diff --git a/CHANGELOG.md b/CHANGELOG.md index 77f7e28..efc9c90 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -5,6 +5,28 @@ All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). +## [0.1.14] - 2026-05-13 + +### Added +- **Call graph in analysis output**: `PyApplication.call_graph: List[PyCallEdge]`. Every run now produces a call graph in addition to the symbol table. Edges carry `source`, `target` (both `PyCallable.signature`), `weight`, and `provenance` (`jedi` / `codeql` / `joern`). +- **`call_graph` module** (`codeanalyzer.semantic_analysis.call_graph`) with `to_digraph` / `from_digraph` networkx adapters, `jedi_call_graph_edges`, and `merge_edges`. Endpoints absent from the symbol table become ghost nodes so RPC / third-party / framework edges are preserved. +- **CodeQL Python query** rewritten against the CodeQL Python library (was Java idioms before). Resolves direct calls and constructor calls via `ClassValue.lookup("__init__")`, using the modern `Value.getACall()` predicate (CodeQL Python 7.x). +- **`augment_call_sites`**: when `--codeql` is enabled, CodeQL backfills `PyCallsite.callee_signature` entries Jedi left unresolved. +- **`resolve_unresolved_constructors`**: heuristic fallback that walks the symbol table by class short-name and scope to fill in constructor sites neither Jedi nor CodeQL resolved (common for classes nested inside functions/methods). Synthesizes `.__init__` signatures. +- **`iter_classes_in_symbol_table`**: full recursive walker over classes — including inner classes, classes nested in functions, and classes nested in class methods. + +### Changed +- **BREAKING**: Removed `--analysis-level` / `analysis_level`. The call graph is built unconditionally; use `--codeql/--no-codeql` to control CodeQL participation. Jedi-derived edges are always available. +- **Jedi constructor calls now resolve to `.__init__`** (was: bare ``). When `script.infer()` returns a class, the qualified name is rewritten to point at the constructor — matching where method `PyCallable`s actually live in the symbol table. `PyCallsite.is_constructor_call` now reflects Jedi's type inference (was: `method_name == "__init__"`, only true for explicit `obj.__init__()` calls). +- **`_call_sites` scope correctness**: replaced naive `ast.walk` with `_iter_calls_in_scope`, which stops at nested `FunctionDef` / `AsyncFunctionDef` / `ClassDef` bodies (those have their own `PyCallable.call_sites`). Decorators, default arguments, return annotations, base classes and class keyword args are still walked since they execute in the enclosing scope. Previously, outer functions over-attributed every call from every nested definition. +- CodeQL CLI binary is now downloaded into `/codeql/bin/` (per-project, respecting `--cache-dir`) and discovered before any CodeQL operation — including when the database cache is reused. The downloaded archive is removed after extraction. +- `CodeQLQueryRunner` now accepts the resolved binary path instead of relying on `PATH`. The temporary `.ql` file is written **inside** a per-project qlpack (`/codeql/qlpack/`) whose `codeql/python-all` dependency is resolved once via `codeql pack install`, eliminating the lockfile / search-path gymnastics. + +### Fixed +- **`zipfile` extraction dropped Unix permissions** on the CodeQL CLI launcher, causing `PermissionError` on first query run. Entries are now extracted with their stored `external_attr` mode applied, plus a defensive `chmod +x` on the resolved binary. +- **`rglob("codeql")` matched the bundled `codeql/codeql/` directory** before the launcher file, returning a directory instead of an executable. Both `CodeQLLoader` and `_ensure_codeql_bin` now filter to `is_file()`. +- **`CodeQLQueryRunner` crashed on subprocess errors** with `'NoneType' object has no attribute 'stderr'` because `stderr=None` returns `None` from `communicate()`. Now captures `stderr=PIPE` and decodes bytes safely. + ## [0.1.13] - 2025-07-22 ### Improved diff --git a/README.md b/README.md index 02fd0a7..f7c4b1a 100644 --- a/README.md +++ b/README.md @@ -80,7 +80,6 @@ To view the available options and commands, run `codeanalyzer --help`. You shoul │ * --input -i PATH Path to the project root directory. [default: None] [required] │ │ --output -o PATH Output directory for artifacts. [default: None] │ │ --format -f [json|msgpack] Output format: json or msgpack. [default: json] │ -│ --analysis-level -a INTEGER 1: symbol table, 2: call graph. [default: 1] │ │ --codeql --no-codeql Enable CodeQL-based analysis. [default: no-codeql] │ │ --eager --lazy Enable eager or lazy analysis. Defaults to lazy. [default: lazy] │ │ --cache-dir -c PATH Directory to store analysis cache. [default: None] │ @@ -112,25 +111,15 @@ To view the available options and commands, run `codeanalyzer --help`. You shoul This will save the analysis results in `analysis.msgpack` in the specified directory. -3. **Toggle analysis levels with `--analysis-level`:** - ```bash - codeanalyzer --input ./my-python-project --analysis-level 1 # Symbol table only - ``` - Call graph analysis can be enabled by setting the level to `2`: - ```bash - codeanalyzer --input ./my-python-project --analysis-level 2 # Symbol table + Call graph - ``` - ***Note: The `--analysis-level=2` is not yet implemented in this version.*** - -4. **Analysis with CodeQL enabled:** +3. **Analysis with CodeQL enabled:** ```bash codeanalyzer --input ./my-python-project --codeql ``` - This will perform CodeQL-based analysis in addition to the standard symbol table generation. + Every run produces a symbol table **and** a call graph. By default, edges come from Jedi's lexical analysis. Adding `--codeql` resolves additional edges (including RPC / third-party / dynamically-dispatched targets) and merges them with the Jedi-derived edges. CodeQL also backfills resolved callees on Jedi-emitted call sites where Jedi couldn't resolve them. - ***Note: Not yet fully implemented. Please refrain from using this option until further notice.*** + ***Note: CodeQL integration is experimental. The CLI is downloaded into `/codeql/` on first use and reused thereafter.*** -5. **Eager analysis with custom cache directory:** +4. **Eager analysis with custom cache directory:** ```bash codeanalyzer --input ./my-python-project --eager --cache-dir /path/to/custom-cache ``` @@ -138,7 +127,7 @@ To view the available options and commands, run `codeanalyzer --help`. You shoul If you provide --cache-dir, the cache will be stored in that directory. If not specified, it defaults to `.codeanalyzer` in the current working directory (`$PWD`). -6. **Quiet mode (minimal output):** +5. **Quiet mode (minimal output):** ```bash codeanalyzer --input /path/to/my-python-project --quiet ``` @@ -236,7 +225,6 @@ To view the available options and commands, run `codeanalyzer --help`. You shoul │ * --input -i PATH Path to the project root directory. [default: None] [required] │ │ --output -o PATH Output directory for artifacts. [default: None] │ │ --format -f [json|msgpack] Output format: json or msgpack. [default: json]. │ -│ --analysis-level -a INTEGER 1: symbol table, 2: call graph. [default: 1] │ │ --codeql --no-codeql Enable CodeQL-based analysis. [default: no-codeql] │ │ --eager --lazy Enable eager or lazy analysis. Defaults to lazy. [default: lazy] │ │ --cache-dir -c PATH Directory to store analysis cache. [default: None] │ @@ -261,25 +249,15 @@ To view the available options and commands, run `codeanalyzer --help`. You shoul Now, you can find the analysis results in `analysis.json` in the specified directory. -2. **Toggle analysis levels with `--analysis-level`:** - ```bash - codeanalyzer --input ./my-python-project --analysis-level 1 # Symbol table only - ``` - Call graph analysis can be enabled by setting the level to `2`: - ```bash - codeanalyzer --input ./my-python-project --analysis-level 2 # Symbol table + Call graph - ``` - ***Note: The `--analysis-level=2` is not yet implemented in this version.*** - -3. **Analysis with CodeQL enabled:** +2. **Analysis with CodeQL enabled:** ```bash codeanalyzer --input ./my-python-project --codeql ``` - This will perform CodeQL-based analysis in addition to the standard symbol table generation. + Every run produces a symbol table **and** a call graph. By default, edges come from Jedi's lexical analysis. Adding `--codeql` resolves additional edges (including RPC / third-party / dynamically-dispatched targets) and merges them with the Jedi-derived edges. CodeQL also backfills resolved callees on Jedi-emitted call sites where Jedi couldn't resolve them. - ***Note: Not yet fully implemented. Please refrain from using this option until further notice.*** + ***Note: CodeQL integration is experimental. The CLI is downloaded into `/codeql/` on first use and reused thereafter.*** -4. **Eager analysis with custom cache directory:** +3. **Eager analysis with custom cache directory:** ```bash codeanalyzer --input ./my-python-project --eager --cache-dir /path/to/custom-cache ``` @@ -287,7 +265,7 @@ To view the available options and commands, run `codeanalyzer --help`. You shoul If you provide --cache-dir, the cache will be stored in that directory. If not specified, it defaults to `.codeanalyzer` in the current working directory (`$PWD`). -5. **Save output in msgpack format:** +4. **Save output in msgpack format:** ```bash codeanalyzer --input ./my-python-project --output /path/to/analysis-results --format msgpack ``` diff --git a/codeanalyzer/__main__.py b/codeanalyzer/__main__.py index f761b92..19e7f2a 100644 --- a/codeanalyzer/__main__.py +++ b/codeanalyzer/__main__.py @@ -27,10 +27,6 @@ def main( case_sensitive=False, ), ] = OutputFormat.JSON, - analysis_level: Annotated[ - int, - typer.Option("-a", "--analysis-level", help="1: symbol table, 2: call graph."), - ] = 1, using_codeql: Annotated[ bool, typer.Option("--codeql/--no-codeql", help="Enable CodeQL-based analysis.") ] = False, @@ -82,7 +78,6 @@ def main( input=input, output=output, format=format, - analysis_level=analysis_level, using_codeql=using_codeql, using_ray=using_ray, rebuild_analysis=rebuild_analysis, diff --git a/codeanalyzer/core.py b/codeanalyzer/core.py index b54f7e6..b8cfcca 100644 --- a/codeanalyzer/core.py +++ b/codeanalyzer/core.py @@ -9,7 +9,14 @@ import ray from codeanalyzer.utils import logger from codeanalyzer.schema import PyApplication, PyModule, model_dump_json, model_validate_json +from codeanalyzer.schema.py_schema import PyCallEdge +from codeanalyzer.semantic_analysis.call_graph import ( + jedi_call_graph_edges, + merge_edges, + resolve_unresolved_constructors, +) from codeanalyzer.semantic_analysis.codeql import CodeQLLoader +from codeanalyzer.semantic_analysis.codeql.codeql_analysis import CodeQL from codeanalyzer.semantic_analysis.codeql.codeql_exceptions import CodeQLExceptions from codeanalyzer.syntactic_analysis.exceptions import SymbolTableBuilderRayError from codeanalyzer.syntactic_analysis.symbol_table_builder import SymbolTableBuilder @@ -49,7 +56,6 @@ class Codeanalyzer: def __init__(self, options: AnalysisOptions) -> None: self.options = options - self.analysis_depth = options.analysis_level self.project_dir = Path(options.input).resolve() self.skip_tests = options.skip_tests self.using_codeql = options.using_codeql @@ -60,6 +66,7 @@ def __init__(self, options: AnalysisOptions) -> None: self.clear_cache = options.clear_cache self.db_path: Optional[Path] = None self.codeql_bin: Optional[Path] = None + self.codeql_packs_dir: Optional[Path] = None self.virtualenv: Optional[Path] = None self.using_ray: bool = options.using_ray self.file_name: Optional[Path] = options.file_name @@ -292,6 +299,15 @@ def __enter__(self) -> "Codeanalyzer": if self.using_codeql: logger.info(f"(Re-)initializing CodeQL analysis for {self.project_dir}") + + # Resolve the CLI binary before anything else uses it: DB build + # below needs it, and so does every subsequent query run. + self.codeql_bin = self._ensure_codeql_bin() + # Download the standard query library pack (idempotent). The + # CLI install ships only the language extractors; the + # ``codeql/python-all`` library pack must be fetched separately. + self.codeql_packs_dir = self._ensure_codeql_packs(self.codeql_bin) + cache_root = self.cache_dir / "codeql" cache_root.mkdir(parents=True, exist_ok=True) self.db_path = cache_root / f"{self.project_dir.name}-db" @@ -310,19 +326,6 @@ def is_cache_valid() -> bool: if self.rebuild_analysis or not is_cache_valid(): logger.info("Creating new CodeQL database...") - codeql_in_path = shutil.which("codeql") - if codeql_in_path: - self.codeql_bin = Path(codeql_in_path) - else: - self.codeql_bin = CodeQLLoader.download_and_extract_codeql( - self.cache_dir / "codeql" / "bin" - ) - - if not shutil.which(str(self.codeql_bin)): - raise FileNotFoundError( - f"CodeQL binary not executable: {self.codeql_bin}" - ) - cmd = [ str(self.codeql_bin), "database", @@ -375,8 +378,27 @@ def analyze(self) -> PyApplication: # Build symbol table from cached application if available (if no available, the build a new one) symbol_table = self._build_symbol_table(cached_pyapplication.symbol_table if cached_pyapplication else {}) + # Build the call graph in four steps: + # 1. Run CodeQL (when enabled). Produces resolved edges with + # ``provenance=["codeql"]`` and augments ``PyCallsite``s + # in-place — filling ``callee_signature`` for sites Jedi + # couldn't resolve. + # 2. Heuristic fallback for constructor calls neither Jedi nor + # CodeQL could resolve (commonly classes nested inside + # functions). Walks the symbol table by class short-name + + # scope and writes ``.__init__`` into the site. + # 3. Derive Jedi edges from the now-fully-augmented symbol + # table — these reflect every resolution the symbol table + # contains, regardless of which pass put it there. + # 4. Merge with CodeQL edges; provenance unions for edges both + # backends saw. + codeql_edges = self._get_call_graph(symbol_table, augment_sites=True) + resolve_unresolved_constructors(symbol_table) + jedi_edges = jedi_call_graph_edges(symbol_table) + call_graph = merge_edges(jedi_edges, codeql_edges) + # Recreate pyapplication - app = PyApplication.builder().symbol_table(symbol_table).build() + app = PyApplication.builder().symbol_table(symbol_table).call_graph(call_graph).build() # Save to cache self._save_analysis_cache(app, cache_file) @@ -579,7 +601,120 @@ def _build_symbol_table(self, cached_symbol_table: Optional[Dict[str, PyModule]] logger.info("✅ Symbol table generation complete.") return symbol_table - def _get_call_graph(self) -> Dict[str, Any]: - """Retrieve call graph from CodeQL database.""" - logger.warning("Call graph extraction not yet implemented.") - return {} \ No newline at end of file + def _ensure_codeql_packs(self, codeql_bin: Path) -> Path: + """Materialize a qlpack that depends on ``codeql/python-all``. + + The CodeQL CLI install ships only the language extractors — query + library packs (and their transitive dependencies like + ``codeql/concepts``) must be resolved separately. The canonical + way is to declare the dependency in a ``qlpack.yml`` and run + ``codeql pack install`` in that directory; CodeQL writes a + ``codeql-pack.lock.yml`` and downloads everything needed. + + We do this once per project under ``/codeql/qlpack/`` + and return that directory. The query runner then writes its + temporary ``.ql`` file inside this pack — colocation makes + ``import python`` resolve without any ``--additional-packs`` or + ``--search-path`` gymnastics. + """ + pack_dir = self.cache_dir / "codeql" / "qlpack" + pack_dir.mkdir(parents=True, exist_ok=True) + qlpack_yml = pack_dir / "qlpack.yml" + lock_file = pack_dir / "codeql-pack.lock.yml" + + if not qlpack_yml.exists(): + qlpack_yml.write_text( + "name: codeanalyzer-deps\n" + "version: 1.0.0\n" + "dependencies:\n" + ' codeql/python-all: "*"\n' + ) + + if lock_file.exists(): + logger.debug(f"CodeQL pack dependencies already installed in {pack_dir}") + return pack_dir + + logger.info(f"Installing CodeQL pack dependencies in {pack_dir}.") + proc = subprocess.Popen( + [str(codeql_bin), "pack", "install", str(pack_dir)], + stdout=subprocess.PIPE, + stderr=subprocess.PIPE, + ) + _, err = proc.communicate() + if proc.returncode != 0: + raise CodeQLExceptions.CodeQLDatabaseBuildException( + f"Failed to install CodeQL pack dependencies:\n" + f"{(err or b'').decode(errors='replace')}" + ) + return pack_dir + + def _ensure_codeql_bin(self) -> Path: + """Locate (or download) the CodeQL CLI binary into the project cache. + + Resolution order: + 1. An existing binary inside ``/codeql/bin/`` — + reused across runs on the same project. + 2. ``codeql`` already on the user's PATH — picked up verbatim. + 3. Otherwise, download into ``/codeql/bin/``. + + The project-local cache is preferred over PATH so the version we + installed earlier wins over whatever the OS ships — keeps behavior + deterministic when the user has both. + """ + bin_root = self.cache_dir / "codeql" / "bin" + bin_root.mkdir(parents=True, exist_ok=True) + + existing = next( + (p for p in bin_root.rglob("codeql") if p.is_file()), + None, + ) + if existing and os.access(existing, os.X_OK): + logger.debug(f"Reusing cached CodeQL CLI at {existing}") + return existing.resolve() + + on_path = shutil.which("codeql") + if on_path: + logger.debug(f"Using CodeQL CLI from PATH at {on_path}") + return Path(on_path) + + logger.info(f"CodeQL CLI not found; downloading into {bin_root}.") + downloaded = CodeQLLoader.download_and_extract_codeql(bin_root) + if not downloaded.exists() or not os.access(downloaded, os.X_OK): + raise FileNotFoundError( + f"CodeQL binary not executable after download: {downloaded}" + ) + return downloaded + + def _get_call_graph( + self, + symbol_table: Dict[str, PyModule], + augment_sites: bool = False, + ) -> List[PyCallEdge]: + """Build CodeQL-resolved call edges and optionally augment sites. + + Returns an empty list when CodeQL isn't enabled or the database + isn't available. Edges carry ``provenance=["codeql"]`` — merge + with Jedi-derived edges via ``call_graph.merge_edges``. + + When ``augment_sites`` is True, also mutates + ``PyCallable.call_sites`` in the symbol table to backfill + ``callee_signature`` for sites Jedi couldn't resolve. The single + CodeQL query is shared (cached on the ``CodeQL`` instance) so + this costs no extra DB work. + """ + if not self.using_codeql or self.db_path is None: + return [] + try: + cq = CodeQL( + self.project_dir, + self.db_path, + codeql_bin=self.codeql_bin, + codeql_packs_dir=self.codeql_packs_dir, + ) + edges = cq.build_call_graph_edges(symbol_table) + if augment_sites: + cq.augment_call_sites(symbol_table) + return edges + except Exception as exc: + logger.warning(f"CodeQL call-graph extraction failed: {exc}") + return [] \ No newline at end of file diff --git a/codeanalyzer/options/options.py b/codeanalyzer/options/options.py index 0378cb3..1602d45 100644 --- a/codeanalyzer/options/options.py +++ b/codeanalyzer/options/options.py @@ -14,7 +14,6 @@ class AnalysisOptions: input: Path output: Optional[Path] = None format: OutputFormat = OutputFormat.JSON - analysis_level: int = 1 using_codeql: bool = False using_ray: bool = False rebuild_analysis: bool = False diff --git a/codeanalyzer/schema/py_schema.py b/codeanalyzer/schema/py_schema.py index 62f3a8d..8bef391 100644 --- a/codeanalyzer/schema/py_schema.py +++ b/codeanalyzer/schema/py_schema.py @@ -339,9 +339,29 @@ class PyModule(BaseModel): file_size: Optional[int] = None +@builder +@msgpk +class PyCallEdge(BaseModel): + """Identity-only call-graph edge with weight. + + Mirrors Java's ``CallDependency``. ``source`` and ``target`` are + ``PyCallable.signature`` strings — nodes of the graph are the existing + ``PyCallable`` entries in the symbol table, not a separate vertex type. + Rich per-call metadata (receiver, arguments, location, ...) lives on + ``PyCallsite`` inside the source ``PyCallable.call_sites``. + """ + + source: str # caller's PyCallable.signature + target: str # callee's PyCallable.signature + type: Literal["CALL_DEP"] = "CALL_DEP" + weight: int = 1 + provenance: List[Literal["jedi", "codeql", "joern"]] = [] + + @builder @msgpk class PyApplication(BaseModel): """Represents a Python application.""" symbol_table: Dict[str, PyModule] + call_graph: List[PyCallEdge] = [] diff --git a/codeanalyzer/semantic_analysis/call_graph.py b/codeanalyzer/semantic_analysis/call_graph.py new file mode 100644 index 0000000..f9fa941 --- /dev/null +++ b/codeanalyzer/semantic_analysis/call_graph.py @@ -0,0 +1,266 @@ +################################################################################ +# Copyright IBM Corporation 2025 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +################################################################################ + +"""Adapters between the persisted call-graph schema and ``networkx``. + +The schema persists the call graph as ``List[PyCallEdge]`` with signatures +referencing ``PyCallable`` entries already in the symbol table. These +helpers rehydrate it into a ``networkx.DiGraph`` for in-process queries +(paths, callers, callees) and reduce a built ``DiGraph`` back to the +serializable edge list. +""" + +from collections import Counter +from typing import Dict, Iterator, List, Tuple + +import networkx as nx + +from codeanalyzer.schema.py_schema import ( + PyApplication, + PyCallable, + PyCallEdge, + PyClass, + PyModule, +) + + +def _walk_class_callables(cls: PyClass) -> Iterator[PyCallable]: + for method in cls.methods.values(): + yield from _walk_callable(method) + for inner in cls.inner_classes.values(): + yield from _walk_class_callables(inner) + + +def _walk_callable(c: PyCallable) -> Iterator[PyCallable]: + yield c + for inner in c.inner_callables.values(): + yield from _walk_callable(inner) + for inner_cls in c.inner_classes.values(): + yield from _walk_class_callables(inner_cls) + + +def _walk_module_callables(module: PyModule) -> Iterator[PyCallable]: + for fn in module.functions.values(): + yield from _walk_callable(fn) + for cls in module.classes.values(): + yield from _walk_class_callables(cls) + + +def iter_callables_in_symbol_table( + symbol_table: Dict[str, PyModule], +) -> Iterator[PyCallable]: + """Yield every ``PyCallable`` in a symbol table, recursively.""" + for module in symbol_table.values(): + yield from _walk_module_callables(module) + + +def _walk_classes_in_class(cls: PyClass) -> Iterator[PyClass]: + yield cls + for inner in cls.inner_classes.values(): + yield from _walk_classes_in_class(inner) + # Classes can live inside methods (e.g. a factory method that defines + # a helper class). Recurse through every method's callable subtree. + for method in cls.methods.values(): + yield from _walk_classes_in_callable(method) + + +def _walk_classes_in_callable(c: PyCallable) -> Iterator[PyClass]: + for inner_cls in c.inner_classes.values(): + yield from _walk_classes_in_class(inner_cls) + for inner in c.inner_callables.values(): + yield from _walk_classes_in_callable(inner) + + +def iter_classes_in_symbol_table( + symbol_table: Dict[str, PyModule], +) -> Iterator[PyClass]: + """Yield every ``PyClass`` in a symbol table, recursively — including + inner classes, classes nested in functions, and classes nested in + class methods.""" + for module in symbol_table.values(): + for cls in module.classes.values(): + yield from _walk_classes_in_class(cls) + for fn in module.functions.values(): + yield from _walk_classes_in_callable(fn) + + +def iter_callables(app: PyApplication) -> Iterator[PyCallable]: + """Yield every ``PyCallable`` in the application, recursively.""" + yield from iter_callables_in_symbol_table(app.symbol_table) + + +def callables_by_signature(app: PyApplication) -> Dict[str, PyCallable]: + """Flat ``signature -> PyCallable`` index for O(1) node lookup.""" + return {c.signature: c for c in iter_callables(app)} + + +def to_digraph(app: PyApplication) -> nx.DiGraph: + """Build a ``networkx.DiGraph`` from a ``PyApplication``. + + Nodes are keyed by ``PyCallable.signature``. Nodes for in-source + callables carry a ``callable`` attribute holding the full + ``PyCallable`` and ``ghost=False``. Endpoints referenced by edges + but absent from the symbol table — RPC targets, third-party + libraries, framework callbacks, dynamically resolved callees — are + added as **ghost** nodes (``callable=None``, ``ghost=True``) so the + edges are preserved. + + Edges carry ``type``, ``weight``, and ``provenance`` attributes. + """ + g = nx.DiGraph() + by_sig = callables_by_signature(app) + for sig, c in by_sig.items(): + g.add_node(sig, callable=c, ghost=False) + for e in app.call_graph: + for sig in (e.source, e.target): + if sig not in g.nodes: + g.add_node(sig, callable=None, ghost=True) + g.add_edge( + e.source, + e.target, + type=e.type, + weight=e.weight, + provenance=list(e.provenance), + ) + return g + + +def from_digraph(g: nx.DiGraph) -> list: + """Reduce a ``DiGraph`` to the persisted ``List[PyCallEdge]`` form. + + Only edges are extracted; nodes are not serialized here — they are + expected to already exist as ``PyCallable`` entries in the symbol + table. Edge attributes default to ``CALL_DEP`` / weight 1 / empty + provenance when missing. + """ + edges = [] + for src, dst, data in g.edges(data=True): + edges.append( + PyCallEdge( + source=src, + target=dst, + type=data.get("type", "CALL_DEP"), + weight=int(data.get("weight", 1)), + provenance=list(data.get("provenance", [])), + ) + ) + return edges + + +def jedi_call_graph_edges( + symbol_table: Dict[str, PyModule], +) -> List[PyCallEdge]: + """Derive ``PyCallEdge`` entries from Jedi's per-callable ``call_sites``. + + For every ``PyCallable`` in the symbol table, each ``PyCallsite`` whose + ``callee_signature`` is resolved (non-empty) contributes an edge + ``caller.signature -> site.callee_signature``. Sites where Jedi failed + to resolve the callee (``callee_signature`` is ``None`` or empty) are + skipped — they have no anchor to put on the graph. + + Edges are coalesced on ``(source, target)``: ``weight`` is the count of + matching sites. Provenance is always ``["jedi"]``; combine with + CodeQL-derived edges via ``merge_edges``. + """ + counts: Counter = Counter() + for caller in iter_callables_in_symbol_table(symbol_table): + for site in caller.call_sites: + if not site.callee_signature: + continue + counts[(caller.signature, site.callee_signature)] += 1 + + return [ + PyCallEdge(source=src, target=dst, weight=n, provenance=["jedi"]) + for (src, dst), n in counts.items() + ] + + +def resolve_unresolved_constructors(symbol_table: Dict[str, PyModule]) -> int: + """Fill in ``PyCallsite.callee_signature`` for unresolved constructor sites. + + When both Jedi and CodeQL fail to resolve a constructor call (commonly + for classes nested inside functions or methods, where static-analysis + points-to is weakest), Jedi still flags the site as + ``is_constructor_call=True`` with ``method_name`` set to the class's + short name. This pass does the resolution heuristically: + + 1. Build a ``short_name -> [PyClass]`` index from all classes in the + symbol table. + 2. For each unresolved constructor site under a caller ``C``, look up + candidates by ``site.method_name`` and prefer the class whose + ``signature`` is the longest prefix-ancestor of ``C.signature`` — + this approximates Python's LEGB scoping for nested classes. + 3. Set ``callee_signature = f"{class.signature}.__init__"``. + + Returns the number of sites resolved. Best-effort; sites with no + matching class or ambiguous candidates with no scope tiebreaker are + left as-is. + """ + by_name: Dict[str, List[PyClass]] = {} + for cls in iter_classes_in_symbol_table(symbol_table): + by_name.setdefault(cls.name, []).append(cls) + + resolved = 0 + for caller in iter_callables_in_symbol_table(symbol_table): + for site in caller.call_sites: + if not site.is_constructor_call or site.callee_signature: + continue + candidates = by_name.get(site.method_name) + if not candidates: + continue + + # Prefer the class whose signature is the longest prefix of + # the caller's signature (closest enclosing scope). + def scope_score(c: PyClass, _caller_sig: str = caller.signature) -> int: + cls_sig = c.signature + parent_sig = cls_sig.rsplit(".", 1)[0] if "." in cls_sig else "" + # Score = length of parent_sig if it's a prefix of caller's + # signature, else -1 (not in scope, lowest priority). + if parent_sig and _caller_sig.startswith(parent_sig): + return len(parent_sig) + # Module-level class (parent_sig is the module path) — give + # it a base score so it still wins over no match. + return 0 if not parent_sig else -1 + + best = max(candidates, key=scope_score) + if scope_score(best) < 0: + # No candidate is reachable from caller's scope. + continue + + site.callee_signature = f"{best.signature}.__init__" + resolved += 1 + + return resolved + + +def merge_edges(*edge_lists: list) -> list: + """Merge multiple ``List[PyCallEdge]`` into one. + + Edges with the same ``(source, target)`` are coalesced: weights sum, + provenance is the sorted union. Useful for combining edges produced + by different backends (e.g. Jedi + CodeQL). + """ + by_key: Dict[Tuple[str, str], PyCallEdge] = {} + for edges in edge_lists: + for e in edges: + k = (e.source, e.target) + if k in by_key: + cur = by_key[k] + cur.weight += e.weight + cur.provenance = sorted(set(cur.provenance) | set(e.provenance)) + else: + by_key[k] = e.model_copy() + return list(by_key.values()) diff --git a/codeanalyzer/semantic_analysis/codeql/codeql_analysis.py b/codeanalyzer/semantic_analysis/codeql/codeql_analysis.py index 5445d6b..0c0e046 100644 --- a/codeanalyzer/semantic_analysis/codeql/codeql_analysis.py +++ b/codeanalyzer/semantic_analysis/codeql/codeql_analysis.py @@ -20,13 +20,16 @@ for Python projects and execute queries against them. """ +from collections import Counter from pathlib import Path -from typing import Union +from typing import Any, Dict, Iterator, List, Tuple, Union -from networkx import DiGraph from pandas import DataFrame +from codeanalyzer.schema.py_schema import PyCallEdge, PyModule +from codeanalyzer.semantic_analysis.call_graph import iter_callables_in_symbol_table from codeanalyzer.semantic_analysis.codeql.codeql_query_runner import CodeQLQueryRunner +from codeanalyzer.utils import logger class CodeQL: @@ -40,94 +43,258 @@ class CodeQL: temp_db (TemporaryDirectory or None): The temporary directory object if a temporary database was created. """ - def __init__(self, project_dir: Union[str, Path], db_path: Path) -> None: + def __init__( + self, + project_dir: Union[str, Path], + db_path: Path, + codeql_bin: Union[str, Path, None] = None, + codeql_packs_dir: Union[str, Path, None] = None, + ) -> None: self.project_dir = project_dir self.db_path = db_path + self.codeql_bin = codeql_bin + self.codeql_packs_dir = codeql_packs_dir + self._cached_df: "DataFrame | None" = None - def _build_call_graph(self) -> DiGraph: - """Builds the call graph of the application. + def _query_call_edges(self) -> DataFrame: + """Runs the CodeQL query that emits one row per resolved call site. - Returns: - DiGraph: A directed graph representing the call graph of the application. - """ - query = [] + The query is written against CodeQL's Python library (``import python``). + It returns physical location handles for both endpoints so the + downstream post-processor can join into Jedi's existing + ``PyCallable.signature`` space via ``(file_path, start_line)`` — + no signature normalization required. - # Add import - query += ["import python"] + Filters: + * Caller must be a ``Function`` (skip module-level / class-body + calls — they have no ``PyCallable`` to anchor to). + * Callee may resolve to anything (in-source or library stub); + non-application callees become **ghost** nodes downstream so + RPC / third-party / framework edges are preserved. - # Add Call edges between caller and callee and filter to only capture application methods. - query += [ - "from Method caller, Method callee", + Returns: + DataFrame: one row per resolved (caller, callee, call-site) + triple. Duplicate ``(caller_file, caller_start_line, + callee_file, callee_start_line)`` tuples represent multiple + call sites in the same caller targeting the same callee and + are coalesced into a single ``PyCallEdge`` (weight = count) + by the post-processor. + """ + query = [ + "/**", + " * @name Python call-graph edges", + " * @description One row per resolved call site: caller, callee,", + " * and the call-expression location.", + " * @kind table", + " * @id py/codeanalyzer/call-graph-edges", + " */", + "import python", + # ``FunctionValue`` / ``ClassValue`` / the ``pointsTo`` predicate + # live in ObjectAPI, which ``import python`` only brings in as a + # private import — they aren't re-exported. Pull them in + # explicitly. + "import semmle.python.objects.ObjectAPI", + "", + # ``Value.getACall()`` is the modern call-resolution API in + # codeql/python-all 7.x — it returns the ``CallNode`` (CFG) + # whose target was resolved to that ``Value``. Cleaner than + # poking at ``pointsTo`` directly. + "from CallNode call, Function caller, FunctionValue calleeVal", "where", - "caller.fromSource() and", - "callee.fromSource() and", - "caller.calls(callee)", + " call.getScope() = caller and", + " (", + # Direct function / bound-method call: foo() or obj.foo() + " call = calleeVal.getACall()", + " or", + # Constructor call: A(...) resolves to a ClassValue; the actual + # callee is the class's __init__ (via MRO lookup so subclasses + # without an explicit __init__ still resolve to the inherited one). + " exists(ClassValue clsVal |", + " call = clsVal.getACall() and", + ' clsVal.lookup("__init__") = calleeVal', + " )", + " )", "select", + # --- Caller endpoint --- (joins to PyCallable via file + start_line) + " caller.getLocation().getFile().getAbsolutePath(),", + " caller.getLocation().getStartLine(),", + " caller.getQualifiedName(),", + # --- Callee endpoint --- (file/line may live in a library stub; + # post-processor classifies as in-source or ghost) + " calleeVal.getScope().getLocation().getFile().getAbsolutePath(),", + " calleeVal.getScope().getLocation().getStartLine(),", + " calleeVal.getQualifiedName(),", + # --- Call-site location --- (for PyCallsite augmentation) + " call.getLocation().getStartLine(),", + " call.getLocation().getStartColumn(),", + " call.getLocation().getEndLine(),", + " call.getLocation().getEndColumn()", + # ``is_constructor`` is derived in the post-processor by + # checking whether ``callee_qname`` ends in ``.__init__``; + # avoids QL's restrictive ``if-then-else`` typing here. ] - - # Caller metadata - query += [ - "caller.getFile().getAbsolutePath(),", - '"[" + caller.getBody().getLocation().getStartLine() + ", " + caller.getBody().getLocation().getEndLine() + "]", //Caller body slice indices', - "caller.getQualifiedName(), // Caller's fullsignature", - "caller.getAModifier(), // caller's method modifier", - "caller.paramsString(), // caller's method parameter types", - "caller.getReturnType().toString(), // Caller's return type", - "caller.getDeclaringType().getQualifiedName(), // Caller's class", - "caller.getDeclaringType().getAModifier(), // Caller's class modifier", - ] - - # Callee metadata - query += [ - "callee.getFile().getAbsolutePath(),", - '"[" + callee.getBody().getLocation().getStartLine() + ", " + callee.getBody().getLocation().getEndLine() + "]", //Caller body slice indices', - "callee.getQualifiedName(), // Caller's fullsignature", - "callee.getAModifier(), // callee's method modifier", - "callee.paramsString(), // callee's method parameter types", - "callee.getReturnType().toString(), // Caller's return type", - "callee.getDeclaringType().getQualifiedName(), // Caller's class", - "callee.getDeclaringType().getAModifier() // Caller's class modifier", - ] + if self._cached_df is not None: + return self._cached_df query_string = "\n".join(query) - # Execute the query using the CodeQLQueryRunner context manager - with CodeQLQueryRunner(self.db_path) as query: - query_result: DataFrame = query.execute( + with CodeQLQueryRunner( + self.db_path, + codeql_bin=self.codeql_bin, + codeql_packs_dir=self.codeql_packs_dir, + ) as runner: + df: DataFrame = runner.execute( query_string, column_names=[ - # Caller Columns "caller_file", - "caller_body_slice_index", - "caller_signature", - "caller_modifier", - "caller_params", - "caller_return_type", - "caller_class_signature", - "caller_class_modifier", - # Callee Columns + "caller_start_line", + "caller_qname", "callee_file", - "callee_body_slice_index", - "callee_signature", - "callee_modifier", - "callee_params", - "callee_return_type", - "callee_class_signature", - "callee_class_modifier", + "callee_start_line", + "callee_qname", + "call_start_line", + "call_start_column", + "call_end_line", + "call_end_column", ], ) - - # Process the query results into JMethod instances - callgraph: DiGraph = self.__process_call_edges_to_callgraph(query_result) - return callgraph + self._cached_df = df + return df @staticmethod - def __process_call_edges_to_callgraph(query_result: DataFrame) -> DiGraph: - """Processes call edges from query results into a call graph. + def _build_callable_location_index( + symbol_table: Dict[str, PyModule], + ) -> Dict[Tuple[str, int], "PyCallable"]: + """Build ``(absolute_file_path, start_line) -> PyCallable`` from Jedi. + + Paths are resolved so they match CodeQL's ``getAbsolutePath()`` + regardless of symlinks or the current working directory. + """ + from codeanalyzer.schema.py_schema import PyCallable # local to avoid cycle + + index: Dict[Tuple[str, int], PyCallable] = {} + for c in iter_callables_in_symbol_table(symbol_table): + try: + abs_path = str(Path(c.path).resolve()) + except (OSError, RuntimeError): + abs_path = c.path + index[(abs_path, c.start_line)] = c + return index + + def _iter_resolved_rows( + self, symbol_table: Dict[str, PyModule] + ) -> "Iterator[Tuple[str, str, Any]]": + """Yield ``(source_sig, target_sig, row)`` for every CodeQL row. + + Rows whose caller can't be matched to a ``PyCallable`` in the + symbol table are skipped. Callee misses fall back to + ``row.callee_qname`` (ghost). Used by both edge construction and + call-site augmentation so a single CodeQL query feeds both. + """ + df = self._query_call_edges() + if df.empty: + return + location_index = self._build_callable_location_index(symbol_table) + + skipped_unknown_caller = 0 + ghost_callees = 0 + for row in df.itertuples(index=False): + caller_key = (row.caller_file, int(row.caller_start_line)) + caller = location_index.get(caller_key) + if caller is None: + skipped_unknown_caller += 1 + continue + + callee_key = (row.callee_file, int(row.callee_start_line)) + callee = location_index.get(callee_key) + if callee is not None: + target_sig = callee.signature + else: + target_sig = row.callee_qname + ghost_callees += 1 - Args: - query_result (DataFrame): The DataFrame containing call edge information. + yield caller.signature, target_sig, row + + if skipped_unknown_caller: + logger.debug( + f"CodeQL: skipped {skipped_unknown_caller} rows whose caller " + f"was not in Jedi's symbol table." + ) + if ghost_callees: + logger.debug( + f"CodeQL: {ghost_callees} rows resolved to ghost (external) callees." + ) + + def build_call_graph_edges( + self, symbol_table: Dict[str, PyModule] + ) -> List[PyCallEdge]: + """Run the CodeQL query and turn each row into a ``PyCallEdge``. + + Edges are coalesced on ``(source, target)`` — ``weight`` is the + number of distinct call sites in the caller targeting the callee. + Provenance is always ``["codeql"]``; combine with Jedi-derived + edges via ``call_graph.merge_edges``. + """ + edge_counts: Counter = Counter() + for source_sig, target_sig, _row in self._iter_resolved_rows(symbol_table): + edge_counts[(source_sig, target_sig)] += 1 + + return [ + PyCallEdge( + source=src, + target=dst, + weight=count, + provenance=["codeql"], + ) + for (src, dst), count in edge_counts.items() + ] + + def augment_call_sites(self, symbol_table: Dict[str, PyModule]) -> int: + """Backfill ``PyCallsite.callee_signature`` using CodeQL resolution. + + Walks every CodeQL row, locates the matching ``PyCallsite`` inside + the caller's ``PyCallable.call_sites`` by call-expression line range + (``start_line``, ``end_line``), and fills in ``callee_signature`` + **only when Jedi left it empty**. Existing Jedi-resolved signatures + are kept (Jedi sees lexical context CodeQL can't, e.g. closures). + + Match is by line range — column matching is brittle across the two + tools' 0- vs 1-based conventions. Ambiguity on a single line + (e.g. ``a.b().c()``) resolves to the first matching site, which is + an acceptable approximation given how rarely Jedi misses callees + on chained call lines. Returns: - DiGraph: A directed graph representing the call graph of the application. + Number of ``PyCallsite`` entries augmented. """ + location_index = self._build_callable_location_index(symbol_table) + df = self._query_call_edges() + if df.empty: + return 0 + + augmented = 0 + for row in df.itertuples(index=False): + caller_key = (row.caller_file, int(row.caller_start_line)) + caller = location_index.get(caller_key) + if caller is None: + continue + + callee_key = (row.callee_file, int(row.callee_start_line)) + callee = location_index.get(callee_key) + resolved_sig = callee.signature if callee is not None else row.callee_qname + + call_start = int(row.call_start_line) + call_end = int(row.call_end_line) + for site in caller.call_sites: + if site.start_line != call_start or site.end_line != call_end: + continue + if not site.callee_signature: + site.callee_signature = resolved_sig + augmented += 1 + break + + if augmented: + logger.debug( + f"CodeQL: augmented {augmented} PyCallsite.callee_signature entries." + ) + return augmented diff --git a/codeanalyzer/semantic_analysis/codeql/codeql_loader.py b/codeanalyzer/semantic_analysis/codeql/codeql_loader.py index eb1faf9..dc95e0b 100644 --- a/codeanalyzer/semantic_analysis/codeql/codeql_loader.py +++ b/codeanalyzer/semantic_analysis/codeql/codeql_loader.py @@ -1,4 +1,6 @@ +import os import platform +import stat import zipfile from pathlib import Path @@ -52,12 +54,38 @@ def download_and_extract_codeql(cls, temp_dir: Path) -> Path: extract_dir = temp_dir / filename.replace(".zip", "") extract_dir.mkdir(exist_ok=True) - print(f"Extracting CodeQL CLI to {extract_dir}") + logger.info(f"Extracting CodeQL CLI to {extract_dir}") + # zipfile.extractall drops Unix permissions (the executable bit), so + # we extract entries manually and copy each one's stored mode onto + # the file system. Without this, the CodeQL launcher script can't + # be executed and the next subprocess.Popen raises PermissionError. with zipfile.ZipFile(archive_path, "r") as zip_ref: - zip_ref.extractall(extract_dir) + for info in zip_ref.infolist(): + extracted_path = zip_ref.extract(info, extract_dir) + stored_mode = info.external_attr >> 16 + if stored_mode: + os.chmod(extracted_path, stored_mode) - codeql_bin = next(extract_dir.rglob("codeql"), None) - if not codeql_bin or not codeql_bin.exists(): + # Archive is no longer needed once extracted. + try: + archive_path.unlink() + except OSError as exc: + logger.warning(f"Could not remove CodeQL archive {archive_path}: {exc}") + + # rglob("codeql") returns both the launcher file *and* an internal + # directory of the same name (CodeQL ships its own runtime under + # ``codeql/codeql/``); insist on a regular file so we never bind to + # the directory. + codeql_bin = next( + (p for p in extract_dir.rglob("codeql") if p.is_file()), + None, + ) + if not codeql_bin: raise FileNotFoundError("CodeQL binary not found in extracted contents.") + # Belt-and-suspenders: ensure the binary is executable even if the + # archive entry's mode was zero (some older zip producers omit it). + st = codeql_bin.stat() + codeql_bin.chmod(st.st_mode | stat.S_IXUSR | stat.S_IXGRP | stat.S_IXOTH) + return codeql_bin.resolve() diff --git a/codeanalyzer/semantic_analysis/codeql/codeql_query_runner.py b/codeanalyzer/semantic_analysis/codeql/codeql_query_runner.py index ce15a98..17eb368 100644 --- a/codeanalyzer/semantic_analysis/codeql/codeql_query_runner.py +++ b/codeanalyzer/semantic_analysis/codeql/codeql_query_runner.py @@ -40,9 +40,13 @@ class CodeQLQueryRunner: Args: database_path (str): The path to the CodeQL database. + codeql_bin (str | Path | None): Absolute path to the CodeQL CLI + binary. When ``None``, falls back to whatever ``codeql`` is on + ``PATH``. Attributes: database_path (Path): The path to the CodeQL database. + codeql_bin (str): Resolved binary path or the literal ``"codeql"``. temp_file_path (Path): The path to the temporary query file. csv_output_file (Path): The path to the CSV output file. temp_bqrs_file_path (Path): The path to the temporary bqrs file. @@ -52,39 +56,46 @@ class CodeQLQueryRunner: CodeQLQueryExecutionException: If there is an error executing the query. """ - def __init__(self, database_path: str): + def __init__(self, database_path: str, codeql_bin=None, codeql_packs_dir=None): self.database_path: Path = Path(database_path) + self.codeql_bin: str = str(codeql_bin) if codeql_bin else "codeql" + self.codeql_packs_dir = ( + Path(codeql_packs_dir) if codeql_packs_dir is not None else None + ) self.temp_file_path: Path = None def __enter__(self): - """Context entry that creates temporary files to execute a CodeQL query. - - Returns: - CodeQLQueryRunner: The instance of the class. - - Note: - This method creates temporary files to hold the query and store their paths. + """Context entry that prepares paths to execute a CodeQL query. + + The ``.ql`` file is written **inside the prepared qlpack + directory** (``codeql_packs_dir``) so ``import python`` resolves + against that pack's installed dependencies — no + ``--additional-packs`` or ``--search-path`` needed. The CSV / + BQRS output files live in ``tempfile`` because they're transient + per-query artifacts. """ - - # Create a temporary file to hold the query and store its path - temp_file = tempfile.NamedTemporaryFile("w", delete=False, suffix=".ql") + # CSV and BQRS files are transient per-query — fine in /tmp. csv_file = tempfile.NamedTemporaryFile("w", delete=False, suffix=".csv") bqrs_file = tempfile.NamedTemporaryFile("w", delete=False, suffix=".bqrs") - self.temp_file_path = Path(temp_file.name) self.csv_output_file = Path(csv_file.name) self.temp_bqrs_file_path = Path(bqrs_file.name) - - # Let's close the files, we'll reopen them by path when needed. - temp_file.close() - bqrs_file.close() csv_file.close() + bqrs_file.close() - # Create a temporary qlpack.yml file - self.temp_qlpack_file = self.temp_file_path.parent / "qlpack.yml" - with self.temp_qlpack_file.open("w") as f: - f.write("name: temp\n") - f.write("version: 1.0.0\n") - f.write("libraryPathDependencies: codeql/java-all\n") + # The .ql file MUST live inside the prepared qlpack so its + # ``import python`` resolves via that pack's lock file. Writing + # outside the pack means CodeQL falls back to a default + # search-path that doesn't include downloaded library packs. + if self.codeql_packs_dir is None: + raise RuntimeError( + "CodeQLQueryRunner requires codeql_packs_dir — the directory " + "of an installed qlpack that depends on codeql/python-all." + ) + ql_file = tempfile.NamedTemporaryFile( + "w", delete=False, suffix=".ql", dir=str(self.codeql_packs_dir) + ) + self.temp_file_path = Path(ql_file.name) + ql_file.close() return self @@ -108,32 +119,41 @@ def execute(self, query_string: str, column_names: List[str]) -> DataFrame: # Write the query to the temp file so we can execute it. self.temp_file_path.write_text(query_string) - # Construct and execute the CodeQL CLI command asking for a JSON output. + # The .ql file sits inside the qlpack directory whose lock file + # already resolves ``codeql/python-all`` and its transitive + # dependencies. ``codeql query run`` auto-discovers the enclosing + # qlpack — no extra flags required. codeql_query_cmd = shlex.split( - f"codeql query run {self.temp_file_path} --database={self.database_path} --output={self.temp_bqrs_file_path}", + f"{shlex.quote(self.codeql_bin)} query run {self.temp_file_path} " + f"--database={self.database_path} " + f"--output={self.temp_bqrs_file_path}", posix=False, ) - call = subprocess.Popen(codeql_query_cmd, stdout=None, stderr=None) + call = subprocess.Popen( + codeql_query_cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE + ) _, err = call.communicate() if call.returncode != 0: raise CodeQLExceptions.CodeQLQueryExecutionException( - f"Error executing query: {err.stderr}" + f"Error executing query: {(err or b'').decode(errors='replace')}" ) # Convert the bqrs file to a CSV file bqrs2csv_command = shlex.split( - f"codeql bqrs decode --format=csv --output={self.csv_output_file} {self.temp_bqrs_file_path}", + f"{shlex.quote(self.codeql_bin)} bqrs decode --format=csv --output={self.csv_output_file} {self.temp_bqrs_file_path}", posix=False, ) # Read the CSV file content and cast it to a DataFrame - call = subprocess.Popen(bqrs2csv_command, stdout=None, stderr=None) + call = subprocess.Popen( + bqrs2csv_command, stdout=subprocess.PIPE, stderr=subprocess.PIPE + ) _, err = call.communicate() if call.returncode != 0: raise CodeQLExceptions.CodeQLQueryExecutionException( - f"Error executing query: {err.stderr}" + f"Error decoding bqrs: {(err or b'').decode(errors='replace')}" ) else: return pd.read_csv( @@ -161,5 +181,5 @@ def __exit__(self, exc_type, exc_val, exc_tb): if self.csv_output_file and self.csv_output_file.exists(): self.csv_output_file.unlink() - if self.temp_qlpack_file and self.temp_qlpack_file.exists(): - self.temp_qlpack_file.unlink() + if self.temp_bqrs_file_path and self.temp_bqrs_file_path.exists(): + self.temp_bqrs_file_path.unlink() diff --git a/codeanalyzer/semantic_analysis/wala/__init__.py b/codeanalyzer/semantic_analysis/wala/__init__.py deleted file mode 100644 index e68623a..0000000 --- a/codeanalyzer/semantic_analysis/wala/__init__.py +++ /dev/null @@ -1,15 +0,0 @@ -################################################################################ -# Copyright IBM Corporation 2025 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -################################################################################ diff --git a/codeanalyzer/syntactic_analysis/symbol_table_builder.py b/codeanalyzer/syntactic_analysis/symbol_table_builder.py index c7b83b4..468bb9d 100644 --- a/codeanalyzer/syntactic_analysis/symbol_table_builder.py +++ b/codeanalyzer/syntactic_analysis/symbol_table_builder.py @@ -4,7 +4,7 @@ from ast import AST, ClassDef from io import StringIO from pathlib import Path -from typing import Dict, List, Optional, Union +from typing import Dict, List, Optional, Tuple, Union import jedi from jedi.api import Script @@ -71,6 +71,32 @@ def _infer_qualified_name(script: Script, line: int, column: int) -> Optional[st pass return None + @staticmethod + def _infer_callee( + script: Script, line: int, column: int + ) -> Tuple[Optional[str], bool]: + """Infer ``(qualified_name, is_class)`` at a call expression. + + When the callee resolves to a class (e.g. ``A()``), the qualified + name is normalized to ``.__init__`` so it joins to the + ``PyCallable`` entry for the constructor in the symbol table — + classes themselves are not ``PyCallable``s, so without this + rewrite every constructor call would surface as a ghost node in + the call graph. + """ + try: + definitions = script.infer(line=line, column=column) + if not definitions: + return None, False + d = definitions[0] + is_class = (d.type == "class") + full = d.full_name + if is_class and full: + full = f"{full}.__init__" + return full, is_class + except Exception: + return None, False + def build_pymodule_from_file(self, py_file: Path) -> PyModule: """Builds a PyModule from a Python file. @@ -485,6 +511,63 @@ def _accessed_symbols( symbols.append(symbol) return symbols + @staticmethod + def _iter_calls_in_scope(fn_node: ast.AST): + """Yield ``ast.Call`` nodes belonging to ``fn_node``'s own scope. + + Naive ``ast.walk`` descends into nested ``FunctionDef`` / ``ClassDef`` + bodies, attributing their calls to the outer function — wrong, since + those nested definitions have their own ``PyCallable`` entries + (built recursively by ``_callables``/``_add_class``) and own + ``call_sites`` lists. + + Decorators, default arguments, return-type annotations, base + classes and class-level keyword args ARE evaluated in the + enclosing scope, so calls in those subtrees stay attributed to + ``fn_node``. Bodies of nested defs/classes are skipped. Lambdas, + comprehensions and inline conditionals don't get their own + ``PyCallable`` so their internals stay attributed to the enclosing + function. + """ + + def walk(node: ast.AST): + if isinstance(node, ast.Call): + yield node + + if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)): + # Decorators, defaults, return annotations run in + # enclosing scope. Body and arg names run in inner scope. + for dec in node.decorator_list: + yield from walk(dec) + for default in node.args.defaults: + yield from walk(default) + for default in node.args.kw_defaults: + if default is not None: + yield from walk(default) + if node.returns is not None: + yield from walk(node.returns) + return + + if isinstance(node, ast.ClassDef): + # Decorators, bases, and keyword args run in enclosing scope. + # Body runs in class scope. + for dec in node.decorator_list: + yield from walk(dec) + for base in node.bases: + yield from walk(base) + for kw in node.keywords: + yield from walk(kw.value) + return + + for child in ast.iter_child_nodes(node): + yield from walk(child) + + for stmt in getattr(fn_node, "body", []): + yield from walk(stmt) + # Decorators / defaults / returns of fn_node itself are evaluated + # in the ENCLOSING scope, so they belong to fn_node's parent, not + # fn_node. Don't yield them here. + def _call_sites(self, fn_node: ast.FunctionDef, script: Script) -> List[PyCallsite]: """ Finds all call sites made from within the function using Jedi for type inference. @@ -498,14 +581,14 @@ def _call_sites(self, fn_node: ast.FunctionDef, script: Script) -> List[PyCallsi """ call_sites: List[PyCallsite] = [] - for node in ast.walk(fn_node): + for node in self._iter_calls_in_scope(fn_node): if not isinstance(node, ast.Call): continue func_expr = node.func method_name = "" - callee_signature = self._infer_qualified_name( + callee_signature, is_constructor = self._infer_callee( script, node.lineno, node.col_offset ) return_type = self._infer_type(script, node.lineno, node.col_offset) @@ -535,7 +618,7 @@ def _call_sites(self, fn_node: ast.FunctionDef, script: Script) -> List[PyCallsi .argument_types(argument_types) .return_type(return_type) .callee_signature(callee_signature) - .is_constructor_call(method_name == "__init__") + .is_constructor_call(is_constructor) .start_line(getattr(node, "lineno", -1)) .start_column(getattr(node, "col_offset", -1)) .end_line(getattr(node, "end_lineno", -1)) diff --git a/pyproject.toml b/pyproject.toml index a735f67..549855f 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,6 +1,6 @@ [project] name = "codeanalyzer-python" -version = "0.1.13" +version = "0.1.14" description = "Static Analysis on Python source code using Jedi, CodeQL and Treesitter." readme = "README.md" authors = [ diff --git a/test/test_cli.py b/test/test_cli.py index d341cfb..b4ba50d 100644 --- a/test/test_cli.py +++ b/test/test_cli.py @@ -20,8 +20,6 @@ def test_cli_call_symbol_table_with_json(cli_runner, whole_applications__xarray) str(whole_applications__xarray), "--output", str(output_dir), - "--analysis-level", - "1", "--ray", "--no-codeql", "--cache-dir",