Level-2: taint/dataflow via Pysa (parallel to codeanalyzer-ts#1)

## Summary

Add **taint / dataflow analysis** as a level-2 enrichment to the Python backend, using **[Pysa](https://pyre-check.org/docs/pysa-basics/)** (Meta's static taint analysis, built on the Pyre type checker). This is the Python parallel to **codellm-devkit/codeanalyzer-ts#1** (which uses [Jelly](https://github.com/cs-au-dk/jelly) for the TypeScript backend) and would **supersede the current experimental CodeQL integration**.

Pysa does mature interprocedural **source → sink** taint tracking (sources/sinks/sanitizers declared in `taint.config` + `.pysa` model files), which is exactly the level-2 capability we want and that the Jedi/tree-sitter front-end doesn't provide today.

## Why Pysa

- Mature, production-grade interprocedural taint (Meta uses it at scale).
- **Ships as a pre-compiled native binary** bundled in the [`pyre-check` PyPI wheel](https://pypi.org/project/pyre-check/) — no OCaml toolchain needed at runtime, same distribution shape as a bundled analyzer binary.
- Models-as-data (`taint.config` + `.pysa`) → sources/sinks extensible without code changes.

## Constraints / risks (eyes open)

Unlike a pure-Python library, Pysa is an **external engine**, so:

- **Sidecar, not in-process.** The Pyre/Pysa binary is shelled out to — it can't be imported as a library. This is the "external tool" integration model (like Joern/CodeQL), not an embeddable pure-Python one.
- **Linux + macOS only — no Windows.** Pyre is officially [Linux/macOS](https://pyre-check.org/docs/installation/); Windows is WSL-only/unsupported. Level-2 taint would be unavailable on native Windows.
- **The binary alone isn't enough to run.** Pysa needs a configured Pyre project: `.pyre_configuration`, `taint.config`, `.pysa` model files, and **typeshed** third-party stubs. "Bundling Pysa" = binary **+** that model/stub/config ecosystem.
- Sizable native binary (tens of MB).

## Distribution approach

- Add `pyre-check` as an **optional extra** (e.g. `pip install codeanalyzer-python[taint]`) so the base install — including Windows — is unaffected.
- Gate level-2 taint on platform/availability; emit a clear "Pysa unavailable on this platform" message and fall back to the Jedi call graph when `pyre`/`pysa` can't run.

## Configurable taint models (sources / sinks / sanitizers as args)

Pysa is **models-as-data by design**: `taint.config` declares the *kinds* (named sources/sinks/sanitizers + the rules connecting them) and `.pysa` files *bind* those kinds to fully-qualified Python symbols. The wrapper's job is to surface that through the `codeanalyzer` CLI and forward it to Pysa, layered over a bundled default pack.

**CLI surface (extends the existing `codeanalyzer` CLI):**
```text
--taint                    enable taint (or implied by the level-2 flag)
--taint-config <path>      extra taint.config JSON ('-' = stdin)
--pysa-models <dir|file>   extra .pysa model files
--source <spec>            inline source     (repeatable)
--sink <spec>              inline sink        (repeatable)
--sanitizer <spec>         inline sanitizer   (repeatable)
--taint-builtins <on|off>  load the bundled default pack (default on)
```
Precedence (later extends/overrides earlier): **built-in pack → `--taint-config`/`--pysa-models` → inline flags**. Inline `--source/--sink/--sanitizer` get **compiled into a generated `taint.config` + `.pysa` fragment** the wrapper writes into the Pyre project before invoking Pysa.

**The two model layers (Pysa-native):**
```python
# taint.config — declares the *kinds* + the rules linking them
{ "sources": [{ "name": "UserControlled" }],
  "sinks":   [{ "name": "RemoteCodeExecution" }],
  "rules":   [{ "name": "RCE", "code": 5001,
               "sources": ["UserControlled"], "sinks": ["RemoteCodeExecution"],
               "message_format": "User input reaches a command execution sink" }] }
```
```python
# *.pysa — binds kinds to fully-qualified symbols via Taint* annotations
def os.system(command: TaintSink[RemoteCodeExecution]): ...
def django.http.request.HttpRequest.GET.__get__() -> TaintSource[UserControlled]: ...
def html.escape(s) -> Sanitize: ...
```

**`match` identity:** Pysa names elements by **fully-qualified Python name** (`os.system`, `django.http.request.HttpRequest.GET`), resolved through **Pyre's type environment** — covering stdlib/third-party (via **typeshed** stubs) *and* first-party functions. So an inline spec is a qualified name + kind, e.g. `--sink "os.system#0:RemoteCodeExecution"`. (Typeshed stubs are the Python analog of Jelly's access paths in codeanalyzer-ts#1.)

- [ ] Built-in default pack: a base `taint.config` + `.pysa` set (common web sources; exec/SQL/path sinks; escapers).
- [ ] Compile inline `--source/--sink/--sanitizer` → generated `.pysa`/`taint.config`; merge precedence + `--taint-builtins off` to fully replace.
- [ ] `--taint-config -` (stdin) so the python-sdk can forward user-defined models without temp files; surface the matched rule `code`/name per reported flow.

## Tasks

- [ ] Spike: run Pysa on a small fixture; capture its output (issues JSON / taint flows) and the minimal config (`.pyre_configuration` + `taint.config` + a few `.pysa` models) needed to drive it headless.
- [ ] Bundle/depend on `pyre-check` as an optional `[taint]` extra; vendor a base `taint.config` + `.pysa` model set + typeshed stubs.
- [ ] Add a level-2 enrichment path that invokes Pysa and maps its findings into a `taint_flows` section of the analysis output (source/sink models, locations, path, sanitized flag) — aligned with the `taint_flows` schema proposed in codeanalyzer-ts#1 so both backends emit a consistent shape.
- [ ] Optionally merge Pysa-derived call edges into the existing `call_graph` (provenance `pysa`), reconciled to the symbol-table identity keys.
- [ ] **Retire the experimental CodeQL integration** (superseded by Pysa for the taint/enrichment use case).
- [ ] Platform/fallback handling (no native Windows) + docs.

## Implementation guide & learning path

Pysa is an external engine, so the work is mostly *driving it headless + model management + output mapping*, not writing a dataflow engine.

**Build order (each independently testable):**
1. **Headless spike** — stand up a minimal Pyre project (`.pyre_configuration` + `taint.config` + a couple `.pysa` models) and run `pyre analyze` on a fixture; capture the issues JSON. Goal: one source→sink flow detected from the CLI, no wrapper yet.
2. Wrapper: depend on `pyre-check` (optional `[taint]` extra), invoke `pyre analyze` as a subprocess, parse the issues JSON.
3. Default model pack + the configurable-models CLI (above).
4. Map Pysa findings → the shared `taint_flows` schema (aligned with codeanalyzer-ts#1).
5. Optional: merge Pysa call edges into `call_graph` (provenance `pysa`).
6. Platform gating + graceful fallback (no native Windows) + docs.

**Concepts / docs to read first:**
- **[Pysa basics](https://pyre-check.org/docs/pysa-basics/)** and **[Running Pysa](https://pyre-check.org/docs/pysa-running/)** — the `taint.config` kinds/rules and the `.pysa` model DSL (`TaintSource`/`TaintSink`/`Sanitize`).
- **Pysa model generators / tips** — how Pysa models large frameworks; informs the default pack.
- **Taint as source→sink graph reachability / IFDS** — Reps–Horwitz–Sagiv, *"Precise Interprocedural Dataflow Analysis via Graph Reachability"* (POPL '95); the shared conceptual core with codeanalyzer-ts#1.
- **Pyre's type environment + typeshed stubs** — Pysa resolves models by fully-qualified name through Pyre's types; understand how stubs supply signatures for un-analyzed libraries.
- **Meta's [Pysa engineering post](https://engineering.fb.com/2020/08/07/security/pysa/)** — the big picture (summaries / iterative analysis).

## Alternatives considered

- **[Scalpel](https://github.com/SMAT-Lab/Scalpel) / [PyCG](https://arxiv.org/pdf/2103.00587)** — pure-Python call graph + alias analysis (the Python analog to Jelly). **Embeddable in-process, all platforms**, but you build the taint layer yourself. This is the better fit if a single self-contained, all-platform binary is the hard requirement; Pysa is the move when mature out-of-the-box taint outweighs the sidecar + Windows-drop cost. Worth keeping on the table if the Windows gap becomes blocking.

---

Parallel of **codellm-devkit/codeanalyzer-ts#1**. Tracks the Python backend's level-2 (taint/dataflow) direction.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Level-2: taint/dataflow via Pysa (parallel to codeanalyzer-ts#1) #31

Summary

Why Pysa

Constraints / risks (eyes open)

Distribution approach

Configurable taint models (sources / sinks / sanitizers as args)

Tasks

Implementation guide & learning path

Alternatives considered

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Level-2: taint/dataflow via Pysa (parallel to codeanalyzer-ts#1) #31

Description

Summary

Why Pysa

Constraints / risks (eyes open)

Distribution approach

Configurable taint models (sources / sinks / sanitizers as args)

Tasks

Implementation guide & learning path

Alternatives considered

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions