Skip to content

Level-2: taint/dataflow via Pysa (parallel to codeanalyzer-ts#1) #31

@rahlk

Description

@rahlk

Summary

Add taint / dataflow analysis as a level-2 enrichment to the Python backend, using Pysa (Meta's static taint analysis, built on the Pyre type checker). This is the Python parallel to codellm-devkit/codeanalyzer-ts#1 (which uses Jelly for the TypeScript backend) and would supersede the current experimental CodeQL integration.

Pysa does mature interprocedural source → sink taint tracking (sources/sinks/sanitizers declared in taint.config + .pysa model files), which is exactly the level-2 capability we want and that the Jedi/tree-sitter front-end doesn't provide today.

Why Pysa

  • Mature, production-grade interprocedural taint (Meta uses it at scale).
  • Ships as a pre-compiled native binary bundled in the pyre-check PyPI wheel — no OCaml toolchain needed at runtime, same distribution shape as a bundled analyzer binary.
  • Models-as-data (taint.config + .pysa) → sources/sinks extensible without code changes.

Constraints / risks (eyes open)

Unlike a pure-Python library, Pysa is an external engine, so:

  • Sidecar, not in-process. The Pyre/Pysa binary is shelled out to — it can't be imported as a library. This is the "external tool" integration model (like Joern/CodeQL), not an embeddable pure-Python one.
  • Linux + macOS only — no Windows. Pyre is officially Linux/macOS; Windows is WSL-only/unsupported. Level-2 taint would be unavailable on native Windows.
  • The binary alone isn't enough to run. Pysa needs a configured Pyre project: .pyre_configuration, taint.config, .pysa model files, and typeshed third-party stubs. "Bundling Pysa" = binary + that model/stub/config ecosystem.
  • Sizable native binary (tens of MB).

Distribution approach

  • Add pyre-check as an optional extra (e.g. pip install codeanalyzer-python[taint]) so the base install — including Windows — is unaffected.
  • Gate level-2 taint on platform/availability; emit a clear "Pysa unavailable on this platform" message and fall back to the Jedi call graph when pyre/pysa can't run.

Configurable taint models (sources / sinks / sanitizers as args)

Pysa is models-as-data by design: taint.config declares the kinds (named sources/sinks/sanitizers + the rules connecting them) and .pysa files bind those kinds to fully-qualified Python symbols. The wrapper's job is to surface that through the codeanalyzer CLI and forward it to Pysa, layered over a bundled default pack.

CLI surface (extends the existing codeanalyzer CLI):

--taint                    enable taint (or implied by the level-2 flag)
--taint-config <path>      extra taint.config JSON ('-' = stdin)
--pysa-models <dir|file>   extra .pysa model files
--source <spec>            inline source     (repeatable)
--sink <spec>              inline sink        (repeatable)
--sanitizer <spec>         inline sanitizer   (repeatable)
--taint-builtins <on|off>  load the bundled default pack (default on)

Precedence (later extends/overrides earlier): built-in pack → --taint-config/--pysa-models → inline flags. Inline --source/--sink/--sanitizer get compiled into a generated taint.config + .pysa fragment the wrapper writes into the Pyre project before invoking Pysa.

The two model layers (Pysa-native):

# taint.config — declares the *kinds* + the rules linking them
{ "sources": [{ "name": "UserControlled" }],
  "sinks":   [{ "name": "RemoteCodeExecution" }],
  "rules":   [{ "name": "RCE", "code": 5001,
               "sources": ["UserControlled"], "sinks": ["RemoteCodeExecution"],
               "message_format": "User input reaches a command execution sink" }] }
# *.pysa — binds kinds to fully-qualified symbols via Taint* annotations
def os.system(command: TaintSink[RemoteCodeExecution]): ...
def django.http.request.HttpRequest.GET.__get__() -> TaintSource[UserControlled]: ...
def html.escape(s) -> Sanitize: ...

match identity: Pysa names elements by fully-qualified Python name (os.system, django.http.request.HttpRequest.GET), resolved through Pyre's type environment — covering stdlib/third-party (via typeshed stubs) and first-party functions. So an inline spec is a qualified name + kind, e.g. --sink "os.system#0:RemoteCodeExecution". (Typeshed stubs are the Python analog of Jelly's access paths in codeanalyzer-ts#1.)

  • Built-in default pack: a base taint.config + .pysa set (common web sources; exec/SQL/path sinks; escapers).
  • Compile inline --source/--sink/--sanitizer → generated .pysa/taint.config; merge precedence + --taint-builtins off to fully replace.
  • --taint-config - (stdin) so the python-sdk can forward user-defined models without temp files; surface the matched rule code/name per reported flow.

Tasks

  • Spike: run Pysa on a small fixture; capture its output (issues JSON / taint flows) and the minimal config (.pyre_configuration + taint.config + a few .pysa models) needed to drive it headless.
  • Bundle/depend on pyre-check as an optional [taint] extra; vendor a base taint.config + .pysa model set + typeshed stubs.
  • Add a level-2 enrichment path that invokes Pysa and maps its findings into a taint_flows section of the analysis output (source/sink models, locations, path, sanitized flag) — aligned with the taint_flows schema proposed in codeanalyzer-ts#1 so both backends emit a consistent shape.
  • Optionally merge Pysa-derived call edges into the existing call_graph (provenance pysa), reconciled to the symbol-table identity keys.
  • Retire the experimental CodeQL integration (superseded by Pysa for the taint/enrichment use case).
  • Platform/fallback handling (no native Windows) + docs.

Implementation guide & learning path

Pysa is an external engine, so the work is mostly driving it headless + model management + output mapping, not writing a dataflow engine.

Build order (each independently testable):

  1. Headless spike — stand up a minimal Pyre project (.pyre_configuration + taint.config + a couple .pysa models) and run pyre analyze on a fixture; capture the issues JSON. Goal: one source→sink flow detected from the CLI, no wrapper yet.
  2. Wrapper: depend on pyre-check (optional [taint] extra), invoke pyre analyze as a subprocess, parse the issues JSON.
  3. Default model pack + the configurable-models CLI (above).
  4. Map Pysa findings → the shared taint_flows schema (aligned with codeanalyzer-ts#1).
  5. Optional: merge Pysa call edges into call_graph (provenance pysa).
  6. Platform gating + graceful fallback (no native Windows) + docs.

Concepts / docs to read first:

  • Pysa basics and Running Pysa — the taint.config kinds/rules and the .pysa model DSL (TaintSource/TaintSink/Sanitize).
  • Pysa model generators / tips — how Pysa models large frameworks; informs the default pack.
  • Taint as source→sink graph reachability / IFDS — Reps–Horwitz–Sagiv, "Precise Interprocedural Dataflow Analysis via Graph Reachability" (POPL '95); the shared conceptual core with codeanalyzer-ts#1.
  • Pyre's type environment + typeshed stubs — Pysa resolves models by fully-qualified name through Pyre's types; understand how stubs supply signatures for un-analyzed libraries.
  • Meta's Pysa engineering post — the big picture (summaries / iterative analysis).

Alternatives considered

  • Scalpel / PyCG — pure-Python call graph + alias analysis (the Python analog to Jelly). Embeddable in-process, all platforms, but you build the taint layer yourself. This is the better fit if a single self-contained, all-platform binary is the hard requirement; Pysa is the move when mature out-of-the-box taint outweighs the sidecar + Windows-drop cost. Worth keeping on the table if the Windows gap becomes blocking.

Parallel of codellm-devkit/codeanalyzer-ts#1. Tracks the Python backend's level-2 (taint/dataflow) direction.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions