Summary
Add taint / dataflow analysis as a level-2 enrichment to the Python backend, using Pysa (Meta's static taint analysis, built on the Pyre type checker). This is the Python parallel to codellm-devkit/codeanalyzer-ts#1 (which uses Jelly for the TypeScript backend) and would supersede the current experimental CodeQL integration.
Pysa does mature interprocedural source → sink taint tracking (sources/sinks/sanitizers declared in taint.config + .pysa model files), which is exactly the level-2 capability we want and that the Jedi/tree-sitter front-end doesn't provide today.
Why Pysa
- Mature, production-grade interprocedural taint (Meta uses it at scale).
- Ships as a pre-compiled native binary bundled in the
pyre-check PyPI wheel — no OCaml toolchain needed at runtime, same distribution shape as a bundled analyzer binary.
- Models-as-data (
taint.config + .pysa) → sources/sinks extensible without code changes.
Constraints / risks (eyes open)
Unlike a pure-Python library, Pysa is an external engine, so:
- Sidecar, not in-process. The Pyre/Pysa binary is shelled out to — it can't be imported as a library. This is the "external tool" integration model (like Joern/CodeQL), not an embeddable pure-Python one.
- Linux + macOS only — no Windows. Pyre is officially Linux/macOS; Windows is WSL-only/unsupported. Level-2 taint would be unavailable on native Windows.
- The binary alone isn't enough to run. Pysa needs a configured Pyre project:
.pyre_configuration, taint.config, .pysa model files, and typeshed third-party stubs. "Bundling Pysa" = binary + that model/stub/config ecosystem.
- Sizable native binary (tens of MB).
Distribution approach
- Add
pyre-check as an optional extra (e.g. pip install codeanalyzer-python[taint]) so the base install — including Windows — is unaffected.
- Gate level-2 taint on platform/availability; emit a clear "Pysa unavailable on this platform" message and fall back to the Jedi call graph when
pyre/pysa can't run.
Configurable taint models (sources / sinks / sanitizers as args)
Pysa is models-as-data by design: taint.config declares the kinds (named sources/sinks/sanitizers + the rules connecting them) and .pysa files bind those kinds to fully-qualified Python symbols. The wrapper's job is to surface that through the codeanalyzer CLI and forward it to Pysa, layered over a bundled default pack.
CLI surface (extends the existing codeanalyzer CLI):
--taint enable taint (or implied by the level-2 flag)
--taint-config <path> extra taint.config JSON ('-' = stdin)
--pysa-models <dir|file> extra .pysa model files
--source <spec> inline source (repeatable)
--sink <spec> inline sink (repeatable)
--sanitizer <spec> inline sanitizer (repeatable)
--taint-builtins <on|off> load the bundled default pack (default on)
Precedence (later extends/overrides earlier): built-in pack → --taint-config/--pysa-models → inline flags. Inline --source/--sink/--sanitizer get compiled into a generated taint.config + .pysa fragment the wrapper writes into the Pyre project before invoking Pysa.
The two model layers (Pysa-native):
# taint.config — declares the *kinds* + the rules linking them
{ "sources": [{ "name": "UserControlled" }],
"sinks": [{ "name": "RemoteCodeExecution" }],
"rules": [{ "name": "RCE", "code": 5001,
"sources": ["UserControlled"], "sinks": ["RemoteCodeExecution"],
"message_format": "User input reaches a command execution sink" }] }
# *.pysa — binds kinds to fully-qualified symbols via Taint* annotations
def os.system(command: TaintSink[RemoteCodeExecution]): ...
def django.http.request.HttpRequest.GET.__get__() -> TaintSource[UserControlled]: ...
def html.escape(s) -> Sanitize: ...
match identity: Pysa names elements by fully-qualified Python name (os.system, django.http.request.HttpRequest.GET), resolved through Pyre's type environment — covering stdlib/third-party (via typeshed stubs) and first-party functions. So an inline spec is a qualified name + kind, e.g. --sink "os.system#0:RemoteCodeExecution". (Typeshed stubs are the Python analog of Jelly's access paths in codeanalyzer-ts#1.)
Tasks
Implementation guide & learning path
Pysa is an external engine, so the work is mostly driving it headless + model management + output mapping, not writing a dataflow engine.
Build order (each independently testable):
- Headless spike — stand up a minimal Pyre project (
.pyre_configuration + taint.config + a couple .pysa models) and run pyre analyze on a fixture; capture the issues JSON. Goal: one source→sink flow detected from the CLI, no wrapper yet.
- Wrapper: depend on
pyre-check (optional [taint] extra), invoke pyre analyze as a subprocess, parse the issues JSON.
- Default model pack + the configurable-models CLI (above).
- Map Pysa findings → the shared
taint_flows schema (aligned with codeanalyzer-ts#1).
- Optional: merge Pysa call edges into
call_graph (provenance pysa).
- Platform gating + graceful fallback (no native Windows) + docs.
Concepts / docs to read first:
- Pysa basics and Running Pysa — the
taint.config kinds/rules and the .pysa model DSL (TaintSource/TaintSink/Sanitize).
- Pysa model generators / tips — how Pysa models large frameworks; informs the default pack.
- Taint as source→sink graph reachability / IFDS — Reps–Horwitz–Sagiv, "Precise Interprocedural Dataflow Analysis via Graph Reachability" (POPL '95); the shared conceptual core with codeanalyzer-ts#1.
- Pyre's type environment + typeshed stubs — Pysa resolves models by fully-qualified name through Pyre's types; understand how stubs supply signatures for un-analyzed libraries.
- Meta's Pysa engineering post — the big picture (summaries / iterative analysis).
Alternatives considered
- Scalpel / PyCG — pure-Python call graph + alias analysis (the Python analog to Jelly). Embeddable in-process, all platforms, but you build the taint layer yourself. This is the better fit if a single self-contained, all-platform binary is the hard requirement; Pysa is the move when mature out-of-the-box taint outweighs the sidecar + Windows-drop cost. Worth keeping on the table if the Windows gap becomes blocking.
Parallel of codellm-devkit/codeanalyzer-ts#1. Tracks the Python backend's level-2 (taint/dataflow) direction.
Summary
Add taint / dataflow analysis as a level-2 enrichment to the Python backend, using Pysa (Meta's static taint analysis, built on the Pyre type checker). This is the Python parallel to codellm-devkit/codeanalyzer-ts#1 (which uses Jelly for the TypeScript backend) and would supersede the current experimental CodeQL integration.
Pysa does mature interprocedural source → sink taint tracking (sources/sinks/sanitizers declared in
taint.config+.pysamodel files), which is exactly the level-2 capability we want and that the Jedi/tree-sitter front-end doesn't provide today.Why Pysa
pyre-checkPyPI wheel — no OCaml toolchain needed at runtime, same distribution shape as a bundled analyzer binary.taint.config+.pysa) → sources/sinks extensible without code changes.Constraints / risks (eyes open)
Unlike a pure-Python library, Pysa is an external engine, so:
.pyre_configuration,taint.config,.pysamodel files, and typeshed third-party stubs. "Bundling Pysa" = binary + that model/stub/config ecosystem.Distribution approach
pyre-checkas an optional extra (e.g.pip install codeanalyzer-python[taint]) so the base install — including Windows — is unaffected.pyre/pysacan't run.Configurable taint models (sources / sinks / sanitizers as args)
Pysa is models-as-data by design:
taint.configdeclares the kinds (named sources/sinks/sanitizers + the rules connecting them) and.pysafiles bind those kinds to fully-qualified Python symbols. The wrapper's job is to surface that through thecodeanalyzerCLI and forward it to Pysa, layered over a bundled default pack.CLI surface (extends the existing
codeanalyzerCLI):Precedence (later extends/overrides earlier): built-in pack →
--taint-config/--pysa-models→ inline flags. Inline--source/--sink/--sanitizerget compiled into a generatedtaint.config+.pysafragment the wrapper writes into the Pyre project before invoking Pysa.The two model layers (Pysa-native):
matchidentity: Pysa names elements by fully-qualified Python name (os.system,django.http.request.HttpRequest.GET), resolved through Pyre's type environment — covering stdlib/third-party (via typeshed stubs) and first-party functions. So an inline spec is a qualified name + kind, e.g.--sink "os.system#0:RemoteCodeExecution". (Typeshed stubs are the Python analog of Jelly's access paths in codeanalyzer-ts#1.)taint.config+.pysaset (common web sources; exec/SQL/path sinks; escapers).--source/--sink/--sanitizer→ generated.pysa/taint.config; merge precedence +--taint-builtins offto fully replace.--taint-config -(stdin) so the python-sdk can forward user-defined models without temp files; surface the matched rulecode/name per reported flow.Tasks
.pyre_configuration+taint.config+ a few.pysamodels) needed to drive it headless.pyre-checkas an optional[taint]extra; vendor a basetaint.config+.pysamodel set + typeshed stubs.taint_flowssection of the analysis output (source/sink models, locations, path, sanitized flag) — aligned with thetaint_flowsschema proposed in codeanalyzer-ts#1 so both backends emit a consistent shape.call_graph(provenancepysa), reconciled to the symbol-table identity keys.Implementation guide & learning path
Pysa is an external engine, so the work is mostly driving it headless + model management + output mapping, not writing a dataflow engine.
Build order (each independently testable):
.pyre_configuration+taint.config+ a couple.pysamodels) and runpyre analyzeon a fixture; capture the issues JSON. Goal: one source→sink flow detected from the CLI, no wrapper yet.pyre-check(optional[taint]extra), invokepyre analyzeas a subprocess, parse the issues JSON.taint_flowsschema (aligned with codeanalyzer-ts#1).call_graph(provenancepysa).Concepts / docs to read first:
taint.configkinds/rules and the.pysamodel DSL (TaintSource/TaintSink/Sanitize).Alternatives considered
Parallel of codellm-devkit/codeanalyzer-ts#1. Tracks the Python backend's level-2 (taint/dataflow) direction.