Summary
Add a separate red-teaming workflow to the testbench that exercises agents under test with adversarial inputs and grades the agent's defenses. v1 ships a promptfoo backend behind a pluggable RedteamAdapter abstraction so alternative backends (garak, PyRIT) can be added later without changes to the entry point or config schema.
Motivation
The testbench today only validates functional correctness of agents — it sends curated test cases via A2A, evaluates the responses with RAGAS-style metrics, and publishes scores to Grafana. It has no answer to the security and safety questions teams now have to ship against:
- Will the agent leak PII or system prompts when prompted adversarially?
- Will it follow indirect prompt injection from tool outputs?
- Will it abuse its tool/function-call permissions (BOLA, BFLA, excessive agency)?
- Will it produce harmful, biased, or out-of-policy content?
- Will it jailbreak under standard attack patterns?
Adding red-teaming closes this gap. It lets teams catch these vulnerabilities in CI alongside their functional eval, see findings in the same Grafana stack via OTLP, and gate releases when high-severity issues regress.
Why a separate workflow (not a phase in pipeline.py)
Promptfoo's data model is fundamentally different from the existing pipeline:
- The functional pipeline runs curated
Step objects with hand-written input + expected reference + tool_calls.
- Red-teaming generates adversarial inputs at run time. There is no
reference. Graders are plugin-specific (PII detector, jailbreak detector, etc.), not RAGAS metrics.
Forcing both into the same Experiment / EvaluatedExperiment schema would distort both. Instead, the red-teaming workflow is its own entry point (scripts/run_redteam.py) and its own Testkube template, sharing only the agent target URL and (optionally) the OTLP endpoint with the functional pipeline.
Why an adapter abstraction (not just promptfoo)
Promptfoo is the most mature LLM red-team CLI today, but it is the Node.js outlier in a Python codebase. Two of the strongest alternatives — garak and PyRIT — are Python-native libraries. The RedteamAdapter Protocol is designed to work cleanly for both subprocess-based (promptfoo) and library-based (garak, PyRIT) backends, so we are not locked into promptfoo's Node toolchain or upstream stability.
Scope
v1 (this issue):
RedteamAdapter Protocol + normalized data models (Finding, RedteamResults, ...).
- One backend implementation:
PromptfooAdapter (subprocess to npx promptfoo redteam run).
- Pydantic-validated
redteam-config.yaml schema mirroring PipelineConfig shape.
- OTLP publisher emitting
redteam_* gauge metrics labeled with experiment_name, execution_id, execution_number, backend, plugin, severity.
- Testkube
TestWorkflowTemplate + concrete TestWorkflow.
- Docker image extended with Node.js ≥ 18 and pinned promptfoo.
- E2E test against the Tilt weather-agent.
- User picks plugins from the configured backend's catalog (
--list-plugins discovers what's available).
Explicitly out of scope for v1:
- Multi-backend in a single run.
- Custom user-authored plugins.
- A second adapter (garak / PyRIT). The abstraction is built for it; the implementation is a follow-up.
- Cross-run finding diffing (handled by Grafana over OTLP).
- Auto-remediation suggestions.
- Replaying a specific finding deterministically.
Documents
The plan is broken into 14 TDD-shaped tasks with concrete code in every step.
Acceptance Criteria
Related
Functional evaluation pipeline (unchanged by this work): scripts/pipeline.py.
Summary
Add a separate red-teaming workflow to the testbench that exercises agents under test with adversarial inputs and grades the agent's defenses. v1 ships a promptfoo backend behind a pluggable
RedteamAdapterabstraction so alternative backends (garak, PyRIT) can be added later without changes to the entry point or config schema.Motivation
The testbench today only validates functional correctness of agents — it sends curated test cases via A2A, evaluates the responses with RAGAS-style metrics, and publishes scores to Grafana. It has no answer to the security and safety questions teams now have to ship against:
Adding red-teaming closes this gap. It lets teams catch these vulnerabilities in CI alongside their functional eval, see findings in the same Grafana stack via OTLP, and gate releases when high-severity issues regress.
Why a separate workflow (not a phase in
pipeline.py)Promptfoo's data model is fundamentally different from the existing pipeline:
Stepobjects with hand-writteninput+ expectedreference+tool_calls.reference. Graders are plugin-specific (PII detector, jailbreak detector, etc.), not RAGAS metrics.Forcing both into the same
Experiment/EvaluatedExperimentschema would distort both. Instead, the red-teaming workflow is its own entry point (scripts/run_redteam.py) and its own Testkube template, sharing only the agent target URL and (optionally) the OTLP endpoint with the functional pipeline.Why an adapter abstraction (not just promptfoo)
Promptfoo is the most mature LLM red-team CLI today, but it is the Node.js outlier in a Python codebase. Two of the strongest alternatives — garak and PyRIT — are Python-native libraries. The
RedteamAdapterProtocol is designed to work cleanly for both subprocess-based (promptfoo) and library-based (garak, PyRIT) backends, so we are not locked into promptfoo's Node toolchain or upstream stability.Scope
v1 (this issue):
RedteamAdapterProtocol + normalized data models (Finding,RedteamResults, ...).PromptfooAdapter(subprocess tonpx promptfoo redteam run).redteam-config.yamlschema mirroringPipelineConfigshape.redteam_*gauge metrics labeled withexperiment_name,execution_id,execution_number,backend,plugin,severity.TestWorkflowTemplate+ concreteTestWorkflow.--list-pluginsdiscovers what's available).Explicitly out of scope for v1:
Documents
docs/superpowers/specs/2026-05-06-red-teaming-promptfoo-design.mddocs/superpowers/plans/2026-05-06-red-teaming-promptfoo.md(on branchfeat/red-teaming-workflow)The plan is broken into 14 TDD-shaped tasks with concrete code in every step.
Acceptance Criteria
uv run python3 scripts/run_redteam.py examples/redteam-config.yamlruns end-to-end against the Tilt weather-agent and writesdata/redteam/findings.json,data/redteam/native-report.html, anddata/redteam/summary.html.uv run python3 scripts/run_redteam.py --list-plugins --backend promptfooprints the catalog of available plugins.redteam_*metrics, filterable byexperiment_nameandplugin.0(clean),1(config invalid OR findings exceedfail_on.severity),2(backend execution failed).uv run poe checkpasses (tests, mypy, bandit, ruff).tests_e2e/test_redteam_e2e.py) passes against Tilt.kubectl testkube run tw example-redteam-workflow --watch.Related
Functional evaluation pipeline (unchanged by this work):
scripts/pipeline.py.