Add red-teaming workflow (promptfoo) to the testbench

## Summary

Add a separate red-teaming workflow to the testbench that exercises agents under test with adversarial inputs and grades the agent's defenses. v1 ships a [promptfoo](https://github.com/promptfoo/promptfoo) backend behind a pluggable `RedteamAdapter` abstraction so alternative backends (garak, PyRIT) can be added later without changes to the entry point or config schema.

## Motivation

The testbench today only validates **functional** correctness of agents — it sends curated test cases via A2A, evaluates the responses with RAGAS-style metrics, and publishes scores to Grafana. It has no answer to the security and safety questions teams now have to ship against:

- Will the agent leak PII or system prompts when prompted adversarially?
- Will it follow indirect prompt injection from tool outputs?
- Will it abuse its tool/function-call permissions (BOLA, BFLA, excessive agency)?
- Will it produce harmful, biased, or out-of-policy content?
- Will it jailbreak under standard attack patterns?

Adding red-teaming closes this gap. It lets teams catch these vulnerabilities in CI alongside their functional eval, see findings in the same Grafana stack via OTLP, and gate releases when high-severity issues regress.

## Why a separate workflow (not a phase in `pipeline.py`)

Promptfoo's data model is fundamentally different from the existing pipeline:

- The functional pipeline runs **curated** `Step` objects with hand-written `input` + expected `reference` + `tool_calls`.
- Red-teaming **generates** adversarial inputs at run time. There is no `reference`. Graders are plugin-specific (PII detector, jailbreak detector, etc.), not RAGAS metrics.

Forcing both into the same `Experiment` / `EvaluatedExperiment` schema would distort both. Instead, the red-teaming workflow is its own entry point (`scripts/run_redteam.py`) and its own Testkube template, sharing only the agent target URL and (optionally) the OTLP endpoint with the functional pipeline.

## Why an adapter abstraction (not just promptfoo)

Promptfoo is the most mature LLM red-team CLI today, but it is the Node.js outlier in a Python codebase. Two of the strongest alternatives — [garak](https://github.com/NVIDIA/garak) and [PyRIT](https://github.com/Azure/PyRIT) — are Python-native libraries. The `RedteamAdapter` Protocol is designed to work cleanly for both subprocess-based (promptfoo) and library-based (garak, PyRIT) backends, so we are not locked into promptfoo's Node toolchain or upstream stability.

## Scope

**v1 (this issue):**
- `RedteamAdapter` Protocol + normalized data models (`Finding`, `RedteamResults`, ...).
- One backend implementation: `PromptfooAdapter` (subprocess to `npx promptfoo redteam run`).
- Pydantic-validated `redteam-config.yaml` schema mirroring `PipelineConfig` shape.
- OTLP publisher emitting `redteam_*` gauge metrics labeled with `experiment_name`, `execution_id`, `execution_number`, `backend`, `plugin`, `severity`.
- Testkube `TestWorkflowTemplate` + concrete `TestWorkflow`.
- Docker image extended with Node.js ≥ 18 and pinned promptfoo.
- E2E test against the Tilt weather-agent.
- User picks plugins from the configured backend's catalog (`--list-plugins` discovers what's available).

**Explicitly out of scope for v1:**
- Multi-backend in a single run.
- Custom user-authored plugins.
- A second adapter (garak / PyRIT). The abstraction is built for it; the implementation is a follow-up.
- Cross-run finding diffing (handled by Grafana over OTLP).
- Auto-remediation suggestions.
- Replaying a specific finding deterministically.

## Documents

- **Design spec:** [`docs/superpowers/specs/2026-05-06-red-teaming-promptfoo-design.md`](../blob/main/docs/superpowers/specs/2026-05-06-red-teaming-promptfoo-design.md)
- **Implementation plan:** [`docs/superpowers/plans/2026-05-06-red-teaming-promptfoo.md`](../blob/feat/red-teaming-workflow/docs/superpowers/plans/2026-05-06-red-teaming-promptfoo.md) (on branch `feat/red-teaming-workflow`)

The plan is broken into 14 TDD-shaped tasks with concrete code in every step.

## Acceptance Criteria

- [ ] `uv run python3 scripts/run_redteam.py examples/redteam-config.yaml` runs end-to-end against the Tilt weather-agent and writes `data/redteam/findings.json`, `data/redteam/native-report.html`, and `data/redteam/summary.html`.
- [ ] `uv run python3 scripts/run_redteam.py --list-plugins --backend promptfoo` prints the catalog of available plugins.
- [ ] Findings are visible in Grafana as `redteam_*` metrics, filterable by `experiment_name` and `plugin`.
- [ ] Exit code semantics implemented and tested: `0` (clean), `1` (config invalid OR findings exceed `fail_on.severity`), `2` (backend execution failed).
- [ ] `uv run poe check` passes (tests, mypy, bandit, ruff).
- [ ] E2E test (`tests_e2e/test_redteam_e2e.py`) passes against Tilt.
- [ ] Testkube workflow runs successfully in the cluster: `kubectl testkube run tw example-redteam-workflow --watch`.

## Related

Functional evaluation pipeline (unchanged by this work): `scripts/pipeline.py`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add red-teaming workflow (promptfoo) to the testbench #40

Summary

Motivation

Why a separate workflow (not a phase in `pipeline.py`)

Why an adapter abstraction (not just promptfoo)

Scope

Documents

Acceptance Criteria

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add red-teaming workflow (promptfoo) to the testbench #40

Description

Summary

Motivation

Why a separate workflow (not a phase in pipeline.py)

Why an adapter abstraction (not just promptfoo)

Scope

Documents

Acceptance Criteria

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Why a separate workflow (not a phase in `pipeline.py`)