Skip to content

Add red-teaming workflow (promptfoo) to the testbench #40

@fmallmann

Description

@fmallmann

Summary

Add a separate red-teaming workflow to the testbench that exercises agents under test with adversarial inputs and grades the agent's defenses. v1 ships a promptfoo backend behind a pluggable RedteamAdapter abstraction so alternative backends (garak, PyRIT) can be added later without changes to the entry point or config schema.

Motivation

The testbench today only validates functional correctness of agents — it sends curated test cases via A2A, evaluates the responses with RAGAS-style metrics, and publishes scores to Grafana. It has no answer to the security and safety questions teams now have to ship against:

  • Will the agent leak PII or system prompts when prompted adversarially?
  • Will it follow indirect prompt injection from tool outputs?
  • Will it abuse its tool/function-call permissions (BOLA, BFLA, excessive agency)?
  • Will it produce harmful, biased, or out-of-policy content?
  • Will it jailbreak under standard attack patterns?

Adding red-teaming closes this gap. It lets teams catch these vulnerabilities in CI alongside their functional eval, see findings in the same Grafana stack via OTLP, and gate releases when high-severity issues regress.

Why a separate workflow (not a phase in pipeline.py)

Promptfoo's data model is fundamentally different from the existing pipeline:

  • The functional pipeline runs curated Step objects with hand-written input + expected reference + tool_calls.
  • Red-teaming generates adversarial inputs at run time. There is no reference. Graders are plugin-specific (PII detector, jailbreak detector, etc.), not RAGAS metrics.

Forcing both into the same Experiment / EvaluatedExperiment schema would distort both. Instead, the red-teaming workflow is its own entry point (scripts/run_redteam.py) and its own Testkube template, sharing only the agent target URL and (optionally) the OTLP endpoint with the functional pipeline.

Why an adapter abstraction (not just promptfoo)

Promptfoo is the most mature LLM red-team CLI today, but it is the Node.js outlier in a Python codebase. Two of the strongest alternatives — garak and PyRIT — are Python-native libraries. The RedteamAdapter Protocol is designed to work cleanly for both subprocess-based (promptfoo) and library-based (garak, PyRIT) backends, so we are not locked into promptfoo's Node toolchain or upstream stability.

Scope

v1 (this issue):

  • RedteamAdapter Protocol + normalized data models (Finding, RedteamResults, ...).
  • One backend implementation: PromptfooAdapter (subprocess to npx promptfoo redteam run).
  • Pydantic-validated redteam-config.yaml schema mirroring PipelineConfig shape.
  • OTLP publisher emitting redteam_* gauge metrics labeled with experiment_name, execution_id, execution_number, backend, plugin, severity.
  • Testkube TestWorkflowTemplate + concrete TestWorkflow.
  • Docker image extended with Node.js ≥ 18 and pinned promptfoo.
  • E2E test against the Tilt weather-agent.
  • User picks plugins from the configured backend's catalog (--list-plugins discovers what's available).

Explicitly out of scope for v1:

  • Multi-backend in a single run.
  • Custom user-authored plugins.
  • A second adapter (garak / PyRIT). The abstraction is built for it; the implementation is a follow-up.
  • Cross-run finding diffing (handled by Grafana over OTLP).
  • Auto-remediation suggestions.
  • Replaying a specific finding deterministically.

Documents

The plan is broken into 14 TDD-shaped tasks with concrete code in every step.

Acceptance Criteria

  • uv run python3 scripts/run_redteam.py examples/redteam-config.yaml runs end-to-end against the Tilt weather-agent and writes data/redteam/findings.json, data/redteam/native-report.html, and data/redteam/summary.html.
  • uv run python3 scripts/run_redteam.py --list-plugins --backend promptfoo prints the catalog of available plugins.
  • Findings are visible in Grafana as redteam_* metrics, filterable by experiment_name and plugin.
  • Exit code semantics implemented and tested: 0 (clean), 1 (config invalid OR findings exceed fail_on.severity), 2 (backend execution failed).
  • uv run poe check passes (tests, mypy, bandit, ruff).
  • E2E test (tests_e2e/test_redteam_e2e.py) passes against Tilt.
  • Testkube workflow runs successfully in the cluster: kubectl testkube run tw example-redteam-workflow --watch.

Related

Functional evaluation pipeline (unchanged by this work): scripts/pipeline.py.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions