AgentFT

Agent Flow Test - Pytest for AI agents

AgentFT (Agent Flow Test) is a pytest-style evaluation framework for AI agents. It provides composable primitives for tasks, scenarios, agents, and judges, with production-ready infrastructure for concurrency, retries, observability, and regression control.

What's New

Secure declarative config mode: --config-json --strict-config
Distributed local orchestration: aft orchestrate for sharded multi-process runs + merge
Retry policy by error type with richer error taxonomy
Task-level vs judge-level aggregation with configurable task pass rules
Ranking support: aft rank (Elo)
Artifact migration tooling: aft migrate
Columnar export API (DuckDB/Parquet)
Environment/episode scenario toolkit and additional preset benchmark suites
Structured external event sinks for run lifecycle telemetry
Interactive report drilldown with filtering

Features

Pytest-like simplicity: Minimal boilerplate, clear abstractions
Production-ready execution: Async runner, retries, rate limiting, timeouts, fail-fast modes
Flexible parallelism: Independent controls for agents, tasks, and judges
Deterministic reproducibility: Seeded shuffle + deterministic sharding
Strong artifact model: JSONL or SQLite backends with schema versioning and checkpoints
Regression workflows: Compare, gate, rejudge, analyze, rank, merge
Extensibility: Plugin registry for agents, scenarios, and judges
Scenario breadth: List/CSV/JSONL/HuggingFace + rollout-based episode scenarios
Governance support: API stability policy and governance/compliance guidance

Quick Start

Install from PyPI:

pip install agentft

Run an example:

git clone https://github.com/Geddydukes/agentflowtest
cd agentflowtest
aft run --config examples/config_example.py

Or use in code:

from agentft import RunConfig, run
from agentft.presets import build_math_basic_scenario, ExactMatchJudge

class MyAgent:
    name = "my_agent"
    version = "0.1.0"
    provider_key = None

    async def setup(self):
        pass

    async def reset(self):
        pass

    async def teardown(self):
        pass

    async def run_task(self, task, context=None):
        return {"response": "42"}

config = RunConfig(
    name="quick_test",
    agents=[MyAgent()],
    scenarios=[build_math_basic_scenario()],
    judges=[ExactMatchJudge()],
)

results = run(config)
print(f"Passed: {sum(r.passed for r in results)}/{len(results)}")

Core CLI Commands

`aft run`

aft run --config path/to/config.py

Strict non-executable mode:

aft run --config-json path/to/config.json --strict-config

Useful overrides:

aft run --config examples/config_example.py \
  --max-tasks-parallel 8 \
  --max-agents-parallel 2 \
  --max-judges-parallel 4 \
  --artifact-backend sqlite \
  --shard-count 4 --shard-index 1

`aft summary`

aft summary --run-dir runs/my_run/

Task-level rollup mode:

aft summary --run-dir runs/my_run/ --aggregation-level task --task-pass-rule majority

`aft compare`

aft compare --run-a runs/base/ --run-b runs/candidate/

`aft gate`

aft gate --run-a runs/base/ --run-b runs/candidate/ --max-regressions 0

`aft analyze`

aft analyze --run-dir runs/my_run/ --output-json runs/my_run/analytics.json

`aft rejudge`

aft rejudge --run-dir runs/my_run/ --config examples/config_example.py

`aft merge`

aft merge --run-dirs runs/shard0/ runs/shard1/ --output-dir runs/merged/

`aft orchestrate`

aft orchestrate --config examples/config_example.py --shards 4 --output-dir runs/merged_orchestrated/

`aft rank`

aft rank --run-dir runs/my_run/

`aft migrate`

aft migrate --run-dir runs/legacy_run/ --target-schema 1.1.0

Run Artifacts

Each run creates runs/<run_id>/ containing:

results.jsonl or artifacts.db
traces.jsonl
run_metadata.json
run_checkpoint.json
agent_outputs.jsonl (JSONL backend cache, optional)
cached_outputs table in artifacts.db (SQLite backend cache, optional)
report.html

Smoke Test Visualization (2026-02-11)

Recent 5-run local validation matrix:

Run	Mode	Path	Result	Notes
R1	Baseline	`runs/smoke_r1_baseline-5a5f7d3c`	4/4 (100%)	Control run
R2	Parallel + Shuffle	`runs/smoke_r2_parallel-0e22f470`	4/4 (100%)	`max_agents_parallel=2`, `max_tasks_parallel=4`, seed=123
R3	SQLite + Cache	`runs/smoke_r3_sqlite_cache-5890a078`	4/4 (100%)	SQLite artifacts and cached outputs
R4	Resume/Checkpoint	`runs/smoke_r4_resume-e3bb2645`	1/1 -> 4/4 (100%)	Early stop then resumed to completion
R5	Orchestrated Shards	`runs/smoke_r5_orchestrated`	4/4 (100%)	2 shards + merged output

Validation checks:

summary on R1: pass rate 100%
compare R1 vs R2: delta 0.00%, regressions 0
gate R1 vs R2: passed
analyze on R3: pass rate 100%, macro scenario rate 100%
rejudge on cached outputs: 4/4 passed

Pass-rate bars:

R1: ██████████ 100%
R2: ██████████ 100%
R3: ██████████ 100%
R4: ██████████ 100% (after resume)
R5: ██████████ 100%

Extended Functionality Guide

A comprehensive guide for all updated functionality is available at:

docs/UPDATED_FUNCTIONALITY.md

Testing

AgentFT includes 90+ automated tests across runner behavior, storage backends, reporting, CLI, presets, plugins, and resilience checks.

pip install -e ".[dev]"
PYTHONPATH=src pytest -q

Optional columnar export dependency:

pip install -e ".[columnar]"

Project Status

AgentFT is in active development.

Current version: 0.1.0

Links

PyPI: https://pypi.org/project/agentft/
GitHub: https://github.com/Geddydukes/agentflowtest
API Stability: docs/API_STABILITY.md
Governance: docs/GOVERNANCE.md
Docs Map: docs/DOCS_UNIFICATION.md

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
docs		docs
examples		examples
scripts		scripts
src/agentft		src/agentft
tests		tests
tmp/smoke		tmp/smoke
.gitignore		.gitignore
DESIGN_OVERVIEW.md		DESIGN_OVERVIEW.md
IMPLEMENTATION_PLAN.md		IMPLEMENTATION_PLAN.md
LICENSE		LICENSE
Makefile		Makefile
PUBLISHING.md		PUBLISHING.md
README.md		README.md
ROADMAP_PROGRESS.md		ROADMAP_PROGRESS.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AgentFT

What's New

Features

Quick Start

Core CLI Commands

`aft run`

`aft summary`

`aft compare`

`aft gate`

`aft analyze`

`aft rejudge`

`aft merge`

`aft orchestrate`

`aft rank`

`aft migrate`

Run Artifacts

Smoke Test Visualization (2026-02-11)

Extended Functionality Guide

Testing

Project Status

Links

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AgentFT

What's New

Features

Quick Start

Core CLI Commands

aft run

aft summary

aft compare

aft gate

aft analyze

aft rejudge

aft merge

aft orchestrate

aft rank

aft migrate

Run Artifacts

Smoke Test Visualization (2026-02-11)

Extended Functionality Guide

Testing

Project Status

Links

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`aft run`

`aft summary`

`aft compare`

`aft gate`

`aft analyze`

`aft rejudge`

`aft merge`

`aft orchestrate`

`aft rank`

`aft migrate`

Packages