Goal
Empirically verify that the long-horizon safeguards (spawn limit, depth limit, repeat-call abort, per-turn input ceiling) catch runaway shapes without blocking legitimate work.
Why
Safeguards default-permissive: defaults are wide enough that normal use shouldn't hit them, override fields exist on AgentDefinition for cases that legitimately need more headroom. But the only way to know whether the defaults are calibrated correctly is to run real workloads and see what happens.
Test scenarios
Legitimate-work (must pass without aborting)
- Big-document research. Agent reads a 150K-token document and summarizes it (single big turn, under 200K per-turn cap).
- Slow steady work. Coding agent runs 200 small turns over 4 hours.
- Wide fan-out. Parent spawns 12 children to verify different aspects of a feature (within spawn cap of 20).
- Flaky-tool retry. Agent retries a failing tool 4 times with same args before succeeding (under repeat-call limit of 5).
- Heavy provider usage. Agent on Claude Max runs through 60% of subscription quota over 4 hours.
Runaway (must abort cleanly with a structured error to parent)
- Stuck loop. Agent calls
bash("ls /tmp") 10 times in a row → aborts at turn 5–6 with stop_reason: "repeat_call_limit".
- Spawn explosion. Misconfigured prompt spawns 100 verifier children → blocks at child 21 with
Spawn limit exceeded.
- Recursive delegation. Parent → child → grandchild → ... → infinite. Caught by depth limit at 7 levels below root.
- Single oversized turn. Default agent given a prompt that would push 500K input tokens → refused with
per_turn_input_exceeded.
Tuning loop
After running each scenario:
- If a legitimate scenario fails (false positive): raise the relevant default.
- If a runaway scenario passes through: tighten the detector or add a new one.
Acceptance
- Each scenario has a measured outcome (pass/fail + observed metrics).
- Findings written up in
docs/ (or wherever validation reports go).
- Defaults adjusted if calibration was off.
- Scenarios that exposed real bugs become regression tests in
tests/atn/test_long_horizon_safeguards.py.
Notes
- Lighter-weight than full pytest fixtures: the legitimate-work scenarios may be manual runs the user observes in the daemon log.
- Don't gate the milestone close on perfect coverage. The goal is calibration confidence, not 100% test fidelity.
Goal
Empirically verify that the long-horizon safeguards (spawn limit, depth limit, repeat-call abort, per-turn input ceiling) catch runaway shapes without blocking legitimate work.
Why
Safeguards default-permissive: defaults are wide enough that normal use shouldn't hit them, override fields exist on
AgentDefinitionfor cases that legitimately need more headroom. But the only way to know whether the defaults are calibrated correctly is to run real workloads and see what happens.Test scenarios
Legitimate-work (must pass without aborting)
Runaway (must abort cleanly with a structured error to parent)
bash("ls /tmp")10 times in a row → aborts at turn 5–6 withstop_reason: "repeat_call_limit".Spawn limit exceeded.per_turn_input_exceeded.Tuning loop
After running each scenario:
Acceptance
docs/(or wherever validation reports go).tests/atn/test_long_horizon_safeguards.py.Notes