Skip to content

Validate long-horizon safeguards: legitimate-work and runaway test plan #22

@EightRice

Description

@EightRice

Goal

Empirically verify that the long-horizon safeguards (spawn limit, depth limit, repeat-call abort, per-turn input ceiling) catch runaway shapes without blocking legitimate work.

Why

Safeguards default-permissive: defaults are wide enough that normal use shouldn't hit them, override fields exist on AgentDefinition for cases that legitimately need more headroom. But the only way to know whether the defaults are calibrated correctly is to run real workloads and see what happens.

Test scenarios

Legitimate-work (must pass without aborting)

  1. Big-document research. Agent reads a 150K-token document and summarizes it (single big turn, under 200K per-turn cap).
  2. Slow steady work. Coding agent runs 200 small turns over 4 hours.
  3. Wide fan-out. Parent spawns 12 children to verify different aspects of a feature (within spawn cap of 20).
  4. Flaky-tool retry. Agent retries a failing tool 4 times with same args before succeeding (under repeat-call limit of 5).
  5. Heavy provider usage. Agent on Claude Max runs through 60% of subscription quota over 4 hours.

Runaway (must abort cleanly with a structured error to parent)

  1. Stuck loop. Agent calls bash("ls /tmp") 10 times in a row → aborts at turn 5–6 with stop_reason: "repeat_call_limit".
  2. Spawn explosion. Misconfigured prompt spawns 100 verifier children → blocks at child 21 with Spawn limit exceeded.
  3. Recursive delegation. Parent → child → grandchild → ... → infinite. Caught by depth limit at 7 levels below root.
  4. Single oversized turn. Default agent given a prompt that would push 500K input tokens → refused with per_turn_input_exceeded.

Tuning loop

After running each scenario:

  • If a legitimate scenario fails (false positive): raise the relevant default.
  • If a runaway scenario passes through: tighten the detector or add a new one.

Acceptance

  • Each scenario has a measured outcome (pass/fail + observed metrics).
  • Findings written up in docs/ (or wherever validation reports go).
  • Defaults adjusted if calibration was off.
  • Scenarios that exposed real bugs become regression tests in tests/atn/test_long_horizon_safeguards.py.

Notes

  • Lighter-weight than full pytest fixtures: the legitimate-work scenarios may be manual runs the user observes in the daemon log.
  • Don't gate the milestone close on perfect coverage. The goal is calibration confidence, not 100% test fidelity.

Metadata

Metadata

Assignees

No one assigned

    Labels

    track:agentATN runtime, providers, orchestrator, bridgestype:researchInvestigation needed before implementation

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions