Skip to content

[deployment] Cost guardrail threshold-cross alert hook (webhook / callback) #70

@dep0we

Description

@dep0we

Background

Cost guardrail warnings fire at the configured thresholds (default 50% / 80% of daily / monthly cap). The warning is logged as a JSONL record in <agent>/log/YYYY-MM/YYYY-MM-DD.jsonl with severity INFO/WARN (verified — atomic_agents/agent.py:1707-1714).

JSONL log is the only delivery channel. There is no webhook hook, no callback, no integration point for an external alerter (Telegram bot, Slack webhook, PagerDuty, email).

Why it matters

The reason warnings exist is to give the operator time to react before the cap blocks runs. Burying them in a JSONL file the operator reads only when they remember to check defeats the purpose.

For a personal-deployment use case (Dan's gizmo), Telegram alerting is non-negotiable — silent cron failures and silent threshold-crosses are both known operational pain. For a SaaS use case, alerting needs to route per tenant.

What to change

  1. Config schema — add optional alert_hooks block to model.md cost_guardrails:
    cost_guardrails:
      daily_cap_usd: 5.00
      monthly_cap_usd: 100.00
      warning_thresholds: [0.50, 0.80]
      alert_hooks:
        - type: webhook
          url: https://hooks.slack.com/...
          on: [threshold_cross, cap_blocked]
        - type: webhook
          url: https://api.telegram.org/bot.../sendMessage
          on: [threshold_cross, cap_blocked]
          template: "atomic-agents alert: {agent} at {pct}% of {period} cap"
  2. Runtime_fire_cost_warning() in agent.py walks alert_hooks, posts to each:
    • JSONL log entry as today (don't regress)
    • Webhook POST with JSON payload {agent, period, pct, threshold, severity, ts}
    • Failure handling: alerter timeout/error logged, doesn't block the agent run
  3. Library hook (orthogonal) — AtomicAgent accepts an optional on_cost_alert: Callable[[CostAlert], None] parameter for programmatic use. Hub-wrapped invocations register their own callable; bare-CLI uses the YAML config.
  4. Spec doc updatedocs/spec/05-cost-guardrails.md documents the alert hook contract.
  5. Sample — Caldwell model.md shows commented-out Telegram webhook example.

Acceptance

  • model.md parses alert_hooks correctly (with and without — backward compatible)
  • Webhook POST happens at threshold cross + at cap-blocked event, payload schema is documented
  • Webhook failure does NOT block the agent run
  • Programmatic on_cost_alert hook fires for both events
  • Tests cover: hook fires once per threshold per day (not on every run), hook failure is logged, webhook timeout doesn't hang the run
  • New JSONL fields cost_alert_dispatched: true/false for audit

Open questions

  • Telegram webhook needs chat_id per operator — is that in alert_hooks.url (URL has chat_id baked in) or a separate field? (URL probably; standard Telegram pattern)
  • Per-agent vs per-deployment alert config: today guardrails are per-agent. Hooks probably want to be per-agent too (different agents → different routing) but with a global default at deployment level (atomic-agents.toml or env). Defer global default until needed.
  • Webhook retry policy: probably "best effort, don't retry, don't block" — alerter is responsible for not dropping. Document.

Context

  • Surfaced in deployment-readiness review (2026-05-08), gap E
  • Telegram alerting is non-negotiable for Dan's gizmo deployment
  • Pattern reference: similar webhook/callback hooks would be useful for other framework events (run_failed, dream_completed, eval_failed) — track here as future-but-not-this-PR

Metadata

Metadata

Assignees

No one assigned

    Labels

    deploymentDeployment, install, upgrade, and operational runbook gapsenhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions