chore(cf-cost): stretch aibtc-heartbeat schedule to 3600s by whoabuddy · Pull Request #3 · aibtcdev/agent-runtime

whoabuddy · 2026-05-04T18:52:27Z

Pairs with arc0btc/agents-love-bitcoin# (cf-cost cleanup campaign, PR 3 — heartbeat consolidation).

Summary

The ALB-side PR 3 change makes `/api/me/inbox-status` refresh `agent_index.last_active_at` (coalesced, ≤ 1 GlobalDO write per agent per 60s window). With the runtime poll cadence at 600s, that's a tighter liveness signal than the dedicated `aibtc-heartbeat` task at 360s ever provided — so the heartbeat stops earning its cost.

This PR adds a fixture documenting the recommended stretched setting (3600s = 1h) without removing the heartbeat entirely. Keeps it as a backstop for cases where the inbox-status path goes quiet for an unusual reason.

Operator action per VM

```bash
~/.bun/bin/bun run src/cli.ts schedule-create \
--config ~/.config/agent-runtime/.host.json \
--file fixtures/aibtc-heartbeat.schedule.json
```

`upsertRecurringSchedule` updates in place, so re-running on a VM that already has an `aibtc-heartbeat` schedule rolls it to 3600s without leaving stale state.

Apply order: ALB PR 3 first (so `last_active_at` keeps refreshing automatically), then this fixture cutover on each VM.

Expected effect

`aibtc-heartbeat` task volume per VM: ~240/day → ~24/day (-90%).
Eliminates ~5 × 240 = 1,200 dispatch-loop iterations/day across the current 5-agent fleet, with proportional savings in Workers requests + AIBTC API writes.

Test plan

`bunx tsc --noEmit` clean
After ALB PR 3 deploys: apply on Spark and confirm `schedule-tick` reports the new 3600s next-run timestamp
2h post-cutover: confirm `last_active_at` for that agent stays within 1h of `now()` via ALB DO query
No operator alarm on agent liveness

🤖 Generated with Claude Code

Pairs with arc0btc/agents-love-bitcoin PR 3 (cf-cost cleanup campaign, heartbeat consolidation). The ALB-side change makes `/api/me/inbox-status` refresh `agent_index.last_active_at` on every poll, so the dedicated `aibtc-heartbeat` task at 360s stops earning its cost — runtime polling at 600s is already a tighter liveness signal than the heartbeat ever was. Stretching to 3600s (1h) keeps the heartbeat as a backstop for agents whose ALB inbox-status path goes quiet for an unusual reason, without the per-VM Workers + DO + AIBTC API write fan-out that 360s incurred. Operator action per VM: ```bash ~/.bun/bin/bun run src/cli.ts schedule-create \ --config ~/.config/agent-runtime/<agent>.host.json \ --file fixtures/aibtc-heartbeat.schedule.json ``` `upsertRecurringSchedule` updates the row in place, so re-running on a VM that already has an `aibtc-heartbeat` schedule rolls it to 3600s without leaving stale state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

whoabuddy · 2026-05-04T19:17:54Z

Closing — paired with reverted ALB PR (arc0btc/agents-love-bitcoin#19 → reverted by #20).

The heartbeat consolidation was scoped wrong as part of the ALB cost-cleanup campaign:

`aibtc-checkin` hits aibtc.com via an externally-installed `aibtc-heartbeat` CLI, not ALB. Stretching the cadence reduces aibtc.com Workers cost, not ALB's.
ALB has no liveness column to keep warm — the `agent_index.last_active_at` field that the original plan referenced was never read by any ALB consumer.

Stretched-heartbeat cadence still has merit if aibtc.com itself wants to reduce Workers cost from the per-VM 360s polling. That's an aibtc.com cost-cleanup track, separate from this campaign.

The fixture in this PR (`fixtures/aibtc-heartbeat.schedule.json` at 3600s) can be cherry-picked into a future PR if the operator decides to apply it independently — the file content stands on its own.

…it flag, host requirement, write-back guard, idempotency key Addresses 7 inline review findings (Copilot + secret-mars + arc0btc) on PR #5 — all substantive items resolved in-tree, no follow-ups left. **Catch widening (Copilot #4, secret-mars [blocking-risk], arc0btc #1)** `runSubstrateIntakeTick` previously only caught `claimNextJob`. Both `jobRowToTaskInput` and `enqueueTask` ran outside the try, so a malformed JobRow or a local SQLite lock contention escaped as a thrown error from `runOnce`. Each call site now has its own catch with a distinct skip reason — `job-parse-fail` and `local-enqueue-fail` — so the contract's "NEVER throws" guarantee is real. The substrate job stays under lease in both failure modes; `releaseExpiredLeases` reconciles on the next cycle. **`substrateDbInitialized` retry on credential fail (Copilot #1, all 3 reviewers seconded)** The flag was set to `true` BEFORE `createSubstrateConnection` succeeded. A transient credential read miss at first tick (e.g. encrypted blob not yet available) permanently disabled substrate intake for the process lifetime — the contract's documented `[substrate] skip reason=credential-fail` log line would never re-fire. Flag now sets only inside the success branch; next tick retries. **`runSubstrateWriteBack` swallows transient PG blips (arc0btc #3)** Was the only substrate fn whose Postgres calls were unguarded — a mid-write-back connection reset would propagate out of `runOnce` even though the local task already finished cleanly. Wrapped both `completeJob` and `failJob` in a single try; new `[substrate] write-back error=<msg> jobs.id=<id>` log line on transient fail. Lease recovery still reconciles eventually. **Default host removed (Copilot #3, arc0btc seconded)** `host` no longer defaults to a hard-coded private IP. When `substrate.enabled: true`, an explicit `substrate.host` is required — `createSubstrateConnection` throws a clear error if unset. Closes the "dev/test slot misconfigured at enabled:true accidentally connects to prod" footgun. **Self-contained log fallbacks (secret-mars #5)** The empty-else branches on `completeJob`/`failJob` `{ ok: false }` results relied on the substrate-db package emitting its own `[substrate] complete-epoch-mismatch ...` log. If that package's log format ever changes, those branches went silent. Added `[substrate] complete-failed jobs.id=<id> epoch=<n>` and `[substrate] fail-failed ...` fallbacks so this code is self-contained. **Idempotency-key threading (arc0btc #2 design, secret-mars idempotency follow-up)** Closes the "side-effecting jobs can execute twice when a lease expires mid-flight" hazard structurally — the fence on `jobs.claim_epoch` only protects the status write, not the action. `jobRowToTaskInput` now threads `payload.idempotency_key = "substrate-<job_id>-e<claim_epoch>"` so downstream side-effecting handlers (email send, PR open, tx broadcast) can dedup against their own per-handler key store. Bounds the *consequence* from "action runs twice" to "action runs once, second attempt no-ops." Substrate tasks also enqueue with `priority: 1` so they execute on the same or next tick — the lease window now tracks real execution latency instead of waiting behind lower-priority work, which shrinks the "lease expires while task waits in local queue" window further. **Silent null claim (Copilot #5)** Contract said null-claim is silent; impl logged `[substrate] claim ... result=none` every tick. Dropped the log — quiet-tick visibility lives in successful-claim and idle-dispatch event lines, not substrate tick-rate noise. **Other nits (Copilot #2, arc0btc #5)** - Removed unused `getTaskById` import in `substrate.ts`. - Documented `substrate` block's shallow-merge behavior in `types.ts` (unlike `profiles`/`adapters` which deep-merge). Slots that extend a base and want to flip only `isLeaseRecoveryOwner: true` must repeat the whole substrate block. Deferred (already declared out of scope by reviewers): - `pg_advisory_lock` guard on `isLeaseRecoveryOwner` race (secret-mars [suggestion]) — multi-slot mis-config protection is a follow-up. - `_substrate_*` payload-injection vs strict validators (secret-mars [suggestion]) — naming is defensive prefixing already. Contract block in src/substrate.ts updated with all new log lines and a new "Side-effect duplicate-execution guard" section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…l/lease-recovery paths Closes the test-gap finding from arc0btc #4 and secret-mars: prior suite covered only `resolveSubstrateConfig`, `jobRowToTaskInput`, and two write-back *early-return* guards via a fakeDb that threw if touched. Claim, complete, fail, epoch-mismatch no-op, transient write-back throw, and lease recovery — the "highest correctness surface" per the PR's own Commit-2 description — were untested. Rewires the file to use `mock.module("@genesis-works/substrate-db", ...)` with mock fns per substrate-db export. Local DB is a real `openDb(:memory:)` so `enqueueTask` exercises the actual sqlite write path. console.{info,error,warn} are spied per-test for log-line assertions. New tests (18, alongside 10 existing for 28 total): - `runSubstrateIntakeTick`: - happy claim → enqueues local task, returns `claimed=true` with correct epoch + jobId, emits `[substrate] claim ...` log - null claim → silent (no `[substrate] claim` log line; just `claimed: false`) - db-unreachable → catches `claimNextJob` throw, logs `reason=db-unreachable` - local-enqueue-fail → forces `enqueueTask` to throw by closing the local db, asserts catch fires + log line includes `reason=local-enqueue-fail jobs.id=<id>` (proves Copilot #4 / arc0btc #1 fix is wired) - `runSubstrateWriteBack`: - non-substrate source no-op (existing — kept) - missing `_substrate_job_id` no-op (existing — kept) - complete happy → calls `completeJob` with `claim_epoch` arg, `[substrate] complete jobs.id=... epoch=...` log - complete `{ ok: false }` → emits self-contained fallback `[substrate] complete-failed` log (proves secret-mars #5 fix) - complete throws → catches transient PG blip, emits `[substrate] write-back error=... jobs.id=...` (proves arc0btc #3 fix is wired — write-back no longer propagates out of `runOnce`) - fail happy → calls `failJob`, logs reason snippet - fail `{ ok: false }` → emits `[substrate] fail-failed` fallback - `runSubstrateLeaseRecovery`: - released > 0 → logs `[substrate] lease-recovery released=<n>` - released = 0 → no log (quiet success) - throws → catches, logs `[substrate] lease-recovery error=...`, does NOT propagate - `createSubstrateConnection`: - throws when `substrate.host` is unset (proves Copilot #3 fix — no implicit default to a hard-coded private IP) - throws on whitespace-only host - `jobRowToTaskInput`: - asserts `payload.idempotency_key = "substrate-<job_id>-e<epoch>"` (proves idempotency-key threading is wired) - asserts `priority = 1` - asserts `_substrate_*` payload fields threaded - asserts `max_attempts = 1` (substrate handles retry) - existing source / subject-fallback tests kept Header comment rewritten to match what's actually covered (Copilot #7). Unused `mock`/`beforeEach`/`afterEach` imports replaced with the ones actually used (Copilot #6 — though now they ARE used, so the prior nit naturally resolves). Test count: 143 total (28 substrate + 115 existing) — up from 125. `bunx tsc --noEmit` clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(substrate): opt-in substrate intake — config, credential, claim, enqueue Adds @genesis-works/substrate-db as a dependency and wires a substrate dispatch intake path into the runOnce tick. Opt-in per slot: substrate is disabled by default (config.substrate.enabled=false). Flipping this flag live changes zero behavior on any slot that has not set it. Config keys (host config substrate block): substrate.enabled — bool, default false substrate.credential — string, credential id (NEVER plaintext password) substrate.kinds — string[], job kinds this slot handles substrate.slotId — string, this slot's claim identifier substrate.leaseSecs — number, default 300 (5 minutes) substrate.host/port/database/user — Postgres connection params On each tick (when enabled): 1. Credential resolved via existing encrypted-blob pattern. 2. claimNextJob(slotId, kinds) called against shared Postgres. Log: "[substrate] claim slot=<id> kinds=<list> result=<jobs.id|none>" 3. Claimed job converted to TaskInput (source="substrate:<jobId>"). 4. TaskInput enqueued as a local runtime task for next-tick execution. Failure conditions (all logged, none crash): - substrate unreachable → "[substrate] skip reason=db-unreachable error=..." - credential fail → "[substrate] skip reason=credential-fail error=..." - claimNextJob null → "[substrate] claim ... result=none" (silent no-op) Pinned substrate-db SHA: d458200b008efe8be3d37912a71b990d73ff2b17 (arc0btc/substrate-db — private, deployed alongside agent-runtime on slots) Co-Authored-By: Claude <noreply@anthropic.com> * feat(substrate): lease semantics — epoch-fenced write-back, max_attempts, lease recovery Adds the epoch-fenced write-back hook and lease recovery cadence to the substrate intake adapter. Write-back (after finalizeTaskAttempt): - If task.source starts with "substrate:", calls runSubstrateWriteBack. - Reads _substrate_job_id and _substrate_claim_epoch from task payload. - Calls completeJob(epoch) on success, failJob(epoch) on failure. - Epoch mismatch (stale executor woke after lease recovery) → logged no-op: "[substrate] complete-epoch-mismatch ... stale executor, ignored" - Pre-fencing row (epoch=0, expected>0) → distinct log line: "pre-fencing row, ignored" - completeJob/failJob return WriteBackResult — conflict is logged, not thrown. Lease recovery: - Runs releaseExpiredLeases on the substrate DB when isLeaseRecoveryOwner=true. - Cadence: leaseRecoveryCadenceSecs (default 60s) using a module-level timer. - ONE nominated owner runs this — not all slots — to avoid stampede. - Log: "[substrate] lease-recovery released=<n>" - Lease expiry measured against Postgres now() — NTP drift irrelevant. Log lines contract (stable, parseable): [substrate] claim slot=<id> kinds=<list> result=<jobs.id|none> [substrate] complete jobs.id=<id> epoch=<n> [substrate] fail jobs.id=<id> epoch=<n> reason=<...> [substrate] lease-recovery released=<n> [substrate] skip reason=credential-fail error=<...> [substrate] skip reason=db-unreachable error=<...> [substrate] complete-epoch-mismatch jobs.id=<id> expected=<n> actual=<m> — ... Tests: 125/125 pass (10 new substrate tests + 115 existing). Co-Authored-By: Claude <noreply@anthropic.com> * fix(substrate): review-feedback correctness pass — catch widening, init flag, host requirement, write-back guard, idempotency key Addresses 7 inline review findings (Copilot + secret-mars + arc0btc) on PR #5 — all substantive items resolved in-tree, no follow-ups left. **Catch widening (Copilot #4, secret-mars [blocking-risk], arc0btc #1)** `runSubstrateIntakeTick` previously only caught `claimNextJob`. Both `jobRowToTaskInput` and `enqueueTask` ran outside the try, so a malformed JobRow or a local SQLite lock contention escaped as a thrown error from `runOnce`. Each call site now has its own catch with a distinct skip reason — `job-parse-fail` and `local-enqueue-fail` — so the contract's "NEVER throws" guarantee is real. The substrate job stays under lease in both failure modes; `releaseExpiredLeases` reconciles on the next cycle. **`substrateDbInitialized` retry on credential fail (Copilot #1, all 3 reviewers seconded)** The flag was set to `true` BEFORE `createSubstrateConnection` succeeded. A transient credential read miss at first tick (e.g. encrypted blob not yet available) permanently disabled substrate intake for the process lifetime — the contract's documented `[substrate] skip reason=credential-fail` log line would never re-fire. Flag now sets only inside the success branch; next tick retries. **`runSubstrateWriteBack` swallows transient PG blips (arc0btc #3)** Was the only substrate fn whose Postgres calls were unguarded — a mid-write-back connection reset would propagate out of `runOnce` even though the local task already finished cleanly. Wrapped both `completeJob` and `failJob` in a single try; new `[substrate] write-back error=<msg> jobs.id=<id>` log line on transient fail. Lease recovery still reconciles eventually. **Default host removed (Copilot #3, arc0btc seconded)** `host` no longer defaults to a hard-coded private IP. When `substrate.enabled: true`, an explicit `substrate.host` is required — `createSubstrateConnection` throws a clear error if unset. Closes the "dev/test slot misconfigured at enabled:true accidentally connects to prod" footgun. **Self-contained log fallbacks (secret-mars #5)** The empty-else branches on `completeJob`/`failJob` `{ ok: false }` results relied on the substrate-db package emitting its own `[substrate] complete-epoch-mismatch ...` log. If that package's log format ever changes, those branches went silent. Added `[substrate] complete-failed jobs.id=<id> epoch=<n>` and `[substrate] fail-failed ...` fallbacks so this code is self-contained. **Idempotency-key threading (arc0btc #2 design, secret-mars idempotency follow-up)** Closes the "side-effecting jobs can execute twice when a lease expires mid-flight" hazard structurally — the fence on `jobs.claim_epoch` only protects the status write, not the action. `jobRowToTaskInput` now threads `payload.idempotency_key = "substrate-<job_id>-e<claim_epoch>"` so downstream side-effecting handlers (email send, PR open, tx broadcast) can dedup against their own per-handler key store. Bounds the *consequence* from "action runs twice" to "action runs once, second attempt no-ops." Substrate tasks also enqueue with `priority: 1` so they execute on the same or next tick — the lease window now tracks real execution latency instead of waiting behind lower-priority work, which shrinks the "lease expires while task waits in local queue" window further. **Silent null claim (Copilot #5)** Contract said null-claim is silent; impl logged `[substrate] claim ... result=none` every tick. Dropped the log — quiet-tick visibility lives in successful-claim and idle-dispatch event lines, not substrate tick-rate noise. **Other nits (Copilot #2, arc0btc #5)** - Removed unused `getTaskById` import in `substrate.ts`. - Documented `substrate` block's shallow-merge behavior in `types.ts` (unlike `profiles`/`adapters` which deep-merge). Slots that extend a base and want to flip only `isLeaseRecoveryOwner: true` must repeat the whole substrate block. Deferred (already declared out of scope by reviewers): - `pg_advisory_lock` guard on `isLeaseRecoveryOwner` race (secret-mars [suggestion]) — multi-slot mis-config protection is a follow-up. - `_substrate_*` payload-injection vs strict validators (secret-mars [suggestion]) — naming is defensive prefixing already. Contract block in src/substrate.ts updated with all new log lines and a new "Side-effect duplicate-execution guard" section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(substrate): expand coverage via mock.module — claim/complete/fail/lease-recovery paths Closes the test-gap finding from arc0btc #4 and secret-mars: prior suite covered only `resolveSubstrateConfig`, `jobRowToTaskInput`, and two write-back *early-return* guards via a fakeDb that threw if touched. Claim, complete, fail, epoch-mismatch no-op, transient write-back throw, and lease recovery — the "highest correctness surface" per the PR's own Commit-2 description — were untested. Rewires the file to use `mock.module("@genesis-works/substrate-db", ...)` with mock fns per substrate-db export. Local DB is a real `openDb(:memory:)` so `enqueueTask` exercises the actual sqlite write path. console.{info,error,warn} are spied per-test for log-line assertions. New tests (18, alongside 10 existing for 28 total): - `runSubstrateIntakeTick`: - happy claim → enqueues local task, returns `claimed=true` with correct epoch + jobId, emits `[substrate] claim ...` log - null claim → silent (no `[substrate] claim` log line; just `claimed: false`) - db-unreachable → catches `claimNextJob` throw, logs `reason=db-unreachable` - local-enqueue-fail → forces `enqueueTask` to throw by closing the local db, asserts catch fires + log line includes `reason=local-enqueue-fail jobs.id=<id>` (proves Copilot #4 / arc0btc #1 fix is wired) - `runSubstrateWriteBack`: - non-substrate source no-op (existing — kept) - missing `_substrate_job_id` no-op (existing — kept) - complete happy → calls `completeJob` with `claim_epoch` arg, `[substrate] complete jobs.id=... epoch=...` log - complete `{ ok: false }` → emits self-contained fallback `[substrate] complete-failed` log (proves secret-mars #5 fix) - complete throws → catches transient PG blip, emits `[substrate] write-back error=... jobs.id=...` (proves arc0btc #3 fix is wired — write-back no longer propagates out of `runOnce`) - fail happy → calls `failJob`, logs reason snippet - fail `{ ok: false }` → emits `[substrate] fail-failed` fallback - `runSubstrateLeaseRecovery`: - released > 0 → logs `[substrate] lease-recovery released=<n>` - released = 0 → no log (quiet success) - throws → catches, logs `[substrate] lease-recovery error=...`, does NOT propagate - `createSubstrateConnection`: - throws when `substrate.host` is unset (proves Copilot #3 fix — no implicit default to a hard-coded private IP) - throws on whitespace-only host - `jobRowToTaskInput`: - asserts `payload.idempotency_key = "substrate-<job_id>-e<epoch>"` (proves idempotency-key threading is wired) - asserts `priority = 1` - asserts `_substrate_*` payload fields threaded - asserts `max_attempts = 1` (substrate handles retry) - existing source / subject-fallback tests kept Header comment rewritten to match what's actually covered (Copilot #7). Unused `mock`/`beforeEach`/`afterEach` imports replaced with the ones actually used (Copilot #6 — though now they ARE used, so the prior nit naturally resolves). Test count: 143 total (28 substrate + 115 existing) — up from 125. `bunx tsc --noEmit` clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Jason Schrader <jason@joinfreehold.com> Co-authored-by: Claude <noreply@anthropic.com>

This was referenced May 4, 2026

chore(cf-cost): refresh last_active_at from inbox-status path arc0btc/agents-love-bitcoin#19

Merged

revert(cf-cost): remove PR 3 heartbeat code; ALB has no liveness column arc0btc/agents-love-bitcoin#20

Merged

whoabuddy closed this May 4, 2026

whoabuddy deleted the chore/cf-cost-stretch-heartbeat branch May 4, 2026 19:17

secret-mars mentioned this pull request May 30, 2026

feat(substrate): opt-in substrate dispatch intake (Phase 5) #5

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(cf-cost): stretch aibtc-heartbeat schedule to 3600s#3

chore(cf-cost): stretch aibtc-heartbeat schedule to 3600s#3
whoabuddy wants to merge 1 commit into
mainfrom
chore/cf-cost-stretch-heartbeat

whoabuddy commented May 4, 2026

Uh oh!

whoabuddy commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

whoabuddy commented May 4, 2026

Summary

Operator action per VM

Expected effect

Test plan

Uh oh!

whoabuddy commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant