Skip to content

feat(alb-email-poll): hit /api/me/inbox-status before full inbox fetch#2

Merged
whoabuddy merged 1 commit into
mainfrom
feat/alb-inbox-status-wakeup
May 4, 2026
Merged

feat(alb-email-poll): hit /api/me/inbox-status before full inbox fetch#2
whoabuddy merged 1 commit into
mainfrom
feat/alb-inbox-status-wakeup

Conversation

@whoabuddy
Copy link
Copy Markdown
Contributor

Pairs with arc0btc/agents-love-bitcoin#18 (cf-cost cleanup campaign, PR 2 — `unread_count` + `/api/me/inbox-status`).

Summary

Idle inbox polls (the common case) currently scan the full ALB inbox table on every run. With 5 active agents on 600s polling cadence that's ~720 scans/day; at 10K agents it's 1.5M.

This adds a wake-up bit:

  • New `getInboxStatus()` calls `GET /api/me/inbox-status` with the same BIP-137/322 sig pattern as `listInbox()`. Returns `{ unread, total }` from a single 1-row stat lookup on the DO.
  • Before `listInbox`, fetch inbox-status. When `unread === 0`, emit a `completed` outcome with the wake-up summary and exit. The full inbox fetch only runs on real arrival.
  • Backwards-compatible: a 404 from the new endpoint (older ALB version that doesn't expose it yet) falls through to the legacy path. Safe to deploy this runtime change before or after ALB PR 2 lands.

Expected effect

At 5 active agents (current fleet): 720 inbox scans/day → ~5/agent/day = 25/day on real arrivals (~96% reduction). At 10K agents: 1.5M → 50K/day.

Test plan

  • `bunx tsc --noEmit` clean
  • Run on Spark or Forge once ALB PR 2 deploys; confirm a poll with no inbound mail emits `Polled ALB inbox-status: 0 unread` and exits without the heavy listInbox call
  • Send a test email; confirm the next poll falls through to listInbox and triages the message normally

🤖 Generated with Claude Code

Pairs with arc0btc/agents-love-bitcoin PR 2 (cf-cost cleanup campaign,
PR 2 — unread_count + /api/me/inbox-status). Idle polls (the common case)
now read a single 1-row stat from AgentDO instead of scanning the inbox
table on every call.

- New `getInboxStatus()` calls `GET /api/me/inbox-status` with the same
  BIP-137/322 sig pattern as `listInbox()`.
- Before listInbox, fetch inbox-status. When `unread === 0`, emit a
  "completed" outcome with the wake-up summary and exit. The full inbox
  fetch runs only on actual mail arrival.
- Backwards compat: if the new endpoint returns 404 (older ALB version),
  fall through to the legacy listInbox path. Safe to deploy this runtime
  change before or after ALB PR 2 merges.

At 5 active agents on 600s polling cadence, this drops AgentDO inbox
scans from 720/day to ~5/agent/day = 25/day on real arrivals — a ~97%
reduction. At 10K agents, 1.5M scans/day → 50K.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@whoabuddy whoabuddy merged commit a4487af into main May 4, 2026
@whoabuddy whoabuddy deleted the feat/alb-inbox-status-wakeup branch May 4, 2026 19:28
whoabuddy added a commit that referenced this pull request May 30, 2026
…it flag, host requirement, write-back guard, idempotency key

Addresses 7 inline review findings (Copilot + secret-mars + arc0btc) on
PR #5 — all substantive items resolved in-tree, no follow-ups left.

**Catch widening (Copilot #4, secret-mars [blocking-risk], arc0btc #1)**

`runSubstrateIntakeTick` previously only caught `claimNextJob`. Both
`jobRowToTaskInput` and `enqueueTask` ran outside the try, so a malformed
JobRow or a local SQLite lock contention escaped as a thrown error from
`runOnce`. Each call site now has its own catch with a distinct skip
reason — `job-parse-fail` and `local-enqueue-fail` — so the contract's
"NEVER throws" guarantee is real. The substrate job stays under lease in
both failure modes; `releaseExpiredLeases` reconciles on the next cycle.

**`substrateDbInitialized` retry on credential fail (Copilot #1, all 3
reviewers seconded)**

The flag was set to `true` BEFORE `createSubstrateConnection` succeeded.
A transient credential read miss at first tick (e.g. encrypted blob not
yet available) permanently disabled substrate intake for the process
lifetime — the contract's documented `[substrate] skip
reason=credential-fail` log line would never re-fire. Flag now sets only
inside the success branch; next tick retries.

**`runSubstrateWriteBack` swallows transient PG blips (arc0btc #3)**

Was the only substrate fn whose Postgres calls were unguarded — a
mid-write-back connection reset would propagate out of `runOnce` even
though the local task already finished cleanly. Wrapped both
`completeJob` and `failJob` in a single try; new
`[substrate] write-back error=<msg> jobs.id=<id>` log line on transient
fail. Lease recovery still reconciles eventually.

**Default host removed (Copilot #3, arc0btc seconded)**

`host` no longer defaults to a hard-coded private IP. When
`substrate.enabled: true`, an explicit `substrate.host` is required —
`createSubstrateConnection` throws a clear error if unset. Closes the
"dev/test slot misconfigured at enabled:true accidentally connects to
prod" footgun.

**Self-contained log fallbacks (secret-mars #5)**

The empty-else branches on `completeJob`/`failJob` `{ ok: false }`
results relied on the substrate-db package emitting its own
`[substrate] complete-epoch-mismatch ...` log. If that package's log
format ever changes, those branches went silent. Added
`[substrate] complete-failed jobs.id=<id> epoch=<n>` and
`[substrate] fail-failed ...` fallbacks so this code is self-contained.

**Idempotency-key threading (arc0btc #2 design, secret-mars idempotency
follow-up)**

Closes the "side-effecting jobs can execute twice when a lease expires
mid-flight" hazard structurally — the fence on `jobs.claim_epoch` only
protects the status write, not the action. `jobRowToTaskInput` now
threads `payload.idempotency_key = "substrate-<job_id>-e<claim_epoch>"`
so downstream side-effecting handlers (email send, PR open, tx
broadcast) can dedup against their own per-handler key store. Bounds the
*consequence* from "action runs twice" to "action runs once, second
attempt no-ops."

Substrate tasks also enqueue with `priority: 1` so they execute on the
same or next tick — the lease window now tracks real execution latency
instead of waiting behind lower-priority work, which shrinks the "lease
expires while task waits in local queue" window further.

**Silent null claim (Copilot #5)**

Contract said null-claim is silent; impl logged `[substrate] claim ...
result=none` every tick. Dropped the log — quiet-tick visibility lives
in successful-claim and idle-dispatch event lines, not substrate
tick-rate noise.

**Other nits (Copilot #2, arc0btc #5)**

- Removed unused `getTaskById` import in `substrate.ts`.
- Documented `substrate` block's shallow-merge behavior in `types.ts`
  (unlike `profiles`/`adapters` which deep-merge). Slots that extend a
  base and want to flip only `isLeaseRecoveryOwner: true` must repeat
  the whole substrate block.

Deferred (already declared out of scope by reviewers):
- `pg_advisory_lock` guard on `isLeaseRecoveryOwner` race (secret-mars
  [suggestion]) — multi-slot mis-config protection is a follow-up.
- `_substrate_*` payload-injection vs strict validators (secret-mars
  [suggestion]) — naming is defensive prefixing already.

Contract block in src/substrate.ts updated with all new log lines and a
new "Side-effect duplicate-execution guard" section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
whoabuddy added a commit that referenced this pull request May 30, 2026
* feat(substrate): opt-in substrate intake — config, credential, claim, enqueue

Adds @genesis-works/substrate-db as a dependency and wires a substrate
dispatch intake path into the runOnce tick. Opt-in per slot: substrate
is disabled by default (config.substrate.enabled=false). Flipping this
flag live changes zero behavior on any slot that has not set it.

Config keys (host config substrate block):
  substrate.enabled   — bool, default false
  substrate.credential — string, credential id (NEVER plaintext password)
  substrate.kinds      — string[], job kinds this slot handles
  substrate.slotId     — string, this slot's claim identifier
  substrate.leaseSecs  — number, default 300 (5 minutes)
  substrate.host/port/database/user — Postgres connection params

On each tick (when enabled):
1. Credential resolved via existing encrypted-blob pattern.
2. claimNextJob(slotId, kinds) called against shared Postgres.
   Log: "[substrate] claim slot=<id> kinds=<list> result=<jobs.id|none>"
3. Claimed job converted to TaskInput (source="substrate:<jobId>").
4. TaskInput enqueued as a local runtime task for next-tick execution.

Failure conditions (all logged, none crash):
  - substrate unreachable → "[substrate] skip reason=db-unreachable error=..."
  - credential fail       → "[substrate] skip reason=credential-fail error=..."
  - claimNextJob null     → "[substrate] claim ... result=none" (silent no-op)

Pinned substrate-db SHA: d458200b008efe8be3d37912a71b990d73ff2b17
(arc0btc/substrate-db — private, deployed alongside agent-runtime on slots)

Co-Authored-By: Claude <noreply@anthropic.com>

* feat(substrate): lease semantics — epoch-fenced write-back, max_attempts, lease recovery

Adds the epoch-fenced write-back hook and lease recovery cadence to the
substrate intake adapter.

Write-back (after finalizeTaskAttempt):
- If task.source starts with "substrate:", calls runSubstrateWriteBack.
- Reads _substrate_job_id and _substrate_claim_epoch from task payload.
- Calls completeJob(epoch) on success, failJob(epoch) on failure.
- Epoch mismatch (stale executor woke after lease recovery) → logged
  no-op: "[substrate] complete-epoch-mismatch ... stale executor, ignored"
- Pre-fencing row (epoch=0, expected>0) → distinct log line: "pre-fencing row, ignored"
- completeJob/failJob return WriteBackResult — conflict is logged, not thrown.

Lease recovery:
- Runs releaseExpiredLeases on the substrate DB when isLeaseRecoveryOwner=true.
- Cadence: leaseRecoveryCadenceSecs (default 60s) using a module-level timer.
- ONE nominated owner runs this — not all slots — to avoid stampede.
- Log: "[substrate] lease-recovery released=<n>"
- Lease expiry measured against Postgres now() — NTP drift irrelevant.

Log lines contract (stable, parseable):
  [substrate] claim slot=<id> kinds=<list> result=<jobs.id|none>
  [substrate] complete jobs.id=<id> epoch=<n>
  [substrate] fail jobs.id=<id> epoch=<n> reason=<...>
  [substrate] lease-recovery released=<n>
  [substrate] skip reason=credential-fail error=<...>
  [substrate] skip reason=db-unreachable error=<...>
  [substrate] complete-epoch-mismatch jobs.id=<id> expected=<n> actual=<m> — ...

Tests: 125/125 pass (10 new substrate tests + 115 existing).

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(substrate): review-feedback correctness pass — catch widening, init flag, host requirement, write-back guard, idempotency key

Addresses 7 inline review findings (Copilot + secret-mars + arc0btc) on
PR #5 — all substantive items resolved in-tree, no follow-ups left.

**Catch widening (Copilot #4, secret-mars [blocking-risk], arc0btc #1)**

`runSubstrateIntakeTick` previously only caught `claimNextJob`. Both
`jobRowToTaskInput` and `enqueueTask` ran outside the try, so a malformed
JobRow or a local SQLite lock contention escaped as a thrown error from
`runOnce`. Each call site now has its own catch with a distinct skip
reason — `job-parse-fail` and `local-enqueue-fail` — so the contract's
"NEVER throws" guarantee is real. The substrate job stays under lease in
both failure modes; `releaseExpiredLeases` reconciles on the next cycle.

**`substrateDbInitialized` retry on credential fail (Copilot #1, all 3
reviewers seconded)**

The flag was set to `true` BEFORE `createSubstrateConnection` succeeded.
A transient credential read miss at first tick (e.g. encrypted blob not
yet available) permanently disabled substrate intake for the process
lifetime — the contract's documented `[substrate] skip
reason=credential-fail` log line would never re-fire. Flag now sets only
inside the success branch; next tick retries.

**`runSubstrateWriteBack` swallows transient PG blips (arc0btc #3)**

Was the only substrate fn whose Postgres calls were unguarded — a
mid-write-back connection reset would propagate out of `runOnce` even
though the local task already finished cleanly. Wrapped both
`completeJob` and `failJob` in a single try; new
`[substrate] write-back error=<msg> jobs.id=<id>` log line on transient
fail. Lease recovery still reconciles eventually.

**Default host removed (Copilot #3, arc0btc seconded)**

`host` no longer defaults to a hard-coded private IP. When
`substrate.enabled: true`, an explicit `substrate.host` is required —
`createSubstrateConnection` throws a clear error if unset. Closes the
"dev/test slot misconfigured at enabled:true accidentally connects to
prod" footgun.

**Self-contained log fallbacks (secret-mars #5)**

The empty-else branches on `completeJob`/`failJob` `{ ok: false }`
results relied on the substrate-db package emitting its own
`[substrate] complete-epoch-mismatch ...` log. If that package's log
format ever changes, those branches went silent. Added
`[substrate] complete-failed jobs.id=<id> epoch=<n>` and
`[substrate] fail-failed ...` fallbacks so this code is self-contained.

**Idempotency-key threading (arc0btc #2 design, secret-mars idempotency
follow-up)**

Closes the "side-effecting jobs can execute twice when a lease expires
mid-flight" hazard structurally — the fence on `jobs.claim_epoch` only
protects the status write, not the action. `jobRowToTaskInput` now
threads `payload.idempotency_key = "substrate-<job_id>-e<claim_epoch>"`
so downstream side-effecting handlers (email send, PR open, tx
broadcast) can dedup against their own per-handler key store. Bounds the
*consequence* from "action runs twice" to "action runs once, second
attempt no-ops."

Substrate tasks also enqueue with `priority: 1` so they execute on the
same or next tick — the lease window now tracks real execution latency
instead of waiting behind lower-priority work, which shrinks the "lease
expires while task waits in local queue" window further.

**Silent null claim (Copilot #5)**

Contract said null-claim is silent; impl logged `[substrate] claim ...
result=none` every tick. Dropped the log — quiet-tick visibility lives
in successful-claim and idle-dispatch event lines, not substrate
tick-rate noise.

**Other nits (Copilot #2, arc0btc #5)**

- Removed unused `getTaskById` import in `substrate.ts`.
- Documented `substrate` block's shallow-merge behavior in `types.ts`
  (unlike `profiles`/`adapters` which deep-merge). Slots that extend a
  base and want to flip only `isLeaseRecoveryOwner: true` must repeat
  the whole substrate block.

Deferred (already declared out of scope by reviewers):
- `pg_advisory_lock` guard on `isLeaseRecoveryOwner` race (secret-mars
  [suggestion]) — multi-slot mis-config protection is a follow-up.
- `_substrate_*` payload-injection vs strict validators (secret-mars
  [suggestion]) — naming is defensive prefixing already.

Contract block in src/substrate.ts updated with all new log lines and a
new "Side-effect duplicate-execution guard" section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(substrate): expand coverage via mock.module — claim/complete/fail/lease-recovery paths

Closes the test-gap finding from arc0btc #4 and secret-mars: prior
suite covered only `resolveSubstrateConfig`, `jobRowToTaskInput`, and
two write-back *early-return* guards via a fakeDb that threw if
touched. Claim, complete, fail, epoch-mismatch no-op, transient
write-back throw, and lease recovery — the "highest correctness
surface" per the PR's own Commit-2 description — were untested.

Rewires the file to use `mock.module("@genesis-works/substrate-db", ...)`
with mock fns per substrate-db export. Local DB is a real
`openDb(:memory:)` so `enqueueTask` exercises the actual sqlite write
path. console.{info,error,warn} are spied per-test for log-line
assertions.

New tests (18, alongside 10 existing for 28 total):

- `runSubstrateIntakeTick`:
  - happy claim → enqueues local task, returns `claimed=true` with
    correct epoch + jobId, emits `[substrate] claim ...` log
  - null claim → silent (no `[substrate] claim` log line; just
    `claimed: false`)
  - db-unreachable → catches `claimNextJob` throw, logs
    `reason=db-unreachable`
  - local-enqueue-fail → forces `enqueueTask` to throw by closing the
    local db, asserts catch fires + log line includes
    `reason=local-enqueue-fail jobs.id=<id>` (proves Copilot #4 /
    arc0btc #1 fix is wired)

- `runSubstrateWriteBack`:
  - non-substrate source no-op (existing — kept)
  - missing `_substrate_job_id` no-op (existing — kept)
  - complete happy → calls `completeJob` with `claim_epoch` arg,
    `[substrate] complete jobs.id=... epoch=...` log
  - complete `{ ok: false }` → emits self-contained fallback
    `[substrate] complete-failed` log (proves secret-mars #5 fix)
  - complete throws → catches transient PG blip, emits
    `[substrate] write-back error=... jobs.id=...` (proves arc0btc #3
    fix is wired — write-back no longer propagates out of `runOnce`)
  - fail happy → calls `failJob`, logs reason snippet
  - fail `{ ok: false }` → emits `[substrate] fail-failed` fallback

- `runSubstrateLeaseRecovery`:
  - released > 0 → logs `[substrate] lease-recovery released=<n>`
  - released = 0 → no log (quiet success)
  - throws → catches, logs `[substrate] lease-recovery error=...`,
    does NOT propagate

- `createSubstrateConnection`:
  - throws when `substrate.host` is unset (proves Copilot #3 fix —
    no implicit default to a hard-coded private IP)
  - throws on whitespace-only host

- `jobRowToTaskInput`:
  - asserts `payload.idempotency_key = "substrate-<job_id>-e<epoch>"`
    (proves idempotency-key threading is wired)
  - asserts `priority = 1`
  - asserts `_substrate_*` payload fields threaded
  - asserts `max_attempts = 1` (substrate handles retry)
  - existing source / subject-fallback tests kept

Header comment rewritten to match what's actually covered (Copilot #7).
Unused `mock`/`beforeEach`/`afterEach` imports replaced with the ones
actually used (Copilot #6 — though now they ARE used, so the prior nit
naturally resolves).

Test count: 143 total (28 substrate + 115 existing) — up from 125.
`bunx tsc --noEmit` clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Jason Schrader <jason@joinfreehold.com>
Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant