Skip to content

fix(identity): stop amplifying Hiro egress throttling into 25s lookup-failed (#939)#951

Merged
biwasxyz merged 1 commit into
mainfrom
fix/identity-lookup-throttle-hardening-939
Jun 1, 2026
Merged

fix(identity): stop amplifying Hiro egress throttling into 25s lookup-failed (#939)#951
biwasxyz merged 1 commit into
mainfrom
fix/identity-lookup-throttle-hardening-939

Conversation

@biwasxyz
Copy link
Copy Markdown
Contributor

@biwasxyz biwasxyz commented Jun 1, 2026

Problem (#939)

When Hiro rate-limits Cloudflare's shared egress IPs (429), the synchronous identity/BNS lookups on POST /api/identity/{addr}/refresh and the enrichment branch of GET /api/agents/{addr} turned a transient upstream blip into a ~25s hang returning lookup-failed — even though direct Hiro calls answer in <500ms. Downstream (aibtc.news identity-gate, 3s budget) then 503s and loops.

Verified live while writing this: when Hiro is healthy, the path is fast and correct (/refresh 2.4s, idOutcome/bnsOutcome positive; direct Hiro holdings 0.66s). So this is defensive hardening for throttle windows, not a correctness bug — Hiro throttling our egress is the trigger; our code amplifying it into a 25s outage is what this fixes.

Two compounding causes

  1. AmplificationdetectAgentIdentity treated any non-ok holdings response (incl. 429/5xx) as "holdings unavailable" and fell back to the O(N) legacy scan, firing 5+ more call-read requests at the same throttled upstream. One rate-limited call became a multi-second storm still ending in failure.
  2. 8s × retries timeout — a single hung call alone burned ~16s, past the consumer budget.

Fix

  • Legacy scan now triggers only on a genuine 404. 429/5xx fail fast as lookup-failed with the existing short-TTL negative cache, so the next request retries cleanly instead of storming.
  • Add configurable perAttemptTimeoutMs to stacksApiFetch (default 8s preserved) and thread SYNC_PER_ATTEMPT_TIMEOUT_MS (3.5s) + reduced retries through the synchronous holdings / get-token-uri / BNS get-primary calls. Worst case on a throttled window drops from ~25s to a sub-second fast-fail.

Tests

New lib/identity/__tests__/detection.test.ts: 429 and 5xx fail fast without the legacy scan; 404 still falls back to it; a holdings hit resolves positive. Full affected suite green.

Not in scope

The heavier architectural option (B in the issue — move enrichment fully async/background) is deferred; this is the surgical fast-fail that resolves the measured 25s symptom.

…-failed (#939)

When Hiro rate-limits Cloudflare's shared egress IPs (429), the synchronous
identity/BNS lookups on POST /api/identity/{addr}/refresh and the enrichment
branch of GET /api/agents/{addr} were turning a transient upstream blip into a
~25s hang that returned lookup-failed — even though direct Hiro calls answer in
<500ms. Two compounding causes:

1. Amplification: detectAgentIdentity treated ANY non-ok holdings response
   (including 429/5xx) as "holdings unavailable" and fell back to the O(N)
   legacy scan — firing 5+ more call-read requests at the same throttled
   upstream, each with its own retry budget. One rate-limited call became a
   multi-second storm that still ended in failure.
2. Long per-attempt timeout: each call used an 8s timeout × retries, so a
   single hung call alone burned ~16s — far past a consumer's ~3s budget
   (aibtc.news identity-gate 503s and loops when we exceed it).

Hardening (we can't stop Hiro throttling our egress, but we stop amplifying it):
- Legacy scan now only triggers on a genuine 404 (endpoint can't serve the
  lookup). 429/5xx fail fast as lookup-failed with the existing short-TTL
  negative cache, so the next request retries cleanly instead of storming.
- Add a configurable perAttemptTimeoutMs to stacksApiFetch (default 8s) and
  thread SYNC_PER_ATTEMPT_TIMEOUT_MS (3.5s) + reduced retries (1) through the
  synchronous holdings, get-token-uri, and BNS get-primary calls. Worst case on
  a throttled window drops from ~25s to a sub-second fast-fail.

This is defensive hardening, not a correctness fix: when Hiro is healthy the
path is fast and correct (verified live: /refresh 2.4s, idOutcome/bnsOutcome
positive). It only changes behavior during Hiro throttle windows.

Tests: new detection suite asserts 429/5xx fail fast WITHOUT the legacy scan,
404 still falls back to it, and a holdings hit resolves positive.
@cloudflare-workers-and-pages
Copy link
Copy Markdown

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Updated (UTC)
✅ Deployment successful!
View logs
landing-page ad9704a Jun 01 2026, 01:18 PM

@biwasxyz biwasxyz merged commit dc06dac into main Jun 1, 2026
8 checks passed
@biwasxyz biwasxyz deleted the fix/identity-lookup-throttle-hardening-939 branch June 1, 2026 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant