fix(identity): stop amplifying Hiro egress throttling into 25s lookup-failed (#939)#951
Merged
Merged
Conversation
…-failed (#939) When Hiro rate-limits Cloudflare's shared egress IPs (429), the synchronous identity/BNS lookups on POST /api/identity/{addr}/refresh and the enrichment branch of GET /api/agents/{addr} were turning a transient upstream blip into a ~25s hang that returned lookup-failed — even though direct Hiro calls answer in <500ms. Two compounding causes: 1. Amplification: detectAgentIdentity treated ANY non-ok holdings response (including 429/5xx) as "holdings unavailable" and fell back to the O(N) legacy scan — firing 5+ more call-read requests at the same throttled upstream, each with its own retry budget. One rate-limited call became a multi-second storm that still ended in failure. 2. Long per-attempt timeout: each call used an 8s timeout × retries, so a single hung call alone burned ~16s — far past a consumer's ~3s budget (aibtc.news identity-gate 503s and loops when we exceed it). Hardening (we can't stop Hiro throttling our egress, but we stop amplifying it): - Legacy scan now only triggers on a genuine 404 (endpoint can't serve the lookup). 429/5xx fail fast as lookup-failed with the existing short-TTL negative cache, so the next request retries cleanly instead of storming. - Add a configurable perAttemptTimeoutMs to stacksApiFetch (default 8s) and thread SYNC_PER_ATTEMPT_TIMEOUT_MS (3.5s) + reduced retries (1) through the synchronous holdings, get-token-uri, and BNS get-primary calls. Worst case on a throttled window drops from ~25s to a sub-second fast-fail. This is defensive hardening, not a correctness fix: when Hiro is healthy the path is fast and correct (verified live: /refresh 2.4s, idOutcome/bnsOutcome positive). It only changes behavior during Hiro throttle windows. Tests: new detection suite asserts 429/5xx fail fast WITHOUT the legacy scan, 404 still falls back to it, and a holdings hit resolves positive.
Deploying with
|
| Status | Name | Latest Commit | Updated (UTC) |
|---|---|---|---|
| ✅ Deployment successful! View logs |
landing-page | ad9704a | Jun 01 2026, 01:18 PM |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem (#939)
When Hiro rate-limits Cloudflare's shared egress IPs (429), the synchronous identity/BNS lookups on
POST /api/identity/{addr}/refreshand the enrichment branch ofGET /api/agents/{addr}turned a transient upstream blip into a ~25s hang returninglookup-failed— even though direct Hiro calls answer in <500ms. Downstream (aibtc.newsidentity-gate, 3s budget) then 503s and loops.Verified live while writing this: when Hiro is healthy, the path is fast and correct (
/refresh2.4s,idOutcome/bnsOutcomepositive; direct Hiro holdings 0.66s). So this is defensive hardening for throttle windows, not a correctness bug — Hiro throttling our egress is the trigger; our code amplifying it into a 25s outage is what this fixes.Two compounding causes
detectAgentIdentitytreated any non-ok holdings response (incl. 429/5xx) as "holdings unavailable" and fell back to the O(N) legacy scan, firing 5+ morecall-readrequests at the same throttled upstream. One rate-limited call became a multi-second storm still ending in failure.Fix
lookup-failedwith the existing short-TTL negative cache, so the next request retries cleanly instead of storming.perAttemptTimeoutMstostacksApiFetch(default 8s preserved) and threadSYNC_PER_ATTEMPT_TIMEOUT_MS(3.5s) + reduced retries through the synchronous holdings /get-token-uri/ BNSget-primarycalls. Worst case on a throttled window drops from ~25s to a sub-second fast-fail.Tests
New
lib/identity/__tests__/detection.test.ts: 429 and 5xx fail fast without the legacy scan; 404 still falls back to it; a holdings hit resolves positive. Full affected suite green.Not in scope
The heavier architectural option (B in the issue — move enrichment fully async/background) is deferred; this is the surgical fast-fail that resolves the measured 25s symptom.