revert: block app.listen() until tenant registry is ready (#4062)#4063
Merged
Conversation
b9ea3ab to
437b2b6
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Reverts #4062. Awaiting tenant warmup before
app.listen()made Fly's deploy-time health-check timeout (300s) fire before the new machines came healthy — the deploy itself failed instead of just the smoke. Worse outcome than the original problem (smoke flake but production healthy).What was tried
warmup()and awaited it beforeapp.listen(). Made Fly's health check time out before the listener bound. Deploy failed withUnrecoverable error: timeout reached waiting for health checks to pass.Tenant registry init is taking >300s on a fresh Fly machine for reasons that aren't visible from the workflow logs (need
flyctl logs --app adcp-docsfor the boot-time stdout). Until we know why, blocking listen is the wrong tool.Where to go from here
Options:
flyctl logs --app adcp-docsduring a deploy, find the actual long pole. Could be DNS, JWKS validation against an unreachable host, framework-server build, etc.For now, restoring main to the #4060 state — at least the deploy completes, even if the smoke still flakes.
Test plan
🤖 Generated with Claude Code