New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Algod: New health endpoint (k8s /ready
endpoint)
#4844
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4844 +/- ##
==========================================
- Coverage 53.71% 53.69% -0.02%
==========================================
Files 444 445 +1
Lines 55669 55682 +13
==========================================
- Hits 29904 29901 -3
- Misses 23439 23447 +8
- Partials 2326 2334 +8
... and 10 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
448c4a9
to
34841aa
Compare
34841aa
to
95512f0
Compare
daemon/algod/api/algod.oas2.json
Outdated
"schemes": [ | ||
"http" | ||
], | ||
"summary": "Returns OK if healthy and fully caught up.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is 100% true...I think we would still return a 200 if we never used catchpoint catchup (using slow round by round catchup via the catchup service). In that case we would be "ready", but still on round 10 out of 24 million for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh boy, I am not so familiar with "slow round" mode.
using slow round by round catchup via the catchup service
I assume slow round by round catchup is on the opposite side of fast catchup. In that case, we should be fine, if we are way behind the latest round, right?
So what would happen, if we use this node to send some transaction, or do some other operations...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By "slow round" mode I just mean starting the node from genesis without using catchpoint catchup.
In this case the node operations should be fine in all the ways that justify a 200, but I wouldn't say the node is "fully caught up".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I was reading form the issue description from #4223 that:
the node really isn't capable of accepting transactions, looking up blocks, or accounts or doing much of anything until its fully caught up.
So that is the motivation of enforcing 200 only if it is fully caught up, such that it is serving like a health endpoint, but a little more than that, as a readiness endpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear - if doing a regular from-genesis catchup this new endpoint should NOT return a 200 until it's fully caught-up. Whether fast catchup was used or not shouldn't matter. The readiness handler criteria should be the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, for a regular round-by-round catchup, it will always 400 until your node reaches the latest round.
Fast catchup is similar, consider following 2 phases before a node reaches latest round:
- still fast catchup against some catchpoint, then by
node.status
there exists a catchpoint, 400 err - catching up the residual rounds between catchpoint and latest round, still 400 error, sync time != 0.
handler under this bulletpoint is the same behavior of normal "from-genesis" catchup stated above.
Such that the handler's behavior holds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, thanks. I'm testing locally now.
This'll be nice as I'll be able to scale the nodes trivially and not worry about them receiving requests before they're ready.
7e57fd9
to
4c3b87c
Compare
3cdcd05
to
5e1f430
Compare
1e28b9f
to
f0c908d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the logging level comment, glad to be aware of this. I'll stamp.
3d949d3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
Summary
Implements #4223.
According to some discussion, and following from probes, we separate following probes and features:
/health
This PR should serve the purpose of readiness probe, namely confirming the node is caught up with chain state.
Test Plan
common
, just likev2
server API, start a mock node, and confirm API logic is correct./ready
should return error. After catchup,/ready
should return 200 ok.Questions (on some uncertainties) and my answers (thinking loudly)
This
/ready
endpoint at 2023/02/21 only serve for fast catchup finality check. Do we want to additionally enable the node to indicate catching-up with the latest round (the residual round after catchpoint catchup)?I believe this is not hard to achieve, by setting a heuristic threshold of catchup time limit in endpoint logic should work. But this would also incur a false positive case:(imagine a node is just started against a network, and it is catching up round-by-round. The/ready
would return true at this case, but really the node is not ready.)This is reflected in current implementation by 2023/03/07, we return 200 only when sync time is 0.0 with no catchpoint in status.