Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Algod: New health endpoint (k8s /ready endpoint) #4844

Merged
merged 70 commits into from Mar 31, 2023

Conversation

ahangsu
Copy link
Contributor

@ahangsu ahangsu commented Nov 30, 2022

Summary

Implements #4223.

According to some discussion, and following from probes, we separate following probes and features:

  • liveness probe: /health
  • readiness probe: chain state is fully caught up (fast catchup and still round by round catchup with negligible sync time).
  • startup probe: all sqlite migration are finished, once this is done then we should be able to do fast catchup.

This PR should serve the purpose of readiness probe, namely confirming the node is caught up with chain state.

Test Plan

  • Mock test in common, just like v2 server API, start a mock node, and confirm API logic is correct.
  • E2E test.
    • Roll up a network, with a primary node generating endpoints (with endpoint files) over a sufficient number of rounds.
    • Let the network proceed for a few rounds, and obtain the latest endpoint.
    • Introduce a new node to the network, and start fast catchup against the latest endpoint.
    • While catching up, /ready should return error. After catchup, /ready should return 200 ok.

Questions (on some uncertainties) and my answers (thinking loudly)

  • This /ready endpoint at 2023/02/21 only serve for fast catchup finality check. Do we want to additionally enable the node to indicate catching-up with the latest round (the residual round after catchpoint catchup)?

    I believe this is not hard to achieve, by setting a heuristic threshold of catchup time limit in endpoint logic should work. But this would also incur a false positive case:
    (imagine a node is just started against a network, and it is catching up round-by-round. The /ready would return true at this case, but really the node is not ready.)

    This is reflected in current implementation by 2023/03/07, we return 200 only when sync time is 0.0 with no catchpoint in status.

@codecov
Copy link

codecov bot commented Nov 30, 2022

Codecov Report

Merging #4844 (3d949d3) into master (b7234a3) will decrease coverage by 0.02%.
The diff coverage is 85.71%.

@@            Coverage Diff             @@
##           master    #4844      +/-   ##
==========================================
- Coverage   53.71%   53.69%   -0.02%     
==========================================
  Files         444      445       +1     
  Lines       55669    55682      +13     
==========================================
- Hits        29904    29901       -3     
- Misses      23439    23447       +8     
- Partials     2326     2334       +8     
Impacted Files Coverage Δ
daemon/algod/api/server/v2/test/helpers.go 75.67% <ø> (ø)
node/follower_node.go 26.08% <ø> (ø)
daemon/algod/api/server/common/test/helpers.go 84.61% <84.61%> (ø)
catchup/service.go 70.11% <100.00%> (+1.17%) ⬆️

... and 10 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@ahangsu ahangsu force-pushed the ready-endpoint branch 4 times, most recently from 448c4a9 to 34841aa Compare November 30, 2022 19:50
daemon/algod/api/algod.oas2.json Outdated Show resolved Hide resolved
"schemes": [
"http"
],
"summary": "Returns OK if healthy and fully caught up.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is 100% true...I think we would still return a 200 if we never used catchpoint catchup (using slow round by round catchup via the catchup service). In that case we would be "ready", but still on round 10 out of 24 million for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh boy, I am not so familiar with "slow round" mode.

using slow round by round catchup via the catchup service

I assume slow round by round catchup is on the opposite side of fast catchup. In that case, we should be fine, if we are way behind the latest round, right?

So what would happen, if we use this node to send some transaction, or do some other operations...?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "slow round" mode I just mean starting the node from genesis without using catchpoint catchup.

In this case the node operations should be fine in all the ways that justify a 200, but I wouldn't say the node is "fully caught up".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I was reading form the issue description from #4223 that:

the node really isn't capable of accepting transactions, looking up blocks, or accounts or doing much of anything until its fully caught up.

So that is the motivation of enforcing 200 only if it is fully caught up, such that it is serving like a health endpoint, but a little more than that, as a readiness endpoint.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear - if doing a regular from-genesis catchup this new endpoint should NOT return a 200 until it's fully caught-up. Whether fast catchup was used or not shouldn't matter. The readiness handler criteria should be the same.

Copy link
Contributor Author

@ahangsu ahangsu Mar 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, for a regular round-by-round catchup, it will always 400 until your node reaches the latest round.

Fast catchup is similar, consider following 2 phases before a node reaches latest round:

  1. still fast catchup against some catchpoint, then by node.status there exists a catchpoint, 400 err
  2. catching up the residual rounds between catchpoint and latest round, still 400 error, sync time != 0.
    handler under this bulletpoint is the same behavior of normal "from-genesis" catchup stated above.

Such that the handler's behavior holds.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, thanks. I'm testing locally now.
This'll be nice as I'll be able to scale the nodes trivially and not worry about them receiving requests before they're ready.

cmd/algod/main.go Outdated Show resolved Hide resolved
bbroder-algo
bbroder-algo previously approved these changes Mar 30, 2023
Eric-Warehime
Eric-Warehime previously approved these changes Mar 30, 2023
catchup/service.go Outdated Show resolved Hide resolved
bbroder-algo
bbroder-algo previously approved these changes Mar 30, 2023
Eric-Warehime
Eric-Warehime previously approved these changes Mar 31, 2023
excalq
excalq previously approved these changes Mar 31, 2023
Copy link
Contributor

@excalq excalq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the logging level comment, glad to be aware of this. I'll stamp.

Copy link
Contributor

@excalq excalq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New health endpoint should be added that only returns success if node is fully caught-up.
7 participants