Algod: New health endpoint (k8s `/ready` endpoint) #4844

ahangsu · 2022-11-30T18:12:12Z

Summary

Implements #4223.

According to some discussion, and following from probes, we separate following probes and features:

liveness probe: /health
readiness probe: chain state is fully caught up (fast catchup and still round by round catchup with negligible sync time).
startup probe: all sqlite migration are finished, once this is done then we should be able to do fast catchup.

This PR should serve the purpose of readiness probe, namely confirming the node is caught up with chain state.

Test Plan

Mock test in common, just like v2 server API, start a mock node, and confirm API logic is correct.
E2E test.
- Roll up a network, with a primary node generating endpoints (with endpoint files) over a sufficient number of rounds.
- Let the network proceed for a few rounds, and obtain the latest endpoint.
- Introduce a new node to the network, and start fast catchup against the latest endpoint.
- While catching up, /ready should return error. After catchup, /ready should return 200 ok.

Questions (on some uncertainties) and my answers (thinking loudly)

This /ready endpoint at 2023/02/21 only serve for fast catchup finality check. Do we want to additionally enable the node to indicate catching-up with the latest round (the residual round after catchpoint catchup)?

~~I believe this is not hard to achieve, by setting a heuristic threshold of catchup time limit in endpoint logic should work. But this would also incur a false positive case:~~
~~(imagine a node is just started against a network, and it is catching up round-by-round. The /ready would return true at this case, but really the node is not ready.)~~

This is reflected in current implementation by 2023/03/07, we return 200 only when sync time is 0.0 with no catchpoint in status.

codecov · 2022-11-30T18:36:30Z

Codecov Report

Merging #4844 (3d949d3) into master (b7234a3) will decrease coverage by 0.02%.
The diff coverage is 85.71%.

@@            Coverage Diff             @@
##           master    #4844      +/-   ##
==========================================
- Coverage   53.71%   53.69%   -0.02%     
==========================================
  Files         444      445       +1     
  Lines       55669    55682      +13     
==========================================
- Hits        29904    29901       -3     
- Misses      23439    23447       +8     
- Partials     2326     2334       +8

Impacted Files	Coverage Δ
daemon/algod/api/server/v2/test/helpers.go	`75.67% <ø> (ø)`
node/follower_node.go	`26.08% <ø> (ø)`
daemon/algod/api/server/common/test/helpers.go	`84.61% <84.61%> (ø)`
catchup/service.go	`70.11% <100.00%> (+1.17%)`	⬆️

... and 10 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

daemon/algod/api/algod.oas2.json

Eric-Warehime · 2022-12-05T18:04:54Z

daemon/algod/api/algod.oas2.json

+        "schemes": [
+          "http"
+        ],
+        "summary": "Returns OK if healthy and fully caught up.",


I'm not sure if this is 100% true...I think we would still return a 200 if we never used catchpoint catchup (using slow round by round catchup via the catchup service). In that case we would be "ready", but still on round 10 out of 24 million for example.

Oh boy, I am not so familiar with "slow round" mode.

using slow round by round catchup via the catchup service

I assume slow round by round catchup is on the opposite side of fast catchup. In that case, we should be fine, if we are way behind the latest round, right?

So what would happen, if we use this node to send some transaction, or do some other operations...?

By "slow round" mode I just mean starting the node from genesis without using catchpoint catchup.

In this case the node operations should be fine in all the ways that justify a 200, but I wouldn't say the node is "fully caught up".

I see, I was reading form the issue description from #4223 that:

the node really isn't capable of accepting transactions, looking up blocks, or accounts or doing much of anything until its fully caught up.

So that is the motivation of enforcing 200 only if it is fully caught up, such that it is serving like a health endpoint, but a little more than that, as a readiness endpoint.

Just to be clear - if doing a regular from-genesis catchup this new endpoint should NOT return a 200 until it's fully caught-up. Whether fast catchup was used or not shouldn't matter. The readiness handler criteria should be the same.

right, for a regular round-by-round catchup, it will always 400 until your node reaches the latest round.

Fast catchup is similar, consider following 2 phases before a node reaches latest round:

still fast catchup against some catchpoint, then by node.status there exists a catchpoint, 400 err

catching up the residual rounds between catchpoint and latest round, still 400 error, sync time != 0.
handler under this bulletpoint is the same behavior of normal "from-genesis" catchup stated above.

Such that the handler's behavior holds.

Perfect, thanks. I'm testing locally now.
This'll be nice as I'll be able to scale the nodes trivially and not worry about them receiving requests before they're ready.

daemon/algod/api/server/v2/handlers.go

daemon/algod/api/algod.oas2.json

cmd/algod/main.go

catchup/service.go

daemon/algod/api/server/common/handlers.go

excalq

Other than the logging level comment, glad to be aware of this. I'll stamp.

excalq

Great!

ahangsu added Enhancement Team Scytale labels Nov 30, 2022

ahangsu force-pushed the ready-endpoint branch 4 times, most recently from 448c4a9 to 34841aa Compare November 30, 2022 19:50

init commit

95512f0

ahangsu force-pushed the ready-endpoint branch from 34841aa to 95512f0 Compare November 30, 2022 20:10

ahangsu linked an issue Nov 30, 2022 that may be closed by this pull request

New health endpoint should be added that only returns success if node is fully caught-up. #4223

Closed

ahangsu added 2 commits December 1, 2022 10:48

minor

bbe6340

Merge branch 'master' into ready-endpoint

78b107e

Eric-Warehime reviewed Dec 5, 2022

View reviewed changes

revert back to only status ok 200

4c3b87c

ahangsu force-pushed the ready-endpoint branch from 7e57fd9 to 4c3b87c Compare December 5, 2022 19:21

ahangsu added 2 commits December 6, 2022 17:14

bad hack

c45de5b

Merge branch 'master' into ready-endpoint

b8a5174

bbroder-algo reviewed Dec 20, 2022

View reviewed changes

daemon/algod/api/server/v2/handlers.go Outdated Show resolved Hide resolved

bbroder-algo reviewed Dec 20, 2022

View reviewed changes

daemon/algod/api/algod.oas2.json Outdated Show resolved Hide resolved

bbroder-algo reviewed Dec 20, 2022

View reviewed changes

cmd/algod/main.go Outdated Show resolved Hide resolved

ahangsu added 5 commits December 20, 2022 13:49

Merge branch 'master' into ready-endpoint

9cd154c

start moving to common endpoint

be5c418

comments

7946207

Merge branch 'master' into ready-endpoint

c3b7a4e

rewriting...

5e1f430

ahangsu force-pushed the ready-endpoint branch from 3cdcd05 to 5e1f430 Compare December 21, 2022 21:13

ahangsu added 3 commits December 21, 2022 16:24

rewriting... err msg

aa20108

rewriting... finalize common handler stuffs

a726d54

grammar police

9674439

ahangsu force-pushed the ready-endpoint branch from 1e28b9f to f0c908d Compare December 21, 2022 21:48

comment minor change

07947fe

ahangsu dismissed Eric-Warehime’s stale review via 07947fe March 30, 2023 14:37

bbroder-algo previously approved these changes Mar 30, 2023

View reviewed changes

Merge branch 'master' into ready-endpoint

32c77bc

ahangsu dismissed bbroder-algo’s stale review via 32c77bc March 30, 2023 16:41

Eric-Warehime previously approved these changes Mar 30, 2023

View reviewed changes

bbroder-algo reviewed Mar 30, 2023

View reviewed changes

catchup/service.go Outdated Show resolved Hide resolved

bbroder-algo dismissed Eric-Warehime’s stale review via 5a43128 March 30, 2023 20:22

Update catchup/service.go

5a43128

bbroder-algo previously approved these changes Mar 30, 2023

View reviewed changes

Eric-Warehime previously approved these changes Mar 31, 2023

View reviewed changes

excalq reviewed Mar 31, 2023

View reviewed changes

daemon/algod/api/server/common/handlers.go Show resolved Hide resolved

excalq previously approved these changes Mar 31, 2023

View reviewed changes

reset log level of endpoint resp, maybe need another round of reapprove

3d949d3

ahangsu dismissed stale reviews from excalq, Eric-Warehime, and bbroder-algo via 3d949d3 March 31, 2023 17:18

excalq approved these changes Mar 31, 2023

View reviewed changes

bbroder-algo approved these changes Mar 31, 2023

View reviewed changes

bbroder-algo merged commit 7703bc4 into algorand:master Mar 31, 2023
9 checks passed

ahangsu deleted the ready-endpoint branch March 31, 2023 21:38

This was referenced May 9, 2023

go-algorand 3.16.0-beta Release PR #5370

Closed

go-algorand 3.16.0-beta Release PR #5371

Closed

go-algorand 3.16.0-beta Release PR #5382

Closed

Algo-devops-service mentioned this pull request May 22, 2023

go-algorand 3.16.0-beta Release PR #5406

Merged

This was referenced May 24, 2023

go-algorand 3.16.0-beta Release PR #5417

Merged

go-algorand 3.16.0-beta Release PR #5430

Merged

go-algorand 3.16.0-beta Release PR #5434

Merged

Algo-devops-service mentioned this pull request Jun 12, 2023

go-algorand 3.16.1-stable Release PR #5465

Merged

onetechnical mentioned this pull request Jun 14, 2023

go-algorand 3.16.2-stable Release PR #5469

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Algod: New health endpoint (k8s `/ready` endpoint) #4844

Algod: New health endpoint (k8s `/ready` endpoint) #4844

ahangsu commented Nov 30, 2022 •

edited

codecov bot commented Nov 30, 2022 •

edited

Eric-Warehime Dec 5, 2022

ahangsu Dec 5, 2022

Eric-Warehime Dec 5, 2022

ahangsu Dec 5, 2022

pbennett Mar 22, 2023

ahangsu Mar 22, 2023 •

edited

pbennett Mar 22, 2023

excalq left a comment

excalq left a comment

Algod: New health endpoint (k8s /ready endpoint) #4844

Algod: New health endpoint (k8s /ready endpoint) #4844

Conversation

ahangsu commented Nov 30, 2022 • edited

Summary

Test Plan

Questions (on some uncertainties) and my answers (thinking loudly)

codecov bot commented Nov 30, 2022 • edited

Codecov Report

Eric-Warehime Dec 5, 2022

Choose a reason for hiding this comment

ahangsu Dec 5, 2022

Choose a reason for hiding this comment

Eric-Warehime Dec 5, 2022

Choose a reason for hiding this comment

ahangsu Dec 5, 2022

Choose a reason for hiding this comment

pbennett Mar 22, 2023

Choose a reason for hiding this comment

ahangsu Mar 22, 2023 • edited

Choose a reason for hiding this comment

pbennett Mar 22, 2023

Choose a reason for hiding this comment

excalq left a comment

Choose a reason for hiding this comment

excalq left a comment

Choose a reason for hiding this comment

Algod: New health endpoint (k8s `/ready` endpoint) #4844

Algod: New health endpoint (k8s `/ready` endpoint) #4844

ahangsu commented Nov 30, 2022 •

edited

codecov bot commented Nov 30, 2022 •

edited

ahangsu Mar 22, 2023 •

edited