Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network: Support simple liveness check via http on gossip server port. #5944

Merged
merged 3 commits into from Mar 7, 2024

Conversation

gmalouf
Copy link
Contributor

@gmalouf gmalouf commented Feb 26, 2024

Summary

This PR adds a simple /status http endpoint on the gossip server port, enough to ensure it is open. Continue to use the /ready endpoint on the API port for readiness checks (i.e. is the node caught up?).

closes #5208

Test Plan

Added new health check tests, existing tests should pass without issue. In addition, started node up locally with NetAddress set, confirmed can reach /status on the specified port.

rpcs/healthService.go Outdated Show resolved Hide resolved
Copy link

codecov bot commented Feb 26, 2024

Codecov Report

Attention: Patch coverage is 60.00000% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 55.73%. Comparing base (298e53a) to head (0bbb162).

Files Patch % Lines
node/follower_node.go 0.00% 1 Missing and 1 partial ⚠️
node/node.go 0.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5944      +/-   ##
==========================================
+ Coverage   55.72%   55.73%   +0.01%     
==========================================
  Files         488      489       +1     
  Lines       68101    68111      +10     
==========================================
+ Hits        37950    37963      +13     
+ Misses      27583    27575       -8     
- Partials     2568     2573       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@gmalouf gmalouf marked this pull request as ready for review February 26, 2024 19:13
rpcs/healthService.go Outdated Show resolved Hide resolved
jasonpaulos
jasonpaulos previously approved these changes Feb 26, 2024
node/node.go Outdated Show resolved Hide resolved
rpcs/healthService.go Outdated Show resolved Hide resolved
rpcs/healthService.go Show resolved Hide resolved
@pbennett
Copy link

Doesn't the /ready endpoint already provide this?

@gmalouf
Copy link
Contributor Author

gmalouf commented Feb 28, 2024

I think Pavel's argument is that if the node is catching up, it is not fully operational and showing a 200 might be deceiving (we were discussing having this return a 503 with a message around the catchup near term). The original issue #5208 requested this for a case where the http port is not available (and hence the /ready endpoint, which is somewhat heavy for this).

@pbennett
Copy link

But the ready endpoint's point is to only return a 200 when the node is actually 'caught up' and ready for accepting transactions. It's what is used to determine 'load-balancer readiness' for eg. The 'health' endpoint is for 'process is alive' - the 'ready' endpoint is it's caught up and ready to accept comms. ie for k8s I have health check set to health endpoint and I have readiness check (what determines what is in load-balancer active set) go to 'ready' endpoint.
So it seems like the existing endpoint already covers what you need or does there really need to be a third type of ready?

@gmalouf
Copy link
Contributor Author

gmalouf commented Feb 28, 2024

This is definitely a gray area/debatable topic - what defines okay here - port is open, or that the node is functioning?

I think the ask from the issue this is meant to address is to give external folks for say an API service a way to know the gossip port is up when the http port isn't enabled publicly. Separate discussion around that exact use-case/why it's important.

Folks then chimed in that a simple 200/port is open response would be great, which is what I implemented originally. I guess the justification is that gossip is not actually working when the node is catching up, so a non-200 response on the status for the gossip port seems more appropriate than saying "everything's great".

Copy link
Member

@jasonpaulos jasonpaulos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a basic, always-200 response, looks good

rpcs/healthService.go Show resolved Hide resolved
algorandskiy

This comment was marked as duplicate.

Copy link
Contributor

@algorandskiy algorandskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not enough to satisfy all people but good enough for a simple check "node is running". I think this requires some followup work to answer question "relay node is healthy"

@gmalouf gmalouf merged commit 86ae7e6 into algorand:master Mar 7, 2024
18 checks passed
@gmalouf gmalouf deleted the health-checks-gossip-port branch March 7, 2024 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Permissionless health endpoint for gossip service
5 participants