New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Network: Support simple liveness check via http on gossip server port. #5944
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5944 +/- ##
==========================================
+ Coverage 55.72% 55.73% +0.01%
==========================================
Files 488 489 +1
Lines 68101 68111 +10
==========================================
+ Hits 37950 37963 +13
+ Misses 27583 27575 -8
- Partials 2568 2573 +5 ☔ View full report in Codecov by Sentry. |
Doesn't the /ready endpoint already provide this? |
I think Pavel's argument is that if the node is catching up, it is not fully operational and showing a 200 might be deceiving (we were discussing having this return a 503 with a message around the catchup near term). The original issue #5208 requested this for a case where the http port is not available (and hence the /ready endpoint, which is somewhat heavy for this). |
But the ready endpoint's point is to only return a 200 when the node is actually 'caught up' and ready for accepting transactions. It's what is used to determine 'load-balancer readiness' for eg. The 'health' endpoint is for 'process is alive' - the 'ready' endpoint is it's caught up and ready to accept comms. ie for k8s I have health check set to health endpoint and I have readiness check (what determines what is in load-balancer active set) go to 'ready' endpoint. |
This is definitely a gray area/debatable topic - what defines okay here - port is open, or that the node is functioning? I think the ask from the issue this is meant to address is to give external folks for say an API service a way to know the gossip port is up when the http port isn't enabled publicly. Separate discussion around that exact use-case/why it's important. Folks then chimed in that a simple 200/port is open response would be great, which is what I implemented originally. I guess the justification is that gossip is not actually working when the node is catching up, so a non-200 response on the status for the gossip port seems more appropriate than saying "everything's great". |
5f92542
to
de3286c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a basic, always-200 response, looks good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not enough to satisfy all people but good enough for a simple check "node is running". I think this requires some followup work to answer question "relay node is healthy"
Summary
This PR adds a simple /status http endpoint on the gossip server port, enough to ensure it is open. Continue to use the /ready endpoint on the API port for readiness checks (i.e. is the node caught up?).
closes #5208
Test Plan
Added new health check tests, existing tests should pass without issue. In addition, started node up locally with NetAddress set, confirmed can reach
/status
on the specified port.