-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add hubble_relay_pool_peer_connection_status metric #28217
Conversation
Thanks for contributing. Such a metric sounds pretty useful to me. If I understand correctly, on each call to any relay api method you are updating the I can suggest a different approach. Instead of binding this to API calls, create a single timer goroutine that will periodically traverse peers list and report gauge metric of Benefits:
|
Commit 9943a754f49c227673b10776b060ea7325a5c27f does not match "(?m)^Signed-off-by:". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
Commit 9943a754f49c227673b10776b060ea7325a5c27f does not match "(?m)^Signed-off-by:". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall a good job. Thanks for implementing this. I have one suggestion to report real grpc peer connection status instead of pre-aggregated as available/non-available.
One more suggestion. This metric reports the connection status between relay and peers. It is implemented as part of observer/Server but actually the Server has different responsibilities as it is rather responsible for handling client requests. It fits better to be implemented inside PeerManager https://github.com/cilium/cilium/blob/main/pkg/hubble/relay/pool/manager.go#L59 |
I think this PR would also close: #27890 (I actually just started looking into that issue before I discovered this PR) One input I have: It might be worth it to add the peer name and/or address to the metric. This would make the implementation slightly more complicated and would increase the cardinality of the metric, but I think for most cluster sizes this should be fine and it would allow cluster operators to easily see from the metric itself which peer is affected without having to dig through logs. |
I believe that the key is in "most cluster sizes". For relatively small clusters that's not a problem, but it may become a problem with larger clusters with thousand nodes (or with dynamic node allocation/cluster autoscalers). This also changes the feature from metric that allows detecting if anything is/was wrong into the detailed status history of each peer. If there's a decision to add a node name/address into labels set then this should be configurable to enable/disable these labels to match the user's cluster configuration. For a general use case I would rather stay just with gauge reporting number of peers in a given connection state and then relate to logs for details on unavailable peers. |
Fair point, maybe adding peer information is too much, feel free to just ignore my input. :) My impression was that if the number of nodes in your cluster will cause high cardinality in your metrics, you'll need to be very careful what and how you scrape metrics anyways. But you're right, this metric should also work in these cases without having to jump through hoops. So an aggregated gauge is probably good enough. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this code fits better into the PeerManager code. Moved it there already.
Unit tests that I wrote for PeerManager utilise time.Sleep
to test the actual code that exports this metric. I don't fully like it, but I don't see time being mocked anywhere in the current unit tests, so I didn't try to reinvent the wheel here. Timer used in PeerManager could be mocked and usually https://github.com/benbjohnson/clock library is a good pick for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Great job! Thanks for contributing!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @siwiutki!
Commit 4f423267aebc412adec5ebefdd89534dacfb2ee9 does not match "(?m)^Signed-off-by:". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
Commit 4f423267aebc412adec5ebefdd89534dacfb2ee9 does not match "(?m)^Signed-off-by:". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
@siwiutki can you please squash the two commits into one please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second commit was a merge commit to be in sync with current main branch. Synced it with rebase instead of merge, so should be fine with 1 commit now.
This comment was marked as resolved.
This comment was marked as resolved.
This change adds a new gauge metric to hubble-relay measuring the connectiion status to all peers. Metric keeps track of number of peers for each possible connectiion status The current set of metrics is not enough to accurately measure the availability of hubble-relay. They measure the status of gRPC calls, but, for instance, in case all peers are unreachable when GetFlows is called, even though gRPC call will succeed and return "OK" status, the response will come with no flows gathered, rendering it useless. This new metric is introduced to cover such cases. Signed-off-by: Michal Siwinski <siwy@google.com>
/test |
re-running as it seems unrelated to the patch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work, lgtm!
Please ensure your pull request adheres to the following guidelines:
description and a
Fixes: #XXX
line if the commit addresses a particularGitHub issue.
Fixes: <commit-id>
tag, thenplease add the commit author[s] as reviewer[s] to this issue.
This change adds a new gauge metric to hubble-relay measuring the connectiion status to all peers. Metric keeps track of number of peers for each possible connectiion status
The current set of metrics is not enough to accurately measure the availability of hubble-relay. They measure the status of gRPC calls, but, for instance, in case all peers are unreachable when GetFlows is called, even though gRPC call will succeed and return "OK" status, the response will come with no flows gathered, rendering it useless. This new metric is introduced to cover such cases.
Fixes #27890