Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hubble_relay_pool_peer_connection_status metric #28217

Merged
merged 1 commit into from
Oct 6, 2023
Merged

Conversation

siwiutki
Copy link
Contributor

@siwiutki siwiutki commented Sep 18, 2023

Please ensure your pull request adheres to the following guidelines:

  • For first time contributors, read Submitting a pull request
  • All code is covered by unit and/or runtime tests where feasible.
  • All commits contain a well written commit description including a title,
    description and a Fixes: #XXX line if the commit addresses a particular
    GitHub issue.
  • If your commit description contains a Fixes: <commit-id> tag, then
    please add the commit author[s] as reviewer[s] to this issue.
  • All commits are signed off. See the section Developer’s Certificate of Origin
  • Provide a title or release-note blurb suitable for the release notes.
  • Are you a user of Cilium? Please add yourself to the Users doc
  • Thanks for contributing!

This change adds a new gauge metric to hubble-relay measuring the connectiion status to all peers. Metric keeps track of number of peers for each possible connectiion status

The current set of metrics is not enough to accurately measure the availability of hubble-relay. They measure the status of gRPC calls, but, for instance, in case all peers are unreachable when GetFlows is called, even though gRPC call will succeed and return "OK" status, the response will come with no flows gathered, rendering it useless. This new metric is introduced to cover such cases.

Fixes #27890

Added hubble_relay_pool_peer_connection_status metric for measuring the connection status of all peers. Metric keeps track of number of peers for each possible connectiion status.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Sep 18, 2023
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Sep 18, 2023
@siwiutki siwiutki marked this pull request as ready for review September 19, 2023 07:49
@siwiutki siwiutki requested a review from a team as a code owner September 19, 2023 07:49
@marqc
Copy link
Contributor

marqc commented Sep 19, 2023

Thanks for contributing. Such a metric sounds pretty useful to me.

If I understand correctly, on each call to any relay api method you are updating the hubble_relay_observer_unavailable_nodes gauge. This seems quite odd that it's labeled with the API method name, as node status is not related to the calling method. This is a state of peers list.

I can suggest a different approach. Instead of binding this to API calls, create a single timer goroutine that will periodically traverse peers list and report gauge metric of hubble_relay_observer_nodes with a state label reflecting possible nodes states https://github.com/cilium/cilium/blob/main/api/v1/relay/relay.proto#L22

Benefits:

  • metric will reflect the current relay peers states even when there are no API calls
  • metric will reflect all nodes and their states. not only unavailable nodes (better for troubleshooting peer discovery problems)

@maintainer-s-little-helper
Copy link

Commit 9943a754f49c227673b10776b060ea7325a5c27f does not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Sep 23, 2023
@maintainer-s-little-helper
Copy link

Commit 9943a754f49c227673b10776b060ea7325a5c27f does not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@siwiutki siwiutki changed the title Add hubble_relay_observer_nodes_unavailable metric Add hubble_relay_observer_nodes_status metric Sep 23, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Sep 23, 2023
@siwiutki siwiutki marked this pull request as ready for review September 25, 2023 07:30
Copy link
Contributor

@marqc marqc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall a good job. Thanks for implementing this. I have one suggestion to report real grpc peer connection status instead of pre-aggregated as available/non-available.

pkg/hubble/relay/observer/server.go Outdated Show resolved Hide resolved
pkg/hubble/relay/observer/server.go Outdated Show resolved Hide resolved
pkg/hubble/relay/observer/server.go Outdated Show resolved Hide resolved
pkg/hubble/relay/observer/server.go Outdated Show resolved Hide resolved
@marqc
Copy link
Contributor

marqc commented Sep 25, 2023

Overall a good job. Thanks for implementing this. I have one suggestion to report real grpc peer connection status instead of pre-aggregated as available/non-available.

One more suggestion. This metric reports the connection status between relay and peers. It is implemented as part of observer/Server but actually the Server has different responsibilities as it is rather responsible for handling client requests. It fits better to be implemented inside PeerManager https://github.com/cilium/cilium/blob/main/pkg/hubble/relay/pool/manager.go#L59

@kaworu kaworu added kind/enhancement This would improve or streamline existing functionality. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. area/metrics Impacts statistics / metrics gathering, eg via Prometheus. sig/hubble Impacts hubble server or relay labels Sep 25, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Sep 25, 2023
@glrf
Copy link
Contributor

glrf commented Sep 26, 2023

I think this PR would also close: #27890 (I actually just started looking into that issue before I discovered this PR)

One input I have: It might be worth it to add the peer name and/or address to the metric. This would make the implementation slightly more complicated and would increase the cardinality of the metric, but I think for most cluster sizes this should be fine and it would allow cluster operators to easily see from the metric itself which peer is affected without having to dig through logs.

@marqc
Copy link
Contributor

marqc commented Sep 27, 2023

@glrf

One input I have: It might be worth it to add the peer name and/or address to the metric. This would make the implementation slightly more complicated and would increase the cardinality of the metric, but I think for most cluster sizes this should be fine and it would allow cluster operators to easily see from the metric itself which peer is affected without having to dig through logs.

I believe that the key is in "most cluster sizes". For relatively small clusters that's not a problem, but it may become a problem with larger clusters with thousand nodes (or with dynamic node allocation/cluster autoscalers). This also changes the feature from metric that allows detecting if anything is/was wrong into the detailed status history of each peer. If there's a decision to add a node name/address into labels set then this should be configurable to enable/disable these labels to match the user's cluster configuration. For a general use case I would rather stay just with gauge reporting number of peers in a given connection state and then relate to logs for details on unavailable peers.

@rolinh
Copy link
Member

rolinh commented Sep 27, 2023

Thanks for your contribution @siwiutki ! It sounds like a very useful feature indeed.
I also share @marqc opinion that the logic should be implemented in the peer manager, which is the component that manages connections to peers rather than in the server.

@glrf
Copy link
Contributor

glrf commented Sep 27, 2023

I believe that the key is in "most cluster sizes". For relatively small clusters that's not a problem, but it may become a problem with larger clusters with thousand nodes (or with dynamic node allocation/cluster autoscalers). This also changes the feature from metric that allows detecting if anything is/was wrong into the detailed status history of each peer. If there's a decision to add a node name/address into labels set then this should be configurable to enable/disable these labels to match the user's cluster configuration. For a general use case I would rather stay just with gauge reporting number of peers in a given connection state and then relate to logs for details on unavailable peers.

Fair point, maybe adding peer information is too much, feel free to just ignore my input. :)

My impression was that if the number of nodes in your cluster will cause high cardinality in your metrics, you'll need to be very careful what and how you scrape metrics anyways. But you're right, this metric should also work in these cases without having to jump through hoops. So an aggregated gauge is probably good enough.

@siwiutki siwiutki changed the title Add hubble_relay_observer_nodes_status metric Add hubble_relay_pool_peer_connection_status metric Sep 27, 2023
Copy link
Contributor Author

@siwiutki siwiutki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this code fits better into the PeerManager code. Moved it there already.

Unit tests that I wrote for PeerManager utilise time.Sleep to test the actual code that exports this metric. I don't fully like it, but I don't see time being mocked anywhere in the current unit tests, so I didn't try to reinvent the wheel here. Timer used in PeerManager could be mocked and usually https://github.com/benbjohnson/clock library is a good pick for that.

Copy link
Contributor

@marqc marqc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great job! Thanks for contributing!

pkg/hubble/relay/pool/manager.go Outdated Show resolved Hide resolved
pkg/hubble/relay/pool/manager_test.go Show resolved Hide resolved
Copy link
Member

@kaworu kaworu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @siwiutki 👋 and thanks for the PR!

Couple of comment, but overall LGTM. Agree with @glrf that testing NIL_CONNECTION would be an improvement.

pkg/hubble/relay/pool/manager.go Outdated Show resolved Hide resolved
pkg/hubble/relay/pool/manager_test.go Outdated Show resolved Hide resolved
Copy link
Member

@kaworu kaworu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @siwiutki!

@maintainer-s-little-helper
Copy link

Commit 4f423267aebc412adec5ebefdd89534dacfb2ee9 does not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Oct 2, 2023
@maintainer-s-little-helper
Copy link

Commit 4f423267aebc412adec5ebefdd89534dacfb2ee9 does not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@kaworu
Copy link
Member

kaworu commented Oct 2, 2023

@siwiutki can you please squash the two commits into one please?

Copy link
Contributor Author

@siwiutki siwiutki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second commit was a merge commit to be in sync with current main branch. Synced it with rebase instead of merge, so should be fine with 1 commit now.

@siwiutki

This comment was marked as resolved.

This change adds a new gauge metric to hubble-relay measuring the connectiion status to all peers. Metric keeps track of number of peers for each possible connectiion status

The current set of metrics is not enough to accurately measure the availability of hubble-relay. They measure the status of gRPC calls, but, for instance, in case all peers are unreachable when GetFlows is called, even though gRPC call will succeed and return "OK" status, the response will come with no flows gathered, rendering it useless. This new metric is introduced to cover such cases.

Signed-off-by: Michal Siwinski <siwy@google.com>
@kaworu
Copy link
Member

kaworu commented Oct 4, 2023

/test

@kaworu
Copy link
Member

kaworu commented Oct 6, 2023

Integration Tests failed with

Unable to find image 'cilium/docs-builder:latest' locally
docker: Error response from daemon: manifest for cilium/docs-builder:latest not found: manifest unknown: manifest unknown.

re-running as it seems unrelated to the patch.

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 6, 2023
@ti-mo ti-mo merged commit e50b3c3 into cilium:main Oct 6, 2023
59 of 61 checks passed
Copy link
Member

@rolinh rolinh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work, lgtm!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/metrics Impacts statistics / metrics gathering, eg via Prometheus. kind/community-contribution This was a contribution made by a community member. kind/enhancement This would improve or streamline existing functionality. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/hubble Impacts hubble server or relay
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CFP: Hubble Relay Peer Metrics
6 participants