Add hubble_relay_pool_peer_connection_status metric #28217

siwiutki · 2023-09-18T20:45:37Z

Please ensure your pull request adheres to the following guidelines:

For first time contributors, read Submitting a pull request
All code is covered by unit and/or runtime tests where feasible.
All commits contain a well written commit description including a title,
description and a Fixes: #XXX line if the commit addresses a particular
GitHub issue.
If your commit description contains a Fixes: <commit-id> tag, then
please add the commit author[s] as reviewer[s] to this issue.
All commits are signed off. See the section Developer’s Certificate of Origin
Provide a title or release-note blurb suitable for the release notes.
Are you a user of Cilium? Please add yourself to the Users doc
Thanks for contributing!

This change adds a new gauge metric to hubble-relay measuring the connectiion status to all peers. Metric keeps track of number of peers for each possible connectiion status

The current set of metrics is not enough to accurately measure the availability of hubble-relay. They measure the status of gRPC calls, but, for instance, in case all peers are unreachable when GetFlows is called, even though gRPC call will succeed and return "OK" status, the response will come with no flows gathered, rendering it useless. This new metric is introduced to cover such cases.

Fixes #27890

Added hubble_relay_pool_peer_connection_status metric for measuring the connection status of all peers. Metric keeps track of number of peers for each possible connectiion status.

marqc · 2023-09-19T08:01:41Z

Thanks for contributing. Such a metric sounds pretty useful to me.

If I understand correctly, on each call to any relay api method you are updating the hubble_relay_observer_unavailable_nodes gauge. This seems quite odd that it's labeled with the API method name, as node status is not related to the calling method. This is a state of peers list.

I can suggest a different approach. Instead of binding this to API calls, create a single timer goroutine that will periodically traverse peers list and report gauge metric of hubble_relay_observer_nodes with a state label reflecting possible nodes states https://github.com/cilium/cilium/blob/main/api/v1/relay/relay.proto#L22

Benefits:

metric will reflect the current relay peers states even when there are no API calls
metric will reflect all nodes and their states. not only unavailable nodes (better for troubleshooting peer discovery problems)

maintainer-s-little-helper · 2023-09-23T14:56:27Z

Commit 9943a754f49c227673b10776b060ea7325a5c27f does not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

maintainer-s-little-helper · 2023-09-23T14:57:06Z

Commit 9943a754f49c227673b10776b060ea7325a5c27f does not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

marqc

Overall a good job. Thanks for implementing this. I have one suggestion to report real grpc peer connection status instead of pre-aggregated as available/non-available.

pkg/hubble/relay/observer/server.go

marqc · 2023-09-25T09:07:29Z

Overall a good job. Thanks for implementing this. I have one suggestion to report real grpc peer connection status instead of pre-aggregated as available/non-available.

One more suggestion. This metric reports the connection status between relay and peers. It is implemented as part of observer/Server but actually the Server has different responsibilities as it is rather responsible for handling client requests. It fits better to be implemented inside PeerManager https://github.com/cilium/cilium/blob/main/pkg/hubble/relay/pool/manager.go#L59

glrf · 2023-09-26T09:11:48Z

I think this PR would also close: #27890 (I actually just started looking into that issue before I discovered this PR)

One input I have: It might be worth it to add the peer name and/or address to the metric. This would make the implementation slightly more complicated and would increase the cardinality of the metric, but I think for most cluster sizes this should be fine and it would allow cluster operators to easily see from the metric itself which peer is affected without having to dig through logs.

marqc · 2023-09-27T08:40:23Z

@glrf

One input I have: It might be worth it to add the peer name and/or address to the metric. This would make the implementation slightly more complicated and would increase the cardinality of the metric, but I think for most cluster sizes this should be fine and it would allow cluster operators to easily see from the metric itself which peer is affected without having to dig through logs.

I believe that the key is in "most cluster sizes". For relatively small clusters that's not a problem, but it may become a problem with larger clusters with thousand nodes (or with dynamic node allocation/cluster autoscalers). This also changes the feature from metric that allows detecting if anything is/was wrong into the detailed status history of each peer. If there's a decision to add a node name/address into labels set then this should be configurable to enable/disable these labels to match the user's cluster configuration. For a general use case I would rather stay just with gauge reporting number of peers in a given connection state and then relate to logs for details on unavailable peers.

rolinh · 2023-09-27T09:07:48Z

Thanks for your contribution @siwiutki ! It sounds like a very useful feature indeed.
I also share @marqc opinion that the logic should be implemented in the peer manager, which is the component that manages connections to peers rather than in the server.

glrf · 2023-09-27T09:11:47Z

I believe that the key is in "most cluster sizes". For relatively small clusters that's not a problem, but it may become a problem with larger clusters with thousand nodes (or with dynamic node allocation/cluster autoscalers). This also changes the feature from metric that allows detecting if anything is/was wrong into the detailed status history of each peer. If there's a decision to add a node name/address into labels set then this should be configurable to enable/disable these labels to match the user's cluster configuration. For a general use case I would rather stay just with gauge reporting number of peers in a given connection state and then relate to logs for details on unavailable peers.

Fair point, maybe adding peer information is too much, feel free to just ignore my input. :)

My impression was that if the number of nodes in your cluster will cause high cardinality in your metrics, you'll need to be very careful what and how you scrape metrics anyways. But you're right, this metric should also work in these cases without having to jump through hoops. So an aggregated gauge is probably good enough.

siwiutki

I agree that this code fits better into the PeerManager code. Moved it there already.

Unit tests that I wrote for PeerManager utilise time.Sleep to test the actual code that exports this metric. I don't fully like it, but I don't see time being mocked anywhere in the current unit tests, so I didn't try to reinvent the wheel here. Timer used in PeerManager could be mocked and usually https://github.com/benbjohnson/clock library is a good pick for that.

marqc

LGTM! Great job! Thanks for contributing!

pkg/hubble/relay/pool/manager.go

pkg/hubble/relay/pool/manager_test.go

kaworu

Hi @siwiutki 👋 and thanks for the PR!

Couple of comment, but overall LGTM. Agree with @glrf that testing NIL_CONNECTION would be an improvement.

pkg/hubble/relay/pool/manager.go

pkg/hubble/relay/pool/manager_test.go

kaworu

Thanks @siwiutki!

maintainer-s-little-helper · 2023-10-02T12:06:11Z

Commit 4f423267aebc412adec5ebefdd89534dacfb2ee9 does not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

maintainer-s-little-helper · 2023-10-02T12:06:57Z

Commit 4f423267aebc412adec5ebefdd89534dacfb2ee9 does not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

kaworu · 2023-10-02T13:42:01Z

@siwiutki can you please squash the two commits into one please?

siwiutki

The second commit was a merge commit to be in sync with current main branch. Synced it with rebase instead of merge, so should be fine with 1 commit now.

This change adds a new gauge metric to hubble-relay measuring the connectiion status to all peers. Metric keeps track of number of peers for each possible connectiion status The current set of metrics is not enough to accurately measure the availability of hubble-relay. They measure the status of gRPC calls, but, for instance, in case all peers are unreachable when GetFlows is called, even though gRPC call will succeed and return "OK" status, the response will come with no flows gathered, rendering it useless. This new metric is introduced to cover such cases. Signed-off-by: Michal Siwinski <siwy@google.com>

kaworu · 2023-10-04T15:15:13Z

/test

kaworu · 2023-10-06T09:44:16Z

Integration Tests failed with

Unable to find image 'cilium/docs-builder:latest' locally
docker: Error response from daemon: manifest for cilium/docs-builder:latest not found: manifest unknown: manifest unknown.

re-running as it seems unrelated to the patch.

rolinh

Awesome work, lgtm!

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Sep 18, 2023

github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Sep 18, 2023

siwiutki force-pushed the main branch from 08f8ffe to 1e69446 Compare September 18, 2023 22:07

siwiutki marked this pull request as ready for review September 19, 2023 07:49

siwiutki requested a review from a team as a code owner September 19, 2023 07:49

siwiutki requested a review from kaworu September 19, 2023 07:49

siwiutki marked this pull request as draft September 19, 2023 12:15

siwiutki force-pushed the main branch from 1e69446 to 9943a75 Compare September 23, 2023 14:56

maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Sep 23, 2023

siwiutki force-pushed the main branch from 9943a75 to 6ec7f86 Compare September 23, 2023 14:57

siwiutki changed the title ~~Add hubble_relay_observer_nodes_unavailable metric~~ Add hubble_relay_observer_nodes_status metric Sep 23, 2023

siwiutki force-pushed the main branch from 6ec7f86 to 19be790 Compare September 23, 2023 16:49

maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Sep 23, 2023

siwiutki marked this pull request as ready for review September 25, 2023 07:30

marqc reviewed Sep 25, 2023

View reviewed changes

pkg/hubble/relay/observer/server.go Outdated Show resolved Hide resolved

pkg/hubble/relay/observer/server.go Outdated Show resolved Hide resolved

pkg/hubble/relay/observer/server.go Outdated Show resolved Hide resolved

pkg/hubble/relay/observer/server.go Outdated Show resolved Hide resolved

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Sep 25, 2023

siwiutki mentioned this pull request Sep 25, 2023

[WIP] Add Hubble test suite kubernetes/perf-tests#2312

Closed

siwiutki force-pushed the main branch from 19be790 to 5e07d95 Compare September 27, 2023 18:19

siwiutki force-pushed the main branch from 9ab16db to f646966 Compare September 27, 2023 18:22

siwiutki changed the title ~~Add hubble_relay_observer_nodes_status metric~~ Add hubble_relay_pool_peer_connection_status metric Sep 27, 2023

siwiutki commented Sep 27, 2023

View reviewed changes

marqc approved these changes Sep 28, 2023

View reviewed changes

pkg/hubble/relay/pool/manager.go Outdated Show resolved Hide resolved

siwiutki force-pushed the main branch from f646966 to 689fadb Compare September 28, 2023 08:19

glrf approved these changes Sep 29, 2023

View reviewed changes

pkg/hubble/relay/pool/manager_test.go Show resolved Hide resolved

kaworu reviewed Sep 29, 2023

View reviewed changes

pkg/hubble/relay/pool/manager.go Outdated Show resolved Hide resolved

pkg/hubble/relay/pool/manager_test.go Outdated Show resolved Hide resolved

siwiutki force-pushed the main branch from 689fadb to ba373b7 Compare October 2, 2023 07:07

kaworu approved these changes Oct 2, 2023

View reviewed changes

siwiutki force-pushed the main branch from ba373b7 to 8efa5a7 Compare October 2, 2023 10:18

maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Oct 2, 2023

siwiutki force-pushed the main branch from 4f42326 to 74bf6d5 Compare October 2, 2023 12:06

siwiutki force-pushed the main branch from 74bf6d5 to 874cb16 Compare October 2, 2023 12:08

maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Oct 2, 2023

siwiutki force-pushed the main branch from 874cb16 to a756103 Compare October 2, 2023 12:09

siwiutki force-pushed the main branch from a756103 to 03a5b12 Compare October 2, 2023 14:24

siwiutki commented Oct 2, 2023

View reviewed changes

This comment was marked as resolved.

Sign in to view

siwiutki force-pushed the main branch from 03a5b12 to 29cd279 Compare October 3, 2023 15:15

siwiutki force-pushed the main branch from 29cd279 to 1e637c9 Compare October 4, 2023 07:55

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 6, 2023

ti-mo merged commit e50b3c3 into cilium:main Oct 6, 2023
59 of 61 checks passed

rolinh approved these changes Oct 6, 2023

View reviewed changes

siwiutki mentioned this pull request Feb 16, 2024

Add siwiutki to organization members cilium/community#86

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add hubble_relay_pool_peer_connection_status metric #28217

Add hubble_relay_pool_peer_connection_status metric #28217

siwiutki commented Sep 18, 2023 •

edited

Loading

marqc commented Sep 19, 2023

maintainer-s-little-helper bot commented Sep 23, 2023

maintainer-s-little-helper bot commented Sep 23, 2023

marqc left a comment

marqc commented Sep 25, 2023

glrf commented Sep 26, 2023

marqc commented Sep 27, 2023

rolinh commented Sep 27, 2023

glrf commented Sep 27, 2023

siwiutki left a comment

marqc left a comment

kaworu left a comment

kaworu left a comment

maintainer-s-little-helper bot commented Oct 2, 2023

maintainer-s-little-helper bot commented Oct 2, 2023

kaworu commented Oct 2, 2023

siwiutki left a comment

This comment was marked as resolved.

kaworu commented Oct 4, 2023

kaworu commented Oct 6, 2023

rolinh left a comment

Add hubble_relay_pool_peer_connection_status metric #28217

Add hubble_relay_pool_peer_connection_status metric #28217

Conversation

siwiutki commented Sep 18, 2023 • edited Loading

marqc commented Sep 19, 2023

maintainer-s-little-helper bot commented Sep 23, 2023

maintainer-s-little-helper bot commented Sep 23, 2023

marqc left a comment

Choose a reason for hiding this comment

marqc commented Sep 25, 2023

glrf commented Sep 26, 2023

marqc commented Sep 27, 2023

rolinh commented Sep 27, 2023

glrf commented Sep 27, 2023

siwiutki left a comment

Choose a reason for hiding this comment

marqc left a comment

Choose a reason for hiding this comment

kaworu left a comment

Choose a reason for hiding this comment

kaworu left a comment

Choose a reason for hiding this comment

maintainer-s-little-helper bot commented Oct 2, 2023

maintainer-s-little-helper bot commented Oct 2, 2023

kaworu commented Oct 2, 2023

siwiutki left a comment

Choose a reason for hiding this comment

This comment was marked as resolved.

kaworu commented Oct 4, 2023

kaworu commented Oct 6, 2023

rolinh left a comment

Choose a reason for hiding this comment

siwiutki commented Sep 18, 2023 •

edited

Loading