Add flows per second information to Hubble status #28205

glrf · 2023-09-18T12:32:11Z

Please ensure your pull request adheres to the following guidelines:

For first time contributors, read Submitting a pull request
All code is covered by unit and/or runtime tests where feasible.
All commits contain a well written commit description including a title,
description and a Fixes: #XXX line if the commit addresses a particular
GitHub issue.
All commits are signed off. See the section Developer’s Certificate of Origin
Provide a title or release-note blurb suitable for the release notes.
Thanks for contributing!

Preparation for cilium/hubble#952. This PR adds a new filed flows_rate to the Hubble ServerStatus that returns the approximate rate of seen flows per second over the last minute. This allows us to report more up to date flows per second information in hubble status

It's "approximate" as we calculate the rate by counting all flow events in the ring buffer that happened in the last minute. If all events in the ring buffer happened in the last minute, we can't calculate the rate over the last minute, so we calculate the rate since the oldest flow event in the ring buffer.

There are other ways to implement this. For example, we could also add a field that returns the number of seen events in the last minute, but that would again complicate things if not all event fit in the ring buffer. That's why I thought server side rate calculation would be the easiest.

Lastly, this implementation assumes that entries in the ring buffer are in order. I'm not entirely sure this is correct. If not I would need to slightly change the implementation.

lambdanis

Looks good to me, but I'm not sure about all edge cases, so I'll leave for someone else to review.

Lastly, this implementation assumes that entries in the ring buffer are in order. I'm not entirely sure this is correct. If not I would need to slightly change the implementation.

I believe the observer GetFLows implementation relies on the same assumption. Someone correct me if I'm wrong, but for an approximate rate I think it's safe to assume that.

api/v1/observer/observer.proto

pkg/hubble/observer/local_observer.go

We add a `flows_rate` field to the Hubble `ServerStatus` that returns the approximate rate of seen flows per second over the last minute. It's "approximate" as we calculate the rate by counting all flow events in the ring buffer that happened in the last minute. If all events in the ring buffer happened in the last minute, we can't calculate the rate over the last minute, so we calculate the rate since the oldest flow event in the ring buffer. Signed-off-by: Fabian Fischer <fabian.fischer@isovalent.com>

kaworu

Thanks for the PR @glrf, left a small comment about multi-node context but overall LGTM. Also great to see the added tests 👍 some general comments:

Although Flows pers second implementation hubble#952 suggest reporting flow rate over the last minute in the Solution section, I'm unsure whether it will be useful enough to deduce a trend in most cases. IMO something akin to UNIX load average would be more useful.
Related to 1., this implementation is fundamentally limited to report at most hubble-event-buffer-capacity which is 4095 by default. Assuming all the ring buffer flows are within the last minute, we'll report ~ 68 flows/sec which seems like a low ceiling to me.
As @rolinh suggested I think we should report the flow rate through metrics in addition to hubble status.

Hooking into OnDecodedFlow instead of counting from the ring buffer would be both more flexible (i.e. count over the last x, y, z minutes like loadavg) and less sensitive about hubble-event-buffer-capacity, at the cost of a small overhead for every flow. Maybe we could even use (or reuse) a Prometheus Counter directly, although I'm not familiar enough with the metrics subsystem to say for sure (e.g. what happen when hubble metrics are disabled in that case). Thoughts? cc @lambdanis @chancez

api/v1/observer/observer.proto

pkg/hubble/observer/local_observer.go

glrf · 2023-09-25T11:31:28Z

Related to 1., this implementation is fundamentally limited to report at most hubble-event-buffer-capacity which is 4095 by default. Assuming all the ring buffer flows are within the last minute, we'll report ~ 68 flows/sec which seems like a low ceiling to me.

Not quite. If all events in the ring buffer happened in the last minute, we calculate the rate since the oldest flow event in the ring buffer. Not quite accurate, but IMO probably close enough.

As @rolinh suggested I think we should report the flow rate through metrics in addition to hubble status.

A metric for that should be trivial as Prometheus handles the rate calculation for us, but I felt like that seemed unrelated. But I can add that as well. I'm surprised we don't expose such a counter already.

Hooking into OnDecodedFlow instead of counting from the ring buffer would be both more flexible (i.e. count over the last x, y, z minutes like loadavg) and less sensitive about hubble-event-buffer-capacity, at the cost of a small overhead for every flow. Maybe we could even use (or reuse) a Prometheus Counter directly, although I'm not familiar enough with the metrics subsystem to say for sure (e.g. what happen when hubble metrics are disabled in that case). Thoughts?

My first intuition was also to count and compute the rate separately. However, that isn't as straight forward as it seems. AFAIK we can't just reuse a Prometheus counter, as it really is just a counter and can't report how many of the events happened in the last minute. So we'd need to construct a separate data structure to keep track of this. Easiest one I could come up with without having to store all events a second time is to have a counter per second of the last minute, which would then allow us to report the average over one minute with reasonable accuracy and a fixed memory overhead.

I tried to keep it simple and reuse the information we already have, but I can also implement the more accurate, but also more complex, approach if you think it's worth it.

kaworu · 2023-09-25T11:46:37Z

[…] If all events in the ring buffer happened in the last minute, we calculate the rate since the oldest flow event in the ring buffer. Not quite accurate, but IMO probably close enough.

Ah thanks I missed that, it's better than I thought then 👍

A metric for that should be trivial as Prometheus handles the rate calculation for us, but I felt like that seemed unrelated. But I can add that as well. I'm surprised we don't expose such a counter already.

👍

My first intuition was also to count and compute the rate separately. However, that isn't as straight forward as it seems. AFAIK we can't just reuse a Prometheus counter, as it really is just a counter and can't report how many of the events happened in the last minute. So we'd need to construct a separate data structure to keep track of this. Easiest one I could come up with without having to store all events a second time is to have a counter per second of the last minute, which would then allow us to report the average over one minute with reasonable accuracy and a fixed memory overhead.

Maybe there's a more appropriate Prometheus type than a Counter, or we could use exponentially decaying counters?

I tried to keep it simple and reuse the information we already have, but I can also implement the more accurate, but also more complex, approach if you think it's worth it.

Make complete sense, let's wait for more feedback to see where we want to go, cc @rolinh @glibsm

lambdanis · 2023-09-25T15:11:55Z

As @rolinh suggested I think we should report the flow rate through metrics in addition to hubble status.

A metric for that should be trivial as Prometheus handles the rate calculation for us, but I felt like that seemed unrelated. But I can add that as well. I'm surprised we don't expose such a counter already.

There is hubble_flows_processed_total counting all processed flows. I believe it provides what's needed, it can be used to compute the flows rate in Prometheus. One issue is that it has to be explicitly enabled like all Hubble metrics and it might be not exactly obvious. So having the rate in Hubble status too is definitely useful, just I would make sure that it's consistent with what the metric exposes in all cases.

chancez · 2023-09-25T17:15:22Z

While I like the idea, I'm not sure how useful this will be. Generally you want to know the rate over time since it will likely change quite a bit based on deploys, etc. I almost would rather make the client responsible because it's hard to actually provide the "correct" value here, as it depends heavily on the use-case, which the client knows much better than the server.

If we did want to do this, I agree that we could probably do this without reading the ring buffer, and maintain some state for these purposes. hubble_flows_processed_total metric would provide what we need, but Prometheus metrics don't expose a way to retrieve the value stored, so it's probably simpler to just add another atomic counter for these purposes.

Calculating rate is a bit tricky, since we need to track the counter value over time, (probably via a Go routine timer), and then we can calculate the rate by comparing previous values to newer values (and computing the deltas).

glrf · 2023-09-26T08:46:12Z

So in this case we'd first need to decide if we really want to implement this feature.

My 2 cents: I see the point that this has questionable benefits for cluster operators and they are better off using metrics. However, we already report Flows/s in hubble status, which is currently just the average over the complete uptime. I think we should try to improve that reported number. I don't think it needs to be 100% accurate, but something that is a bit closer to the actual rate. That would help in demos and give a better first impression.
But if the consensus is we don't need that, feel free to close this PR and cilium/hubble#952 as won't do.

Next, if we want to implement this, we need to agree on an implementation. I agree the current approach isn't particularly elegant, but it's simple and doesn't introduce any overhead for the write path. I'm happy to implement this in a more elegant way, but I don't want to over-engineer this.
So I'd like to have an consensus on what the implementation needs to do. Should it report rate averaged over a minute or something else? Should it report multiple rates to show a trend or not? Should it be more performant? Should it be more accurate?

chancez · 2023-09-26T15:42:26Z

Should it report multiple rates to show a trend or not?

I kinda like this idea. Like the loadavg. We could show the 1/5/15 minute flow rates (or similar). What do people think?

lambdanis · 2023-09-27T09:10:31Z

I see the point that this has questionable benefits for cluster operators and they are better off using metrics. However, we already report Flows/s in hubble status, which is currently just the average over the complete uptime. I think we should try to improve that reported number. I don't think it needs to be 100% accurate, but something that is a bit closer to the actual rate. That would help in demos and give a better first impression.

IMO it's worth having this number in Hubble status. It can be helpful when metrics are disabled or the metrics system is down, or you just want to get the live rate from the CLI without caring about the whole metrics pipeline and promql.

Next, if we want to implement this, we need to agree on an implementation. I agree the current approach isn't particularly elegant, but it's simple and doesn't introduce any overhead for the write path. I'm happy to implement this in a more elegant way, but I don't want to over-engineer this.
So I'd like to have an consensus on what the implementation needs to do. Should it report rate averaged over a minute or something else? Should it report multiple rates to show a trend or not? Should it be more performant? Should it be more accurate?

The 1-minute average makes sense to me, it seems reasonable for reporting the live rate. Multiple averages can be helpful too, although then reading rates from the ring buffer can become very confusing. So I would rather implement these with timers, although this smells like over-engineering the problem to me.

michi-covalent · 2023-09-28T15:10:47Z

/test

lmb · 2023-10-02T09:32:15Z

Please resolve all conversations, otherwise I can't merge the change.

kaworu · 2023-10-02T10:51:03Z

Please resolve all conversations, otherwise I can't merge the change.

Looks like all conversations are resolved, marking back as ready-to-merge.

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Sep 18, 2023

github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Sep 18, 2023

glrf force-pushed the hubble/flowRate branch from a0f4ae8 to b5247c2 Compare September 19, 2023 07:49

glrf marked this pull request as ready for review September 19, 2023 08:31

glrf requested review from a team as code owners September 19, 2023 08:31

glrf requested a review from kaworu September 19, 2023 08:31

lambdanis added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/hubble Impacts hubble server or relay labels Sep 20, 2023

maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Sep 20, 2023

lambdanis reviewed Sep 20, 2023

View reviewed changes

api/v1/observer/observer.proto Outdated Show resolved Hide resolved

pkg/hubble/observer/local_observer.go Outdated Show resolved Hide resolved

lambdanis added kind/community-contribution This was a contribution made by a community member. and removed kind/community-contribution This was a contribution made by a community member. labels Sep 20, 2023

lambdanis self-assigned this Sep 21, 2023

glrf force-pushed the hubble/flowRate branch from b5247c2 to 4018408 Compare September 25, 2023 07:49

kaworu reviewed Sep 25, 2023

View reviewed changes

api/v1/observer/observer.proto Show resolved Hide resolved

pkg/hubble/observer/local_observer.go Show resolved Hide resolved

michi-covalent approved these changes Sep 28, 2023

View reviewed changes

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Sep 28, 2023

chancez approved these changes Sep 28, 2023

View reviewed changes

kaworu approved these changes Sep 29, 2023

View reviewed changes

aanm added dont-merge/waiting-for-review and removed ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Sep 29, 2023

aanm requested a review from lambdanis September 29, 2023 18:53

lambdanis approved these changes Sep 29, 2023

View reviewed changes

aanm added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. and removed dont-merge/waiting-for-review labels Sep 29, 2023

lmb added dont-merge/bad-bot To prevent MLH from marking ready-to-merge. and removed ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Oct 2, 2023

kaworu added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. and removed dont-merge/bad-bot To prevent MLH from marking ready-to-merge. labels Oct 2, 2023

lmb merged commit 81adc9c into cilium:main Oct 2, 2023
60 of 61 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add flows per second information to Hubble status #28205

Add flows per second information to Hubble status #28205

glrf commented Sep 18, 2023

lambdanis left a comment

kaworu left a comment

glrf commented Sep 25, 2023 •

edited

kaworu commented Sep 25, 2023 •

edited

lambdanis commented Sep 25, 2023

chancez commented Sep 25, 2023

glrf commented Sep 26, 2023

chancez commented Sep 26, 2023

lambdanis commented Sep 27, 2023

michi-covalent commented Sep 28, 2023

lmb commented Oct 2, 2023

kaworu commented Oct 2, 2023

Add flows per second information to Hubble status #28205

Add flows per second information to Hubble status #28205

Conversation

glrf commented Sep 18, 2023

lambdanis left a comment

Choose a reason for hiding this comment

kaworu left a comment

Choose a reason for hiding this comment

glrf commented Sep 25, 2023 • edited

kaworu commented Sep 25, 2023 • edited

lambdanis commented Sep 25, 2023

chancez commented Sep 25, 2023

glrf commented Sep 26, 2023

chancez commented Sep 26, 2023

lambdanis commented Sep 27, 2023

michi-covalent commented Sep 28, 2023

lmb commented Oct 2, 2023

kaworu commented Oct 2, 2023

glrf commented Sep 25, 2023 •

edited

kaworu commented Sep 25, 2023 •

edited