Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add flows per second information to Hubble status #28205

Merged
merged 1 commit into from
Oct 2, 2023

Conversation

glrf
Copy link
Contributor

@glrf glrf commented Sep 18, 2023

Please ensure your pull request adheres to the following guidelines:

  • For first time contributors, read Submitting a pull request
  • All code is covered by unit and/or runtime tests where feasible.
  • All commits contain a well written commit description including a title,
    description and a Fixes: #XXX line if the commit addresses a particular
    GitHub issue.
  • All commits are signed off. See the section Developer’s Certificate of Origin
  • Provide a title or release-note blurb suitable for the release notes.
  • Thanks for contributing!

Preparation for cilium/hubble#952. This PR adds a new filed flows_rate to the Hubble ServerStatus that returns the approximate rate of seen flows per second over the last minute. This allows us to report more up to date flows per second information in hubble status

It's "approximate" as we calculate the rate by counting all flow events in the ring buffer that happened in the last minute. If all events in the ring buffer happened in the last minute, we can't calculate the rate over the last minute, so we calculate the rate since the oldest flow event in the ring buffer.

There are other ways to implement this. For example, we could also add a field that returns the number of seen events in the last minute, but that would again complicate things if not all event fit in the ring buffer. That's why I thought server side rate calculation would be the easiest.

Lastly, this implementation assumes that entries in the ring buffer are in order. I'm not entirely sure this is correct. If not I would need to slightly change the implementation.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Sep 18, 2023
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Sep 18, 2023
@glrf glrf marked this pull request as ready for review September 19, 2023 08:31
@glrf glrf requested review from a team as code owners September 19, 2023 08:31
@glrf glrf requested a review from kaworu September 19, 2023 08:31
@lambdanis lambdanis added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/hubble Impacts hubble server or relay labels Sep 20, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Sep 20, 2023
Copy link
Contributor

@lambdanis lambdanis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but I'm not sure about all edge cases, so I'll leave for someone else to review.

Lastly, this implementation assumes that entries in the ring buffer are in order. I'm not entirely sure this is correct. If not I would need to slightly change the implementation.

I believe the observer GetFLows implementation relies on the same assumption. Someone correct me if I'm wrong, but for an approximate rate I think it's safe to assume that.

api/v1/observer/observer.proto Outdated Show resolved Hide resolved
pkg/hubble/observer/local_observer.go Outdated Show resolved Hide resolved
@lambdanis lambdanis added kind/community-contribution This was a contribution made by a community member. and removed kind/community-contribution This was a contribution made by a community member. labels Sep 20, 2023
@lambdanis lambdanis self-assigned this Sep 21, 2023
We add a `flows_rate` field to the Hubble `ServerStatus` that returns
the approximate rate of seen flows per second over the last minute.

It's "approximate" as we calculate the rate by counting all flow events
in the ring buffer that happened in the last minute. If all events in
the ring buffer happened in the last minute, we can't calculate the
rate over the last minute, so we calculate the rate since the oldest
flow event in the ring buffer.

Signed-off-by: Fabian Fischer <fabian.fischer@isovalent.com>
Copy link
Member

@kaworu kaworu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @glrf, left a small comment about multi-node context but overall LGTM. Also great to see the added tests 👍 some general comments:

  1. Although Flows pers second implementation hubble#952 suggest reporting flow rate over the last minute in the Solution section, I'm unsure whether it will be useful enough to deduce a trend in most cases. IMO something akin to UNIX load average would be more useful.
  2. Related to 1., this implementation is fundamentally limited to report at most hubble-event-buffer-capacity which is 4095 by default. Assuming all the ring buffer flows are within the last minute, we'll report ~ 68 flows/sec which seems like a low ceiling to me.
  3. As @rolinh suggested I think we should report the flow rate through metrics in addition to hubble status.

Hooking into OnDecodedFlow instead of counting from the ring buffer would be both more flexible (i.e. count over the last x, y, z minutes like loadavg) and less sensitive about hubble-event-buffer-capacity, at the cost of a small overhead for every flow. Maybe we could even use (or reuse) a Prometheus Counter directly, although I'm not familiar enough with the metrics subsystem to say for sure (e.g. what happen when hubble metrics are disabled in that case). Thoughts? cc @lambdanis @chancez

api/v1/observer/observer.proto Show resolved Hide resolved
pkg/hubble/observer/local_observer.go Show resolved Hide resolved
@glrf
Copy link
Contributor Author

glrf commented Sep 25, 2023

  1. Related to 1., this implementation is fundamentally limited to report at most hubble-event-buffer-capacity which is 4095 by default. Assuming all the ring buffer flows are within the last minute, we'll report ~ 68 flows/sec which seems like a low ceiling to me.

Not quite. If all events in the ring buffer happened in the last minute, we calculate the rate since the oldest flow event in the ring buffer. Not quite accurate, but IMO probably close enough.

  1. As @rolinh suggested I think we should report the flow rate through metrics in addition to hubble status.

A metric for that should be trivial as Prometheus handles the rate calculation for us, but I felt like that seemed unrelated. But I can add that as well. I'm surprised we don't expose such a counter already.

Hooking into OnDecodedFlow instead of counting from the ring buffer would be both more flexible (i.e. count over the last x, y, z minutes like loadavg) and less sensitive about hubble-event-buffer-capacity, at the cost of a small overhead for every flow. Maybe we could even use (or reuse) a Prometheus Counter directly, although I'm not familiar enough with the metrics subsystem to say for sure (e.g. what happen when hubble metrics are disabled in that case). Thoughts?

My first intuition was also to count and compute the rate separately. However, that isn't as straight forward as it seems. AFAIK we can't just reuse a Prometheus counter, as it really is just a counter and can't report how many of the events happened in the last minute. So we'd need to construct a separate data structure to keep track of this. Easiest one I could come up with without having to store all events a second time is to have a counter per second of the last minute, which would then allow us to report the average over one minute with reasonable accuracy and a fixed memory overhead.

I tried to keep it simple and reuse the information we already have, but I can also implement the more accurate, but also more complex, approach if you think it's worth it.

@kaworu
Copy link
Member

kaworu commented Sep 25, 2023

[…] If all events in the ring buffer happened in the last minute, we calculate the rate since the oldest flow event in the ring buffer. Not quite accurate, but IMO probably close enough.

Ah thanks I missed that, it's better than I thought then 👍

A metric for that should be trivial as Prometheus handles the rate calculation for us, but I felt like that seemed unrelated. But I can add that as well. I'm surprised we don't expose such a counter already.

👍

My first intuition was also to count and compute the rate separately. However, that isn't as straight forward as it seems. AFAIK we can't just reuse a Prometheus counter, as it really is just a counter and can't report how many of the events happened in the last minute. So we'd need to construct a separate data structure to keep track of this. Easiest one I could come up with without having to store all events a second time is to have a counter per second of the last minute, which would then allow us to report the average over one minute with reasonable accuracy and a fixed memory overhead.

Maybe there's a more appropriate Prometheus type than a Counter, or we could use exponentially decaying counters?

I tried to keep it simple and reuse the information we already have, but I can also implement the more accurate, but also more complex, approach if you think it's worth it.

Make complete sense, let's wait for more feedback to see where we want to go, cc @rolinh @glibsm

@lambdanis
Copy link
Contributor

  1. As @rolinh suggested I think we should report the flow rate through metrics in addition to hubble status.

A metric for that should be trivial as Prometheus handles the rate calculation for us, but I felt like that seemed unrelated. But I can add that as well. I'm surprised we don't expose such a counter already.

There is hubble_flows_processed_total counting all processed flows. I believe it provides what's needed, it can be used to compute the flows rate in Prometheus. One issue is that it has to be explicitly enabled like all Hubble metrics and it might be not exactly obvious. So having the rate in Hubble status too is definitely useful, just I would make sure that it's consistent with what the metric exposes in all cases.

@chancez
Copy link
Contributor

chancez commented Sep 25, 2023

While I like the idea, I'm not sure how useful this will be. Generally you want to know the rate over time since it will likely change quite a bit based on deploys, etc. I almost would rather make the client responsible because it's hard to actually provide the "correct" value here, as it depends heavily on the use-case, which the client knows much better than the server.

If we did want to do this, I agree that we could probably do this without reading the ring buffer, and maintain some state for these purposes. hubble_flows_processed_total metric would provide what we need, but Prometheus metrics don't expose a way to retrieve the value stored, so it's probably simpler to just add another atomic counter for these purposes.

Calculating rate is a bit tricky, since we need to track the counter value over time, (probably via a Go routine timer), and then we can calculate the rate by comparing previous values to newer values (and computing the deltas).

@glrf
Copy link
Contributor Author

glrf commented Sep 26, 2023

So in this case we'd first need to decide if we really want to implement this feature.

My 2 cents: I see the point that this has questionable benefits for cluster operators and they are better off using metrics. However, we already report Flows/s in hubble status, which is currently just the average over the complete uptime. I think we should try to improve that reported number. I don't think it needs to be 100% accurate, but something that is a bit closer to the actual rate. That would help in demos and give a better first impression.
But if the consensus is we don't need that, feel free to close this PR and cilium/hubble#952 as won't do.

Next, if we want to implement this, we need to agree on an implementation. I agree the current approach isn't particularly elegant, but it's simple and doesn't introduce any overhead for the write path. I'm happy to implement this in a more elegant way, but I don't want to over-engineer this.
So I'd like to have an consensus on what the implementation needs to do. Should it report rate averaged over a minute or something else? Should it report multiple rates to show a trend or not? Should it be more performant? Should it be more accurate?

@chancez
Copy link
Contributor

chancez commented Sep 26, 2023

Should it report multiple rates to show a trend or not?

I kinda like this idea. Like the loadavg. We could show the 1/5/15 minute flow rates (or similar). What do people think?

@lambdanis
Copy link
Contributor

I see the point that this has questionable benefits for cluster operators and they are better off using metrics. However, we already report Flows/s in hubble status, which is currently just the average over the complete uptime. I think we should try to improve that reported number. I don't think it needs to be 100% accurate, but something that is a bit closer to the actual rate. That would help in demos and give a better first impression.

IMO it's worth having this number in Hubble status. It can be helpful when metrics are disabled or the metrics system is down, or you just want to get the live rate from the CLI without caring about the whole metrics pipeline and promql.

Next, if we want to implement this, we need to agree on an implementation. I agree the current approach isn't particularly elegant, but it's simple and doesn't introduce any overhead for the write path. I'm happy to implement this in a more elegant way, but I don't want to over-engineer this.
So I'd like to have an consensus on what the implementation needs to do. Should it report rate averaged over a minute or something else? Should it report multiple rates to show a trend or not? Should it be more performant? Should it be more accurate?

The 1-minute average makes sense to me, it seems reasonable for reporting the live rate. Multiple averages can be helpful too, although then reading rates from the ring buffer can become very confusing. So I would rather implement these with timers, although this smells like over-engineering the problem to me.

@michi-covalent
Copy link
Contributor

/test

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Sep 28, 2023
@aanm aanm added dont-merge/waiting-for-review and removed ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Sep 29, 2023
@aanm aanm requested a review from lambdanis September 29, 2023 18:53
@aanm aanm added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. and removed dont-merge/waiting-for-review labels Sep 29, 2023
@lmb
Copy link
Contributor

lmb commented Oct 2, 2023

Please resolve all conversations, otherwise I can't merge the change.

@lmb lmb added dont-merge/bad-bot To prevent MLH from marking ready-to-merge. and removed ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Oct 2, 2023
@kaworu
Copy link
Member

kaworu commented Oct 2, 2023

Please resolve all conversations, otherwise I can't merge the change.

Looks like all conversations are resolved, marking back as ready-to-merge.

@kaworu kaworu added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. and removed dont-merge/bad-bot To prevent MLH from marking ready-to-merge. labels Oct 2, 2023
@lmb lmb merged commit 81adc9c into cilium:main Oct 2, 2023
60 of 61 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/community-contribution This was a contribution made by a community member. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/hubble Impacts hubble server or relay
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants