-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add flows per second information to Hubble status #28205
Conversation
a0f4ae8
to
b5247c2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, but I'm not sure about all edge cases, so I'll leave for someone else to review.
Lastly, this implementation assumes that entries in the ring buffer are in order. I'm not entirely sure this is correct. If not I would need to slightly change the implementation.
I believe the observer GetFLows
implementation relies on the same assumption. Someone correct me if I'm wrong, but for an approximate rate I think it's safe to assume that.
We add a `flows_rate` field to the Hubble `ServerStatus` that returns the approximate rate of seen flows per second over the last minute. It's "approximate" as we calculate the rate by counting all flow events in the ring buffer that happened in the last minute. If all events in the ring buffer happened in the last minute, we can't calculate the rate over the last minute, so we calculate the rate since the oldest flow event in the ring buffer. Signed-off-by: Fabian Fischer <fabian.fischer@isovalent.com>
b5247c2
to
4018408
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @glrf, left a small comment about multi-node context but overall LGTM. Also great to see the added tests 👍 some general comments:
- Although Flows pers second implementation hubble#952 suggest reporting flow rate over the last minute in the Solution section, I'm unsure whether it will be useful enough to deduce a trend in most cases. IMO something akin to UNIX load average would be more useful.
- Related to 1., this implementation is fundamentally limited to report at most
hubble-event-buffer-capacity
which is 4095 by default. Assuming all the ring buffer flows are within the last minute, we'll report ~ 68 flows/sec which seems like a low ceiling to me. - As @rolinh suggested I think we should report the flow rate through metrics in addition to
hubble status
.
Hooking into OnDecodedFlow
instead of counting from the ring buffer would be both more flexible (i.e. count over the last x, y, z minutes like loadavg) and less sensitive about hubble-event-buffer-capacity
, at the cost of a small overhead for every flow. Maybe we could even use (or reuse) a Prometheus Counter directly, although I'm not familiar enough with the metrics subsystem to say for sure (e.g. what happen when hubble metrics are disabled in that case). Thoughts? cc @lambdanis @chancez
Not quite. If all events in the ring buffer happened in the last minute, we calculate the rate since the oldest flow event in the ring buffer. Not quite accurate, but IMO probably close enough.
A metric for that should be trivial as Prometheus handles the rate calculation for us, but I felt like that seemed unrelated. But I can add that as well. I'm surprised we don't expose such a counter already.
My first intuition was also to count and compute the rate separately. However, that isn't as straight forward as it seems. AFAIK we can't just reuse a Prometheus counter, as it really is just a counter and can't report how many of the events happened in the last minute. So we'd need to construct a separate data structure to keep track of this. Easiest one I could come up with without having to store all events a second time is to have a counter per second of the last minute, which would then allow us to report the average over one minute with reasonable accuracy and a fixed memory overhead. I tried to keep it simple and reuse the information we already have, but I can also implement the more accurate, but also more complex, approach if you think it's worth it. |
Ah thanks I missed that, it's better than I thought then 👍
👍
Maybe there's a more appropriate Prometheus type than a Counter, or we could use exponentially decaying counters?
Make complete sense, let's wait for more feedback to see where we want to go, cc @rolinh @glibsm |
There is hubble_flows_processed_total counting all processed flows. I believe it provides what's needed, it can be used to compute the flows rate in Prometheus. One issue is that it has to be explicitly enabled like all Hubble metrics and it might be not exactly obvious. So having the rate in Hubble status too is definitely useful, just I would make sure that it's consistent with what the metric exposes in all cases. |
While I like the idea, I'm not sure how useful this will be. Generally you want to know the rate over time since it will likely change quite a bit based on deploys, etc. I almost would rather make the client responsible because it's hard to actually provide the "correct" value here, as it depends heavily on the use-case, which the client knows much better than the server. If we did want to do this, I agree that we could probably do this without reading the ring buffer, and maintain some state for these purposes. Calculating rate is a bit tricky, since we need to track the counter value over time, (probably via a Go routine timer), and then we can calculate the rate by comparing previous values to newer values (and computing the deltas). |
So in this case we'd first need to decide if we really want to implement this feature. My 2 cents: I see the point that this has questionable benefits for cluster operators and they are better off using metrics. However, we already report Flows/s in Next, if we want to implement this, we need to agree on an implementation. I agree the current approach isn't particularly elegant, but it's simple and doesn't introduce any overhead for the write path. I'm happy to implement this in a more elegant way, but I don't want to over-engineer this. |
I kinda like this idea. Like the loadavg. We could show the 1/5/15 minute flow rates (or similar). What do people think? |
IMO it's worth having this number in Hubble status. It can be helpful when metrics are disabled or the metrics system is down, or you just want to get the live rate from the CLI without caring about the whole metrics pipeline and promql.
The 1-minute average makes sense to me, it seems reasonable for reporting the live rate. Multiple averages can be helpful too, although then reading rates from the ring buffer can become very confusing. So I would rather implement these with timers, although this smells like over-engineering the problem to me. |
/test |
Please resolve all conversations, otherwise I can't merge the change. |
Looks like all conversations are resolved, marking back as |
Please ensure your pull request adheres to the following guidelines:
description and a
Fixes: #XXX
line if the commit addresses a particularGitHub issue.
Preparation for cilium/hubble#952. This PR adds a new filed
flows_rate
to the HubbleServerStatus
that returns the approximate rate of seen flows per second over the last minute. This allows us to report more up to date flows per second information inhubble status
It's "approximate" as we calculate the rate by counting all flow events in the ring buffer that happened in the last minute. If all events in the ring buffer happened in the last minute, we can't calculate the rate over the last minute, so we calculate the rate since the oldest flow event in the ring buffer.
There are other ways to implement this. For example, we could also add a field that returns the number of seen events in the last minute, but that would again complicate things if not all event fit in the ring buffer. That's why I thought server side rate calculation would be the easiest.
Lastly, this implementation assumes that entries in the ring buffer are in order. I'm not entirely sure this is correct. If not I would need to slightly change the implementation.