-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gorouter emits way too many metrics (metric frequency proportional to request frequency) #159
Comments
We have created an issue in Pivotal Tracker to manage this: https://www.pivotaltracker.com/story/show/137018375 The labels on this github issue will be updated when the story is started. |
Hello @holgero Sorry to hear about your metrics system falling over. The currently recommended approach is to use a nozzle in front of the Loggregator firehose to filter/batch/count metrics before sending them to your metrics analysis/warehouse. The gorouter already emits many metrics, and we are concerned about adding additional counters for two reasons:
You've got me thinking though; it's a reasonable argument that emitting counters means that metrics data doesn't increase with scale. I'd be open to the idea as long as it came with no impact to router performance. @shashwathi Thoughts? |
@Nino-K and I discussed this idea and came up with the following: |
@shalako @abbyachau how should we proceed ? Regards |
I'll prioritize the story to explore the answers the questions we have. |
Hi @holgero Given the age of this issue we are considering closing this, let us know if this is still an issue you are facing. Feel free to reopen this issue and let us know. Thanks! |
I stumbled upon this issue while analysing the amount of logging we had from our PCF installation. thanks, |
Hi all, I’d like to restart the discussion on this topic, since these metrics are causing enormous load on our Loggregator system. Previously, we were able to filter the aforementioned metrics in our nozzle. However, our landscape has grown and the amount of metrics is threatening loggregator stability and causing significant costs. As @metskem ’s comment shows, there are other installations that have issues with these metrics as well. I was wondering whether it’d possible to reduce the frequency of these metrics being emitted or to allow disabling them entirely. Thanks! |
Adding to my previous comment to address any performance concerns: Since it's already possible to get detailed and aggregated latency info from the Thanks and cheers, |
Hi @jkbschmid, I am supportive of either (1) allowing a users to configure the frequency with which these metrics are emitted or (2) allowing a user to turn them off entirely. My preference is option 1 (which could also include option 2 in it!). However, our team does not have the bandwidth for this work at the moment. If you are willing and able to PR it in, it would be much appreciated. Thanks, |
Hi @ameowlia, |
I am currently looking into this and I saw that there are two more envelopes emitted per request which even have much more bytes then the simple metrics mentioned above: https://github.com/cloudfoundry/gorouter/blob/main/handlers/httpstartstop.go#L63 Would it make sense to switch off also sending these envelopes with the same config used to switch off the per-request metrics? @ameowlia: Do you know what these envelopes may be used for? |
cloudfoundry#159 The metrics latency, latency.component and route_lookup_time are sent for each request and cause significant load on gorouter and loggregator. In our measurements the CPU load was reduced by 40% in both gorouter and doppler VMs. This is a huge gain for large scale. Note that the latency metrics is still available at the varz endpoint.
cloudfoundry#159 Switching off these events reduces the CPU load on gorouter and doppler VMS significantly. These events are of type "timer".
cloudfoundry#159 Switching off these events reduces the CPU load on gorouter and doppler VMS significantly. These events are of type "timer". Note that they should not be switched off when the app-autoscaler is used.
cloudfoundry#159 The metrics latency, latency.component and route_lookup_time are sent for each request and cause significant load on gorouter and loggregator. In our measurements the CPU load was reduced by 40% in both gorouter and doppler VMs. This is a huge gain for large scale. Note that the latency metrics is still available at the varz endpoint.
cloudfoundry#159 Switching off these events reduces the CPU load on gorouter and doppler VMS significantly. These events are of type "timer".
cloudfoundry#159 Switching off these events reduces the CPU load on gorouter and doppler VMS significantly. These events are of type "timer". Note that they should not be switched off when the app-autoscaler is used.
#159 The metrics latency, latency.component and route_lookup_time are sent for each request and cause significant load on gorouter and loggregator. In our measurements the CPU load was reduced by 40% in both gorouter and doppler VMs. This is a huge gain for large scale. Note that the latency metrics is still available at the varz endpoint.
#159 Switching off these events reduces the CPU load on gorouter and doppler VMS significantly. These events are of type "timer".
#159 Switching off these events reduces the CPU load on gorouter and doppler VMS significantly. These events are of type "timer". Note that they should not be switched off when the app-autoscaler is used.
4 years later we released a PR fix from the community: https://github.com/cloudfoundry/routing-release/releases/tag/0.213.0 Thanks for your patience 😅 |
When we put some load on our cloudfoundry installation, we noticed an increase of the number of metrics reported which managed to overload our riemann installation.
Digging into the issue I found out that the bulk of the reported metrics is reported by the
gorouter
with the nameslatency
androute_lookup_time
.Looking into the code revealed that these metrics (and
latency.<component>
) are reported for each request that is handled by the gorouter.This means that when the traffic increases also the number of metrics reported increases. And if you are unlucky, it leads to a situation where your monitoring fails at the moment when you need it the most.
The proposal: Wouldn't it be better to report instead only an average (perhaps min/max/avg) of these times in fixed intervals (like once or twice a minute)? In this way the amount of metrics reported stays constant when incoming requests increase.
What do you think, would it make sense to prepare a pull request for this?
The text was updated successfully, but these errors were encountered: