Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gorouter emits way too many metrics (metric frequency proportional to request frequency) #159

Closed
holgero opened this issue Jan 4, 2017 · 13 comments

Comments

@holgero
Copy link

holgero commented Jan 4, 2017

When we put some load on our cloudfoundry installation, we noticed an increase of the number of metrics reported which managed to overload our riemann installation.
Digging into the issue I found out that the bulk of the reported metrics is reported by the gorouter with the names latency and route_lookup_time.
Looking into the code revealed that these metrics (and latency.<component>) are reported for each request that is handled by the gorouter.
This means that when the traffic increases also the number of metrics reported increases. And if you are unlucky, it leads to a situation where your monitoring fails at the moment when you need it the most.

The proposal: Wouldn't it be better to report instead only an average (perhaps min/max/avg) of these times in fixed intervals (like once or twice a minute)? In this way the amount of metrics reported stays constant when incoming requests increase.

What do you think, would it make sense to prepare a pull request for this?

@cf-gitbot
Copy link

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/137018375

The labels on this github issue will be updated when the story is started.

@shalako
Copy link
Contributor

shalako commented Jan 11, 2017

Hello @holgero

Sorry to hear about your metrics system falling over.

The currently recommended approach is to use a nozzle in front of the Loggregator firehose to filter/batch/count metrics before sending them to your metrics analysis/warehouse.

The gorouter already emits many metrics, and we are concerned about adding additional counters for two reasons:

  1. The number of metrics operators want may grow unbounded. Instead of adding another metric every time an operator wants to slice the data a different way, we've taken the approach that we provide raw data, and the operator can count them any way they want using nozzles or their existing metrics analysis tools.
  2. Resources required to maintain the counters and math to emit calculated metrics, when the router must be such a performant component

You've got me thinking though; it's a reasonable argument that emitting counters means that metrics data doesn't increase with scale. I'd be open to the idea as long as it came with no impact to router performance.

@shashwathi Thoughts?

@aaronshurley
Copy link
Contributor

@Nino-K and I discussed this idea and came up with the following:
We think this would be reasonable to change our metric-emitting behavior in Proxy.ServeHTTP.
The proposal would be to change the current behavior, instead of sending the raw value on every request, we could send the min, max, and average value for a period of time (frequency, which is configurable). Of course this could be subject to acceptable performance.
@shashwathi any thoughts?

@aaronshurley
Copy link
Contributor

  • How much perf impact is it going to make on gorouter to accumulate metrics vs emitting them ?
    • Gorouter is emitting metrics on every request and not bothered with aggregation but now we need a atomic value to track the request metrics properly.
    • Concerns for allowing the operators to configure the frequency
      • What if they set the value too high and make gorouter also work as a nozzle? We need more documentation on how to turn the nob carefully.
  • It might be worth exploring this option. My concern with asking @holgero for PR would be that if perf does degrade we might not pull this in.

@shalako @abbyachau how should we proceed ?

Regards
@shashwathi && Aaron

@shalako
Copy link
Contributor

shalako commented Jan 23, 2017

I'll prioritize the story to explore the answers the questions we have.

@adobley
Copy link
Contributor

adobley commented Nov 5, 2018

Hi @holgero

Given the age of this issue we are considering closing this, let us know if this is still an issue you are facing. Feel free to reopen this issue and let us know.

Thanks!

@metskem
Copy link

metskem commented Mar 31, 2019

I stumbled upon this issue while analysing the amount of logging we had from our PCF installation.
Having a gorouter ValueMetric for each request, I find ridiculous.
Would it be possible to have this ValueMetric be emitted on a fixed time interval, basically what @aaronshurley already suggested?

thanks,
Harry

@jkbschmid
Copy link

jkbschmid commented Dec 2, 2020

Hi all, I’d like to restart the discussion on this topic, since these metrics are causing enormous load on our Loggregator system.

Previously, we were able to filter the aforementioned metrics in our nozzle. However, our landscape has grown and the amount of metrics is threatening loggregator stability and causing significant costs.
Since these metrics are emitted per request, gorouters emit a total of ~60k envelopes/s (3.7M envs/min, not counting access logs) during peak business hours. The Dashboard shows how this relates to the overall load on the loggregator:
Screenshot 2020-12-02 at 13 58 12

As @metskem ’s comment shows, there are other installations that have issues with these metrics as well.

I was wondering whether it’d possible to reduce the frequency of these metrics being emitted or to allow disabling them entirely.

Thanks!

@ameowlia ameowlia reopened this Dec 2, 2020
@jkbschmid
Copy link

Adding to my previous comment to address any performance concerns:
Actually, I am convinced that reducing the frequency of metric egress would improve performance, even with metric aggregation. If you look at the CPU usage of a router VM and run a simple top, you’ll find that Loggregator agents consume more CPU time than the gorouter process itself (almost double!).

Since it's already possible to get detailed and aggregated latency info from the /varz endpoint, the option to disable these metrics would be perfectly sufficient for our use case. I’d really be interested to have your thought and whether this could be something that you're interested as well, @ameowlia.

Thanks and cheers,
Jakob

@ameowlia
Copy link
Member

Hi @jkbschmid,

I am supportive of either (1) allowing a users to configure the frequency with which these metrics are emitted or (2) allowing a user to turn them off entirely. My preference is option 1 (which could also include option 2 in it!).

However, our team does not have the bandwidth for this work at the moment. If you are willing and able to PR it in, it would be much appreciated.

Thanks,
Amelia

@jkbschmid
Copy link

jkbschmid commented Dec 16, 2020

Hi @ameowlia,
of course, I'll see how much change would 1) and 2) would require and we'll provide a PR :)
Thank you!

@stefanlay
Copy link
Member

stefanlay commented Jan 20, 2021

I am currently looking into this and I saw that there are two more envelopes emitted per request which even have much more bytes then the simple metrics mentioned above:

https://github.com/cloudfoundry/gorouter/blob/main/handlers/httpstartstop.go#L63
https://github.com/cloudfoundry/dropsonde/blob/master/instrumented_round_tripper/instrumented_round_tripper.go#L78 (called by https://github.com/cloudfoundry/gorouter/blob/main/proxy/round_tripper/dropsonde_round_tripper.go#L23)

Would it make sense to switch off also sending these envelopes with the same config used to switch off the per-request metrics? @ameowlia: Do you know what these envelopes may be used for?

stefanlay added a commit to stefanlay/gorouter that referenced this issue Jan 29, 2021
cloudfoundry#159

The metrics latency, latency.component and route_lookup_time are
sent for each request and cause significant load on gorouter and
loggregator. In our measurements the CPU load was reduced by 40%
in both gorouter and doppler VMs. This is a huge gain for large
scale.

Note that the latency metrics is still available at the varz
endpoint.
stefanlay added a commit to stefanlay/gorouter that referenced this issue Jan 29, 2021
cloudfoundry#159

Switching off these events reduces the CPU load on gorouter and
doppler VMS significantly.

These events are of type "timer".
stefanlay added a commit to stefanlay/gorouter that referenced this issue Jan 29, 2021
cloudfoundry#159

Switching off these events reduces the CPU load on gorouter and
doppler VMS significantly.

These events are of type "timer".

Note that they should not be switched off when the app-autoscaler
is used.
stefanlay added a commit to stefanlay/gorouter that referenced this issue Jan 29, 2021
cloudfoundry#159

The metrics latency, latency.component and route_lookup_time are
sent for each request and cause significant load on gorouter and
loggregator. In our measurements the CPU load was reduced by 40%
in both gorouter and doppler VMs. This is a huge gain for large
scale.

Note that the latency metrics is still available at the varz
endpoint.
stefanlay added a commit to stefanlay/gorouter that referenced this issue Jan 29, 2021
cloudfoundry#159

Switching off these events reduces the CPU load on gorouter and
doppler VMS significantly.

These events are of type "timer".
stefanlay added a commit to stefanlay/gorouter that referenced this issue Jan 29, 2021
cloudfoundry#159

Switching off these events reduces the CPU load on gorouter and
doppler VMS significantly.

These events are of type "timer".

Note that they should not be switched off when the app-autoscaler
is used.
ameowlia pushed a commit that referenced this issue Mar 2, 2021
#159

The metrics latency, latency.component and route_lookup_time are
sent for each request and cause significant load on gorouter and
loggregator. In our measurements the CPU load was reduced by 40%
in both gorouter and doppler VMs. This is a huge gain for large
scale.

Note that the latency metrics is still available at the varz
endpoint.
ameowlia pushed a commit that referenced this issue Mar 2, 2021
#159

Switching off these events reduces the CPU load on gorouter and
doppler VMS significantly.

These events are of type "timer".
ameowlia pushed a commit that referenced this issue Mar 2, 2021
#159

Switching off these events reduces the CPU load on gorouter and
doppler VMS significantly.

These events are of type "timer".

Note that they should not be switched off when the app-autoscaler
is used.
@ameowlia
Copy link
Member

4 years later we released a PR fix from the community: https://github.com/cloudfoundry/routing-release/releases/tag/0.213.0

Thanks for your patience 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants