Gorouter emits way too many metrics (metric frequency proportional to request frequency) #159

holgero · 2017-01-04T14:15:31Z

When we put some load on our cloudfoundry installation, we noticed an increase of the number of metrics reported which managed to overload our riemann installation.
Digging into the issue I found out that the bulk of the reported metrics is reported by the gorouter with the names latency and route_lookup_time.
Looking into the code revealed that these metrics (and latency.<component>) are reported for each request that is handled by the gorouter.
This means that when the traffic increases also the number of metrics reported increases. And if you are unlucky, it leads to a situation where your monitoring fails at the moment when you need it the most.

The proposal: Wouldn't it be better to report instead only an average (perhaps min/max/avg) of these times in fixed intervals (like once or twice a minute)? In this way the amount of metrics reported stays constant when incoming requests increase.

What do you think, would it make sense to prepare a pull request for this?

The text was updated successfully, but these errors were encountered:

cf-gitbot · 2017-01-04T14:15:33Z

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/137018375

The labels on this github issue will be updated when the story is started.

shalako · 2017-01-11T01:27:54Z

Hello @holgero

Sorry to hear about your metrics system falling over.

The currently recommended approach is to use a nozzle in front of the Loggregator firehose to filter/batch/count metrics before sending them to your metrics analysis/warehouse.

The gorouter already emits many metrics, and we are concerned about adding additional counters for two reasons:

The number of metrics operators want may grow unbounded. Instead of adding another metric every time an operator wants to slice the data a different way, we've taken the approach that we provide raw data, and the operator can count them any way they want using nozzles or their existing metrics analysis tools.
Resources required to maintain the counters and math to emit calculated metrics, when the router must be such a performant component

You've got me thinking though; it's a reasonable argument that emitting counters means that metrics data doesn't increase with scale. I'd be open to the idea as long as it came with no impact to router performance.

@shashwathi Thoughts?

aaronshurley · 2017-01-23T17:39:15Z

@Nino-K and I discussed this idea and came up with the following:
We think this would be reasonable to change our metric-emitting behavior in Proxy.ServeHTTP.
The proposal would be to change the current behavior, instead of sending the raw value on every request, we could send the min, max, and average value for a period of time (frequency, which is configurable). Of course this could be subject to acceptable performance.
@shashwathi any thoughts?

aaronshurley · 2017-01-23T17:48:13Z

How much perf impact is it going to make on gorouter to accumulate metrics vs emitting them ?
- Gorouter is emitting metrics on every request and not bothered with aggregation but now we need a atomic value to track the request metrics properly.
- Concerns for allowing the operators to configure the frequency
  - What if they set the value too high and make gorouter also work as a nozzle? We need more documentation on how to turn the nob carefully.
It might be worth exploring this option. My concern with asking @holgero for PR would be that if perf does degrade we might not pull this in.

@shalako @abbyachau how should we proceed ?

Regards
@shashwathi && Aaron

shalako · 2017-01-23T21:36:20Z

I'll prioritize the story to explore the answers the questions we have.

adobley · 2018-11-05T22:28:45Z

Hi @holgero

Given the age of this issue we are considering closing this, let us know if this is still an issue you are facing. Feel free to reopen this issue and let us know.

Thanks!

metskem · 2019-03-31T16:24:37Z

I stumbled upon this issue while analysing the amount of logging we had from our PCF installation.
Having a gorouter ValueMetric for each request, I find ridiculous.
Would it be possible to have this ValueMetric be emitted on a fixed time interval, basically what @aaronshurley already suggested?

thanks,
Harry

jkbschmid · 2020-12-02T13:12:53Z

Hi all, I’d like to restart the discussion on this topic, since these metrics are causing enormous load on our Loggregator system.

Previously, we were able to filter the aforementioned metrics in our nozzle. However, our landscape has grown and the amount of metrics is threatening loggregator stability and causing significant costs.
Since these metrics are emitted per request, gorouters emit a total of ~60k envelopes/s (3.7M envs/min, not counting access logs) during peak business hours. The Dashboard shows how this relates to the overall load on the loggregator:

As @metskem ’s comment shows, there are other installations that have issues with these metrics as well.

I was wondering whether it’d possible to reduce the frequency of these metrics being emitted or to allow disabling them entirely.

Thanks!

jkbschmid · 2020-12-15T16:22:22Z

Adding to my previous comment to address any performance concerns:
Actually, I am convinced that reducing the frequency of metric egress would improve performance, even with metric aggregation. If you look at the CPU usage of a router VM and run a simple top, you’ll find that Loggregator agents consume more CPU time than the gorouter process itself (almost double!).

Since it's already possible to get detailed and aggregated latency info from the /varz endpoint, the option to disable these metrics would be perfectly sufficient for our use case. I’d really be interested to have your thought and whether this could be something that you're interested as well, @ameowlia.

Thanks and cheers,
Jakob

ameowlia · 2020-12-15T16:57:35Z

Hi @jkbschmid,

I am supportive of either (1) allowing a users to configure the frequency with which these metrics are emitted or (2) allowing a user to turn them off entirely. My preference is option 1 (which could also include option 2 in it!).

However, our team does not have the bandwidth for this work at the moment. If you are willing and able to PR it in, it would be much appreciated.

Thanks,
Amelia

jkbschmid · 2020-12-16T07:24:12Z

Hi @ameowlia,
of course, I'll see how much change would 1) and 2) would require and we'll provide a PR :)
Thank you!

stefanlay · 2021-01-20T16:02:48Z

I am currently looking into this and I saw that there are two more envelopes emitted per request which even have much more bytes then the simple metrics mentioned above:

https://github.com/cloudfoundry/gorouter/blob/main/handlers/httpstartstop.go#L63
https://github.com/cloudfoundry/dropsonde/blob/master/instrumented_round_tripper/instrumented_round_tripper.go#L78 (called by https://github.com/cloudfoundry/gorouter/blob/main/proxy/round_tripper/dropsonde_round_tripper.go#L23)

Would it make sense to switch off also sending these envelopes with the same config used to switch off the per-request metrics? @ameowlia: Do you know what these envelopes may be used for?

cloudfoundry#159 The metrics latency, latency.component and route_lookup_time are sent for each request and cause significant load on gorouter and loggregator. In our measurements the CPU load was reduced by 40% in both gorouter and doppler VMs. This is a huge gain for large scale. Note that the latency metrics is still available at the varz endpoint.

cloudfoundry#159 Switching off these events reduces the CPU load on gorouter and doppler VMS significantly. These events are of type "timer".

cloudfoundry#159 Switching off these events reduces the CPU load on gorouter and doppler VMS significantly. These events are of type "timer". Note that they should not be switched off when the app-autoscaler is used.

cloudfoundry#159 The metrics latency, latency.component and route_lookup_time are sent for each request and cause significant load on gorouter and loggregator. In our measurements the CPU load was reduced by 40% in both gorouter and doppler VMs. This is a huge gain for large scale. Note that the latency metrics is still available at the varz endpoint.

cloudfoundry#159 Switching off these events reduces the CPU load on gorouter and doppler VMS significantly. These events are of type "timer".

cloudfoundry#159 Switching off these events reduces the CPU load on gorouter and doppler VMS significantly. These events are of type "timer". Note that they should not be switched off when the app-autoscaler is used.

#159 The metrics latency, latency.component and route_lookup_time are sent for each request and cause significant load on gorouter and loggregator. In our measurements the CPU load was reduced by 40% in both gorouter and doppler VMs. This is a huge gain for large scale. Note that the latency metrics is still available at the varz endpoint.

#159 Switching off these events reduces the CPU load on gorouter and doppler VMS significantly. These events are of type "timer".

#159 Switching off these events reduces the CPU load on gorouter and doppler VMS significantly. These events are of type "timer". Note that they should not be switched off when the app-autoscaler is used.

ameowlia · 2021-04-22T15:28:24Z

4 years later we released a PR fix from the community: https://github.com/cloudfoundry/routing-release/releases/tag/0.213.0

Thanks for your patience 😅

cf-gitbot added the unscheduled label Jan 4, 2017

aaronshurley added the PM REVIEW label Jan 23, 2017

cf-gitbot added scheduled and removed unscheduled labels Jan 23, 2017

cf-gitbot added unscheduled and removed scheduled labels Jul 19, 2017

adobley closed this as completed Nov 5, 2018

cf-gitbot added delivered accepted and removed unscheduled delivered labels Nov 5, 2018

ameowlia reopened this Dec 2, 2020

stefanlay added a commit to stefanlay/gorouter that referenced this issue Jan 29, 2021

Make sending server httpstartstopevents configurable

1e69c45

cloudfoundry#159 Switching off these events reduces the CPU load on gorouter and doppler VMS significantly. These events are of type "timer".

stefanlay added a commit to stefanlay/gorouter that referenced this issue Jan 29, 2021

Make sending server httpstartstopevents configurable

95a7d2f

cloudfoundry#159 Switching off these events reduces the CPU load on gorouter and doppler VMS significantly. These events are of type "timer".

stefanlay mentioned this issue Jan 29, 2021

Fix issue 159: Gorouter emits way too many metrics #275

Merged

3 tasks

ameowlia pushed a commit that referenced this issue Mar 2, 2021

Make sending server httpstartstopevents configurable

2cacb51

#159 Switching off these events reduces the CPU load on gorouter and doppler VMS significantly. These events are of type "timer".

stefanlay mentioned this issue Mar 3, 2021

Consume per request metrics reporting config cloudfoundry/routing-release#198

Merged

3 tasks

ameowlia closed this as completed Apr 22, 2021

cf-gitbot removed the accepted label Apr 22, 2021

chombium mentioned this issue May 17, 2021

Request to be an elector for TOC election cloudfoundry/community#116

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gorouter emits way too many metrics (metric frequency proportional to request frequency) #159

Gorouter emits way too many metrics (metric frequency proportional to request frequency) #159

holgero commented Jan 4, 2017

cf-gitbot commented Jan 4, 2017

shalako commented Jan 11, 2017

aaronshurley commented Jan 23, 2017

aaronshurley commented Jan 23, 2017

shalako commented Jan 23, 2017

adobley commented Nov 5, 2018

metskem commented Mar 31, 2019

jkbschmid commented Dec 2, 2020 •

edited

jkbschmid commented Dec 15, 2020

ameowlia commented Dec 15, 2020

jkbschmid commented Dec 16, 2020 •

edited

stefanlay commented Jan 20, 2021 •

edited

ameowlia commented Apr 22, 2021

Gorouter emits way too many metrics (metric frequency proportional to request frequency) #159

Gorouter emits way too many metrics (metric frequency proportional to request frequency) #159

Comments

holgero commented Jan 4, 2017

cf-gitbot commented Jan 4, 2017

shalako commented Jan 11, 2017

aaronshurley commented Jan 23, 2017

aaronshurley commented Jan 23, 2017

shalako commented Jan 23, 2017

adobley commented Nov 5, 2018

metskem commented Mar 31, 2019

jkbschmid commented Dec 2, 2020 • edited

jkbschmid commented Dec 15, 2020

ameowlia commented Dec 15, 2020

jkbschmid commented Dec 16, 2020 • edited

stefanlay commented Jan 20, 2021 • edited

ameowlia commented Apr 22, 2021

jkbschmid commented Dec 2, 2020 •

edited

jkbschmid commented Dec 16, 2020 •

edited

stefanlay commented Jan 20, 2021 •

edited