Killing `cloud_controller_ng` can cause counter metrics like `requests.outstanding` to get stuck #1312

tcdowney · 2019-03-19T22:40:22Z

Issue

Logging this a Github issue for discoverability.

The cc.requests.outstanding counter metric can get stuck if Cloud Controller is stopped/killed in the middle of handling requests.

Context

Cloud Controller emits the cc.requests.outstanding to give operators insight into the number of active requests being handled by the cloud_controller_ng Ruby process.

Number of requests that are currently being processed. Emitted for each Cloud Controller request.
More information on emitted metrics: https://docs.cloudfoundry.org/running/all_metrics.html

Cloud Controller manages this metric in

cloud_controller_ng/middleware/request_metrics.rb

Line 14 in 90bbdda

@request_metrics.complete_request(status)

.

So if the Cloud Controller is stopped/killed before it has a chance to call complete_request here, none of the active requests can decrement the counter in statsd. This is similar to #1294 .

Steps to Reproduce

Install the CF Loggregator Firehose Plugin
Connect to the firehose and grep out the metric:

cf nozzle -f ValueMetric | grep --line-buffered cc.*outstanding

Hammer the Cloud Controllers with requests:

hey -n 1000000 -c 50 -H "Authorization: $(cf oauth-token)" http://api.arya.capi.land/v2/apps

Observe something similar to the following from cf nozzle

origin:"cc" eventType:ValueMetric timestamp:1553034399010006107 deployment:"cf" job:"api" index:"c032ab53-66f7-45fc-a229-d25d509967dc" ip:"10.0.1.11" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:19 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034399011308663 deployment:"cf" job:"api" index:"c032ab53-66f7-45fc-a229-d25d509967dc" ip:"10.0.1.11" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:20 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034398981292501 deployment:"cf" job:"api" index:"b55807af-8ebb-4a54-8bd7-1edc7b8264f9" ip:"10.0.1.10" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:20 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034399098849677 deployment:"cf" job:"api" index:"b55807af-8ebb-4a54-8bd7-1edc7b8264f9" ip:"10.0.1.10" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:19 unit:"counter" >

On an api vm, stop the cloud_controller_ng process:

monit stop cloud_controller_ng

Feel free to substitute a kill -9 for the monit stop if monit is being too gentle in stopping the Cloud Controller.

Start it back up again:

monit start cloud_controller_ng

Expected result

An accurate count for requests.outstanding continues to be emitted, capped at 20 (the default number of EventMachine threads).

Current result

Since the Cloud Controller was stopped in the middle of handling all of the requests from hey it was not able to decrement the statsd counter, so now there is a baseline of 20 for all of the cc.requests.outstanding metrics from the cloud_controller_ng job on the VM that was stopped (in this test environment that was b55807af-8ebb-4a54-8bd7-1edc7b8264f9).

origin:"cc" eventType:ValueMetric timestamp:1553034630853046264 deployment:"cf" job:"api" index:"c032ab53-66f7-45fc-a229-d25d509967dc" ip:"10.0.1.11" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:3 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034630865535939 deployment:"cf" job:"api" index:"c032ab53-66f7-45fc-a229-d25d509967dc" ip:"10.0.1.11" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:2 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034630917230329 deployment:"cf" job:"api" index:"c032ab53-66f7-45fc-a229-d25d509967dc" ip:"10.0.1.11" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:3 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034630849778556 deployment:"cf" job:"api" index:"b55807af-8ebb-4a54-8bd7-1edc7b8264f9" ip:"10.0.1.10" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:40 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034630888060296 deployment:"cf" job:"api" index:"b55807af-8ebb-4a54-8bd7-1edc7b8264f9" ip:"10.0.1.10" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:39 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034630889439373 deployment:"cf" job:"api" index:"b55807af-8ebb-4a54-8bd7-1edc7b8264f9" ip:"10.0.1.10" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:40 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034630986481710 deployment:"cf" job:"api" index:"b55807af-8ebb-4a54-8bd7-1edc7b8264f9" ip:"10.0.1.10" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:39 unit:"counter" >

Possible Fix

Current mitigation: restarting the statsd_injector by running monit restart statsd_injector resets the counter.

Longer term: Cloud Controller could potentially clear these statsd counters on startup so it always starts with a clean slate.

Can we expect a fix for this issue? As mentioned above, restarting the statsd_injector process whenever cloud_controller_ng is restarted would already help. We are trying to use this metric as scaling indicator (as recommended on https://docs.cloudfoundry.org/running/managing-cf/scaling-cloud-controller.html), but currently it's not reliable.

cwlbraa · 2020-04-29T14:57:44Z

IIRC clearing the counters is easy to say but not easy to do. @tcdowney do you remember if your "current mitigation" was already implemented?

jochenehret · 2020-04-30T09:08:11Z

Another option would be to emit an absolute value from cloud_controller_ng (instead of relative increment/decrement values). Can you estimate how difficult this would be to implement? Would you accept a PR? We can try to reserve some capacity in our next sprint to provide a fix.

reneighbor · 2020-05-07T20:42:43Z

Thank you @jochenehret! If you're willing and have capacity, a PR would be great. One thing we discussed in tech forum today would be to use an absolute value and have the state stored as a variable in memory on the ccng side. Is that what you meant?

jochenehret · 2020-05-08T08:16:15Z

Yes, that's what we meant. We are trying to create a PR in our next sprint.

cwlbraa · 2020-05-12T18:39:37Z

would an "absolute value" be backwards-compatible in this case? or would it need to go under a different metric name?

jochenehret · 2020-05-13T06:09:32Z

I think if the Cloud Controller emits an absolute "requests.outstanding" metric, that should be backwards-compatible. I just don't know how the "statsd_injector" process behaves. It currently gets the relative metric in the format "ccc.requests.outstanding:1" or "ccc.requests.outstanding:-1". We must make sure that it forwards the absolute numbers without doing any additional calculations.

dawu415 · 2020-10-23T15:12:07Z

Hello, just curious if there's any movement on this as I am currently seeing this bug as well. We're still investigating on our end but it seems connectivity to MySQL db dropped and the outstanding requests remain stuck at 20.

sethboyles · 2020-11-18T20:26:17Z

@dawu415 We have looked into this in the past week and unfortunately there doesn't seem to be a way to reset the counter to 0 on restart. The statsd_injector doesn't provide the statsd admin interface, which provides the option to delete a counter.

We may be able to resolve this by changing the metric from a counter metric to a gauge. However, we are still looking into this and considering the implications of making a change.

Perhaps, if gauges do provide a way to reset the metric without the admin interface, we can create a new metric with the gauge and deprecate this counter, but not remove it? @Gerg

Gerg · 2020-11-19T19:05:53Z

@sethboyles How would CC know what value to set the gauge to? All the instances of CC would need need to have some shared state to track/manage outstanding requests.

sethboyles · 2020-11-19T21:23:25Z

We should be able to increment or decrement gauges just like we do with counters, using the delta option:

https://statsd.readthedocs.io/en/v3.2.1/types.html#gauge-deltas
https://github.com/statsd/statsd/blob/master/docs/metric_types.md#gauges

Gerg · 2020-11-19T22:52:10Z

So, we could add a new metric requests.outstanding.ng (real name TBD). This metric would be a gauge that deltas +1/-1 whenever the requests.outstanding counter increments or decrements, respectively. Additionally, the gauge is set to 0 whenever CC starts. Does that sound right?

sethboyles · 2020-11-20T17:25:05Z

Yeah, I think that is something to look into at least.

stephanme · 2021-03-01T10:22:46Z

Any chance to get PR #2087 merged?

The outstanding requests metric is interesting as scaling indicator but can't be used as long as it is broken.

sethboyles · 2021-03-01T19:31:02Z

Merged!

Not closing this issue yet until we complete documentation

sethboyles · 2021-03-01T19:33:00Z

We need to update these docs:
https://docs.cloudfoundry.org/running/managing-cf/scaling-cloud-controller.html
https://docs.cloudfoundry.org/running/all_metrics.html

@sethboyles

….gauge This is needed after introducing the new metric as part of this issue: cloudfoundry/cloud_controller_ng#1312 cc: @sethboyles

@sethboyles

….gauge This is needed after introducing the new metric as part of this issue: cloudfoundry/cloud_controller_ng#1312 cc: @sethboyles

sethboyles · 2021-03-23T21:12:56Z

Closing as we have updated the relevant docs.

cloudfoundry#1312 introduced cc.requests.outstanding.gauge which holds the counter in memory. With the introduction of puma there may be multiple processes, each would have its own value for this metric. This would cause the gauge to flop between values. This fix, specifically for puma instead reads the value from prometheus and emits it.

cloudfoundry#1312 introduced cc.requests.outstanding.gauge which holds the counter in memory. With the introduction of puma there may be multiple processes, each would have its own value for this metric. This would cause the gauge to flop between values. This fix, specifically for puma instead reads the value from prometheus and emits it. Additionally, add a missing mutux to counter for thread safety

cloudfoundry#1312 introduced cc.requests.outstanding.gauge which holds the counter in memory. With the introduction of puma there may be multiple processes, each would have its own value for this metric. This would cause the gauge to flop between values. This fix, specifically for puma instead uses redis. Additionally, add a missing mutux to counter for thread safety for the in memory implementation. Inspired by cloudfoundry@4539e59

cloudfoundry#1312 introduced cc.requests.outstanding.gauge which holds the counter in memory. With the introduction of puma there may be multiple processes, each would have its own value for this metric. This would cause the gauge to flop between values. This fix, specifically for puma instead uses redis. * Move requests_metric to use statsd_updater * Statsd_updater now contains statsd request logic * Add Redis/Inmemory store to statsd_updager * Additionally, add a missing mutux to counter for thread safety for the in memory implementation. Inspired by cloudfoundry@4539e59

cloudfoundry#1312 introduced `cc.requests.outstanding.gauge` which holds the counter in memory. With the introduction of puma there may be multiple processes, so each would emit its own value for this metric. This would cause the gauge to flop between values. This metric is listed as an important kpi for capi scaling https://docs.cloudfoundry.org/running/managing-cf/scaling-cloud-controller.html#cloud_controller_ng. This fix for puma will instead uses Redis for the gauge. * Move requests_metric to use statsd_updater * Statsd_updater now contains statsd request logic * Add Redis/Inmemory store to statsd_updager * Add a missing mutux to counter for thread safety for the in memory implementation. * Chose to prefix the entry in redis with `metrics:` but open to feedback here. Inspired by cloudfoundry@4539e59 An alternative considered, was to read the prometheus metric and re-emit that to StatsD, however we observed performance degradation. Presumably because of the number of reads from disk for the [DirectFileStorage](https://github.com/prometheus/client_ruby?tab=readme-ov-file#directfilestore-caveats-and-things-to-keep-in-mind) to aggregate the metric across processes.

#1312 introduced `cc.requests.outstanding.gauge` which holds the counter in memory. With the introduction of puma there may be multiple processes, so each would emit its own value for this metric. This would cause the gauge to flop between values. This metric is listed as an important kpi for capi scaling https://docs.cloudfoundry.org/running/managing-cf/scaling-cloud-controller.html#cloud_controller_ng. This fix for puma will instead uses Redis for the gauge. * Move requests_metric to use statsd_updater * Statsd_updater now contains statsd request logic * Add Redis/Inmemory store to statsd_updager * Add a missing mutux to counter for thread safety for the in memory implementation. * Chose to prefix the entry in redis with `metrics:` but open to feedback here. Inspired by 4539e59 An alternative considered, was to read the prometheus metric and re-emit that to StatsD, however we observed performance degradation. Presumably because of the number of reads from disk for the [DirectFileStorage](https://github.com/prometheus/client_ruby?tab=readme-ov-file#directfilestore-caveats-and-things-to-keep-in-mind) to aggregate the metric across processes.

cf-gitbot added unscheduled scheduled and removed unscheduled labels Mar 19, 2019

cf-gitbot added in progress scheduled and removed scheduled in progress labels Apr 4, 2019

cf-gitbot added unscheduled and removed scheduled labels Apr 18, 2019

cwlbraa added the bug label Jan 23, 2020

cf-gitbot added scheduled and removed unscheduled labels Nov 3, 2020

cf-gitbot added in progress and removed scheduled labels Nov 12, 2020

cf-gitbot added scheduled and removed in progress labels Dec 7, 2020

MerricdeLauney mentioned this issue Mar 1, 2021

Deprecate cc.requests.outstanding in favor of cc.requests.outstanding… cloudfoundry/docs-running-cf#96

Merged

sethboyles closed this as completed Mar 23, 2021

cf-gitbot added delivered accepted and removed unscheduled delivered labels Mar 23, 2021

Samze mentioned this issue Jun 12, 2024

Fix cc.requests.outstanding.gauge when using puma web server #3841

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Killing `cloud_controller_ng` can cause counter metrics like `requests.outstanding` to get stuck #1312

Killing `cloud_controller_ng` can cause counter metrics like `requests.outstanding` to get stuck #1312

tcdowney commented Mar 19, 2019

cf-gitbot commented Mar 19, 2019

jochenehret commented Apr 28, 2020

cwlbraa commented Apr 29, 2020

jochenehret commented Apr 30, 2020 •

edited

Loading

reneighbor commented May 7, 2020

jochenehret commented May 8, 2020

cwlbraa commented May 12, 2020

jochenehret commented May 13, 2020

dawu415 commented Oct 23, 2020

sethboyles commented Nov 18, 2020

Gerg commented Nov 19, 2020

sethboyles commented Nov 19, 2020 •

edited

Loading

Gerg commented Nov 19, 2020

sethboyles commented Nov 20, 2020

stephanme commented Mar 1, 2021

sethboyles commented Mar 1, 2021

sethboyles commented Mar 1, 2021

sethboyles commented Mar 23, 2021

Killing cloud_controller_ng can cause counter metrics like requests.outstanding to get stuck #1312

Killing cloud_controller_ng can cause counter metrics like requests.outstanding to get stuck #1312

Comments

tcdowney commented Mar 19, 2019

Issue

Context

Steps to Reproduce

Expected result

Current result

Possible Fix

Related

cf-gitbot commented Mar 19, 2019

jochenehret commented Apr 28, 2020

cwlbraa commented Apr 29, 2020

jochenehret commented Apr 30, 2020 • edited Loading

reneighbor commented May 7, 2020

jochenehret commented May 8, 2020

cwlbraa commented May 12, 2020

jochenehret commented May 13, 2020

dawu415 commented Oct 23, 2020

sethboyles commented Nov 18, 2020

Gerg commented Nov 19, 2020

sethboyles commented Nov 19, 2020 • edited Loading

Gerg commented Nov 19, 2020

sethboyles commented Nov 20, 2020

stephanme commented Mar 1, 2021

sethboyles commented Mar 1, 2021

sethboyles commented Mar 1, 2021

sethboyles commented Mar 23, 2021

Killing `cloud_controller_ng` can cause counter metrics like `requests.outstanding` to get stuck #1312

Killing `cloud_controller_ng` can cause counter metrics like `requests.outstanding` to get stuck #1312

jochenehret commented Apr 30, 2020 •

edited

Loading

sethboyles commented Nov 19, 2020 •

edited

Loading