Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Killing cloud_controller_ng can cause counter metrics like requests.outstanding to get stuck #1312

Closed
tcdowney opened this issue Mar 19, 2019 · 28 comments

Comments

@tcdowney
Copy link
Member

Issue

Logging this a Github issue for discoverability.

The cc.requests.outstanding counter metric can get stuck if Cloud Controller is stopped/killed in the middle of handling requests.

Context

Cloud Controller emits the cc.requests.outstanding to give operators insight into the number of active requests being handled by the cloud_controller_ng Ruby process.

Number of requests that are currently being processed. Emitted for each Cloud Controller request.
More information on emitted metrics: https://docs.cloudfoundry.org/running/all_metrics.html

Cloud Controller manages this metric in

@request_metrics.complete_request(status)
.

So if the Cloud Controller is stopped/killed before it has a chance to call complete_request here, none of the active requests can decrement the counter in statsd. This is similar to #1294 .

Steps to Reproduce

  1. Install the CF Loggregator Firehose Plugin

  2. Connect to the firehose and grep out the metric:

cf nozzle -f ValueMetric | grep --line-buffered cc.*outstanding
  1. Hammer the Cloud Controllers with requests:
hey -n 1000000 -c 50 -H "Authorization: $(cf oauth-token)" http://api.arya.capi.land/v2/apps
  1. Observe something similar to the following from cf nozzle
origin:"cc" eventType:ValueMetric timestamp:1553034399010006107 deployment:"cf" job:"api" index:"c032ab53-66f7-45fc-a229-d25d509967dc" ip:"10.0.1.11" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:19 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034399011308663 deployment:"cf" job:"api" index:"c032ab53-66f7-45fc-a229-d25d509967dc" ip:"10.0.1.11" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:20 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034398981292501 deployment:"cf" job:"api" index:"b55807af-8ebb-4a54-8bd7-1edc7b8264f9" ip:"10.0.1.10" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:20 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034399098849677 deployment:"cf" job:"api" index:"b55807af-8ebb-4a54-8bd7-1edc7b8264f9" ip:"10.0.1.10" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:19 unit:"counter" >
  1. On an api vm, stop the cloud_controller_ng process:
monit stop cloud_controller_ng

Feel free to substitute a kill -9 for the monit stop if monit is being too gentle in stopping the Cloud Controller.

  1. Start it back up again:
monit start cloud_controller_ng

Expected result

An accurate count for requests.outstanding continues to be emitted, capped at 20 (the default number of EventMachine threads).

Current result

Since the Cloud Controller was stopped in the middle of handling all of the requests from hey it was not able to decrement the statsd counter, so now there is a baseline of 20 for all of the cc.requests.outstanding metrics from the cloud_controller_ng job on the VM that was stopped (in this test environment that was b55807af-8ebb-4a54-8bd7-1edc7b8264f9).

origin:"cc" eventType:ValueMetric timestamp:1553034630853046264 deployment:"cf" job:"api" index:"c032ab53-66f7-45fc-a229-d25d509967dc" ip:"10.0.1.11" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:3 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034630865535939 deployment:"cf" job:"api" index:"c032ab53-66f7-45fc-a229-d25d509967dc" ip:"10.0.1.11" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:2 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034630917230329 deployment:"cf" job:"api" index:"c032ab53-66f7-45fc-a229-d25d509967dc" ip:"10.0.1.11" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:3 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034630849778556 deployment:"cf" job:"api" index:"b55807af-8ebb-4a54-8bd7-1edc7b8264f9" ip:"10.0.1.10" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:40 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034630888060296 deployment:"cf" job:"api" index:"b55807af-8ebb-4a54-8bd7-1edc7b8264f9" ip:"10.0.1.10" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:39 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034630889439373 deployment:"cf" job:"api" index:"b55807af-8ebb-4a54-8bd7-1edc7b8264f9" ip:"10.0.1.10" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:40 unit:"counter" >
origin:"cc" eventType:ValueMetric timestamp:1553034630986481710 deployment:"cf" job:"api" index:"b55807af-8ebb-4a54-8bd7-1edc7b8264f9" ip:"10.0.1.10" tags:<key:"source_id" value:"cc" > valueMetric:<name:"requests.outstanding" value:39 unit:"counter" >

Possible Fix

Current mitigation: restarting the statsd_injector by running monit restart statsd_injector resets the counter.

Longer term: Cloud Controller could potentially clear these statsd counters on startup so it always starts with a clean slate.

Related

@cf-gitbot
Copy link

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/164748070

The labels on this github issue will be updated when the story is started.

@jochenehret
Copy link
Contributor

Can we expect a fix for this issue? As mentioned above, restarting the statsd_injector process whenever cloud_controller_ng is restarted would already help. We are trying to use this metric as scaling indicator (as recommended on https://docs.cloudfoundry.org/running/managing-cf/scaling-cloud-controller.html), but currently it's not reliable.

@cwlbraa
Copy link
Contributor

cwlbraa commented Apr 29, 2020

IIRC clearing the counters is easy to say but not easy to do. @tcdowney do you remember if your "current mitigation" was already implemented?

@jochenehret
Copy link
Contributor

jochenehret commented Apr 30, 2020

Another option would be to emit an absolute value from cloud_controller_ng (instead of relative increment/decrement values). Can you estimate how difficult this would be to implement? Would you accept a PR? We can try to reserve some capacity in our next sprint to provide a fix.

@reneighbor
Copy link
Member

Thank you @jochenehret! If you're willing and have capacity, a PR would be great. One thing we discussed in tech forum today would be to use an absolute value and have the state stored as a variable in memory on the ccng side. Is that what you meant?

@jochenehret
Copy link
Contributor

Yes, that's what we meant. We are trying to create a PR in our next sprint.

@cwlbraa
Copy link
Contributor

cwlbraa commented May 12, 2020

would an "absolute value" be backwards-compatible in this case? or would it need to go under a different metric name?

@jochenehret
Copy link
Contributor

I think if the Cloud Controller emits an absolute "requests.outstanding" metric, that should be backwards-compatible. I just don't know how the "statsd_injector" process behaves. It currently gets the relative metric in the format "ccc.requests.outstanding:1" or "ccc.requests.outstanding:-1". We must make sure that it forwards the absolute numbers without doing any additional calculations.

@dawu415
Copy link

dawu415 commented Oct 23, 2020

Hello, just curious if there's any movement on this as I am currently seeing this bug as well. We're still investigating on our end but it seems connectivity to MySQL db dropped and the outstanding requests remain stuck at 20.

@sethboyles
Copy link
Member

@dawu415 We have looked into this in the past week and unfortunately there doesn't seem to be a way to reset the counter to 0 on restart. The statsd_injector doesn't provide the statsd admin interface, which provides the option to delete a counter.

We may be able to resolve this by changing the metric from a counter metric to a gauge. However, we are still looking into this and considering the implications of making a change.

Perhaps, if gauges do provide a way to reset the metric without the admin interface, we can create a new metric with the gauge and deprecate this counter, but not remove it? @Gerg

@Gerg
Copy link
Member

Gerg commented Nov 19, 2020

@sethboyles How would CC know what value to set the gauge to? All the instances of CC would need need to have some shared state to track/manage outstanding requests.

@sethboyles
Copy link
Member

sethboyles commented Nov 19, 2020

We should be able to increment or decrement gauges just like we do with counters, using the delta option:

https://statsd.readthedocs.io/en/v3.2.1/types.html#gauge-deltas
https://github.com/statsd/statsd/blob/master/docs/metric_types.md#gauges

@Gerg
Copy link
Member

Gerg commented Nov 19, 2020

So, we could add a new metric requests.outstanding.ng (real name TBD). This metric would be a gauge that deltas +1/-1 whenever the requests.outstanding counter increments or decrements, respectively. Additionally, the gauge is set to 0 whenever CC starts. Does that sound right?

@sethboyles
Copy link
Member

Yeah, I think that is something to look into at least.

@stephanme
Copy link
Contributor

Any chance to get PR #2087 merged?

The outstanding requests metric is interesting as scaling indicator but can't be used as long as it is broken.

@sethboyles
Copy link
Member

Merged!

Not closing this issue yet until we complete documentation

@sethboyles
Copy link
Member

We need to update these docs:
https://docs.cloudfoundry.org/running/managing-cf/scaling-cloud-controller.html
https://docs.cloudfoundry.org/running/all_metrics.html

MerricdeLauney added a commit to MerricdeLauney/docs-running-cf that referenced this issue Mar 1, 2021
….gauge

This is needed after introducing the new metric as part of this issue: cloudfoundry/cloud_controller_ng#1312

cc: @sethboyles
MerricdeLauney added a commit to MerricdeLauney/docs-running-cf that referenced this issue Mar 1, 2021
….gauge

This is needed after introducing the new metric as part of this issue: cloudfoundry/cloud_controller_ng#1312

cc: @sethboyles
@sethboyles
Copy link
Member

Closing as we have updated the relevant docs.

Samze added a commit to Samze/cloud_controller_ng_sam that referenced this issue May 31, 2024
cloudfoundry#1312 introduced cc.requests.outstanding.gauge
which holds the counter in memory. With the introduction of puma there may be multiple processes, each
would have its own value for this metric. This would cause the gauge to flop between values.

This fix, specifically for puma instead reads the value from prometheus and emits it.
Samze added a commit to Samze/cloud_controller_ng_sam that referenced this issue May 31, 2024
cloudfoundry#1312 introduced cc.requests.outstanding.gauge
which holds the counter in memory. With the introduction of puma there may be multiple processes, each
would have its own value for this metric. This would cause the gauge to flop between values.

This fix, specifically for puma instead reads the value from prometheus and emits it.
Samze added a commit to Samze/cloud_controller_ng_sam that referenced this issue May 31, 2024
cloudfoundry#1312 introduced cc.requests.outstanding.gauge
which holds the counter in memory. With the introduction of puma there may be multiple processes, each
would have its own value for this metric. This would cause the gauge to flop between values.

This fix, specifically for puma instead reads the value from prometheus and emits it.
Samze added a commit to Samze/cloud_controller_ng_sam that referenced this issue May 31, 2024
cloudfoundry#1312 introduced cc.requests.outstanding.gauge
which holds the counter in memory. With the introduction of puma there may be multiple processes, each
would have its own value for this metric. This would cause the gauge to flop between values.

This fix, specifically for puma instead reads the value from prometheus and emits it.
Samze added a commit to Samze/cloud_controller_ng_sam that referenced this issue Jun 3, 2024
cloudfoundry#1312 introduced cc.requests.outstanding.gauge
which holds the counter in memory. With the introduction of puma there may be multiple processes, each
would have its own value for this metric. This would cause the gauge to flop between values.

This fix, specifically for puma instead reads the value from prometheus and emits it.

Additionally, add a missing mutux to counter for thread safety
Samze added a commit to Samze/cloud_controller_ng_sam that referenced this issue Jun 3, 2024
cloudfoundry#1312 introduced cc.requests.outstanding.gauge
which holds the counter in memory. With the introduction of puma there may be multiple processes, each
would have its own value for this metric. This would cause the gauge to flop between values.

This fix, specifically for puma instead reads the value from prometheus and emits it.

Additionally, add a missing mutux to counter for thread safety
Samze added a commit to Samze/cloud_controller_ng_sam that referenced this issue Jun 4, 2024
cloudfoundry#1312 introduced cc.requests.outstanding.gauge
which holds the counter in memory. With the introduction of puma there may be multiple processes, each
would have its own value for this metric. This would cause the gauge to flop between values.

This fix, specifically for puma instead reads the value from prometheus and emits it.

Additionally, add a missing mutux to counter for thread safety
Samze added a commit to Samze/cloud_controller_ng_sam that referenced this issue Jun 11, 2024
cloudfoundry#1312 introduced cc.requests.outstanding.gauge
which holds the counter in memory. With the introduction of puma there may be multiple processes, each
would have its own value for this metric. This would cause the gauge to flop between values.

This fix, specifically for puma instead uses redis.

Additionally, add a missing mutux to counter for thread safety for the in memory implementation.

Inspired by cloudfoundry@4539e59
Samze added a commit to Samze/cloud_controller_ng_sam that referenced this issue Jun 11, 2024
cloudfoundry#1312 introduced cc.requests.outstanding.gauge
which holds the counter in memory. With the introduction of puma there may be multiple processes, each
would have its own value for this metric. This would cause the gauge to flop between values.

This fix, specifically for puma instead uses redis.

Additionally, add a missing mutux to counter for thread safety for the in memory implementation.

Inspired by cloudfoundry@4539e59
Samze added a commit to Samze/cloud_controller_ng_sam that referenced this issue Jun 11, 2024
cloudfoundry#1312 introduced cc.requests.outstanding.gauge
which holds the counter in memory. With the introduction of puma there may be multiple processes, each
would have its own value for this metric. This would cause the gauge to flop between values.

This fix, specifically for puma instead uses redis.

Additionally, add a missing mutux to counter for thread safety for the in memory implementation.

Inspired by cloudfoundry@4539e59
Samze added a commit to Samze/cloud_controller_ng_sam that referenced this issue Jun 12, 2024
cloudfoundry#1312 introduced cc.requests.outstanding.gauge
which holds the counter in memory. With the introduction of puma there may be multiple processes, each
would have its own value for this metric. This would cause the gauge to flop between values.

This fix, specifically for puma instead uses redis.

* Move requests_metric to use statsd_updater
* Statsd_updater now contains statsd request logic
* Add Redis/Inmemory store to statsd_updager
* Additionally, add a missing mutux to counter for thread safety for the in memory implementation.

Inspired by cloudfoundry@4539e59
Samze added a commit to Samze/cloud_controller_ng_sam that referenced this issue Jun 12, 2024
cloudfoundry#1312 introduced `cc.requests.outstanding.gauge` which holds the counter in memory. With the introduction of puma there may be multiple processes, so each would emit its own value for this metric. This would cause the gauge to flop between values. This metric is listed as an important kpi for capi scaling https://docs.cloudfoundry.org/running/managing-cf/scaling-cloud-controller.html#cloud_controller_ng.

This fix for puma will instead uses Redis for the gauge.

* Move requests_metric to use statsd_updater
* Statsd_updater now contains statsd request logic
* Add Redis/Inmemory store to statsd_updager
* Add a missing mutux to counter for thread safety for the in memory implementation.
* Chose to prefix the entry in redis with `metrics:` but open to feedback here.

Inspired by cloudfoundry@4539e59

An alternative considered, was to read the prometheus metric and re-emit that to StatsD, however we observed performance degradation. Presumably because of the number of reads from disk for the [DirectFileStorage](https://github.com/prometheus/client_ruby?tab=readme-ov-file#directfilestore-caveats-and-things-to-keep-in-mind) to aggregate the metric across processes.
Samze added a commit to Samze/cloud_controller_ng_sam that referenced this issue Jun 12, 2024
cloudfoundry#1312 introduced `cc.requests.outstanding.gauge` which holds the counter in memory. With the introduction of puma there may be multiple processes, so each would emit its own value for this metric. This would cause the gauge to flop between values. This metric is listed as an important kpi for capi scaling https://docs.cloudfoundry.org/running/managing-cf/scaling-cloud-controller.html#cloud_controller_ng.

This fix for puma will instead uses Redis for the gauge.

* Move requests_metric to use statsd_updater
* Statsd_updater now contains statsd request logic
* Add Redis/Inmemory store to statsd_updager
* Add a missing mutux to counter for thread safety for the in memory implementation.
* Chose to prefix the entry in redis with `metrics:` but open to feedback here.

Inspired by cloudfoundry@4539e59

An alternative considered, was to read the prometheus metric and re-emit that to StatsD, however we observed performance degradation. Presumably because of the number of reads from disk for the [DirectFileStorage](https://github.com/prometheus/client_ruby?tab=readme-ov-file#directfilestore-caveats-and-things-to-keep-in-mind) to aggregate the metric across processes.
Samze added a commit to Samze/cloud_controller_ng_sam that referenced this issue Jun 12, 2024
cloudfoundry#1312 introduced `cc.requests.outstanding.gauge` which holds the counter in memory. With the introduction of puma there may be multiple processes, so each would emit its own value for this metric. This would cause the gauge to flop between values. This metric is listed as an important kpi for capi scaling https://docs.cloudfoundry.org/running/managing-cf/scaling-cloud-controller.html#cloud_controller_ng.

This fix for puma will instead uses Redis for the gauge.

* Move requests_metric to use statsd_updater
* Statsd_updater now contains statsd request logic
* Add Redis/Inmemory store to statsd_updager
* Add a missing mutux to counter for thread safety for the in memory implementation.
* Chose to prefix the entry in redis with `metrics:` but open to feedback here.

Inspired by cloudfoundry@4539e59

An alternative considered, was to read the prometheus metric and re-emit that to StatsD, however we observed performance degradation. Presumably because of the number of reads from disk for the [DirectFileStorage](https://github.com/prometheus/client_ruby?tab=readme-ov-file#directfilestore-caveats-and-things-to-keep-in-mind) to aggregate the metric across processes.
Samze added a commit to Samze/cloud_controller_ng_sam that referenced this issue Jun 12, 2024
cloudfoundry#1312 introduced `cc.requests.outstanding.gauge` which holds the counter in memory. With the introduction of puma there may be multiple processes, so each would emit its own value for this metric. This would cause the gauge to flop between values. This metric is listed as an important kpi for capi scaling https://docs.cloudfoundry.org/running/managing-cf/scaling-cloud-controller.html#cloud_controller_ng.

This fix for puma will instead uses Redis for the gauge.

* Move requests_metric to use statsd_updater
* Statsd_updater now contains statsd request logic
* Add Redis/Inmemory store to statsd_updager
* Add a missing mutux to counter for thread safety for the in memory implementation.
* Chose to prefix the entry in redis with `metrics:` but open to feedback here.

Inspired by cloudfoundry@4539e59

An alternative considered, was to read the prometheus metric and re-emit that to StatsD, however we observed performance degradation. Presumably because of the number of reads from disk for the [DirectFileStorage](https://github.com/prometheus/client_ruby?tab=readme-ov-file#directfilestore-caveats-and-things-to-keep-in-mind) to aggregate the metric across processes.
Samze added a commit to Samze/cloud_controller_ng_sam that referenced this issue Jun 12, 2024
cloudfoundry#1312 introduced `cc.requests.outstanding.gauge` which holds the counter in memory. With the introduction of puma there may be multiple processes, so each would emit its own value for this metric. This would cause the gauge to flop between values. This metric is listed as an important kpi for capi scaling https://docs.cloudfoundry.org/running/managing-cf/scaling-cloud-controller.html#cloud_controller_ng.

This fix for puma will instead uses Redis for the gauge.

* Move requests_metric to use statsd_updater
* Statsd_updater now contains statsd request logic
* Add Redis/Inmemory store to statsd_updager
* Add a missing mutux to counter for thread safety for the in memory implementation.
* Chose to prefix the entry in redis with `metrics:` but open to feedback here.

Inspired by cloudfoundry@4539e59

An alternative considered, was to read the prometheus metric and re-emit that to StatsD, however we observed performance degradation. Presumably because of the number of reads from disk for the [DirectFileStorage](https://github.com/prometheus/client_ruby?tab=readme-ov-file#directfilestore-caveats-and-things-to-keep-in-mind) to aggregate the metric across processes.
sethboyles pushed a commit that referenced this issue Jun 14, 2024
#1312 introduced `cc.requests.outstanding.gauge` which holds the counter in memory. With the introduction of puma there may be multiple processes, so each would emit its own value for this metric. This would cause the gauge to flop between values. This metric is listed as an important kpi for capi scaling https://docs.cloudfoundry.org/running/managing-cf/scaling-cloud-controller.html#cloud_controller_ng.

This fix for puma will instead uses Redis for the gauge.

* Move requests_metric to use statsd_updater
* Statsd_updater now contains statsd request logic
* Add Redis/Inmemory store to statsd_updager
* Add a missing mutux to counter for thread safety for the in memory implementation.
* Chose to prefix the entry in redis with `metrics:` but open to feedback here.

Inspired by 4539e59

An alternative considered, was to read the prometheus metric and re-emit that to StatsD, however we observed performance degradation. Presumably because of the number of reads from disk for the [DirectFileStorage](https://github.com/prometheus/client_ruby?tab=readme-ov-file#directfilestore-caveats-and-things-to-keep-in-mind) to aggregate the metric across processes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants