-
Notifications
You must be signed in to change notification settings - Fork 355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Killing cloud_controller_ng
can cause counter metrics like requests.outstanding
to get stuck
#1312
Comments
We have created an issue in Pivotal Tracker to manage this: https://www.pivotaltracker.com/story/show/164748070 The labels on this github issue will be updated when the story is started. |
Can we expect a fix for this issue? As mentioned above, restarting the |
IIRC clearing the counters is easy to say but not easy to do. @tcdowney do you remember if your "current mitigation" was already implemented? |
Another option would be to emit an absolute value from cloud_controller_ng (instead of relative increment/decrement values). Can you estimate how difficult this would be to implement? Would you accept a PR? We can try to reserve some capacity in our next sprint to provide a fix. |
Thank you @jochenehret! If you're willing and have capacity, a PR would be great. One thing we discussed in tech forum today would be to use an absolute value and have the state stored as a variable in memory on the ccng side. Is that what you meant? |
Yes, that's what we meant. We are trying to create a PR in our next sprint. |
would an "absolute value" be backwards-compatible in this case? or would it need to go under a different metric name? |
I think if the Cloud Controller emits an absolute "requests.outstanding" metric, that should be backwards-compatible. I just don't know how the "statsd_injector" process behaves. It currently gets the relative metric in the format "ccc.requests.outstanding:1" or "ccc.requests.outstanding:-1". We must make sure that it forwards the absolute numbers without doing any additional calculations. |
Hello, just curious if there's any movement on this as I am currently seeing this bug as well. We're still investigating on our end but it seems connectivity to MySQL db dropped and the outstanding requests remain stuck at 20. |
@dawu415 We have looked into this in the past week and unfortunately there doesn't seem to be a way to reset the counter to 0 on restart. The statsd_injector doesn't provide the statsd admin interface, which provides the option to delete a counter. We may be able to resolve this by changing the metric from a Perhaps, if gauges do provide a way to reset the metric without the admin interface, we can create a new metric with the gauge and deprecate this counter, but not remove it? @Gerg |
@sethboyles How would CC know what value to set the gauge to? All the instances of CC would need need to have some shared state to track/manage outstanding requests. |
We should be able to increment or decrement gauges just like we do with counters, using the delta option: https://statsd.readthedocs.io/en/v3.2.1/types.html#gauge-deltas |
So, we could add a new metric |
Yeah, I think that is something to look into at least. |
Any chance to get PR #2087 merged? The outstanding requests metric is interesting as scaling indicator but can't be used as long as it is broken. |
Merged! Not closing this issue yet until we complete documentation |
….gauge This is needed after introducing the new metric as part of this issue: cloudfoundry/cloud_controller_ng#1312 cc: @sethboyles
….gauge This is needed after introducing the new metric as part of this issue: cloudfoundry/cloud_controller_ng#1312 cc: @sethboyles
Closing as we have updated the relevant docs. |
cloudfoundry#1312 introduced cc.requests.outstanding.gauge which holds the counter in memory. With the introduction of puma there may be multiple processes, each would have its own value for this metric. This would cause the gauge to flop between values. This fix, specifically for puma instead reads the value from prometheus and emits it.
cloudfoundry#1312 introduced cc.requests.outstanding.gauge which holds the counter in memory. With the introduction of puma there may be multiple processes, each would have its own value for this metric. This would cause the gauge to flop between values. This fix, specifically for puma instead reads the value from prometheus and emits it.
cloudfoundry#1312 introduced cc.requests.outstanding.gauge which holds the counter in memory. With the introduction of puma there may be multiple processes, each would have its own value for this metric. This would cause the gauge to flop between values. This fix, specifically for puma instead reads the value from prometheus and emits it.
cloudfoundry#1312 introduced cc.requests.outstanding.gauge which holds the counter in memory. With the introduction of puma there may be multiple processes, each would have its own value for this metric. This would cause the gauge to flop between values. This fix, specifically for puma instead reads the value from prometheus and emits it.
cloudfoundry#1312 introduced cc.requests.outstanding.gauge which holds the counter in memory. With the introduction of puma there may be multiple processes, each would have its own value for this metric. This would cause the gauge to flop between values. This fix, specifically for puma instead reads the value from prometheus and emits it. Additionally, add a missing mutux to counter for thread safety
cloudfoundry#1312 introduced cc.requests.outstanding.gauge which holds the counter in memory. With the introduction of puma there may be multiple processes, each would have its own value for this metric. This would cause the gauge to flop between values. This fix, specifically for puma instead reads the value from prometheus and emits it. Additionally, add a missing mutux to counter for thread safety
cloudfoundry#1312 introduced cc.requests.outstanding.gauge which holds the counter in memory. With the introduction of puma there may be multiple processes, each would have its own value for this metric. This would cause the gauge to flop between values. This fix, specifically for puma instead reads the value from prometheus and emits it. Additionally, add a missing mutux to counter for thread safety
cloudfoundry#1312 introduced cc.requests.outstanding.gauge which holds the counter in memory. With the introduction of puma there may be multiple processes, each would have its own value for this metric. This would cause the gauge to flop between values. This fix, specifically for puma instead uses redis. Additionally, add a missing mutux to counter for thread safety for the in memory implementation. Inspired by cloudfoundry@4539e59
cloudfoundry#1312 introduced cc.requests.outstanding.gauge which holds the counter in memory. With the introduction of puma there may be multiple processes, each would have its own value for this metric. This would cause the gauge to flop between values. This fix, specifically for puma instead uses redis. Additionally, add a missing mutux to counter for thread safety for the in memory implementation. Inspired by cloudfoundry@4539e59
cloudfoundry#1312 introduced cc.requests.outstanding.gauge which holds the counter in memory. With the introduction of puma there may be multiple processes, each would have its own value for this metric. This would cause the gauge to flop between values. This fix, specifically for puma instead uses redis. Additionally, add a missing mutux to counter for thread safety for the in memory implementation. Inspired by cloudfoundry@4539e59
cloudfoundry#1312 introduced cc.requests.outstanding.gauge which holds the counter in memory. With the introduction of puma there may be multiple processes, each would have its own value for this metric. This would cause the gauge to flop between values. This fix, specifically for puma instead uses redis. * Move requests_metric to use statsd_updater * Statsd_updater now contains statsd request logic * Add Redis/Inmemory store to statsd_updager * Additionally, add a missing mutux to counter for thread safety for the in memory implementation. Inspired by cloudfoundry@4539e59
cloudfoundry#1312 introduced `cc.requests.outstanding.gauge` which holds the counter in memory. With the introduction of puma there may be multiple processes, so each would emit its own value for this metric. This would cause the gauge to flop between values. This metric is listed as an important kpi for capi scaling https://docs.cloudfoundry.org/running/managing-cf/scaling-cloud-controller.html#cloud_controller_ng. This fix for puma will instead uses Redis for the gauge. * Move requests_metric to use statsd_updater * Statsd_updater now contains statsd request logic * Add Redis/Inmemory store to statsd_updager * Add a missing mutux to counter for thread safety for the in memory implementation. * Chose to prefix the entry in redis with `metrics:` but open to feedback here. Inspired by cloudfoundry@4539e59 An alternative considered, was to read the prometheus metric and re-emit that to StatsD, however we observed performance degradation. Presumably because of the number of reads from disk for the [DirectFileStorage](https://github.com/prometheus/client_ruby?tab=readme-ov-file#directfilestore-caveats-and-things-to-keep-in-mind) to aggregate the metric across processes.
cloudfoundry#1312 introduced `cc.requests.outstanding.gauge` which holds the counter in memory. With the introduction of puma there may be multiple processes, so each would emit its own value for this metric. This would cause the gauge to flop between values. This metric is listed as an important kpi for capi scaling https://docs.cloudfoundry.org/running/managing-cf/scaling-cloud-controller.html#cloud_controller_ng. This fix for puma will instead uses Redis for the gauge. * Move requests_metric to use statsd_updater * Statsd_updater now contains statsd request logic * Add Redis/Inmemory store to statsd_updager * Add a missing mutux to counter for thread safety for the in memory implementation. * Chose to prefix the entry in redis with `metrics:` but open to feedback here. Inspired by cloudfoundry@4539e59 An alternative considered, was to read the prometheus metric and re-emit that to StatsD, however we observed performance degradation. Presumably because of the number of reads from disk for the [DirectFileStorage](https://github.com/prometheus/client_ruby?tab=readme-ov-file#directfilestore-caveats-and-things-to-keep-in-mind) to aggregate the metric across processes.
cloudfoundry#1312 introduced `cc.requests.outstanding.gauge` which holds the counter in memory. With the introduction of puma there may be multiple processes, so each would emit its own value for this metric. This would cause the gauge to flop between values. This metric is listed as an important kpi for capi scaling https://docs.cloudfoundry.org/running/managing-cf/scaling-cloud-controller.html#cloud_controller_ng. This fix for puma will instead uses Redis for the gauge. * Move requests_metric to use statsd_updater * Statsd_updater now contains statsd request logic * Add Redis/Inmemory store to statsd_updager * Add a missing mutux to counter for thread safety for the in memory implementation. * Chose to prefix the entry in redis with `metrics:` but open to feedback here. Inspired by cloudfoundry@4539e59 An alternative considered, was to read the prometheus metric and re-emit that to StatsD, however we observed performance degradation. Presumably because of the number of reads from disk for the [DirectFileStorage](https://github.com/prometheus/client_ruby?tab=readme-ov-file#directfilestore-caveats-and-things-to-keep-in-mind) to aggregate the metric across processes.
cloudfoundry#1312 introduced `cc.requests.outstanding.gauge` which holds the counter in memory. With the introduction of puma there may be multiple processes, so each would emit its own value for this metric. This would cause the gauge to flop between values. This metric is listed as an important kpi for capi scaling https://docs.cloudfoundry.org/running/managing-cf/scaling-cloud-controller.html#cloud_controller_ng. This fix for puma will instead uses Redis for the gauge. * Move requests_metric to use statsd_updater * Statsd_updater now contains statsd request logic * Add Redis/Inmemory store to statsd_updager * Add a missing mutux to counter for thread safety for the in memory implementation. * Chose to prefix the entry in redis with `metrics:` but open to feedback here. Inspired by cloudfoundry@4539e59 An alternative considered, was to read the prometheus metric and re-emit that to StatsD, however we observed performance degradation. Presumably because of the number of reads from disk for the [DirectFileStorage](https://github.com/prometheus/client_ruby?tab=readme-ov-file#directfilestore-caveats-and-things-to-keep-in-mind) to aggregate the metric across processes.
cloudfoundry#1312 introduced `cc.requests.outstanding.gauge` which holds the counter in memory. With the introduction of puma there may be multiple processes, so each would emit its own value for this metric. This would cause the gauge to flop between values. This metric is listed as an important kpi for capi scaling https://docs.cloudfoundry.org/running/managing-cf/scaling-cloud-controller.html#cloud_controller_ng. This fix for puma will instead uses Redis for the gauge. * Move requests_metric to use statsd_updater * Statsd_updater now contains statsd request logic * Add Redis/Inmemory store to statsd_updager * Add a missing mutux to counter for thread safety for the in memory implementation. * Chose to prefix the entry in redis with `metrics:` but open to feedback here. Inspired by cloudfoundry@4539e59 An alternative considered, was to read the prometheus metric and re-emit that to StatsD, however we observed performance degradation. Presumably because of the number of reads from disk for the [DirectFileStorage](https://github.com/prometheus/client_ruby?tab=readme-ov-file#directfilestore-caveats-and-things-to-keep-in-mind) to aggregate the metric across processes.
#1312 introduced `cc.requests.outstanding.gauge` which holds the counter in memory. With the introduction of puma there may be multiple processes, so each would emit its own value for this metric. This would cause the gauge to flop between values. This metric is listed as an important kpi for capi scaling https://docs.cloudfoundry.org/running/managing-cf/scaling-cloud-controller.html#cloud_controller_ng. This fix for puma will instead uses Redis for the gauge. * Move requests_metric to use statsd_updater * Statsd_updater now contains statsd request logic * Add Redis/Inmemory store to statsd_updager * Add a missing mutux to counter for thread safety for the in memory implementation. * Chose to prefix the entry in redis with `metrics:` but open to feedback here. Inspired by 4539e59 An alternative considered, was to read the prometheus metric and re-emit that to StatsD, however we observed performance degradation. Presumably because of the number of reads from disk for the [DirectFileStorage](https://github.com/prometheus/client_ruby?tab=readme-ov-file#directfilestore-caveats-and-things-to-keep-in-mind) to aggregate the metric across processes.
Issue
Logging this a Github issue for discoverability.
The
cc.requests.outstanding
counter metric can get stuck if Cloud Controller is stopped/killed in the middle of handling requests.Context
Cloud Controller emits the
cc.requests.outstanding
to give operators insight into the number of active requests being handled by thecloud_controller_ng
Ruby process.Cloud Controller manages this metric in
cloud_controller_ng/middleware/request_metrics.rb
Line 14 in 90bbdda
So if the Cloud Controller is stopped/killed before it has a chance to call
complete_request
here, none of the active requests can decrement the counter in statsd. This is similar to #1294 .Steps to Reproduce
Install the CF Loggregator Firehose Plugin
Connect to the firehose and grep out the metric:
cf nozzle
api
vm, stop thecloud_controller_ng
process:Feel free to substitute a
kill -9
for themonit stop
if monit is being too gentle in stopping the Cloud Controller.Expected result
An accurate count for
requests.outstanding
continues to be emitted, capped at20
(the default number of EventMachine threads).Current result
Since the Cloud Controller was stopped in the middle of handling all of the requests from
hey
it was not able to decrement the statsd counter, so now there is a baseline of20
for all of thecc.requests.outstanding
metrics from thecloud_controller_ng
job on the VM that was stopped (in this test environment that wasb55807af-8ebb-4a54-8bd7-1edc7b8264f9
).Possible Fix
Current mitigation: restarting the
statsd_injector
by runningmonit restart statsd_injector
resets the counter.Longer term: Cloud Controller could potentially clear these statsd counters on startup so it always starts with a clean slate.
Related
The text was updated successfully, but these errors were encountered: