☂️-issue for gardener component metrics/alerts/dashboards #2815

timebertt · 2020-09-04T13:16:40Z

How to categorize this issue?

/area monitoring ops-productivity
/kind enhancement
/priority normal

What would you like to be added:
Currently the /metrics endpoint of our components (gcm, g-scheduler, gardenlet) only expose some very rudimentary metrics: {garden_cm,gardener_scheduler,gardenlet}_worker_amount, basic go runtime metrics and prometheus http handler metrics. AFAIK, we don't display any of those metrics in any dashboard.

We should augment the set of component metrics, that are collected and exposed to improve observability of the gardener core components. The client-go and controller-runtime libraries contain some helpers for collecting controller related metrics, that can/should be leveraged for that (see client-go and c-r).

Helpful client-side metrics would be:

client-side API request count per resource/url
client-side API request latency/duration (histogram)
client-side rate limit/throttle latency/duration (histogram)
API request response count per code
client certificate expiry/rotation age
metrics around lists/watches (e.g. short_watches_total, watch_duration, lists_total)
controller workqueue stats (depth, dirty, inflight, duration, ...)

Further server-side metrics, that would also be helpful:

webhook inflight requests, response code distribution (e.g. in gcm, seed-admission-controller webhooks)
Gardener API server metrics (same as exposed by kube-apiserver, check if that's already enabled)

More advanced metrics, that could be displayed in dashboards or used in alerts, may include (but not so important):

number of goroutines (for detecting goroutine leaks), number of system threads
go garbage collection stats

We could then add dedicated dashboards for all gardener system components displaying:

the mentioned component-aware metrics
kubernetes metrics (e.g amount of container/pod restarts, number of replicas, leader election lease durations, ...) (similar to control plane component dashboards)
resource metrics (e.g. cpu, memory, network usage) (similar to control plane component dashboards)

Once we gained some experience with those metrics, we could build more alerts based on those metrics regarding our system components, for example:

high amount of short watches
high percentage of request failures
high request latency
high amount of throttled requests / high throttling latency
only few weeks/days left until the client certificate expires

Why is this needed:
While operating fast growing gardener systems, it is important to be able to have deeper insights for its system components.
In the past few months we have observed some issues regarding scalability/stability/availability which are very hard to debug, reproduce and fix (e.g. #2689, #2747).
Having component-aware metrics, alerts and dashboards will help in running and operating large-scale gardener installations, while maintaining quality of service and keeping operational effort low.

/cc @wyb1 @istvanballok

The text was updated successfully, but these errors were encountered:

timebertt · 2020-09-11T10:48:04Z

We had an internal discussion with @wyb1 and @istvanballok. Thanks again for the productive session!

To summarize, we formalized the following action items / plan for this topic:

add new prometheus instance without whitelisting/filtering (with low retention time) [@wyb1, @istvanballok]
- this will help us to explore over time
  - which metrics are indeed helpful -> might be added to whitelists for prometheus with longer retention time
  - which metrics are not helpful/waste -> might be blacklisted accordingly
- should be configurable, can be enabled/disabled in some way (probably gardenlet config)
collect/expose the desired new metrics in gardener's components in the go code (starting with those listed above) [@wyb1, @istvanballok, @gardener/gardener-maintainers]
collect experience during debugging sessions / with new augmented set of metrics [all, operators] and based on these experiences:
- add new and relevant metrics to existing whitelists [@gardener/gardener-maintainers, operators]
- define helpful alerts [@gardener/gardener-maintainers, operators]
- design helpful dashboards [@wyb1, @istvanballok]

rfranzke · 2020-09-24T10:27:08Z

/assign @danielfoehrKn
as he showed some interest in this these topics and might start looking into it

danielfoehrKn · 2020-09-24T12:32:20Z

After initially looking into it, I have to say that I would need a considerable amount of time and focus for this topic ( this is one of the areas I am least familiar in ) that I do not have at the moment.
Lets discuss this in a bigger round when @timebertt is back.

/unassign

timebertt · 2021-01-12T18:14:23Z

/touch

rfranzke · 2021-01-15T10:31:24Z

/assign @timuthy @rfranzke

rfranzke · 2021-02-15T09:51:16Z

/unassign @timuthy @rfranzke for now in favor of #1723 and #1724 and #1725

timebertt · 2022-03-15T08:10:44Z

/touch

rfranzke · 2022-12-19T08:32:58Z

cc @shafeeqes

rfranzke · 2023-01-18T06:43:18Z

@shafeeqes @timebertt Can you check whether this issue is still up-to-date and if not change the description accordingly? :) Thank you!

shafeeqes · 2023-02-03T04:20:52Z

I think

controller workqueue stats (depth, dirty, inflight, duration, ...)

is completed with #7180

timebertt · 2023-03-29T08:10:26Z

I checked off a few of the tasks that are already done with the controller runtime refactoring.
I expect that alerts won't be done, as alerts are kind of orphaned and unmaintained in this repository.
The remaining open tasks for additional metrics and dashboards could partially still be interesting to add.

timebertt added the kind/enhancement Enhancement, improvement, extension label Sep 4, 2020

gardener-robot added area/monitoring Monitoring (including availability monitoring and alerting) related area/ops-productivity Operator productivity related (how to improve operations) priority/normal labels Sep 4, 2020

timebertt changed the title ~~Gardener component metrics/dashboards~~ ☂️-issue for gardener component metrics/alerts/dashboards Sep 4, 2020

vlerenc mentioned this issue Sep 22, 2020

Dashboard improvements gardener/monitoring#4

Closed

7 tasks

gardener-robot assigned danielfoehrKn Sep 24, 2020

gardener-robot unassigned danielfoehrKn Sep 24, 2020

stoyanr mentioned this issue Nov 5, 2020

Shoot reconciliation failed with an error when attempting to update the managed resource extension-worker-mcm-shoot #3118

Closed

gardener-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 24, 2020

wyb1 mentioned this issue Nov 27, 2020

Add and Expose Admission Controller Metrics #3234

Closed

gardener-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 12, 2021

gardener-robot assigned rfranzke and timuthy Jan 15, 2021

rfranzke mentioned this issue Jan 20, 2021

☂️-Issue for unit/integration tests in the extensions library #2751

Closed

2 tasks

rfranzke mentioned this issue Feb 5, 2021

Enhance Grafana dashboards for kube-apiserver #3502

Merged

gardener-robot unassigned rfranzke and timuthy Feb 15, 2021

gardener-robot added priority/3 Priority (lower number equals higher priority) and removed priority/normal labels Mar 8, 2021

rfranzke added the roadmap/internal Roadmap for our team-internal goals, e.g. drive up seed utilization label Jun 11, 2021

rfranzke added this to the 2021-Q4 milestone Jun 11, 2021

gardener deleted a comment from gardener-robot Jun 22, 2021

timebertt mentioned this issue Aug 10, 2021

☂️ -Issue for graduating CachedRuntimeClients to beta #2822

Closed

12 tasks

rfranzke removed this from the 2021-Q4 milestone Sep 7, 2021

timebertt mentioned this issue Sep 21, 2021

Use cached controller-runtime client #2414

Closed

16 tasks

wyb1 mentioned this issue Sep 21, 2021

Seed monitoring enhancement #4706

Merged

timebertt mentioned this issue Oct 11, 2021

Monitoring for extension pods running on seed gardener/gardener-extension-provider-aws#427

Open

2 tasks

timebertt mentioned this issue Oct 26, 2021

Remove dummy controller metrics #4913

Merged

timebertt mentioned this issue Dec 17, 2021

Refactor Gardener components to native controller-runtime components #4251

Closed

76 tasks

gardener-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 8, 2022

gardener-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 15, 2022

rfranzke added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label May 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

☂️-issue for gardener component metrics/alerts/dashboards #2815

☂️-issue for gardener component metrics/alerts/dashboards #2815

timebertt commented Sep 4, 2020 •

edited

Loading

timebertt commented Sep 11, 2020

rfranzke commented Sep 24, 2020

danielfoehrKn commented Sep 24, 2020

timebertt commented Jan 12, 2021

rfranzke commented Jan 15, 2021

rfranzke commented Feb 15, 2021 •

edited

Loading

timebertt commented Mar 15, 2022

rfranzke commented Dec 19, 2022

rfranzke commented Jan 18, 2023

shafeeqes commented Feb 3, 2023

timebertt commented Mar 29, 2023

☂️-issue for gardener component metrics/alerts/dashboards #2815

☂️-issue for gardener component metrics/alerts/dashboards #2815

Comments

timebertt commented Sep 4, 2020 • edited Loading

timebertt commented Sep 11, 2020

rfranzke commented Sep 24, 2020

danielfoehrKn commented Sep 24, 2020

timebertt commented Jan 12, 2021

rfranzke commented Jan 15, 2021

rfranzke commented Feb 15, 2021 • edited Loading

timebertt commented Mar 15, 2022

rfranzke commented Dec 19, 2022

rfranzke commented Jan 18, 2023

shafeeqes commented Feb 3, 2023

timebertt commented Mar 29, 2023

timebertt commented Sep 4, 2020 •

edited

Loading

rfranzke commented Feb 15, 2021 •

edited

Loading