-
Notifications
You must be signed in to change notification settings - Fork 464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
☂️-issue for gardener component metrics/alerts/dashboards #2815
Comments
We had an internal discussion with @wyb1 and @istvanballok. Thanks again for the productive session! To summarize, we formalized the following action items / plan for this topic:
|
/assign @danielfoehrKn |
After initially looking into it, I have to say that I would need a considerable amount of time and focus for this topic ( this is one of the areas I am least familiar in ) that I do not have at the moment. /unassign |
/touch |
/touch |
cc @shafeeqes |
@shafeeqes @timebertt Can you check whether this issue is still up-to-date and if not change the description accordingly? :) Thank you! |
I think
is completed with #7180 |
I checked off a few of the tasks that are already done with the controller runtime refactoring. |
How to categorize this issue?
/area monitoring ops-productivity
/kind enhancement
/priority normal
What would you like to be added:
Currently the
/metrics
endpoint of our components (gcm, g-scheduler, gardenlet) only expose some very rudimentary metrics:{garden_cm,gardener_scheduler,gardenlet}_worker_amount
, basic go runtime metrics and prometheus http handler metrics. AFAIK, we don't display any of those metrics in any dashboard.We should augment the set of component metrics, that are collected and exposed to improve observability of the gardener core components. The client-go and controller-runtime libraries contain some helpers for collecting controller related metrics, that can/should be leveraged for that (see client-go and c-r).
Helpful client-side metrics would be:
short_watches_total
,watch_duration
,lists_total
)Further server-side metrics, that would also be helpful:
More advanced metrics, that could be displayed in dashboards or used in alerts, may include (but not so important):
We could then add dedicated dashboards for all gardener system components displaying:
Once we gained some experience with those metrics, we could build more alerts based on those metrics regarding our system components, for example:
Why is this needed:
While operating fast growing gardener systems, it is important to be able to have deeper insights for its system components.
In the past few months we have observed some issues regarding scalability/stability/availability which are very hard to debug, reproduce and fix (e.g. #2689, #2747).
Having component-aware metrics, alerts and dashboards will help in running and operating large-scale gardener installations, while maintaining quality of service and keeping operational effort low.
/cc @wyb1 @istvanballok
The text was updated successfully, but these errors were encountered: