-
Notifications
You must be signed in to change notification settings - Fork 451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace vpa-exporter with kube-state-metrics #6771
Conversation
/assign |
/hold |
charts/seed-monitoring/charts/grafana/dashboards/operators/vpa-dashboard.json
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice PR! 👏🏻
…as well There are 2 kinds of kube-state-metrics deployments in a seed: - in each shoot control plane namespace - in the garden namespace The one running in each shoot control plane namespace is connected to the shoot API server and hence exposes metrics about the shoot cluster. The one running in the garden namespace is connected to the seed API server and hence exposes metrics about the seed cluster. Previously, only the shoot control plane kube-state-metrics deployment had access to and exported the verticalpodautoscalers metrics. With this change, the cluster role that is bound to the service account that is used by the kube-state-metrics instance running in the garden namespace is granted access to verticalpodautoscaler resources (get, list, watch). Furthermore, the command line flags are extended so that vpa metrics are exposed. Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
The kube-state-metrics in the garden namespace is scraped by the "cache" prometheus. The newly exposed vpa metrics are allow listed with this change. The control plane prometheus is already scraping the kube-state-metrics in the control plane namespace and allow listing some vpa metrics. With this change, its allow list is adjusted so that now both prometheuses keep the same set of vpa metrics: - recommendation (target, upper bound, lower bound) - min allowed, max allowed - update mode By having the same set of vpa metrics for both the seed and shoot clusters, we can use the same promql queries in the dashboard for pods that are running in the shoot cluster and in the control plane. Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
Only federate the ksm metrics (including the new vpa metrics) for the garden and extension namespaces. The grafana deployment in the garden namespace uses the aggregate-prometheus as a datasource, hence we need to federate these ksm/vpa metrics to the aggregate-prometheus to be able to show them on a dashboard. Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
To show the actual (memory/cpu) usage in the VPA dashboard in Grafana in the garden namespace. Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
Hi @wyb1 and @rfranzke, cc @voelzmo The revised VPA dashboard could provide some insights how the VPA works in our seed clusters. |
- The "Shoot VPA Recommendations" dashboard was exposed to shoot owners. That dashboard didn't use variables and it is replaced now with the "VPA Recommendations" dashboard that was previously exposed only to Gardener operators: now it is exposed to shoot owners as well. - There were 2 "VPA Recommendations" dashboards: one in the control plane, the other one in the garden namespace and they diverged some time ago. Now we use a single dashboard and a symbolic link for the garden grafana. - The "VPA Recommendations" dashboard has been improved to show how CPU and memory 1) usage, the 2) VPA recommandations, upper and lower bounds and limits and 3) the resources requests and limits influence each other. Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
We can use the (remote) local setup and deploy the base commit of this PR, and export the relevant state of the cluster with this script. Then, we can upgrade to this PR and restart the gardenlet and let the changes be applied and export the relevant state again. The diff shows the changes that are triggered by the PR. The resulting diff might be helpful when reviewing the PR. Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
Note that this is done by resource manager, acting upon the managed resource which no longer contains the vpa-exporter artifacts. Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
Equivalent metrics are scraped from the ksm component Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
Equivalent metrics are federated from the cache prometheus Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
…ometheus The cadvisor metrics (like CPU and memory usage) are needed for the VPA Recommendations dashboard. Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
Equivalent metrics are federated from the ksm component Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
It federates only the relevant metrics (i.e. metrics for its own namespace) from the cache prometheus. Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
The cache prometheus allow lists all because it didn't allow list any of them before. kube_verticalpodautoscaler_status_recommendation_containerrecommendations_target kube_verticalpodautoscaler_status_recommendation_containerrecommendations_upperbound kube_verticalpodautoscaler_status_recommendation_containerrecommendations_lowerbound kube_verticalpodautoscaler_spec_resourcepolicy_container_policies_minallowed kube_verticalpodautoscaler_spec_resourcepolicy_container_policies_maxallowed kube_verticalpodautoscaler_spec_updatepolicy_updatemode The control plane prometheus allow lists only a few more that it didn't allow list before (so that the 2 allow lists are aligned, and hence we can use the same dashboard for seed and shoot related metrics). kube_verticalpodautoscaler_spec_resourcepolicy_container_policies_minallowed kube_verticalpodautoscaler_spec_resourcepolicy_container_policies_maxallowed Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
There are 2 instances of this alert, and previously both instances considered etcd-main. Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
Previously it checked that the scrape target is "up", now it checks that there are some federated metrics. Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
The control plane prometheus is scraping the ksm for the shoot, which exposes metrics only for the kube-system namespace, so that it is not necessary to overwrite the namespace with kube-system, because the namespace is already kube-system. Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
…view Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
@istvanballok: The following test failed, say
Full PR test history. Your PR dashboard. Command help for this repository. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/unhold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, well done!
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: rfranzke, rickardsjp, wyb1 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
How to categorize this PR?
/area monitoring
/kind enhancement
What this PR does / why we need it:
Replace vpa-exporter with kube-state-metrics
Previously, the vpa-exporter component was used to expose VPA related metrics.
With this PR, the vpa-exporter component is no longer used and equivalent (but differently named) metrics are exposed by the existing kube-state-metrics deployments. The Grafana dashboards are adjusted and aligned appropriately.
This PR also contains a related network bandwidth usage improvement: previously the control plane prometheus-es used to scrape the kube-state-metrics independently, getting metrics for the whole seed cluster. Now they only federate the relevant metrics from the cache prometheus.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Please check the individual commits.
To facilitate code review and to verify the impact of the source code changes at runtime, the second half of the commits of this PR, starting with this commit, explains the effects that this PR introduces in the KinD based local setup of Gardener with a single shoot cluster.
Here is a screenshot of the revised dashboard:
/cc @rickardsjp @wyb1
Release note: