Replace vpa-exporter with kube-state-metrics #6771

istvanballok · 2022-10-04T14:53:37Z

How to categorize this PR?

/area monitoring
/kind enhancement

What this PR does / why we need it:

Replace vpa-exporter with kube-state-metrics

Previously, the vpa-exporter component was used to expose VPA related metrics.
With this PR, the vpa-exporter component is no longer used and equivalent (but differently named) metrics are exposed by the existing kube-state-metrics deployments. The Grafana dashboards are adjusted and aligned appropriately.

This PR also contains a related network bandwidth usage improvement: previously the control plane prometheus-es used to scrape the kube-state-metrics independently, getting metrics for the whole seed cluster. Now they only federate the relevant metrics from the cache prometheus.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Please check the individual commits.

To facilitate code review and to verify the impact of the source code changes at runtime, the second half of the commits of this PR, starting with this commit, explains the effects that this PR introduces in the KinD based local setup of Gardener with a single shoot cluster.

Here is a screenshot of the revised dashboard:

/cc @rickardsjp @wyb1

Release note:

Replace vpa-exporter with kube-state-metrics.
The vpa-exporter is no longer used in Gardener.
The kube-state-metrics component is exposing the VPA related metrics.

rfranzke · 2022-10-04T15:17:09Z

/assign

wyb1 · 2022-10-04T15:22:41Z

/hold

charts/seed-monitoring/charts/grafana/dashboards/operators/vpa-dashboard.json

rfranzke

Nice PR! 👏🏻

pkg/operation/botanist/component/vpa/monitoring.go

pkg/utils/images/images.go

…as well There are 2 kinds of kube-state-metrics deployments in a seed: - in each shoot control plane namespace - in the garden namespace The one running in each shoot control plane namespace is connected to the shoot API server and hence exposes metrics about the shoot cluster. The one running in the garden namespace is connected to the seed API server and hence exposes metrics about the seed cluster. Previously, only the shoot control plane kube-state-metrics deployment had access to and exported the verticalpodautoscalers metrics. With this change, the cluster role that is bound to the service account that is used by the kube-state-metrics instance running in the garden namespace is granted access to verticalpodautoscaler resources (get, list, watch). Furthermore, the command line flags are extended so that vpa metrics are exposed. Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

The kube-state-metrics in the garden namespace is scraped by the "cache" prometheus. The newly exposed vpa metrics are allow listed with this change. The control plane prometheus is already scraping the kube-state-metrics in the control plane namespace and allow listing some vpa metrics. With this change, its allow list is adjusted so that now both prometheuses keep the same set of vpa metrics: - recommendation (target, upper bound, lower bound) - min allowed, max allowed - update mode By having the same set of vpa metrics for both the seed and shoot clusters, we can use the same promql queries in the dashboard for pods that are running in the shoot cluster and in the control plane. Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

Only federate the ksm metrics (including the new vpa metrics) for the garden and extension namespaces. The grafana deployment in the garden namespace uses the aggregate-prometheus as a datasource, hence we need to federate these ksm/vpa metrics to the aggregate-prometheus to be able to show them on a dashboard. Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

To show the actual (memory/cpu) usage in the VPA dashboard in Grafana in the garden namespace. Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

istvanballok · 2022-10-05T13:59:35Z

Hi @wyb1 and @rfranzke,
@rickardsjp and I applied your review suggestions, updated the description and added a few explanatory commits, please take another look.

cc @voelzmo The revised VPA dashboard could provide some insights how the VPA works in our seed clusters.

- The "Shoot VPA Recommendations" dashboard was exposed to shoot owners. That dashboard didn't use variables and it is replaced now with the "VPA Recommendations" dashboard that was previously exposed only to Gardener operators: now it is exposed to shoot owners as well. - There were 2 "VPA Recommendations" dashboards: one in the control plane, the other one in the garden namespace and they diverged some time ago. Now we use a single dashboard and a symbolic link for the garden grafana. - The "VPA Recommendations" dashboard has been improved to show how CPU and memory 1) usage, the 2) VPA recommandations, upper and lower bounds and limits and 3) the resources requests and limits influence each other. Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

We can use the (remote) local setup and deploy the base commit of this PR, and export the relevant state of the cluster with this script. Then, we can upgrade to this PR and restart the gardenlet and let the changes be applied and export the relevant state again. The diff shows the changes that are triggered by the PR. The resulting diff might be helpful when reviewing the PR. Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

Note that this is done by resource manager, acting upon the managed resource which no longer contains the vpa-exporter artifacts. Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

Equivalent metrics are scraped from the ksm component Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

Equivalent metrics are federated from the cache prometheus Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

…ometheus The cadvisor metrics (like CPU and memory usage) are needed for the VPA Recommendations dashboard. Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

Equivalent metrics are federated from the ksm component Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

It federates only the relevant metrics (i.e. metrics for its own namespace) from the cache prometheus. Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

The cache prometheus allow lists all because it didn't allow list any of them before. kube_verticalpodautoscaler_status_recommendation_containerrecommendations_target kube_verticalpodautoscaler_status_recommendation_containerrecommendations_upperbound kube_verticalpodautoscaler_status_recommendation_containerrecommendations_lowerbound kube_verticalpodautoscaler_spec_resourcepolicy_container_policies_minallowed kube_verticalpodautoscaler_spec_resourcepolicy_container_policies_maxallowed kube_verticalpodautoscaler_spec_updatepolicy_updatemode The control plane prometheus allow lists only a few more that it didn't allow list before (so that the 2 allow lists are aligned, and hence we can use the same dashboard for seed and shoot related metrics). kube_verticalpodautoscaler_spec_resourcepolicy_container_policies_minallowed kube_verticalpodautoscaler_spec_resourcepolicy_container_policies_maxallowed Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

There are 2 instances of this alert, and previously both instances considered etcd-main. Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

Previously it checked that the scrape target is "up", now it checks that there are some federated metrics. Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

The control plane prometheus is scraping the ksm for the shoot, which exposes metrics only for the kube-system namespace, so that it is not necessary to overwrite the namespace with kube-system, because the namespace is already kube-system. Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

…view Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

gardener-prow · 2022-10-06T08:15:47Z

@istvanballok: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-gardener-check-vulnerabilities	`fb1d627`	link	false	`/test pull-gardener-check-vulnerabilities`

Full PR test history. Your PR dashboard. Command help for this repository.
Please help us cut down on flakes by linking this test failure to an open flake report or filing a new flake report if you can't find an existing one. Also see our testing guideline for how to avoid and hunt flakes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

istvanballok · 2022-10-07T06:58:54Z

/unhold

rfranzke

Very nice, well done!

/lgtm
/approve

gardener-prow · 2022-10-07T11:34:38Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rfranzke, rickardsjp, wyb1

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [rfranzke]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gardener-prow bot requested review from rickardsjp and wyb1 October 4, 2022 14:53

wyb1 approved these changes Oct 4, 2022

View reviewed changes

gardener-prow bot assigned wyb1 Oct 4, 2022

gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label Oct 4, 2022

rickardsjp approved these changes Oct 4, 2022

View reviewed changes

gardener-prow bot assigned rickardsjp Oct 4, 2022

gardener-prow bot assigned rfranzke Oct 4, 2022

gardener-prow bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 4, 2022

wyb1 reviewed Oct 4, 2022

View reviewed changes

charts/seed-monitoring/charts/grafana/dashboards/operators/vpa-dashboard.json Show resolved Hide resolved

rfranzke reviewed Oct 4, 2022

View reviewed changes

pkg/operation/botanist/component/vpa/monitoring.go Outdated Show resolved Hide resolved

pkg/utils/images/images.go Show resolved Hide resolved

istvanballok and others added 4 commits October 5, 2022 09:36

istvanballok force-pushed the ksm-vpa branch from bf4ff3a to 08ec6f6 Compare October 5, 2022 08:54

gardener-prow bot removed lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Oct 5, 2022

istvanballok force-pushed the ksm-vpa branch from 08ec6f6 to 0c69bbb Compare October 5, 2022 13:29

gardener-prow bot added cla: no Indicates the PR's author has not signed the cla-assistant.io CLA. and removed cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. labels Oct 5, 2022

istvanballok force-pushed the ksm-vpa branch from 0c69bbb to 897d2eb Compare October 5, 2022 13:55

istvanballok and others added 18 commits October 6, 2022 10:10

Baseline

ab1aa58

Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

The ksm cluster role now has access to verticalpodautoscalers

460c6f8

Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

The ksm is configured to expose vpa metrics

bf74e82

Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

The alert KubeEtcd3HighMemoryConsumption is adjusted

e40d1f2

Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

Unused (vpa related) recording rules are removed

1fc718d

Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

Remove the temporary files that were only added to facilitate code re…

fb1d627

…view Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com> Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com> Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>

istvanballok force-pushed the ksm-vpa branch from 41f50c0 to fb1d627 Compare October 6, 2022 08:12

rickardsjp approved these changes Oct 6, 2022

View reviewed changes

gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label Oct 6, 2022

gardener-prow bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 7, 2022

rfranzke self-requested a review October 7, 2022 07:00

rfranzke reviewed Oct 7, 2022

View reviewed changes

gardener-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 7, 2022

gardener-prow bot merged commit e57e63d into gardener:master Oct 7, 2022

acumino mentioned this pull request Apr 11, 2023

Fix broken grafana dashboard for state-metrics-seed #7771

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace vpa-exporter with kube-state-metrics #6771

Replace vpa-exporter with kube-state-metrics #6771

istvanballok commented Oct 4, 2022 •

edited

rfranzke commented Oct 4, 2022

wyb1 commented Oct 4, 2022

rfranzke left a comment

istvanballok commented Oct 5, 2022

gardener-prow bot commented Oct 6, 2022 •

edited

istvanballok commented Oct 7, 2022

rfranzke left a comment

gardener-prow bot commented Oct 7, 2022

Replace vpa-exporter with kube-state-metrics #6771

Replace vpa-exporter with kube-state-metrics #6771

Conversation

istvanballok commented Oct 4, 2022 • edited

rfranzke commented Oct 4, 2022

wyb1 commented Oct 4, 2022

rfranzke left a comment

Choose a reason for hiding this comment

istvanballok commented Oct 5, 2022

gardener-prow bot commented Oct 6, 2022 • edited

istvanballok commented Oct 7, 2022

rfranzke left a comment

Choose a reason for hiding this comment

gardener-prow bot commented Oct 7, 2022

istvanballok commented Oct 4, 2022 •

edited

gardener-prow bot commented Oct 6, 2022 •

edited