Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace vpa-exporter with kube-state-metrics #6771

Merged
merged 27 commits into from
Oct 7, 2022

Conversation

istvanballok
Copy link
Contributor

@istvanballok istvanballok commented Oct 4, 2022

How to categorize this PR?

/area monitoring
/kind enhancement

What this PR does / why we need it:

Replace vpa-exporter with kube-state-metrics

Previously, the vpa-exporter component was used to expose VPA related metrics.
With this PR, the vpa-exporter component is no longer used and equivalent (but differently named) metrics are exposed by the existing kube-state-metrics deployments. The Grafana dashboards are adjusted and aligned appropriately.

This PR also contains a related network bandwidth usage improvement: previously the control plane prometheus-es used to scrape the kube-state-metrics independently, getting metrics for the whole seed cluster. Now they only federate the relevant metrics from the cache prometheus.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Please check the individual commits.

To facilitate code review and to verify the impact of the source code changes at runtime, the second half of the commits of this PR, starting with this commit, explains the effects that this PR introduces in the KinD based local setup of Gardener with a single shoot cluster.

Here is a screenshot of the revised dashboard:
image

/cc @rickardsjp @wyb1

Release note:

Replace vpa-exporter with kube-state-metrics.
The vpa-exporter is no longer used in Gardener.
The kube-state-metrics component is exposing the VPA related metrics.

@gardener-prow gardener-prow bot added area/monitoring Monitoring (including availability monitoring and alerting) related kind/enhancement Enhancement, improvement, extension needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Oct 4, 2022
@gardener-prow gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label Oct 4, 2022
@rfranzke
Copy link
Member

rfranzke commented Oct 4, 2022

/assign

@wyb1
Copy link
Contributor

wyb1 commented Oct 4, 2022

/hold

@gardener-prow gardener-prow bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 4, 2022
Copy link
Member

@rfranzke rfranzke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR! 👏🏻

pkg/operation/botanist/component/vpa/monitoring.go Outdated Show resolved Hide resolved
pkg/utils/images/images.go Show resolved Hide resolved
istvanballok and others added 4 commits October 5, 2022 09:36
…as well

There are 2 kinds of kube-state-metrics deployments in a seed:

- in each shoot control plane namespace
- in the garden namespace

The one running in each shoot control plane namespace is connected to the
shoot API server and hence exposes metrics about the shoot cluster.

The one running in the garden namespace is connected to the seed API server
and hence exposes metrics about the seed cluster.

Previously, only the shoot control plane kube-state-metrics deployment
had access to and exported the verticalpodautoscalers metrics.

With this change, the cluster role that is bound to the service account
that is used by the kube-state-metrics instance running in the garden
namespace is granted access to verticalpodautoscaler resources (get, list,
watch). Furthermore, the command line flags are extended so that vpa
metrics are exposed.

Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com>
Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
The kube-state-metrics in the garden namespace is scraped by the "cache"
prometheus. The newly exposed vpa metrics are allow listed with this
change.

The control plane prometheus is already scraping the kube-state-metrics
in the control plane namespace and allow listing some vpa metrics.
With this change, its allow list is adjusted so that now both prometheuses
keep the same set of vpa metrics:

- recommendation (target, upper bound, lower bound)
- min allowed, max allowed
- update mode

By having the same set of vpa metrics for both the seed and shoot clusters,
we can use the same promql queries in the dashboard for pods that are
running in the shoot cluster and in the control plane.

Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com>
Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
Only federate the ksm metrics (including the new vpa metrics)
for the garden and extension namespaces.

The grafana deployment in the garden namespace uses the
aggregate-prometheus as a datasource, hence we need to federate these
ksm/vpa metrics to the aggregate-prometheus to be able to show them
on a dashboard.

Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com>
Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
To show the actual (memory/cpu) usage in the VPA dashboard in Grafana
in the garden namespace.

Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com>
Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
@gardener-prow gardener-prow bot removed lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Oct 5, 2022
@gardener-prow gardener-prow bot added cla: no Indicates the PR's author has not signed the cla-assistant.io CLA. and removed cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. labels Oct 5, 2022
@istvanballok
Copy link
Contributor Author

Hi @wyb1 and @rfranzke,
@rickardsjp and I applied your review suggestions, updated the description and added a few explanatory commits, please take another look.

cc @voelzmo The revised VPA dashboard could provide some insights how the VPA works in our seed clusters.

istvanballok and others added 18 commits October 6, 2022 10:10
- The "Shoot VPA Recommendations" dashboard was exposed to shoot owners.
  That dashboard didn't use variables and it is replaced now with the
  "VPA Recommendations" dashboard that was previously exposed only
  to Gardener operators: now it is exposed to shoot owners as well.

- There were 2 "VPA Recommendations" dashboards: one in the control plane,
  the other one in the garden namespace and they diverged some time ago.
  Now we use a single dashboard and a symbolic link for the garden grafana.

- The "VPA Recommendations" dashboard has been improved to show how
  CPU and memory 1) usage, the 2) VPA recommandations, upper and lower
  bounds and limits and 3) the resources requests and limits influence
  each other.

Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com>
Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
We can use the (remote) local setup and deploy the base commit of this PR,
and export the relevant state of the cluster with this script.

Then, we can upgrade to this PR and restart the gardenlet and let the
changes be applied and export the relevant state again.

The diff shows the changes that are triggered by the PR. The resulting
diff might be helpful when reviewing the PR.

Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com>
Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com>
Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
Note that this is done by resource manager, acting
upon the managed resource which no longer contains
the vpa-exporter artifacts.

Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com>
Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com>
Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com>
Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
Equivalent metrics are scraped from the ksm component

Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com>
Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
Equivalent metrics are federated from the cache prometheus

Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com>
Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
…ometheus

The cadvisor metrics (like CPU and memory usage) are needed for the
VPA Recommendations dashboard.

Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com>
Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
Equivalent metrics are federated from the ksm component

Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com>
Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
It federates only the relevant metrics (i.e. metrics for its own namespace)
from the cache prometheus.

Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com>
Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
The cache prometheus allow lists all because it didn't allow list any of
them before.

kube_verticalpodautoscaler_status_recommendation_containerrecommendations_target
kube_verticalpodautoscaler_status_recommendation_containerrecommendations_upperbound
kube_verticalpodautoscaler_status_recommendation_containerrecommendations_lowerbound
kube_verticalpodautoscaler_spec_resourcepolicy_container_policies_minallowed
kube_verticalpodautoscaler_spec_resourcepolicy_container_policies_maxallowed
kube_verticalpodautoscaler_spec_updatepolicy_updatemode

The control plane prometheus allow lists only a few more that it didn't
allow list before (so that the 2 allow lists are aligned, and hence
we can use the same dashboard for seed and shoot related metrics).

kube_verticalpodautoscaler_spec_resourcepolicy_container_policies_minallowed
kube_verticalpodautoscaler_spec_resourcepolicy_container_policies_maxallowed

Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com>
Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com>
Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
There are 2 instances of this alert, and previously both instances
considered etcd-main.

Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com>
Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
Previously it checked that the scrape target is "up", now it checks
that there are some federated metrics.

Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com>
Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com>
Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
The control plane prometheus is scraping the ksm for the shoot, which
exposes metrics only for the kube-system namespace, so that it is not
necessary to overwrite the namespace with kube-system, because the
namespace is already kube-system.

Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com>
Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
…view

Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com>
Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
Co-authored-by: Jeremy Rickards <jeremy.rickards@sap.com>
@gardener-prow
Copy link
Contributor

gardener-prow bot commented Oct 6, 2022

@istvanballok: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-gardener-check-vulnerabilities fb1d627 link false /test pull-gardener-check-vulnerabilities

Full PR test history. Your PR dashboard. Command help for this repository.
Please help us cut down on flakes by linking this test failure to an open flake report or filing a new flake report if you can't find an existing one. Also see our testing guideline for how to avoid and hunt flakes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@gardener-prow gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label Oct 6, 2022
@istvanballok
Copy link
Contributor Author

/unhold

@gardener-prow gardener-prow bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 7, 2022
@rfranzke rfranzke self-requested a review October 7, 2022 07:00
Copy link
Member

@rfranzke rfranzke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, well done!

/lgtm
/approve

@gardener-prow
Copy link
Contributor

gardener-prow bot commented Oct 7, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rfranzke, rickardsjp, wyb1

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gardener-prow gardener-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 7, 2022
@gardener-prow gardener-prow bot merged commit e57e63d into gardener:master Oct 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/monitoring Monitoring (including availability monitoring and alerting) related cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. kind/enhancement Enhancement, improvement, extension lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants