Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GEP-19] Monitoring Stack - Migrating to the prometheus-operator #6151

Merged
merged 26 commits into from
Jul 13, 2022

Conversation

wyb1
Copy link
Contributor

@wyb1 wyb1 commented Jun 21, 2022

How to categorize this PR?

/area monitoring
/area documentation
/kind discussion

What this PR does / why we need it:
A proposal on how Gardener can migrate to the prometheus operator.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Release note:

NONE

@gardener-prow
Copy link
Contributor

gardener-prow bot commented Jun 21, 2022

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@gardener-prow gardener-prow bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 21, 2022
@gardener-prow gardener-prow bot added area/monitoring Monitoring (including availability monitoring and alerting) related kind/discussion Discussion (engaging others in deciding about multiple options) labels Jun 21, 2022
@gardener-prow
Copy link
Contributor

gardener-prow bot commented Jun 21, 2022

@wyb1: The label(s) area/documention cannot be applied, because the repository doesn't have them.

In response to this:

How to categorize this PR?

/area monitoring
/area documention
/kind discussion

What this PR does / why we need it:
A proposal on how Gardener can migrate to the prometheus operator.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Release note:

NONE

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@gardener-prow gardener-prow bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. labels Jun 21, 2022
@rfranzke rfranzke changed the title Monitoring GEP [GEP-19] Monitoring Stack - Migrating to the prometheus-operator Jun 21, 2022
docs/proposals/19-monitoring.md Outdated Show resolved Hide resolved
docs/proposals/19-monitoring.md Outdated Show resolved Hide resolved
docs/proposals/19-monitoring.md Outdated Show resolved Hide resolved
docs/proposals/19-monitoring.md Outdated Show resolved Hide resolved
docs/proposals/19-monitoring.md Outdated Show resolved Hide resolved
docs/proposals/19-monitoring.md Outdated Show resolved Hide resolved
docs/proposals/19-monitoring.md Outdated Show resolved Hide resolved
docs/proposals/19-monitoring.md Outdated Show resolved Hide resolved
docs/proposals/19-monitoring.md Outdated Show resolved Hide resolved
docs/proposals/19-monitoring.md Outdated Show resolved Hide resolved
docs/proposals/19-monitoring.md Outdated Show resolved Hide resolved
docs/proposals/19-monitoring.md Outdated Show resolved Hide resolved
docs/proposals/19-monitoring.md Outdated Show resolved Hide resolved
docs/proposals/19-monitoring.md Outdated Show resolved Hide resolved
docs/proposals/19-monitoring.md Outdated Show resolved Hide resolved
docs/proposals/19-monitoring.md Outdated Show resolved Hide resolved
docs/proposals/19-monitoring.md Outdated Show resolved Hide resolved
docs/proposals/19-monitoring.md Outdated Show resolved Hide resolved
docs/proposals/19-monitoring.md Outdated Show resolved Hide resolved
docs/proposals/19-monitoring.md Outdated Show resolved Hide resolved
@timebertt
Copy link
Member

/assign
I definitely want to take a look and provide feedback, but will probably only manage to do so next week.

Copy link
Contributor

@istvanballok istvanballok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor wording issues

docs/proposals/19-migrating-to-prometheus-operator.md.md Outdated Show resolved Hide resolved
docs/proposals/19-migrating-to-prometheus-operator.md.md Outdated Show resolved Hide resolved
docs/proposals/19-migrating-to-prometheus-operator.md.md Outdated Show resolved Hide resolved
docs/proposals/19-migrating-to-prometheus-operator.md.md Outdated Show resolved Hide resolved
docs/proposals/19-migrating-to-prometheus-operator.md.md Outdated Show resolved Hide resolved
Co-authored-by: Wesley Bermbach <wesley.bermbach@sap.com>
Co-authored-by: Istvan Zoltan Ballok <istvan.zoltan.ballok@sap.com>
@wyb1
Copy link
Contributor Author

wyb1 commented Jun 27, 2022

Thanks for the reviews from everyone so far. Question: Should I create new commits instead of force pushing to make the changes easier to track?

Copy link
Member

@timebertt timebertt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this proposal, really looking forward to it!
I'm not through with my review, but I already left some questions and thoughts :)

docs/proposals/19-migrating-to-prometheus-operator.md.md Outdated Show resolved Hide resolved
docs/proposals/19-migrating-to-prometheus-operator.md.md Outdated Show resolved Hide resolved
docs/proposals/19-migrating-to-prometheus-operator.md.md Outdated Show resolved Hide resolved
docs/proposals/19-migrating-to-prometheus-operator.md.md Outdated Show resolved Hide resolved
Copy link
Member

@timebertt timebertt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it through the document and added some more comments.
Overall, it looks pretty good and I'm excited for this :)

In general, it would be good to precisely define all relevant contracts already in this GEP. This will make it easier to agree on something before jumping into the implementation.

Also, please open an umbrella issue with the concrete steps for implementing this proposal once this PR gets merged :)

docs/proposals/19-migrating-to-prometheus-operator.md.md Outdated Show resolved Hide resolved
docs/proposals/19-migrating-to-prometheus-operator.md.md Outdated Show resolved Hide resolved
docs/proposals/19-migrating-to-prometheus-operator.md.md Outdated Show resolved Hide resolved
docs/proposals/19-migrating-to-prometheus-operator.md.md Outdated Show resolved Hide resolved
docs/proposals/19-migrating-to-prometheus-operator.md.md Outdated Show resolved Hide resolved
docs/proposals/19-migrating-to-prometheus-operator.md.md Outdated Show resolved Hide resolved
@ialidzhikov
Copy link
Member

Thanks for addressing my comments/questions. I think on high level I don't have additional comments, lgtm.

@wyb1
Copy link
Contributor Author

wyb1 commented Jun 30, 2022

Also, please open an umbrella issue with the concrete steps for implementing this proposal once this PR gets merged :)

We can use gardener/monitoring#14 as an umbrella issue. I will add items there.

@gardener-prow gardener-prow bot removed the lgtm Indicates that a PR is ready to be merged. label Jul 12, 2022
gardener-prow bot pushed a commit that referenced this pull request Jul 12, 2022
…6293)

* `garden` namespace deployment is only needed for second kind cluster

In the first kind cluster, the `garden` namespace already exists because it runs the Gardener control plane.
Without this, the second client-side apply removes the project labels from the `Namespace`.

Co-Authored-By: Tim Ebert <tim.ebert@sap.com>

* Enable `ShootVPAEnabledByDefault` admission plugin in local setup

Co-Authored-By: Tim Ebert <tim.ebert@sap.com>

* Simplify gardenlet bootstrap kubeconfig

This makes it usable for locally created `ManagedSeed`s and follows the same pattern like in `example/gardener-local/gardenlet/values-kind2.yaml`

Co-Authored-By: Tim Ebert <tim.ebert@sap.com>

* Add documentation for `ManagedSeed`s

Co-Authored-By: Tim Ebert <tim.ebert@sap.com>

* Add example Kubernetes resources for local `ManagedSeed`s

Co-Authored-By: Tim Ebert <tim.ebert@sap.com>

* Register `node` webhook for shoot clusters

Co-Authored-By: Tim Ebert <tim.ebert@sap.com>

* Introduce seed-local container image registry

We will use this registry as a mirror for shoots such that the images
built by Skaffold are accessible from the shoot cluster in case it gets
registered as `ManagedSeed`.

This is needed because we don't want to push the images built by
Skaffold to any official, publicly available registry.

In the future, we might even be able to reuse this such that we can
speed up the image pull processing times.

Co-Authored-By: Tim Ebert <tim.ebert@sap.com>

* Allow machine pods to talk to the registry

Co-Authored-By: Tim Ebert <tim.ebert@sap.com>

* Mutate `containerd` config to import additional configuration files

This only applies to newly created nodes.

Co-Authored-By: Tim Ebert <tim.ebert@sap.com>

* Bump `machine-controller-manager-provider-local` image

This includes
gardener-attic/machine-controller-manager-provider-local@f2c9319
which allows machine pods to talk to the seed API server. In the local
setup, the seed API server is also the garden API server and the
gardenlet needs to talk to it to register the `Seed`.

Co-Authored-By: Tim Ebert <tim.ebert@sap.com>

* Mount `backup-path` volume with `DirectoryOrCreate` mode in `provider-local`

This will create the directory if it does not exist which is the case for shoot clusters registered as `ManagedSeed`s.

Co-Authored-By: Tim Ebert <tim.ebert@sap.com>

* Explicitly delete `Ingress`es in seed deletion

This effectively fixes #6062.

It's actually a work around since not all seed system components (like
the monitoring stack) are deployed via `ManagedResource`s yet. Hence,
`gardener-resource-manager` does not clean this up for us and we have to
delete the resources manually.
We only do it for `Ingress`es now to fix above mentioned bug since the
deployment of the monitoring stack is anyways planned to be refactored
with [GEP-19](#6151).

Co-Authored-By: Tim Ebert <tim.ebert@sap.com>

* Only delete seed system resources after all other resources are deleted

Otherwise, we might delete important system resources like `PriorityClass`es which can cause extensions to not come up anymore (e.g. after being scaled by VPA). This can result in a deadlock during seed deletion.

Co-Authored-By: Tim Ebert <tim.ebert@sap.com>

* Make `defaultShoot` function reuseable in `e2e` test packages

Co-Authored-By: Tim Ebert <tim.ebert@sap.com>

* Add e2e test for `ManagedSeed`s

Co-Authored-By: Tim Ebert <tim.ebert@sap.com>

* Drop testmachinery-based `ManagedSeed` test

This is no longer valuable now that we have an e2e test which can run on each PR and periodically on `master` branch.

Co-Authored-By: Tim Ebert <tim.ebert@sap.com>

* Address PR review feedback

* Address PR review feedback

Co-authored-by: Tim Ebert <tim.ebert@sap.com>
Copy link
Member

@timebertt timebertt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/hold cancel

@gardener-prow gardener-prow bot added lgtm Indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Jul 13, 2022
@gardener-prow
Copy link
Contributor

gardener-prow bot commented Jul 13, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: timebertt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@timebertt timebertt removed this from the v1.51 milestone Jul 13, 2022
@istvanballok
Copy link
Contributor

istvanballok commented Jul 13, 2022

/hold
(typo in the link)
otherwise, looks good to me 🎉

@gardener-prow gardener-prow bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 13, 2022
@gardener-prow gardener-prow bot removed the lgtm Indicates that a PR is ready to be merged. label Jul 13, 2022
@istvanballok
Copy link
Contributor

/lgtm
/hold cancel
🎉

@gardener-prow gardener-prow bot added lgtm Indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Jul 13, 2022
@timebertt
Copy link
Member

/override pull-gardener-e2e-kind

@gardener-prow
Copy link
Contributor

gardener-prow bot commented Jul 13, 2022

@timebertt: Overrode contexts on behalf of timebertt: pull-gardener-e2e-kind

In response to this:

/override pull-gardener-e2e-kind

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@gardener-prow gardener-prow bot merged commit e3a1300 into gardener:master Jul 13, 2022
@wyb1 wyb1 deleted the gep-monitoring branch July 13, 2022 14:00
krgostev pushed a commit to krgostev/gardener that referenced this pull request Sep 8, 2022
* 'master' of github.com:gardener/gardener: (51 commits)
  Switch extension controller to `logr` and streamline/cleanup logs (gardener#6332)
  Switch `./test/...` packages to `logr` and drop `github.com/sirupsen/logrus` dependency (gardener#6316)
  Only check shoot conditions during hibernation integration test (gardener#6325)
  Add dashboard for monitoring conntrack race failures. (gardener#6329)
  Reconcile quota before rbac (gardener#6326)
  Update istio to v1.14.1 (gardener#6271)
  Update gardenlet's base image to alpine:3.16.0 (gardener#6321)
  Update envoy proxy to v1.21.4 (gardener#6320)
  Deploy the metrics server to the kind cluster (gardener#6301)
  Fix tools download for aarch64 (arm64) (gardener#6314)
  update with latest CA releases (gardener#6295)
  Add missing unit tests for the predicates provided by the extensions library (gardener#6249)
  [GEP-19] Monitoring Stack - Migrating to the `prometheus-operator` (gardener#6151)
  Revert "Recreate DWD deployment if needed" (gardener#6307)
  Update to golang 1.18.4 (gardener#6300)
  Cleaned up imports in vpn-seed-server (gardener#6315)
  Prepare next Dev Cycle v1.52.0-dev
  Release v1.51.0
  Add pre/post reconciliation/deletion hooks for the Worker resource (gardener#6290)
  Update the supported values in the usage text of the `--leader-election-resource-lock` flag (gardener#6304)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/monitoring Monitoring (including availability monitoring and alerting) related cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. kind/discussion Discussion (engaging others in deciding about multiple options) lgtm Indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants