Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Druid Refactor to Address Multiple Controller Conflicts #777

Merged
merged 235 commits into from
Jun 24, 2024

Conversation

unmarshall
Copy link
Contributor

@unmarshall unmarshall commented Apr 2, 2024

How to categorize this PR?

/area control-plane dev-productivity quality
/kind api-change enhancement cleanup test

What this PR does / why we need it:

Which issue(s) this PR fixes:

Release note:

Custodian controller has now been removed in favour of etcd status reconciliation handled by etcd controller. CLI flags `--custodian-workers` and `--custodian-sync-period` have now been removed, and are no longer recognised by etcd-druid.
CLI flag `--workers` has now been renamed to `--etcd-workers`. Additionally, etcd controller also accepts new CLI flags `enable-etcd-spec-auto-reconcile` to control how and when the etcd spec is reconciled, and `etcd-status-sync-period` to specify the duration after which an event will be re-queued to ensure etcd status reconciliation. CLI flag `ignore-operation-annotation` has been deprecated, and will be removed in an upcoming release.
Creation of Etcd resource no longer requires annotation `gardener.cloud/operation: reconcile` to be set on it for etcd-druid to reconcile it. In other words, creation of Etcd resource is immediate, irrespective of whether etcd-spec-auto-reconciliation is enabled or not.
CLI flag `--metrics-addr` is now deprecated. Please use `--metrics-bind-address` and `--metrics-port` instead.
CLI flag `--leader-election-resource-lock` is now deprecated, and will be set to `leases` from a future release onwards.
A new validating webhook named `sentinel` has been introduced to safeguard resources created by etcd-druid. A new annotation `druid.gardener.cloud/disable-etcd-component-protection` has been introduced, which if set, tells sentinel webhook to allow manual changes by an operator on any resource managed by etcd-druid.

This webhook is disabled by default, and can be enabled as follows:
- If deploying druid via the binary, please pass CLI flag `--enable-sentinel-webhook` to it. Additionally, CLI flags `--webhook-server-bind-address`, `--webhook-server-port` and `--webhook-server-tls-server-cert-dir` need to be passed when enabling the webhook, which enforces TLS communication using the given certs.
- If deploying druid via the Helm charts, please set chart value `webhooks.sentinel.enabled: true`.
- If deploying druid via Skaffold, please set environment variable `DRUID_ENABLE_SENTINEL_WEBHOOK=true`. This is also applicable when running Make targets such as `deploy`, `deploy-dev`, `deploy-debug`, `test-e2e`, etc, except for `ci-e2e-kind`.
Annotation `druid.gardener.cloud/ignore-reconciliation` has been marked as deprecated. Please use `druid.gardener.cloud/suspend-etcd-spec-reconcile` instead, which provides the same behavior.
All packages under `/pkg` and `/controllers` directories have now been moved to new parent `/internal` directory.
The component model used for deploying resources has now been replaced with a simplified `ResourceOperator` model, found under `/internal/operator`.
Etcd resource status now includes field `LastOperation` to indicate the last operation performed on the etcd resource. This includes a unique `RunID` to help sift through logs containing the specific `RunID`, improving debuggability. Every reconciler run generates a unique `RunID`.
Etcd resource status now includes field `LastErrors` to indicate any errors encountered in the last reconciliation of the etcd resource. Custom error codes have been introduced to help capture contextual information from the reconciliation run.
Labels on druid-managed resources are now streamlined, and no longer include `name` and `instance`. Instead, these are now standard labels `app.kubernetes.io/managed-by` and `app.kubernetes.io/part-of`, as [recommended by Kubernetes](https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/#labels). Additionally, `app.kubernetes.io/component` label is also used to set the type of the component for an etcd cluster.
Scale-up logic for single-node etcd clusters with peerTLS disabled to multi-node etcd clusters with peerTLS enabled, has been improved by making it deterministic and eliminates an unnecessary restart of the first etcd member, thus making this process faster and error-free.
A new condition `DataVolumesReady` has been introduced in `etcd.Status` to capture and report PVC warnings.
Etcd pods now mount files with `DefaultMode` set to `0640`.
etcd-backup-restore container was started with SYS_PTRACE linux capability. This prevented creating etcd cluster with [Pod Security Standards](https://kubernetes.io/docs/concepts/security/pod-security-standards/). This linux capability has now been removed as it is no longer required.
Volume mounts for the etcd StatefulSet have now been fixed, to allow individually specifying TLS secrets for the etcd and backup-restore servers. CA and TLS certificates used for etcd client-server communication, relevant to the container that they are mounted on, can be found at `/var/etcd/ssl/`. CA and TLS certificates used for etcd peer communication, relevant to the container that they are mounted on, can be found at `/var/etcd/ssl/peer`. CA and TLS certificates used for etcd-backup-restore client-server communication, relevant to the container that they are mounted on, can be found at `/var/etcdbr/ssl`.
The backup secret mounted onto the backup-restore container in the etcd pod, used by etcd-backup-restore process, is now mounted onto `/var/backup-store` instead of `/var/etcd-backup`.
We are moving towards using golang native tests. This also allowed us to relook at the unit and integration tests that we have. In this PR we have only partially introduced comprehensive golang native tests for specific packages (`internal/operator`, `internal/webhook`, `internal/controller/etcd/` and `internal/utils/`). We have also added comprehensive integration tests for etcd controller and the new IT tests are present at `test/it/controller/etcd`. In future PRs we will replace the ginkgo based tests and replace it with native golang tests for rest of the packages as well.
Introduced new Makefile targets:
`deploy-dev` - starts skaffold in dev mode allowing reloading druid upon change.
`deploy-debug` - starts skaffold in debug mode allowing using breakpoints to interrupt the control-flow.
`undeploy` - uses skaffold delete to delete all resources that are installed via skaffold.
Added support for new backup store provider `stackit` which is an alias for `S3`.

@unmarshall unmarshall requested a review from a team as a code owner April 2, 2024 08:39
@gardener-robot gardener-robot added the needs/review Needs review label Apr 2, 2024
@gardener-robot-ci-2 gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Apr 2, 2024
@gardener-robot gardener-robot added area/control-plane Control plane related area/dev-productivity Developer productivity related (how to improve development) labels Apr 2, 2024
@unmarshall unmarshall removed their assignment Apr 2, 2024
@gardener-robot gardener-robot added area/quality Output qualification (tests, checks, scans, automation in general, etc.) related kind/api-change API change with impact on API users kind/cleanup Something that is not needed anymore and can be cleaned up kind/enhancement Enhancement, improvement, extension kind/test Test size/xl Size of pull request is huge (see gardener-robot robot/bots/size.py) needs/second-opinion Needs second review by someone else labels Apr 2, 2024
Copy link

gitguardian bot commented Apr 2, 2024

⚠️ GitGuardian has uncovered 3 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
- RSA Private Key e2bfea3 charts/druid/resources/server.key View secret
- RSA Private Key e2bfea3 charts/druid/resources/ca.key View secret
- RSA Private Key e2bfea3 charts/druid/templates/secret-server-tls-crt.yaml View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secrets safely. Learn here the best practices.
  3. Revoke and rotate these secrets.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

@gardener-robot-ci-2 gardener-robot-ci-2 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Apr 2, 2024
@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Apr 2, 2024
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Apr 2, 2024
@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Apr 2, 2024
@gardener-robot-ci-2 gardener-robot-ci-2 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Apr 3, 2024
@gardener-robot-ci-2 gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Apr 3, 2024
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Apr 3, 2024
@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jun 17, 2024
Copy link
Contributor

@seshachalam-yv seshachalam-yv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@gardener-robot gardener-robot added reviewed/lgtm Has approval for merging and removed needs/second-opinion Needs second review by someone else labels Jun 18, 2024
@gardener-robot-ci-1 gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jun 18, 2024
@gardener-robot gardener-robot added needs/second-opinion Needs second review by someone else and removed reviewed/lgtm Has approval for merging labels Jun 23, 2024
@gardener-robot-ci-3 gardener-robot-ci-3 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jun 23, 2024
@shreyas-s-rao
Copy link
Contributor

List of manual tests executed:

TEST NAME Druid Auto-Reconcile Single/Multi Node Backups (provider) Etcd Client TLS Etcd Peer TLS EtcdBR TLS TEST RESULT
Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd FALSE Single NA FALSE FALSE FALSE TRUE
Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd FALSE Single NA TRUE TRUE TRUE TRUE
Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd FALSE Single AWS TRUE TRUE TRUE TRUE
Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd FALSE Multi NA FALSE FALSE FALSE TRUE
Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd FALSE Multi NA TRUE TRUE TRUE TRUE
Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd FALSE Multi AWS TRUE TRUE TRUE TRUE
Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd FALSE Multi GCP TRUE TRUE TRUE TRUE
Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd FALSE Multi Azure TRUE TRUE TRUE TRUE
Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd FALSE Multi Openstack TRUE TRUE TRUE TRUE
Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd FALSE Multi Local TRUE TRUE TRUE TRUE
Perform etcd spec changes, check if reconciliation triggered FALSE Multi GCP TRUE TRUE TRUE TRUE
Scale-up etcd from single-node non-TLS to multi-node non-TLS, hibernate, unhibernate FALSE Single GCP FALSE FALSE FALSE TRUE
Scale-up etcd from single-node non-TLS to multi-node TLS, hibernate, unhibernate FALSE Single GCP FALSE FALSE FALSE TRUE
Scale-up etcd from single-node TLS to multi-node TLS, hibernate, unhibernate FALSE Single NA TRUE TRUE TRUE TRUE
Upgrade druid from master to #777, check status updates, add reconcile annotation, check reconciliation FALSE Multi GCP TRUE TRUE TRUE TRUE
Deploy etcdcopybackupstask, check success FALSE Multi Local TRUE TRUE TRUE TRUE
Configure compaction with low threshold, populate etcd, check if compaction jobs are triggered and run FALSE Single AWS TRUE TRUE TRUE TRUE
Deploy etcd, check reconciliation, hibernate, unhibernate, delete etcd TRUE Multi GCP TRUE TRUE TRUE TRUE
Perform etcd spec changes, check if reconciliation triggered TRUE Multi GCP TRUE TRUE TRUE TRUE

@shreyas-s-rao
Copy link
Contributor

New changes (minor fixes) as part of 4843c75:

  1. Fix scale subresource to use selectorpath as Etcd.spec.selector rather than the deprecated Etcd.status.labelSelector
  2. Add missing delete role to druid
  3. Handle case when auto-reconcile is disabled, and Etcd spec is updated from 1 to 3 replicas, but has not yet been reconciled. During such a case, the expected members in the cluster are only 1 (actual current replicas), and not 3 (spec.replicas). This can be checked by checking whether the etcd observedGeneration matches the generation. This check is now added to Etcd.status.ready check as well as computing Etcd.status.members and consequently the AllMembersReady condition.
  4. Minor fix in skaffold-based helm deployment, to set certain compaction-related flags only upon profile activation, and not always.

@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jun 24, 2024
@shreyas-s-rao
Copy link
Contributor

/retest

1 similar comment
@shreyas-s-rao
Copy link
Contributor

/retest

@unmarshall
Copy link
Contributor Author

I ran the e2e tests locally and these pass for me.

@unmarshall unmarshall merged commit 7b347f7 into gardener:master Jun 24, 2024
11 checks passed
@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Jun 24, 2024
renormalize pushed a commit that referenced this pull request Jul 8, 2024
…which includes #777 (#804)

* Add new labels to sts pods, for backward compatibility with v0.23.0 which includes #777

* Fix unit tests for status.members checker

* Address review comment by @anveshreddy18; check exact match for sts label selector for sts recreation

* Statefulset PreDeploy now only checks pod labels, and not whether they are updated or ready, similar to #823

* Add comments for `PreDeploy` methods
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Control plane related area/dev-productivity Developer productivity related (how to improve development) area/quality Output qualification (tests, checks, scans, automation in general, etc.) related kind/api-change API change with impact on API users kind/cleanup Something that is not needed anymore and can be cleaned up kind/enhancement Enhancement, improvement, extension kind/test Test needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/second-opinion Needs second review by someone else size/xl Size of pull request is huge (see gardener-robot robot/bots/size.py) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] violates PodSecurity "baseline:latest" ☂ Druid Refactor to Address Multiple Controller Conflicts