From c0d104a7a3b8022fb3b3c5a8272707e439fb06c1 Mon Sep 17 00:00:00 2001 From: Fernando Ripoll Date: Mon, 26 Feb 2024 16:18:59 +0100 Subject: [PATCH] Kyverno stuck (#230) * Add ops recipe for Kyverno being stuck in upgrade pending * Add ops recipe for Kyverno being stuck in upgrade pending --- CHANGELOG.md | 4 ++++ .../ops-recipes/kyverno-stuck-upgrade-pending.md | 14 ++++++++++++++ .../ops-recipes/troubleshooting-gitops.md | 14 +++++++------- 3 files changed, 25 insertions(+), 7 deletions(-) create mode 100644 content/docs/support-and-ops/ops-recipes/kyverno-stuck-upgrade-pending.md diff --git a/CHANGELOG.md b/CHANGELOG.md index 745abb7..40b0b0f 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,6 +7,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] +### Added + +- New recipe for Kyverno stuck in upgrade pending. + ### Changed - Update docsy to v0.9.0 diff --git a/content/docs/support-and-ops/ops-recipes/kyverno-stuck-upgrade-pending.md b/content/docs/support-and-ops/ops-recipes/kyverno-stuck-upgrade-pending.md new file mode 100644 index 0000000..c1e9c32 --- /dev/null +++ b/content/docs/support-and-ops/ops-recipes/kyverno-stuck-upgrade-pending.md @@ -0,0 +1,14 @@ +--- +title: "Kyverno Stuck In Pending Upgrade" +owner: +- https://github.com/orgs/giantswarm/teams/team-shield +confidentiality: public +--- + +There have been cases where during cluster upgrades, for example from AWS v18 -> v19, the Kyverno migration logic takes longer than the default `app-operator` installation timeout. This can result in Kyverno getting stuck in Helm `pending-upgrade` and requiring manual intervention. + +To force the resolution the best idea is to rollback to previous version, which will cause `app-operator` to re-reconcile the App and refresh the stuck Helm charts. + +``` +CLUSTER_ID=XXXXX; helm rollback -n "$CLUSTER_ID" "$CLUSTER_ID"-security-bundle $(helm ls -n $CLUSTER_ID -f "$CLUSTER_ID"-security-bundle -o yaml | yq '.[].revision') --force +``` diff --git a/content/docs/support-and-ops/ops-recipes/troubleshooting-gitops.md b/content/docs/support-and-ops/ops-recipes/troubleshooting-gitops.md index 6876a64..a2e8d1c 100644 --- a/content/docs/support-and-ops/ops-recipes/troubleshooting-gitops.md +++ b/content/docs/support-and-ops/ops-recipes/troubleshooting-gitops.md @@ -5,7 +5,7 @@ owner: confidentiality: public --- -We are offering GitOps as interface for our customers, here we collect tips on how to troubleshoot problems which can occur. +We are offering GitOps as interface for our customers, here we collect tips on how to troubleshoot problems which can occur. # Table of Contents 1. [Identify which kustomization owns a resource](#identify-which-kustomization-owns-a-resource) @@ -23,14 +23,14 @@ We are offering GitOps as interface for our customers, here we collect tips on h kustomize.toolkit.fluxcd.io/name: gorilla-clusters-rfjh2 kustomize.toolkit.fluxcd.io/namespace: default ``` - + From the kustomization one can tell the source Git repository by looking at the spec field `sourceRef`. 2. Use the flux command line. It offers a subcommand `trace` which describes all details related to GitOps: ``` ยป flux trace app/alfred-app -n alfred-ns - + Object: App/alfred-app Namespace: rfjh2 Status: Managed by Flux @@ -43,7 +43,7 @@ We are offering GitOps as interface for our customers, here we collect tips on h Namespace: default ... ``` - + __Note__: If the resource has no labels (or `flux trace` returns `object not managed by Flux`) the object is not produced as result of helm or kustomize but could still be owned by a higher resource. An example would be a *pod* which may not have the labels, but the parent *deployment* does. ## Download the Git Repository source @@ -70,8 +70,8 @@ Remember to notify the customer of this change. ## Customer Communication -After stopping reconcilation, please notify the customer of the change via slack support channel where the customer will be able to review and make the necessary changes the following business day. +After stopping reconcilation, please notify the customer of the change via slack support channel where the customer will be able to review and make the necessary changes the following business day. -In the case of an issue that cannot be fixed by stopping reconcilation and manually doing, a silence may be required. In this case, please notify via slack support channel a) the situation that we are alerted for and that we cannot help due to customer ownership and no access b) we will silence the alert until the next buisness day. +In the case of an issue that cannot be fixed by stopping reconcilation and manually doing, a silence may be required. In this case, please notify via slack support channel a) the situation that we are alerted for and that we cannot help due to customer ownership and no access b) we will silence the alert until the next buisness day. -In case of urgent situations or when pausing reconcilation does not fix the issue and the customer needs to be notified before the next business day, please reference the customer specific escalation matrix found in intranet. This will notify the customer of the situation and that Giant Swarm has no way to fix the problem and that Giant Swarm will silence the alert because of this. `urgent@giantswarm.io` remains available for additional help within the Giant Swarm scope but can only be useful after the customer takes care of their fix. +In case of urgent situations or when pausing reconcilation does not fix the issue and the customer needs to be notified before the next business day, please reference the customer specific escalation matrix found in intranet. This will notify the customer of the situation and that Giant Swarm has no way to fix the problem and that Giant Swarm will silence the alert because of this. `urgent@giantswarm.io` remains available for additional help within the Giant Swarm scope but can only be useful after the customer takes care of their fix.