Skip to content

Helm install race: cluster-autoscaler / kccm / kcsi block on admin-kubeconfig Secret that CAPI has not produced yet #2412

@lexfrei

Description

@lexfrei

Environment

  • Cozystack: v1.2.2 (verified unchanged on v1.3.0-rc.1 and master)
  • Chart: packages/apps/kubernetes
  • Kubernetes: v1.35.0+k3s3

Symptom

helm install for any apps.cozystack.io/Kubernetes application fails with context deadline exceeded after the 5-minute chart wait. The underlying error is a mount-time failure on three Deployments inside the chart:

FailedMount: MountVolume.SetUp failed for volume "kubeconfig":
  secret "<release>-admin-kubeconfig" not found

The affected workloads are:

  • <release>-cluster-autoscaler
  • <release>-kccm (cloud-controller-manager)
  • <release>-kcsi-controller (CSI controller)

HelmRelease / HelmController output

creating 35 resource(s)
beginning wait for 35 resources with timeout of 5m0s
Deployment is not ready: <ns>/<release>-cluster-autoscaler. 0 out of 1 expected pods are ready
(148 duplicate lines omitted)
Error received when checking status of resource <release>-cluster-autoscaler.
Error: 'client rate limiter Wait returned an error: context deadline exceeded'
wait for resources failed after 5m0s: context deadline exceeded

Root cause

The chart renders 35 resources in a single Helm release and relies on the default Helm wait to serialize their readiness. Three workloads in that set mount <release>-admin-kubeconfig as a non-optional Secret volume:

  • packages/apps/kubernetes/templates/cluster-autoscaler/deployment.yamlsecretName: {{ .Release.Name }}-admin-kubeconfig
  • Same pattern for kccm/manager.yaml and csi/controller.yaml

That Secret is not produced by the chart — it is produced asynchronously by the CAPI Kubeadm control-plane provider after the Cluster CR (also created in this same release) has finished provisioning the workload-cluster control plane. On clusters where control-plane provisioning exceeds the 5-minute --wait budget (typical, especially on first boot / slow node images), the Deployment pods stay ContainerCreating for the whole window, Helm reports them not-ready, and the install times out.

CAPI operators who followed along:

Cluster kubernetes-demo-k8s   PHASE: Provisioning   AGE: 3m15s

The Secret appears only once the control plane reaches Ready, but by then the Helm operation has already failed and the HelmRelease is marked install-failed.

Reproduction

  1. Cozystack v1.2.2 (also reproduced with the chart from v1.3.0-rc.1).
  2. Create a Tenant (any name).
  3. Create an apps.cozystack.io/Kubernetes Application inside that tenant with default values.
  4. Wait ~5 minutes. The HelmRelease transitions to install-failed.

Suggested fixes (pick one)

Ordered from most surgical to most invasive:

  1. Mark the admin-kubeconfig secret volume optional on the three Deployments. The workloads would start, crash-loop briefly while the Secret is absent, and recover once CAPI produces it. Single-attribute change per Deployment (optional: true on the secret volume source). Avoids the helm-wait timeout entirely but trades it for a short CrashLoopBackOff window.

  2. Move cluster-autoscaler / kccm / kcsi-controller into a separate HelmRelease that dependsOn the workload-cluster control-plane Secret being present. This is the FluxCD-idiomatic shape and mirrors how the chart already splits kubernetes-demo-k8s-cilium, kubernetes-demo-k8s-coredns, etc. as child HelmReleases that dependsOn the parent (visible in the HelmRelease list during reproduction).

  3. Add an init-container per Deployment that polls for the Secret with a short sleep loop. Inelegant but fully contained.

  4. Raise the Helm wait timeout in the cozystack-operator past the 95th-percentile CAPI provisioning time (15m). This only papers over the race — a slow node can still miss the window — but at least tightens the loop.

Why this matters

First-time-user experience: creating a Kubernetes cluster from the dashboard produces an install-failed HelmRelease, which the UI surfaces as Failed. The cluster eventually does come up (CAPI keeps going even after the HelmRelease gives up), but the retry path is non-obvious — operators typically flux suspend && resume the HelmRelease or delete and recreate the Application.

Related

No matching existing issues found in the repo for admin-kubeconfig, FailedMount, cluster-autoscaler timeout. The last commit touching templates/cluster-autoscaler/ is f6d4541 (March 2025, resource limits). The chart structure has been in this shape since the v0.0.1 prep in February 2024.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions