Helm install race: cluster-autoscaler / kccm / kcsi block on admin-kubeconfig Secret that CAPI has not produced yet

## Environment

- Cozystack: v1.2.2 (verified unchanged on v1.3.0-rc.1 and master)
- Chart: `packages/apps/kubernetes`
- Kubernetes: v1.35.0+k3s3

## Symptom

`helm install` for any `apps.cozystack.io/Kubernetes` application fails with `context deadline exceeded` after the 5-minute chart wait. The underlying error is a mount-time failure on three Deployments inside the chart:

```text
FailedMount: MountVolume.SetUp failed for volume "kubeconfig":
  secret "<release>-admin-kubeconfig" not found
```

The affected workloads are:

- `<release>-cluster-autoscaler`
- `<release>-kccm` (cloud-controller-manager)
- `<release>-kcsi-controller` (CSI controller)

## HelmRelease / HelmController output

```text
creating 35 resource(s)
beginning wait for 35 resources with timeout of 5m0s
Deployment is not ready: <ns>/<release>-cluster-autoscaler. 0 out of 1 expected pods are ready
(148 duplicate lines omitted)
Error received when checking status of resource <release>-cluster-autoscaler.
Error: 'client rate limiter Wait returned an error: context deadline exceeded'
wait for resources failed after 5m0s: context deadline exceeded
```

## Root cause

The chart renders 35 resources in a single Helm release and relies on the default Helm wait to serialize their readiness. Three workloads in that set mount `<release>-admin-kubeconfig` as a non-optional Secret volume:

- `packages/apps/kubernetes/templates/cluster-autoscaler/deployment.yaml` → `secretName: {{ .Release.Name }}-admin-kubeconfig`
- Same pattern for `kccm/manager.yaml` and `csi/controller.yaml`

That Secret is not produced by the chart — it is produced asynchronously by the CAPI Kubeadm control-plane provider **after** the `Cluster` CR (also created in this same release) has finished provisioning the workload-cluster control plane. On clusters where control-plane provisioning exceeds the 5-minute `--wait` budget (typical, especially on first boot / slow node images), the Deployment pods stay `ContainerCreating` for the whole window, Helm reports them not-ready, and the install times out.

CAPI operators who followed along:

```text
Cluster kubernetes-demo-k8s   PHASE: Provisioning   AGE: 3m15s
```

The Secret appears only once the control plane reaches `Ready`, but by then the Helm operation has already failed and the HelmRelease is marked `install-failed`.

## Reproduction

1. Cozystack v1.2.2 (also reproduced with the chart from v1.3.0-rc.1).
2. Create a Tenant (any name).
3. Create an `apps.cozystack.io/Kubernetes` Application inside that tenant with default values.
4. Wait ~5 minutes. The HelmRelease transitions to `install-failed`.

## Suggested fixes (pick one)

Ordered from most surgical to most invasive:

1. **Mark the admin-kubeconfig secret volume optional** on the three Deployments. The workloads would start, crash-loop briefly while the Secret is absent, and recover once CAPI produces it. Single-attribute change per Deployment (`optional: true` on the `secret` volume source). Avoids the helm-wait timeout entirely but trades it for a short CrashLoopBackOff window.

2. **Move cluster-autoscaler / kccm / kcsi-controller into a separate HelmRelease** that `dependsOn` the workload-cluster control-plane Secret being present. This is the FluxCD-idiomatic shape and mirrors how the chart already splits `kubernetes-demo-k8s-cilium`, `kubernetes-demo-k8s-coredns`, etc. as child HelmReleases that `dependsOn` the parent (visible in the HelmRelease list during reproduction).

3. **Add an init-container per Deployment** that polls for the Secret with a short sleep loop. Inelegant but fully contained.

4. **Raise the Helm wait timeout** in the cozystack-operator past the 95th-percentile CAPI provisioning time (15m). This only papers over the race — a slow node can still miss the window — but at least tightens the loop.

## Why this matters

First-time-user experience: creating a Kubernetes cluster from the dashboard produces an `install-failed` HelmRelease, which the UI surfaces as `Failed`. The cluster eventually does come up (CAPI keeps going even after the HelmRelease gives up), but the retry path is non-obvious — operators typically `flux suspend && resume` the HelmRelease or delete and recreate the Application.

## Related

No matching existing issues found in the repo for `admin-kubeconfig`, `FailedMount`, `cluster-autoscaler timeout`. The last commit touching `templates/cluster-autoscaler/` is `f6d4541` (March 2025, resource limits). The chart structure has been in this shape since the v0.0.1 prep in February 2024.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Helm install race: cluster-autoscaler / kccm / kcsi block on admin-kubeconfig Secret that CAPI has not produced yet #2412

Environment

Symptom

HelmRelease / HelmController output

Root cause

Reproduction

Suggested fixes (pick one)

Why this matters

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Helm install race: cluster-autoscaler / kccm / kcsi block on admin-kubeconfig Secret that CAPI has not produced yet #2412

Description

Environment

Symptom

HelmRelease / HelmController output

Root cause

Reproduction

Suggested fixes (pick one)

Why this matters

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions