Checks
Controller Version
gha-runner-scale-set-controller v0.10.0 (current latest at time of filing).
Deployment Method
Helm — official OCI chart oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller:0.10.0, deployed via ArgoCD into namespace arc-system on AKS 1.29.
To Reproduce
- Install the chart with defaults (so the chart-rendered
Role/<release>-listener grants secrets: [create, delete, get, patch, update] to the controller's ServiceAccount — note: no list).
- Run any pipeline through an
AutoscalingRunnerSet so an EphemeralRunner pod is created, completes a job, and the controller initiates finalizer cleanup.
- Observe the controller log; observe that the
EphemeralRunner Custom Resource stays in Phase=Running with deletionTimestamp set and finalizer ephemeralrunner.actions.github.com/finalizer not removed.
Describe the bug
The EphemeralRunner finalizer cleanup path calls client.List on secrets with a label selector, but the chart-provisioned listener Role for the controller's ServiceAccount only grants get/create/delete/patch/update — not list. The controller therefore cannot complete the finalizer; the CR is stuck forever in Phase=Running with deletionTimestamp set, and EphemeralRunnerSets accumulate because their EphemeralRunner children never finalize.
This contradicts ADR 2023-04-11-limit-manager-role-permission.md, which explicitly states:
We will change the default cache-based client to bypass cache on reading Secrets and ConfigMaps (ConfigMap is used when you configure githubServerTLS), so we can eliminate the need for List and Watch Secrets permission in cluster scope.
…and goes on to define the listener Role permissions as exactly the five non-list verbs that are currently in the chart. The chart is consistent with the ADR; the controller code is not.
Evidence
Chart — charts/gha-runner-scale-set-controller/templates/manager_listener_role.yaml at current master:
- apiGroups: [""]
resources:
- secrets
verbs:
- create
- delete
- get
- patch
- update
Code — controllers/actions.github.com/ephemeralrunner_controller.go lines 486–499 at current master:
func (r *EphemeralRunnerReconciler) cleanupContainerHooksResources(ctx context.Context, ephemeralRunner *v1alpha1.EphemeralRunner, log logr.Logger) error {
log.Info("Cleaning up runner linked pods")
var errs []error
if err := r.cleanupRunnerLinkedPods(ctx, ephemeralRunner, log); err != nil {
errs = append(errs, err)
}
log.Info("Cleaning up runner linked secrets")
if err := r.cleanupRunnerLinkedSecrets(ctx, ephemeralRunner, log); err != nil {
errs = append(errs, err)
}
return errors.Join(errs...)
}
cleanupRunnerLinkedSecrets does r.List(ctx, &secretList, client.InNamespace(...), runnerLinkedLabels) (the same pattern visible at lines 508–510 for cleanupRunnerLinkedPods).
Live error from a vanilla install:
ERROR EphemeralRunner Failed to clean up container hooks resources
version=0.10.0
ephemeralrunner={"name":"<scale-set>-runner-<hash>","namespace":"arc-system"}
error="failed to list runner-linked secrets: secrets is forbidden:
User \"system:serviceaccount:arc-system:arc-gha-rs-controller\"
cannot list resource \"secrets\" in API group \"\" in the namespace \"arc-system\""
Resulting K8s state (observed on our dev cluster after ~4 days of normal pipeline traffic):
- 13+ orphan
EphemeralRunner CRs stuck Phase=Running, deletionTimestamp set, finalizer present.
- 7
EphemeralRunnerSets for 3 logical scale-sets (old ERSets do not GC because their ER children cannot finalize).
- Controller logs error every ~10 s as the workqueue retries.
Why this is a bug (not user-error)
- The chart
Role is as the ADR specifies. The user is correct to expect it not to grant list.
- The code path that runs is hit by every successful EphemeralRunner cleanup, not an edge case.
- The error is non-fatal but causes unbounded CR accumulation, eventual workqueue churn, and (at scale) controller OOM.
Related but not duplicate
#3054 reports the same error message in v0.6.1. It was closed after the reporter said cleanup-protection-finalizer wiring "fixed it," but the underlying mismatch between code (List) and chart (no list verb) has not been addressed. The error is reproducible on a clean install of v0.10.0.
Suggested fix (choose one)
Option A — fix the code (matches ADR intent):
Change cleanupRunnerLinkedSecrets to enumerate secrets by name rather than label-selector list. The runner-linked secret name is deterministic from the runner name, so a single Get is sufficient. Keeps the chart Role minimal.
Option B — fix the chart (matches what the code does):
Add list (and watch, since informer-based List commonly attaches a watch) to the listener Role's secrets rule. Update the ADR to reflect that List-on-secrets is required in this namespace-scoped Role.
Either option closes the gap. Option A is preferable because it preserves the original ADR's least-privilege intent. Option B is faster but represents a permission expansion on a sensitive resource.
Workaround (local)
We're patching this downstream by shipping a supplement Role in our wrapper chart that adds list, watch on secrets only in the controller's release namespace. Happy to share the manifest if helpful — it is six lines.
Additional context
Found while investigating CI runner cleanup behavior on AKS. The chart is otherwise stable; this is the only consistent reconcile error in our controller logs.
Checks
Controller Version
gha-runner-scale-set-controllerv0.10.0 (current latest at time of filing).Deployment Method
Helm — official OCI chart
oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller:0.10.0, deployed via ArgoCD into namespacearc-systemon AKS 1.29.To Reproduce
Role/<release>-listenergrantssecrets: [create, delete, get, patch, update]to the controller's ServiceAccount — note: nolist).AutoscalingRunnerSetso anEphemeralRunnerpod is created, completes a job, and the controller initiates finalizer cleanup.EphemeralRunnerCustom Resource stays inPhase=RunningwithdeletionTimestampset and finalizerephemeralrunner.actions.github.com/finalizernot removed.Describe the bug
The
EphemeralRunnerfinalizer cleanup path callsclient.Listonsecretswith a label selector, but the chart-provisioned listenerRolefor the controller's ServiceAccount only grantsget/create/delete/patch/update— notlist. The controller therefore cannot complete the finalizer; the CR is stuck forever inPhase=RunningwithdeletionTimestampset, andEphemeralRunnerSetsaccumulate because theirEphemeralRunnerchildren never finalize.This contradicts ADR
2023-04-11-limit-manager-role-permission.md, which explicitly states:…and goes on to define the listener
Rolepermissions as exactly the five non-listverbs that are currently in the chart. The chart is consistent with the ADR; the controller code is not.Evidence
Chart —
charts/gha-runner-scale-set-controller/templates/manager_listener_role.yamlat currentmaster:Code —
controllers/actions.github.com/ephemeralrunner_controller.golines 486–499 at currentmaster:cleanupRunnerLinkedSecretsdoesr.List(ctx, &secretList, client.InNamespace(...), runnerLinkedLabels)(the same pattern visible at lines 508–510 forcleanupRunnerLinkedPods).Live error from a vanilla install:
Resulting K8s state (observed on our dev cluster after ~4 days of normal pipeline traffic):
EphemeralRunnerCRs stuckPhase=Running,deletionTimestampset, finalizer present.EphemeralRunnerSetsfor 3 logical scale-sets (old ERSets do not GC because their ER children cannot finalize).Why this is a bug (not user-error)
Roleis as the ADR specifies. The user is correct to expect it not to grantlist.Related but not duplicate
#3054 reports the same error message in v0.6.1. It was closed after the reporter said cleanup-protection-finalizer wiring "fixed it," but the underlying mismatch between code (
List) and chart (nolistverb) has not been addressed. The error is reproducible on a clean install of v0.10.0.Suggested fix (choose one)
Option A — fix the code (matches ADR intent):
Change
cleanupRunnerLinkedSecretsto enumerate secrets by name rather than label-selector list. The runner-linked secret name is deterministic from the runner name, so a singleGetis sufficient. Keeps the chartRoleminimal.Option B — fix the chart (matches what the code does):
Add
list(andwatch, since informer-based List commonly attaches a watch) to the listenerRole'ssecretsrule. Update the ADR to reflect that List-on-secrets is required in this namespace-scoped Role.Either option closes the gap. Option A is preferable because it preserves the original ADR's least-privilege intent. Option B is faster but represents a permission expansion on a sensitive resource.
Workaround (local)
We're patching this downstream by shipping a supplement
Rolein our wrapper chart that addslist, watchon secrets only in the controller's release namespace. Happy to share the manifest if helpful — it is six lines.Additional context
Found while investigating CI runner cleanup behavior on AKS. The chart is otherwise stable; this is the only consistent reconcile error in our controller logs.