Skip to content

gha-runner-scale-set-controller v0.10.0: EphemeralRunner finalizer cleanup calls list on secrets but chart Role omits the verb (contradicts ADR-2023-04-11) #4510

@milosCvetkovicObsidian22

Description

Checks

  • I've already read the troubleshooting guide and my issue is not covered.
  • I am using charts that are officially provided.
  • I have read the changelog and this is not due to a recently-introduced backward-incompatible change.

Controller Version

gha-runner-scale-set-controller v0.10.0 (current latest at time of filing).

Deployment Method

Helm — official OCI chart oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller:0.10.0, deployed via ArgoCD into namespace arc-system on AKS 1.29.

To Reproduce

  1. Install the chart with defaults (so the chart-rendered Role/<release>-listener grants secrets: [create, delete, get, patch, update] to the controller's ServiceAccount — note: no list).
  2. Run any pipeline through an AutoscalingRunnerSet so an EphemeralRunner pod is created, completes a job, and the controller initiates finalizer cleanup.
  3. Observe the controller log; observe that the EphemeralRunner Custom Resource stays in Phase=Running with deletionTimestamp set and finalizer ephemeralrunner.actions.github.com/finalizer not removed.

Describe the bug

The EphemeralRunner finalizer cleanup path calls client.List on secrets with a label selector, but the chart-provisioned listener Role for the controller's ServiceAccount only grants get/create/delete/patch/update — not list. The controller therefore cannot complete the finalizer; the CR is stuck forever in Phase=Running with deletionTimestamp set, and EphemeralRunnerSets accumulate because their EphemeralRunner children never finalize.

This contradicts ADR 2023-04-11-limit-manager-role-permission.md, which explicitly states:

We will change the default cache-based client to bypass cache on reading Secrets and ConfigMaps (ConfigMap is used when you configure githubServerTLS), so we can eliminate the need for List and Watch Secrets permission in cluster scope.

…and goes on to define the listener Role permissions as exactly the five non-list verbs that are currently in the chart. The chart is consistent with the ADR; the controller code is not.

Evidence

Chartcharts/gha-runner-scale-set-controller/templates/manager_listener_role.yaml at current master:

- apiGroups: [""]
  resources:
  - secrets
  verbs:
  - create
  - delete
  - get
  - patch
  - update

Codecontrollers/actions.github.com/ephemeralrunner_controller.go lines 486–499 at current master:

func (r *EphemeralRunnerReconciler) cleanupContainerHooksResources(ctx context.Context, ephemeralRunner *v1alpha1.EphemeralRunner, log logr.Logger) error {
    log.Info("Cleaning up runner linked pods")
    var errs []error
    if err := r.cleanupRunnerLinkedPods(ctx, ephemeralRunner, log); err != nil {
        errs = append(errs, err)
    }

    log.Info("Cleaning up runner linked secrets")
    if err := r.cleanupRunnerLinkedSecrets(ctx, ephemeralRunner, log); err != nil {
        errs = append(errs, err)
    }

    return errors.Join(errs...)
}

cleanupRunnerLinkedSecrets does r.List(ctx, &secretList, client.InNamespace(...), runnerLinkedLabels) (the same pattern visible at lines 508–510 for cleanupRunnerLinkedPods).

Live error from a vanilla install:

ERROR EphemeralRunner Failed to clean up container hooks resources
  version=0.10.0
  ephemeralrunner={"name":"<scale-set>-runner-<hash>","namespace":"arc-system"}
  error="failed to list runner-linked secrets: secrets is forbidden:
        User \"system:serviceaccount:arc-system:arc-gha-rs-controller\"
        cannot list resource \"secrets\" in API group \"\" in the namespace \"arc-system\""

Resulting K8s state (observed on our dev cluster after ~4 days of normal pipeline traffic):

  • 13+ orphan EphemeralRunner CRs stuck Phase=Running, deletionTimestamp set, finalizer present.
  • 7 EphemeralRunnerSets for 3 logical scale-sets (old ERSets do not GC because their ER children cannot finalize).
  • Controller logs error every ~10 s as the workqueue retries.

Why this is a bug (not user-error)

  • The chart Role is as the ADR specifies. The user is correct to expect it not to grant list.
  • The code path that runs is hit by every successful EphemeralRunner cleanup, not an edge case.
  • The error is non-fatal but causes unbounded CR accumulation, eventual workqueue churn, and (at scale) controller OOM.

Related but not duplicate

#3054 reports the same error message in v0.6.1. It was closed after the reporter said cleanup-protection-finalizer wiring "fixed it," but the underlying mismatch between code (List) and chart (no list verb) has not been addressed. The error is reproducible on a clean install of v0.10.0.

Suggested fix (choose one)

Option A — fix the code (matches ADR intent):
Change cleanupRunnerLinkedSecrets to enumerate secrets by name rather than label-selector list. The runner-linked secret name is deterministic from the runner name, so a single Get is sufficient. Keeps the chart Role minimal.

Option B — fix the chart (matches what the code does):
Add list (and watch, since informer-based List commonly attaches a watch) to the listener Role's secrets rule. Update the ADR to reflect that List-on-secrets is required in this namespace-scoped Role.

Either option closes the gap. Option A is preferable because it preserves the original ADR's least-privilege intent. Option B is faster but represents a permission expansion on a sensitive resource.

Workaround (local)

We're patching this downstream by shipping a supplement Role in our wrapper chart that adds list, watch on secrets only in the controller's release namespace. Happy to share the manifest if helpful — it is six lines.

Additional context

Found while investigating CI runner cleanup behavior on AKS. The chart is otherwise stable; this is the only consistent reconcile error in our controller logs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions