feat: RunnerSet backed by StatefulSet #629

mumoshu · 2021-06-12T15:17:03Z

TL;DR; RunnerSet is a more feature-rich, flexible, easy to configure, and maintainable alternative to RunnerDeployment.

It can be said feature-rich as it supports Add support for volumeClaimTemplates #612 for using persistent volumes for caching.
It is flexible as it supports all the pod template settings from StatefulSet API while supporting all the runner-related settings from Runner API
It is easy to configure and maintainable as the pod-related and container-related settings are now inherited from StatefulSet/Pod Template API. We don't need to maintain and rely on our own variants included in Runner Spec.

A runnerset can manage a set of "stateful" runners by combining a statefulset and an admission webhook. A statefulset is a standard Kubernetes construct that manages a set of pods and a pool of persistent volumes. We use that to manage runner pods, while using the admission webhook mutates each pod to have required environment variables and registration tokens.

It is considered to be a complete replacement to the former method of deploying a set of runners, RunnerDeployment, which also creates pods with the required environment variables and registration tokens.

Differences between RunnerSet and RunnerDeployment

The only and big functional difference between RunnerSet and RunnerDeployment is that the former has support for volumeClaimTemplates, which allows actions-runner-controller to manage a pool of dynamically provisioned persistent volumes. This should be useful to make certain types of actions workflows faster by utilizing per-pod-identity cache, like docker layer caches in /var/lib/docker persistent across pod restarts.

The basic usage of RunnerSet is very similar to that of RunnerDeployment.

This RunnerDeployment:

# runnerdeployment.yaml
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: example-runnerdeploy
spec:
  replicas: 2
  template:
    spec:
      repository: mumoshu/actions-runner-controller-ci
      env: []

can be rewritten to:

# runnerset.yaml
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerSet
metadata:
  name: example
spec:
  # NOTE: RunnerSet supports non-ephemeral runners only today
  ephemeral: false
  replicas: 2
  repository: mumoshu/actions-runner-controller-ci
  # Other mandatory fields from StatefulSet
  selector:
    matchLabels:
      app: example
  serviceName: example
  template:
    metadata:
      labels:
        app: example

Also note that, unlike RunnerDeployment, you can write the full StatefulSet spec inside RunnerSet. Configure the pod template however you like, and the runnerset controller reads and tweaks the pod template to create a complete runner pod spec. This makes it unnecessary to add every pod spec fields to runner spec.

How to configure your RunnerSet

You might have written a RunnerDeployment like the below with various tweaks:

# runnerdeployment.yaml
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: example-runnerdeploy
spec:
  replicas: 2
  # Mandatory fields from StatefulSet
  selector:
    matchLabels:
      app: example
  serviceName: example
  # Pod template
  template:
    metadata:
      # Mandatory fields from StatefulSet
      labels:
        app: example
    # Pod template spec
    spec:
      repository: mumoshu/actions-runner-controller-ci
      dockerdWithinRunnerContainer: true
      env: []
      securityContext:
        #All level/role/type/user values will vary based on your SELinux policies.
        #See https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_atomic_host/7/html/container_security_guide/docker_selinux_security_policy for information about SELinux with containers
        seLinuxOptions: 
          level: "s0"
          role: "system_r"
          type: "super_t"
          user: "system_u"
      resources:
        limits:
          cpu: "4.0"
          memory: "8Gi"
        requests:
          cpu: "2.0"
          memory: "4Gi"
      dockerdContainerResources:
        limits:
          cpu: "4.0"
          memory: "8Gi"
        requests:
          cpu: "2.0"
          memory: "4Gi"

In RunnerDeployment API, you have 4 things to declare in 2 places. 1 thing under spec and 3 things under spec.template.spec:

Per-deployment settings like replicas under spec
Per-deployment and runner-related settings like repository, organization, enterprise, dockerdWithinRunnerContainer, and so on under spec.template.spec
Per-pod settings like securityContext, volumes under spec.template.spec
Per-container settings like resources, dockerdContainerResources, image, dockerImage, and so on under spec.template.spec

In RunnerSet API, you have 3 things to declare in 3 places:

Per-set settings like replicas, repository, organization, enterprise, and so on under spec
Per-pod settings under spec.template.spec
Per-container settings under spec.template.spec.containers[]
- All the dockerdContainer* settings in RunnerDeployment goes to the containers entry whose name is docker, for example.

2 and 3 might be more familiar to many users and therefore it will be easy to write, as it's a standard pod template syntax used widely in Kubernetes Deployment, ReplicaSet, and StatefulSet.

That being said, the above example can be rewritten in RunnerSet like the following:

# runnerset.yaml
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerSet
metadata:
  name: example
spec:
  # NOTE: RunnerSet supports non-ephemeral runners only today
  ephemeral: false
  replicas: 2
  repository: mumoshu/actions-runner-controller-ci
  dockerdWithinRunnerContainer: true
  template:
    spec:
      securityContext:
        #All level/role/type/user values will vary based on your SELinux policies.
        #See https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_atomic_host/7/html/container_security_guide/docker_selinux_security_policy for information about SELinux with containers
        seLinuxOptions: 
          level: "s0"
          role: "system_r"
          type: "super_t"
          user: "system_u"
      containers:
      - name: runner
        env: []
        resources:
          limits:
            cpu: "4.0"
            memory: "8Gi"
          requests:
            cpu: "2.0"
            memory: "4Gi"
      - name: docker
        resources:
          limits:
            cpu: "4.0"
            memory: "8Gi"
          requests:
            cpu: "2.0"
            memory: "4Gi"

Planned but not yet implemented

The following features are planned but not implemented. Please use RunnerDeployment for now if you need any of them.

HRA support:
The support for HorizontalRunnerAutoscaler is planned but not done yet.

Scale-from/to-zero:
Scale-from/to-zero is planned but not implemented yet.

Auto-recovery runner pods stuck while registering:
Planned but not implemented yet.

Call for help

I've already verified this to work manually using the updated helm chart and my own build of the actions-runner-controller container image. But, as a lot of changes are made to the code-base, I don't think this is tested enough.

If you want this feature to get merged at all, or get merged earlier, please test and report any problems you encounter!

Changelog

Add API types, CRDs, controllers/runnerset_contorller.go and controllers/pod_runner_token_injector.go
Added acceptance/testdata/runnerset.yaml and updated acceptance/deploy.sh. Run a deployment with USE_RUNNERSET=1 to deploy test runners using RunnerSet instead of RunnerDeployment.
Manually update manager_role.yaml chart template to reflect changes in kustomize config/rbac/role.yaml
Update controller-gen from 0.3.0 to 0.4.1 to avoid errors generating RunnerSet CRD. The side-effect of this is that all the other CRDs upgraded to v1 API
Update controller-gen to 0.6.0 to avoid issues on runnerset.spec.template.metaata being empty
Manually update webhook_configs.yaml char template to reflect changes in kustomize config/webhook/manifests.yaml
Upgraded controller-runtime to 0.9.0 to prevent envtest failing to install v1 CRDs
- I had to fix various errors due to deprecation and removals listed in e.g. https://github.com/kubernetes-sigs/controller-runtime/releases/tag/v0.7.0
  Issues I have encountered while developing this:
Default value not allowed for in x-kubernetes-list-map-keys kubernetes-sigs/controller-tools#444 (comment)
Single-version generated CRDs cannot be installed kubernetes-sigs/controller-tools#302

Related issues

Resolves #613
Ref #612
Revival of #4

Unlike a runner deployment, a runner set can manage a set of stateful runners by combining a statefulset and an admission webhook that mutates statefulset-managed pods with required envvars and registration tokens. Resolves #613 Ref #612

… local setup

…r-controller got failed after the mutating webhook has been registered

…DOCKER_HOST and DOCKER_TLS_VERIFY envvars when dockerdWithinRunner=false

…changes when there were no changes in runnerset spec

callum-tait-pbx · 2021-06-20T18:40:06Z

Makefile

@@ -158,20 +164,20 @@ acceptance: release/clean acceptance/pull docker-build release
 acceptance/run: acceptance/kind acceptance/load acceptance/setup acceptance/deploy acceptance/tests acceptance/teardown


if all the images used in acceptance/load aren't already on your local machine then this fails. acceptance/pull needs to be done after acceptance/kind

callum-tait-pbx · 2021-06-20T18:44:56Z

Makefile

+	kind load docker-image quay.io/jetstack/cert-manager-controller:v1.0.4 --name ${CLUSTER}
+	kind load docker-image quay.io/jetstack/cert-manager-cainjector:v1.0.4 --name ${CLUSTER}
+	kind load docker-image quay.io/jetstack/cert-manager-webhook:v1.0.4 --name ${CLUSTER}


Worth bumping to v1.1.1 in this PR?

v1.1.1 is the last of the v1.X.X series. I run v1.1.1 on EKS and have done across multiple controller versions. v1.1.1 is fairly old at this point but we should consider bumping it but a newer major version outside of this PR so if there are issues (I don't see why there would be tbh) they are dealt with seperately to this work. I can vouch for v1.1.1 so it would be nice to bump to latest of that series in this PR seen as we have done various bumps already.

Perhaps it's worth having CERT_MANAGER_VERSION = v1.1.1 at the top and the version to be deployed pulled from that making it easier to bump next time?

Ref #629 Ref #613 Ref #612

esvirskiy · 2021-06-22T15:18:31Z

Hi @mumoshu. I am testing the controller using the canary tag and I am seeing the following error

actions-runner-controller-67bc455dd6-css9q manager E0622 15:10:58.821753       1 leaderelection.go:325] error retrieving resource lock actions-runner-system/actions-runner-controller: leases.coordination.k8s.io "actions-runner-controller" is forbidden: User "system:serviceaccount:actions-runner-system:actions-runner-controller" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "actions-runner-system"

I see that https://github.com/actions-runner-controller/actions-runner-controller/blob/8b90b0f0e3a4a254c096f8d9ecd8aeed0ee3c00e/controllers/runnerset_controller.go#L68 is commented out. Is that needed?

esvirskiy · 2021-06-22T21:43:03Z

Hi @mumoshu. I am testing the controller using the canary tag and I am seeing the following error
actions-runner-controller-67bc455dd6-css9q manager E0622 15:10:58.821753       1 leaderelection.go:325] error retrieving resource lock actions-runner-system/actions-runner-controller: leases.coordination.k8s.io "actions-runner-controller" is forbidden: User "system:serviceaccount:actions-runner-system:actions-runner-controller" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "actions-runner-system"
I see that

https://github.com/actions-runner-controller/actions-runner-controller/blob/8b90b0f0e3a4a254c096f8d9ecd8aeed0ee3c00e/controllers/runnerset_controller.go#L68

is commented out. Is that needed?

This was my fault. I used an outdated chart (changed controller tag to canary).
The chart that is currently in master works fine. Thanks! I'll continue testing this!

mumoshu · 2021-06-22T23:44:00Z

@esvirskiy Wow! Thanks a lot for testing ☺️ Please feel free to leave any early feedbacks. That would be super helpful to shape this feature.

Note that there's a few unimplemented things as explained in the pr description:

I'm working for the HRA support at #647. I'll tackle the auto-recovery feature next. Scale-from/to-zero is at the lowest priority and I may skip working on it entirely, because a potential enhancement on the GitHub side can make it unnecessary.

Ref #629 Ref #613 Ref #612

`HRA.Spec.ScaleTargetRef.Kind` is added to denote that the scale-target is a RunnerSet. It defaults to `RunnerDeployment` for backward compatibility. ``` apiVersion: actions.summerwind.dev/v1alpha1 kind: HorizontalRunnerAutoscaler metadata: name: myhra spec: scaleTargetRef: kind: RunnerSet name: myrunnerset ``` Ref #629 Ref #613 Ref #612

…ation on pod termination Ref #629 Ref #613 Ref #612

…ation on pod termination (#652) Ref #629 Ref #613 Ref #612

Ref #629 Ref #613 Ref #612

Follow-up for #721 and #629

mumoshu force-pushed the re-statefulset branch 5 times, most recently from 50019b1 to 96f0a08 Compare June 12, 2021 15:57

feat: RunnerSet backed by StatefulSet

dc0e1bf

Unlike a runner deployment, a runner set can manage a set of stateful runners by combining a statefulset and an admission webhook that mutates statefulset-managed pods with required envvars and registration tokens. Resolves #613 Ref #612

mumoshu force-pushed the re-statefulset branch from 96f0a08 to dc0e1bf Compare June 13, 2021 05:09

mumoshu changed the title ~~WIP: feat: RunnerSet backed by StatefulSet~~ feat: RunnerSet backed by StatefulSet Jun 13, 2021

mumoshu added 8 commits June 13, 2021 05:48

Upgrade controller-runtime to 0.9.0

aedaab5

Bump Go to 1.16.x following controller-runtime 0.9.0

6993a6d

Upgrade kubebuilder to 2.3.2 for updated etcd and apiserver following…

2439e91

… local setup

Fix startup failure due to missing LeaderElectionID

0cdf017

Fix the issue that any pods become unable to start once actions-runne…

61697b2

…r-controller got failed after the mutating webhook has been registered

Allow force-updating statefulset

352c799

Fix runner container missing work and certs-client volume mounts and …

1a8e645

…DOCKER_HOST and DOCKER_TLS_VERIFY envvars when dockerdWithinRunner=false

Fix runnerset-controller not applying statefulset.spec.template.spec …

626ba60

…changes when there were no changes in runnerset spec

toast-gear mentioned this pull request Jun 14, 2021

Added support to enable and disable enableServiceLinks. #628

Merged

mumoshu and others added 3 commits June 17, 2021 00:24

Enable running acceptance tests against arbitrary kind cluster

3f88ce8

RunnerSet supports non-ephemeral runners only today

69b43fa

fix: docker-build from root Makefile on intel mac

0669ef9

callum-tait-pbx reviewed Jun 20, 2021

View reviewed changes

callum-tait-pbx and others added 7 commits June 21, 2021 11:04

fix: arch check fixes for mac and ARM

2d95900

ci: aligning test data format and patching checks

1a15420

fix: removing namespace in test data

2aac467

chore: adding more ignores

a2b9da4

chore: removing leading space in shebang

35ffe60

Re-add metrics to org hra testdata

3c4c6ca

Bump cert-manager to v1.1.1 and fix deploy.sh

5336e31

mumoshu merged commit 9e4dbf4 into master Jun 22, 2021

mumoshu deleted the re-statefulset branch June 22, 2021 08:10

mumoshu added a commit that referenced this pull request Jun 22, 2021

Add HRA support for RunnerSet

a69d39e

Ref #629 Ref #613 Ref #612

mumoshu mentioned this pull request Jun 22, 2021

Add HRA support for RunnerSet #647

Merged

mumoshu mentioned this pull request Jun 23, 2021

Update deprecated APIs #144

Closed

mumoshu added a commit that referenced this pull request Jun 23, 2021

Add HRA support for RunnerSet

9a40b5c

Ref #629 Ref #613 Ref #612

mumoshu added a commit that referenced this pull request Jun 24, 2021

RunnerSet: Automatic-recovery from registration timeout and deregistr…

71db232

…ation on pod termination Ref #629 Ref #613 Ref #612

mumoshu mentioned this pull request Jun 24, 2021

RunnerSet: Automatic-recovery from registration timeout and deregistration on pod termination #652

Merged

mumoshu added a commit that referenced this pull request Jun 24, 2021

RunnerSet: Automatic-recovery from registration timeout and deregistr…

acb9061

…ation on pod termination (#652) Ref #629 Ref #613 Ref #612

mumoshu added a commit that referenced this pull request Jun 25, 2021

doc: Describe RunnerSet

5b62b9b

Ref #629 Ref #613 Ref #612

mumoshu mentioned this pull request Jun 25, 2021

doc: Describe RunnerSet #654

Merged

mumoshu added a commit that referenced this pull request Jun 25, 2021

doc: Describe RunnerSet (#654)

3b45d1b

Ref #629 Ref #613 Ref #612

mumoshu mentioned this pull request Jul 9, 2021

Draining nodes without interrupting busy runners #643

Closed

mumoshu added a commit that referenced this pull request Aug 16, 2021

Update documentation about epehemral runners and RunnerSet

b983986

Follow-up for #721 and #629

mumoshu mentioned this pull request Aug 16, 2021

Update documentation about epehemral runners and RunnerSet #727

Merged

mumoshu added a commit that referenced this pull request Aug 17, 2021

Update documentation about epehemral runners and RunnerSet

b70757b

Follow-up for #721 and #629

mumoshu added a commit that referenced this pull request Aug 25, 2021

Update documentation about epehemral runners and RunnerSet (#727)

2c71150

Follow-up for #721 and #629

donovanmuller mentioned this pull request Sep 10, 2021

RunnerDeployment 'initContainers' getting overwritten #772

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: RunnerSet backed by StatefulSet #629

feat: RunnerSet backed by StatefulSet #629

mumoshu commented Jun 12, 2021 •

edited

Loading

callum-tait-pbx Jun 20, 2021 •

edited by toast-gear

Loading

callum-tait-pbx Jun 20, 2021 •

edited

Loading

esvirskiy commented Jun 22, 2021

esvirskiy commented Jun 22, 2021

mumoshu commented Jun 22, 2021 •

edited

Loading

		@@ -158,20 +164,20 @@ acceptance: release/clean acceptance/pull docker-build release
		acceptance/run: acceptance/kind acceptance/load acceptance/setup acceptance/deploy acceptance/tests acceptance/teardown

feat: RunnerSet backed by StatefulSet #629

feat: RunnerSet backed by StatefulSet #629

Conversation

mumoshu commented Jun 12, 2021 • edited Loading

Differences between RunnerSet and RunnerDeployment

How to configure your RunnerSet

Planned but not yet implemented

Call for help

Changelog

Related issues

callum-tait-pbx Jun 20, 2021 • edited by toast-gear Loading

Choose a reason for hiding this comment

callum-tait-pbx Jun 20, 2021 • edited Loading

Choose a reason for hiding this comment

esvirskiy commented Jun 22, 2021

esvirskiy commented Jun 22, 2021

mumoshu commented Jun 22, 2021 • edited Loading

mumoshu commented Jun 12, 2021 •

edited

Loading

callum-tait-pbx Jun 20, 2021 •

edited by toast-gear

Loading

callum-tait-pbx Jun 20, 2021 •

edited

Loading

mumoshu commented Jun 22, 2021 •

edited

Loading