Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: RunnerSet backed by StatefulSet #629

Merged
merged 19 commits into from
Jun 22, 2021
Merged

feat: RunnerSet backed by StatefulSet #629

merged 19 commits into from
Jun 22, 2021

Conversation

mumoshu
Copy link
Collaborator

@mumoshu mumoshu commented Jun 12, 2021

TL;DR; RunnerSet is a more feature-rich, flexible, easy to configure, and maintainable alternative to RunnerDeployment.

  • It can be said feature-rich as it supports Add support for volumeClaimTemplates #612 for using persistent volumes for caching.
  • It is flexible as it supports all the pod template settings from StatefulSet API while supporting all the runner-related settings from Runner API
  • It is easy to configure and maintainable as the pod-related and container-related settings are now inherited from StatefulSet/Pod Template API. We don't need to maintain and rely on our own variants included in Runner Spec.

A runnerset can manage a set of "stateful" runners by combining a statefulset and an admission webhook. A statefulset is a standard Kubernetes construct that manages a set of pods and a pool of persistent volumes. We use that to manage runner pods, while using the admission webhook mutates each pod to have required environment variables and registration tokens.

It is considered to be a complete replacement to the former method of deploying a set of runners, RunnerDeployment, which also creates pods with the required environment variables and registration tokens.

Differences between RunnerSet and RunnerDeployment

The only and big functional difference between RunnerSet and RunnerDeployment is that the former has support for volumeClaimTemplates, which allows actions-runner-controller to manage a pool of dynamically provisioned persistent volumes. This should be useful to make certain types of actions workflows faster by utilizing per-pod-identity cache, like docker layer caches in /var/lib/docker persistent across pod restarts.

The basic usage of RunnerSet is very similar to that of RunnerDeployment.

This RunnerDeployment:

# runnerdeployment.yaml
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: example-runnerdeploy
spec:
  replicas: 2
  template:
    spec:
      repository: mumoshu/actions-runner-controller-ci
      env: []

can be rewritten to:

# runnerset.yaml
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerSet
metadata:
  name: example
spec:
  # NOTE: RunnerSet supports non-ephemeral runners only today
  ephemeral: false
  replicas: 2
  repository: mumoshu/actions-runner-controller-ci
  # Other mandatory fields from StatefulSet
  selector:
    matchLabels:
      app: example
  serviceName: example
  template:
    metadata:
      labels:
        app: example

Also note that, unlike RunnerDeployment, you can write the full StatefulSet spec inside RunnerSet. Configure the pod template however you like, and the runnerset controller reads and tweaks the pod template to create a complete runner pod spec. This makes it unnecessary to add every pod spec fields to runner spec.

How to configure your RunnerSet

You might have written a RunnerDeployment like the below with various tweaks:

# runnerdeployment.yaml
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: example-runnerdeploy
spec:
  replicas: 2
  # Mandatory fields from StatefulSet
  selector:
    matchLabels:
      app: example
  serviceName: example
  # Pod template
  template:
    metadata:
      # Mandatory fields from StatefulSet
      labels:
        app: example
    # Pod template spec
    spec:
      repository: mumoshu/actions-runner-controller-ci
      dockerdWithinRunnerContainer: true
      env: []
      securityContext:
        #All level/role/type/user values will vary based on your SELinux policies.
        #See https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_atomic_host/7/html/container_security_guide/docker_selinux_security_policy for information about SELinux with containers
        seLinuxOptions: 
          level: "s0"
          role: "system_r"
          type: "super_t"
          user: "system_u"
      resources:
        limits:
          cpu: "4.0"
          memory: "8Gi"
        requests:
          cpu: "2.0"
          memory: "4Gi"
      dockerdContainerResources:
        limits:
          cpu: "4.0"
          memory: "8Gi"
        requests:
          cpu: "2.0"
          memory: "4Gi"

In RunnerDeployment API, you have 4 things to declare in 2 places. 1 thing under spec and 3 things under spec.template.spec:

  1. Per-deployment settings like replicas under spec
  2. Per-deployment and runner-related settings like repository, organization, enterprise, dockerdWithinRunnerContainer, and so on under spec.template.spec
  3. Per-pod settings like securityContext, volumes under spec.template.spec
  4. Per-container settings like resources, dockerdContainerResources, image, dockerImage, and so on under spec.template.spec

In RunnerSet API, you have 3 things to declare in 3 places:

  1. Per-set settings like replicas, repository, organization, enterprise, and so on under spec
  2. Per-pod settings under spec.template.spec
  3. Per-container settings under spec.template.spec.containers[]
    • All the dockerdContainer* settings in RunnerDeployment goes to the containers entry whose name is docker, for example.

2 and 3 might be more familiar to many users and therefore it will be easy to write, as it's a standard pod template syntax used widely in Kubernetes Deployment, ReplicaSet, and StatefulSet.

That being said, the above example can be rewritten in RunnerSet like the following:

# runnerset.yaml
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerSet
metadata:
  name: example
spec:
  # NOTE: RunnerSet supports non-ephemeral runners only today
  ephemeral: false
  replicas: 2
  repository: mumoshu/actions-runner-controller-ci
  dockerdWithinRunnerContainer: true
  template:
    spec:
      securityContext:
        #All level/role/type/user values will vary based on your SELinux policies.
        #See https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_atomic_host/7/html/container_security_guide/docker_selinux_security_policy for information about SELinux with containers
        seLinuxOptions: 
          level: "s0"
          role: "system_r"
          type: "super_t"
          user: "system_u"
      containers:
      - name: runner
        env: []
        resources:
          limits:
            cpu: "4.0"
            memory: "8Gi"
          requests:
            cpu: "2.0"
            memory: "4Gi"
      - name: docker
        resources:
          limits:
            cpu: "4.0"
            memory: "8Gi"
          requests:
            cpu: "2.0"
            memory: "4Gi"

Planned but not yet implemented

The following features are planned but not implemented. Please use RunnerDeployment for now if you need any of them.

HRA support:
The support for HorizontalRunnerAutoscaler is planned but not done yet.

Scale-from/to-zero:
Scale-from/to-zero is planned but not implemented yet.

Auto-recovery runner pods stuck while registering:
Planned but not implemented yet.

Call for help

I've already verified this to work manually using the updated helm chart and my own build of the actions-runner-controller container image. But, as a lot of changes are made to the code-base, I don't think this is tested enough.

If you want this feature to get merged at all, or get merged earlier, please test and report any problems you encounter!

Changelog

Related issues

Resolves #613
Ref #612
Revival of #4

@mumoshu mumoshu force-pushed the re-statefulset branch 5 times, most recently from 50019b1 to 96f0a08 Compare June 12, 2021 15:57
Unlike a runner deployment, a runner set can manage a set of stateful runners by combining a statefulset and an admission webhook that mutates statefulset-managed pods with required envvars and registration tokens.

Resolves #613
Ref #612
@mumoshu mumoshu changed the title WIP: feat: RunnerSet backed by StatefulSet feat: RunnerSet backed by StatefulSet Jun 13, 2021
@@ -158,20 +164,20 @@ acceptance: release/clean acceptance/pull docker-build release
acceptance/run: acceptance/kind acceptance/load acceptance/setup acceptance/deploy acceptance/tests acceptance/teardown
Copy link
Contributor

@callum-tait-pbx callum-tait-pbx Jun 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if all the images used in acceptance/load aren't already on your local machine then this fails. acceptance/pull needs to be done after acceptance/kind

Makefile Outdated
Comment on lines 177 to 179
kind load docker-image quay.io/jetstack/cert-manager-controller:v1.0.4 --name ${CLUSTER}
kind load docker-image quay.io/jetstack/cert-manager-cainjector:v1.0.4 --name ${CLUSTER}
kind load docker-image quay.io/jetstack/cert-manager-webhook:v1.0.4 --name ${CLUSTER}
Copy link
Contributor

@callum-tait-pbx callum-tait-pbx Jun 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth bumping to v1.1.1 in this PR?

v1.1.1 is the last of the v1.X.X series. I run v1.1.1 on EKS and have done across multiple controller versions. v1.1.1 is fairly old at this point but we should consider bumping it but a newer major version outside of this PR so if there are issues (I don't see why there would be tbh) they are dealt with seperately to this work. I can vouch for v1.1.1 so it would be nice to bump to latest of that series in this PR seen as we have done various bumps already.

Perhaps it's worth having CERT_MANAGER_VERSION = v1.1.1 at the top and the version to be deployed pulled from that making it easier to bump next time?

@mumoshu mumoshu merged commit 9e4dbf4 into master Jun 22, 2021
@mumoshu mumoshu deleted the re-statefulset branch June 22, 2021 08:10
mumoshu added a commit that referenced this pull request Jun 22, 2021
@esvirskiy
Copy link

Hi @mumoshu. I am testing the controller using the canary tag and I am seeing the following error

actions-runner-controller-67bc455dd6-css9q manager E0622 15:10:58.821753       1 leaderelection.go:325] error retrieving resource lock actions-runner-system/actions-runner-controller: leases.coordination.k8s.io "actions-runner-controller" is forbidden: User "system:serviceaccount:actions-runner-system:actions-runner-controller" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "actions-runner-system"

I see that https://github.com/actions-runner-controller/actions-runner-controller/blob/8b90b0f0e3a4a254c096f8d9ecd8aeed0ee3c00e/controllers/runnerset_controller.go#L68 is commented out. Is that needed?

@esvirskiy
Copy link

Hi @mumoshu. I am testing the controller using the canary tag and I am seeing the following error

actions-runner-controller-67bc455dd6-css9q manager E0622 15:10:58.821753       1 leaderelection.go:325] error retrieving resource lock actions-runner-system/actions-runner-controller: leases.coordination.k8s.io "actions-runner-controller" is forbidden: User "system:serviceaccount:actions-runner-system:actions-runner-controller" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "actions-runner-system"

I see that

https://github.com/actions-runner-controller/actions-runner-controller/blob/8b90b0f0e3a4a254c096f8d9ecd8aeed0ee3c00e/controllers/runnerset_controller.go#L68

is commented out. Is that needed?

This was my fault. I used an outdated chart (changed controller tag to canary).
The chart that is currently in master works fine. Thanks! I'll continue testing this!

@mumoshu
Copy link
Collaborator Author

mumoshu commented Jun 22, 2021

@esvirskiy Wow! Thanks a lot for testing ☺️ Please feel free to leave any early feedbacks. That would be super helpful to shape this feature.

Note that there's a few unimplemented things as explained in the pr description:

CleanShot 2021-06-23 at 08 41 20@2x

I'm working for the HRA support at #647. I'll tackle the auto-recovery feature next. Scale-from/to-zero is at the lowest priority and I may skip working on it entirely, because a potential enhancement on the GitHub side can make it unnecessary.

@mumoshu mumoshu mentioned this pull request Jun 23, 2021
mumoshu added a commit that referenced this pull request Jun 23, 2021
mumoshu added a commit that referenced this pull request Jun 23, 2021
`HRA.Spec.ScaleTargetRef.Kind` is added to denote that the scale-target is a RunnerSet.

It defaults to `RunnerDeployment` for backward compatibility.

```
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: myhra
spec:
  scaleTargetRef:
    kind: RunnerSet
    name: myrunnerset
```

Ref #629
Ref #613
Ref #612
mumoshu added a commit that referenced this pull request Jun 24, 2021
mumoshu added a commit that referenced this pull request Jun 24, 2021
mumoshu added a commit that referenced this pull request Jun 25, 2021
@mumoshu mumoshu mentioned this pull request Jun 25, 2021
mumoshu added a commit that referenced this pull request Jun 25, 2021
mumoshu added a commit that referenced this pull request Aug 16, 2021
mumoshu added a commit that referenced this pull request Aug 17, 2021
mumoshu added a commit that referenced this pull request Aug 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Another API object to deploy runners backed by StatefulSet
4 participants