Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to scale Organization RunnerDeployment with "workflow_job" #951

Closed
1 of 2 tasks
sigurdfalk opened this issue Nov 16, 2021 · 11 comments
Closed
1 of 2 tasks

Comments

@sigurdfalk
Copy link

Describe the bug

Organization RunnerDeployment not able to scale with workflow_job event. Manifests looks like below:

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: iac-runners
  namespace: actions-runner-system
spec:
  template:
    metadata:
      labels:
        app: iac-runners
        owner: platform
        environment: test
    spec:
      organization: gjensidige
      group: IaC
      labels:
        - iac
      ephemeral: true
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: iac-runners
  namespace: actions-runner-system
spec:
  scaleTargetRef:
    name: iac-runners
  scaleUpTriggers:
    - githubEvent: {}
      amount: 1
      duration: "5m"
  minReplicas: 2
  maxReplicas: 5

Seing the following in logs:

2021-11-16T20:04:01.274Z        DEBUG   controllers.Runner      Found 0 HRAs by key     {"key": "gjensidige/terraform-aks"}
2021-11-16T20:04:01.274Z        DEBUG   controllers.Runner      Found 1 HRAs by key     {"key": "gjensidige"}
2021-11-16T20:04:01.274Z        DEBUG   controllers.Runner      Found 0 HRAs by key     {"key": "enterprises/gjensidige"}
2021-11-16T20:04:01.274Z        DEBUG   controllers.Runner      no repository/organizational/enterprise runner found    {"event": "workflow_job", "hookID": "328719738", "delivery": "56d85c90-4718-11ec-8b7f-a4369b1789a2", "workflowJob.status": "queued", "workflowJob.labels": ["self-hosted", "iac"], "repository.name": "terraform-aks", "repository.owner.login": "gjensidige", "repository.owner.type": "Organization", "enterprise.slug": "gjensidige", "action": "queued", "repository": "gjensidige/terraform-aks", "organization": "gjensidige", "enterprises": "gjensidige"}
2021-11-16T20:04:01.274Z        INFO    controllers.Runner      Scale target not found. If this is unexpected, ensure that there is exactly one repository-wide or organizational runner deployment that matches this webhook event      {"event": "workflow_job", "hookID": "328719738", "delivery": "56d85c90-4718-11ec-8b7f-a4369b1789a2", "workflowJob.status": "queued", "workflowJob.labels": ["self-hosted", "iac"], "repository.name": "terraform-aks", "repository.owner.login": "gjensidige", "repository.owner.type": "Organization", "enterprise.slug": "gjensidige", "action": "queued"}

The amount of replicas stays at 2 no matter how many queued jobs we have requesting these runners.

Checks

  • My actions-runner-controller version (v0.x.y) does support the feature
  • I'm using an unreleased version of the controller I built from HEAD of the default branch

Expected behavior

Amount of runners should scale up when we have queued workflows

Environment (please complete the following information):

  • Controller Version: 0.20.3
  • Deployment Method: Helm
  • Helm Chart Version: 0.15.0
@mumoshu
Copy link
Collaborator

mumoshu commented Nov 16, 2021

@sigurdfalk Hey! My eyes were caught by this part of your log "workflowJob.labels": ["self-hosted", "iac"].

Could you also share your workflow definition YAML file?
I'm mainly interested in where this self-hosted label in the workflow job event is coming from.

@sigurdfalk
Copy link
Author

@mumoshu I think "self-hosted" is added by default to all self-hosted runners by GitHub. I added "self-hosted" to labels in my RunnerDeployment and then it started working and scaling as expected 🎉

In our workflow, we use: runs-on: [self-hosted, iac]

Thank you so much for your help and all the great work you are doing with this project! We really love it ❤️ I think we can close this issue now

@ghost
Copy link

ghost commented Nov 17, 2021

Hello ! I have had the exact same issue with enterprise runners.
Some more infos :

  • I could make the webhook work with checkRun events (example #2) but not with workflowJob
  • the workflowJob.labels comes from the workflow file, from the jobs.<JOB_NAME>.runs-on field

Following @sigurdfalk comment adding the self-hosted label on the RunnerDeployment solved the workflowJob webhook trigger, the webhook server should maybe ignore this label or assign it by default to pods as it is created by default by the runner ?

I think currently there is a difference between the labels the controller knows about (== the explicitly specified labels) vs the labels Github receive (explicit + implicit (self-hosted, OS, infrastructure)). The same issue arises when specifiying the OS or the infrastructure, even though the job does get assigned on the same node that should be autoscaled.

edit: the webhook controller checks the labels requested by the workflow match the explicitly specified labels, which block with the implicit labels. The issue will not arise if only self-hosted is specified in the labels.

@ghost
Copy link

ghost commented Nov 17, 2021

I pushed a fix for the self-hosted label + doc for the other labels, it seems to work fine on my environment.

@mumoshu
Copy link
Collaborator

mumoshu commented Nov 17, 2021

implicit (self-hosted, OS, infrastructure

@sigurdfalk @clement-loiselet-talend Hey! Thanks a lot for your reports. Ignoring self-hosted sounds good. But what about other implicit labels?

Do you, by any chance, know full list of implicit labels other than self-hosted we can use for the ignore list?

@toast-gear
Copy link
Collaborator

toast-gear commented Nov 18, 2021

https://docs.github.com/en/actions/hosting-your-own-runners/using-self-hosted-runners-in-a-workflow#using-default-labels-to-route-jobs I think this is the list of implicit labels that get applied by GitHub Actions dependant on the hardware

self-hosted
linux
macOS
windows
x64
ARM
ARM64

@mumoshu
Copy link
Collaborator

mumoshu commented Nov 18, 2021

@toast-gear Hey! Thanks. I think your list is for runner labels. I was more interested in workflow_job labels.

Perhaps no workflow_job will implicitly get x64 label, as if it isn't explicitly listed under runs-on you're basically saying "I don't mind which CPU arch the runner is using" then GitHub won't defeat it.

@ghost
Copy link

ghost commented Nov 18, 2021

The list on the workflow_job is always explicit/user-defined on the github-action, but the labels it will use may be implicitely declared on the k8s Runner ressource, meaning the controller will not know them.

e.g. :

workflow.yaml
name: Docker image CI
jobs:
  lint:
    runs-on: [self-hosted, docker-in-docker, linux]
RunnerDeployment.yaml
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: github-action-runner
spec:
  template:
    spec:
      # explicit labels
      labels:
        - docker-in-docker

In this example, the created runner will have the labels [self-hosted, linux, x64, docker-in-docker] in Github, so the workflow will be able to run on it.
However the webhook controller only knows about the docker-in-docker label, so when the webhook will query the controller with the [self-hosted, linux, docker-in-docker] labels it will not be able to find the corresponding RunnerDeployment.
It's confusing for users as they do not expect the workflow webhook to use the explicit labels only.

I don't think we should ignore these labels on the webhook controller as it risks upscaling a deployment the workflow cannot be deployed on. From what I saw we can't easily extract it from the runner, so I guess we could add a warning in the webhook to ease the debug if someone were to try to use these implicit labels without declaring them on the Runner.

@mumoshu
Copy link
Collaborator

mumoshu commented Dec 5, 2021

@clement-loiselet-talend Hey! Sorry for the delay. It took more time than I had thought to fully understand this but now- Gotcha. So, here're my thoughts:

  • We'd better add documentation to generally recommend users so that they copy-paste all the labels used in their workflow definitions to RunnerDeployment.Spec.Template.Labels
  • Add warning log message in our webhook server so that the user can realize why a webhook didn't trigger a scale-up(due to insufficient labels in the runnerdeployment.spec)
  • In the future, we'd better remove the logic in our webhook server that silently ignores self-hosted label in the webhook payload.

Also, thank you for your #953! Thanks to your detailed response it turns out we'd better not merge that, and instead we should enhance logging.

@mumoshu
Copy link
Collaborator

mumoshu commented Dec 19, 2021

Sorry for the back and forth but I'm convinced that we should merge #953. Thanks again for your contribution @clement-loiselet-talend 🙇 I've left a comment with some more contexts in the PR, fyi.

@itsvit-vlasov-y
Copy link

- apiVersion: actions.summerwind.dev/v1alpha1
  kind: RunnerDeployment
  metadata:
    name: my-runner
  spec:
    replicas: 0
    template:
      spec:
        serviceAccountName: runner-sa
        organization: SuperOrg
        labels:
        - self-hosted
        - staging
        env:
        - name: GOOGLE_PROJECT_ID
          value: {{ quote .Values.googleProjectID }}
        resources:
          limits:
            memory: 2Gi
          requests:
            cpu: 50m
            memory: 500Mi
---
- apiVersion: actions.summerwind.dev/v1alpha1
  kind: HorizontalRunnerAutoscaler
  metadata:
    name: my-runner
  spec:
    minReplicas: 0
    maxReplicas: 10
    scaleTargetRef:
      kind: RunnerDeployment
      name: my-runner
    scaleDownDelaySecondsAfterScaleOut: 1800
    scaleUpTriggers:
    - amount: 1
      duration: 30m
      githubEvent:
        workflowJob:
          action: queued      
---
name: Frontend
run-name: Frontend

on:
- push

jobs:
  build-frontend-image:
    name: Build and push a frontend image
    runs-on: ["staging"]
    steps:
    - run: echo hello

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants