Docker Container Action Jobs failing to schedule on autoscaled cluster #140

rteeling-evernorth · 2024-03-01T15:52:47Z

Checks

I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
I am using charts that are officially provided

Controller Version

0.7.0,0.8.2

Deployment Method

ArgoCD

Checks

This isn't a question or user support case (For Q&A and community support, go to Discussions).
I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Use k8s Cluster Autoscaler
2. Create ScaleSet using Kubernetes mode
3. Run a docker container-based action
3. The cluster must not have the capacity to schedule the action's job pod

Describe the bug

When the k8s job pod tries to run, the k8s cannot find a node to schedule and throws the following event/error:
Node didn't have enough resource: cpu, requested: 2000, used: 13920, capacity: 15890

The K8S Job has the following error on it: Job has reached the specified backoff limit

This causes the Actions job to fail

Describe the expected behavior

Job pod should wait for new nodes to come online to schedule (average: 45 seconds)

Additional Context

gha-runner-scale-set:

  githubConfigUrl: changeme
  
  githubConfigSecret: github-arc-secret

  minRunners: 0
  runnerGroup: "changeme"

  runnerScaleSetName: "changeme"
  
  githubServerTLS:
    certificateFrom:
      configMapKeyRef:
        name: my-cacert
        key: ca.crt
    runnerMountPath: /usr/local/share/ca-certificates/
  
  containerMode:
    type: "kubernetes"  ## type can be set to dind or kubernetes
    kubernetesModeWorkVolumeClaim:
      accessModes: ["ReadWriteOnce"]
      storageClassName: "gp2-encrypted"
      resources:
        requests:
          storage: 5Gi
  template:
    ### CUSTOM ###
    spec:
      nodeSelector:
        github: "true"
      tolerations:
      - effect: NoSchedule
        key: dedicated
        operator: Equal
        value: github
      priorityClassName: github
      ### END CUSTOM ###
      securityContext:
        fsGroup: 123
      containers:
      - name: runner
        # image: ghcr.io/actions/actions-runner:latest
        image: ACTIONS-RUNNER-IMAGE-MIRROR/actions-runner:2.314.0
        command: ["/home/runner/run.sh"]
        resources:
          limits:
            cpu: "200m"
            memory: "512Mi"

        env:
          - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
            value: "true"
          - name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
            value: /home/runner/pod-templates/default.yaml
        volumeMounts:
          - name: pod-templates
            mountPath: /home/runner/pod-templates
            readOnly: true
      volumes:
      - name: pod-templates
        configMap:
          name: pod-templates

  controllerServiceAccount:
    
    namespace: arc-system
    name: github-actions-controller-gha-rs-controller

Controller Logs

My employer's open source contribution policy prohibits me from posting this information in public, however i can post relevant redacted portions upon request

Runner Pod Logs

My employer's open source contribution policy prohibits me from posting this information in public, however i can post relevant redacted portions upon request.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-03-01T15:53:22Z

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

nikola-jokic · 2024-03-04T12:54:04Z

Hey @rteeling-evernorth,

This issue is related to the hook ☺️. Are you using default hook implementation in your container mirror? If so, job schedules the pod to run on the same node where the runner is. If so, the problem is with the node capacity, not with the scheduler. By default, we are skipping the scheduler so we can use the volume mount from the runner pod. This can be avoided in case you use ReadWriteMany volumes, but would require you to configure envs appropriately.

rteeling-evernorth · 2024-03-04T16:28:58Z

Ah! That would explain it. Everything in my mirror is off-the-shelf for 0.8.2. I was using the default volume mount in the values file which is ReadWriteOnce. This would compel the behavior I am seeing. Thank you so much for the info!

nikola-jokic · 2024-03-05T10:43:47Z

You are welcome!

rteeling-evernorth added the bug Something isn't working label Mar 1, 2024

rteeling-evernorth changed the title ~~Docker Container Action Jobs failing to schedule~~ Docker Container Action Jobs failing to schedule on autoscaled cluster Mar 1, 2024

tarasmadan mentioned this issue Mar 4, 2024

.github/workflows/ci.yml: improvements google/syzkaller#4551

Open

nikola-jokic transferred this issue from actions/actions-runner-controller Mar 4, 2024

nikola-jokic closed this as completed Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docker Container Action Jobs failing to schedule on autoscaled cluster #140

Docker Container Action Jobs failing to schedule on autoscaled cluster #140

rteeling-evernorth commented Mar 1, 2024 •

edited

github-actions bot commented Mar 1, 2024

nikola-jokic commented Mar 4, 2024

rteeling-evernorth commented Mar 4, 2024

nikola-jokic commented Mar 5, 2024

Docker Container Action Jobs failing to schedule on autoscaled cluster #140

Docker Container Action Jobs failing to schedule on autoscaled cluster #140

Comments

rteeling-evernorth commented Mar 1, 2024 • edited

Checks

Controller Version

Deployment Method

Checks

To Reproduce

Describe the bug

Describe the expected behavior

Additional Context

Controller Logs

Runner Pod Logs

github-actions bot commented Mar 1, 2024

nikola-jokic commented Mar 4, 2024

rteeling-evernorth commented Mar 4, 2024

nikola-jokic commented Mar 5, 2024

rteeling-evernorth commented Mar 1, 2024 •

edited