Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker Container Action Jobs failing to schedule on autoscaled cluster #140

Closed
4 tasks done
rteeling-evernorth opened this issue Mar 1, 2024 · 4 comments
Closed
4 tasks done
Labels
bug Something isn't working

Comments

@rteeling-evernorth
Copy link

rteeling-evernorth commented Mar 1, 2024

Checks

Controller Version

0.7.0,0.8.2

Deployment Method

ArgoCD

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Use k8s Cluster Autoscaler
2. Create ScaleSet using Kubernetes mode
3. Run a docker container-based action
3. The cluster must not have the capacity to schedule the action's job pod

Describe the bug

When the k8s job pod tries to run, the k8s cannot find a node to schedule and throws the following event/error:
Node didn't have enough resource: cpu, requested: 2000, used: 13920, capacity: 15890

The K8S Job has the following error on it: Job has reached the specified backoff limit

This causes the Actions job to fail

Describe the expected behavior

Job pod should wait for new nodes to come online to schedule (average: 45 seconds)

Additional Context

gha-runner-scale-set:

  githubConfigUrl: changeme
  
  githubConfigSecret: github-arc-secret

  minRunners: 0
  runnerGroup: "changeme"

  runnerScaleSetName: "changeme"
  
  githubServerTLS:
    certificateFrom:
      configMapKeyRef:
        name: my-cacert
        key: ca.crt
    runnerMountPath: /usr/local/share/ca-certificates/
  
  containerMode:
    type: "kubernetes"  ## type can be set to dind or kubernetes
    kubernetesModeWorkVolumeClaim:
      accessModes: ["ReadWriteOnce"]
      storageClassName: "gp2-encrypted"
      resources:
        requests:
          storage: 5Gi
  template:
    ### CUSTOM ###
    spec:
      nodeSelector:
        github: "true"
      tolerations:
      - effect: NoSchedule
        key: dedicated
        operator: Equal
        value: github
      priorityClassName: github
      ### END CUSTOM ###
      securityContext:
        fsGroup: 123
      containers:
      - name: runner
        # image: ghcr.io/actions/actions-runner:latest
        image: ACTIONS-RUNNER-IMAGE-MIRROR/actions-runner:2.314.0
        command: ["/home/runner/run.sh"]
        resources:
          limits:
            cpu: "200m"
            memory: "512Mi"

        env:
          - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
            value: "true"
          - name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
            value: /home/runner/pod-templates/default.yaml
        volumeMounts:
          - name: pod-templates
            mountPath: /home/runner/pod-templates
            readOnly: true
      volumes:
      - name: pod-templates
        configMap:
          name: pod-templates

  controllerServiceAccount:
    
    namespace: arc-system
    name: github-actions-controller-gha-rs-controller

Controller Logs

My employer's open source contribution policy prohibits me from posting this information in public, however i can post relevant redacted portions upon request

Runner Pod Logs

My employer's open source contribution policy prohibits me from posting this information in public, however i can post relevant redacted portions upon request.
@rteeling-evernorth rteeling-evernorth added the bug Something isn't working label Mar 1, 2024
@rteeling-evernorth rteeling-evernorth changed the title Docker Container Action Jobs failing to schedule Docker Container Action Jobs failing to schedule on autoscaled cluster Mar 1, 2024
Copy link

github-actions bot commented Mar 1, 2024

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

@nikola-jokic nikola-jokic transferred this issue from actions/actions-runner-controller Mar 4, 2024
@nikola-jokic
Copy link
Member

Hey @rteeling-evernorth,

This issue is related to the hook ☺️. Are you using default hook implementation in your container mirror? If so, job schedules the pod to run on the same node where the runner is. If so, the problem is with the node capacity, not with the scheduler. By default, we are skipping the scheduler so we can use the volume mount from the runner pod. This can be avoided in case you use ReadWriteMany volumes, but would require you to configure envs appropriately.

@rteeling-evernorth
Copy link
Author

Ah! That would explain it. Everything in my mirror is off-the-shelf for 0.8.2. I was using the default volume mount in the values file which is ReadWriteOnce. This would compel the behavior I am seeing. Thank you so much for the info!

@nikola-jokic
Copy link
Member

You are welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants