Too many GitHub's API requests reach the rate limit #206

eb-trigo · 2020-11-19T12:13:29Z

Hi,

After deploying ~20 runners/pods I noticed that after half an hour they are up and running I received an error that appeared inside the controller-manager container saying I've reached the maximum of GitHub Apps's rate limit of 5,000 requests per hour.

Why the actions-runner-controller API is making too many concurrent requests in such a short time to Github?
Is it possible to reduce it significantly?

These are the settings I added to avoid this issue but they did not help:

kind: HorizontalRunnerAutoscaler
spec:
  scaleDownDelaySecondsAfterScaleOut: 60

And --sync-period=10m

The text was updated successfully, but these errors were encountered:

ghost · 2020-11-25T04:46:43Z

I think a main contributor to this problem are the number of times the Runner Reconciler updates the registration token and then re-queues the reconcile job. Would you mind confirming the number of times Updated registration token is logged by the manager per hour (lifetime of the token)

ghost · 2020-11-25T05:19:46Z

Since updates to the Runner resource are not evaluated immediately but are placed on the controller-runtime workqueue, the token expiration date is not immediately updated, the following function returns false https://git.blendlabs.com/blend/actions-runner-controller/blob/184538e3ee1f85477884d29bc6dc33454f072f5d/controllers/runner_controller.go#L265 and the reconcile function returns Requeue: true in the runner reconcile func below:

		if updated, err := r.updateRegistrationToken(ctx, runner); err != nil {
			return ctrl.Result{}, err
		} else if updated {
			return ctrl.Result{Requeue: true}, nil
		}

I wonder if there is a good solution to delay requeue - one option is to add time.Sleep(X * time.Second) before returning the requeue result.

ghost · 2020-11-30T05:50:53Z

Hi @eb-trigo - can you try running with the latest release? I above PR should reduce the total number of api requests made to GitHub substantially.

theobolo · 2020-12-18T16:40:50Z

@ZacharyBenamram I'm using latest images for the controller and for the runners (DinD runner image).

I'm still regularly banned from Github API.
I'm operating an EKS cluster for our self-hosted Actions Runner, witch can auto-scale to 100 Runners maximum (m5.x2large with 1 runner/k8s node), i've tried to set : scaleDownDelaySecondsAfterScaleOut: 120 but nothing work.

My controller is banned within 10 or 20min it depends...
I was hoping to be able to scale more than that, but i'm stuck with the rate limiting.

Do you think that there's a solution except the PR about GetRegistrationToken ?
Do i need to increase : --sync-period to something like 30m or 1h ?

ghost · 2020-12-18T16:50:13Z

Hey @theobolo - another step you can take is to use the alternate hpa scheme PercentageRunnersBusy. This reduces the number of api calls made to github considerably. Instead of iterating through each repository, and then checking each workflow, and subsequently each job - the new hpa scheme only makes one call to github to get the number of runners busy. This should reduce the total number of requests you are making. I would keep --sync-period to 1m to test to begin with. If you set sync-period too high, then the hpa will become useless.

theobolo · 2020-12-18T17:06:33Z

@ZacharyBenamram Thanks for your answer, i'm gonna give it a try right now :)

theobolo · 2020-12-18T17:49:05Z

@ZacharyBenamram I'm certainly missing something but controller is printing this log, and the AutoScaling is not working :

2020-12-18T17:47:40.323Z	ERROR	controller-runtime.controller	Reconciler error	{"controller": "horizontalrunnerautoscaler", "request": "actions-runner-system/actions-runner-aosforce-autoscaler", "error": "organization and repository are both empty"}

I updated my CRDs to use the last modifications btw

ghost · 2020-12-18T17:58:22Z

@theobolo - are you able to add your organization to the runnerdeployment spec? And can you also provide your runner deployment + horizontal runner autoscaler specs here?

theobolo · 2020-12-18T18:04:09Z

Ok that's working using Organization to my runnerdeployment spec, using repository scoped RunnerDeployment seems not working. I can't use both within the RunnerDeployment (as excepted).

There is my RunnerDeployment + Horizontal witch is working now :

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: actions-runner-aos
  namespace: actions-runner-system
spec:
  replicas: 1
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: role
                operator: In
                values:
                - actions-runner
            topologyKey: "kubernetes.io/hostname"
      # repository: go-aos/aos-app
      organization: go-aos
      image: summerwind/actions-runner-dind:latest
      dockerdWithinRunnerContainer: true
      volumes:
      - emptyDir:
          medium: Memory
        name: runner-work    
      volumeMounts:
      - name: runner-work 
        mountPath: "/home/runner"
      env:
      - name: TZ
        value: Europe/Paris
      resources:
        limits:
          cpu: "7"
          memory: "28Gi"
        requests:
          cpu: "7"
          memory: "28Gi"
      workDir: /home/runner/work

---

apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: actions-runner-aos-autoscaler
  namespace: actions-runner-system
spec:
  scaleTargetRef:
    name: actions-runner-aos
  minReplicas: 1
  maxReplicas: 50
  scaleDownDelaySecondsAfterScaleOut: 60
  metrics:
  - type: PercentageRunnersBusy
    scaleUpThreshold: '0.75'
    scaleDownThreshold: '0.3'
    scaleUpFactor: '1.4'
    scaleDownFactor: '0.7'

And this one, witch is not :

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: actions-runner-aosforce
  namespace: actions-runner-system
spec:
  replicas: 1
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: role
                operator: In
                values:
                - actions-runner
            topologyKey: "kubernetes.io/hostname"
      repository: go-aos/aosforce-app
      image: summerwind/actions-runner-dind:latest
      dockerdWithinRunnerContainer: true
      volumes:
      - emptyDir:
          medium: Memory
        name: runner-work    
      volumeMounts:
      - name: runner-work 
        mountPath: "/home/runner"
      env:
      - name: TZ
        value: Europe/Paris
      resources:
        limits:
          cpu: "7"
          memory: "28Gi"
        requests:
          cpu: "7"
          memory: "28Gi"
      workDir: /home/runner/work

---

apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: actions-runner-aosforce-autoscaler
  namespace: actions-runner-system
spec:
  scaleTargetRef:
    name: actions-runner-aosforce
  minReplicas: 1
  maxReplicas: 5
  scaleDownDelaySecondsAfterScaleOut: 60
  metrics:
  - type: PercentageRunnersBusy
    scaleUpThreshold: '0.75'
    scaleDownThreshold: '0.3'
    scaleUpFactor: '1.4'
    scaleDownFactor: '0.7'

I'll push testing a little bit tonight, i'm off for the moment. Thanks for your Quick answer @ZacharyBenamram

theobolo · 2020-12-19T01:09:21Z

@ZacharyBenamram I did try with this config & --sync-period=1m :

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: actions-runner-aos
  namespace: actions-runner-system
spec:
  replicas: 1
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: role
                operator: In
                values:
                - actions-runner
            topologyKey: "kubernetes.io/hostname"
      # repository: go-aos/aos-app
      organization: go-aos
      image: summerwind/actions-runner-dind:latest
      dockerdWithinRunnerContainer: true
      volumes:
      - emptyDir:
          medium: Memory
        name: runner-work    
      volumeMounts:
      - name: runner-work 
        mountPath: "/home/runner"
      env:
      - name: TZ
        value: Europe/Paris
      resources:
        limits:
          cpu: "7"
          memory: "28Gi"
        requests:
          cpu: "7"
          memory: "28Gi"
      workDir: /home/runner/work

---

apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: actions-runner-aos-autoscaler
  namespace: actions-runner-system
spec:
  scaleTargetRef:
    name: actions-runner-aos
  minReplicas: 1
  maxReplicas: 20
  scaleDownDelaySecondsAfterScaleOut: 60
  metrics:
  - type: PercentageRunnersBusy
    scaleUpThreshold: '0.75'
    scaleDownThreshold: '0.3'
    scaleUpFactor: '1.4'
    scaleDownFactor: '0.7'

Unfortunately i ended being rate limited after maybe 30mins more or less.
My strategie was to load up my 20 workers during 1 hour with jobs.
Thoses jobs are running during 5-6mins.

BTW, I did success to run 10 parrallel workers, during 1 hour with 5min jobs on each.... wich is quite disapointing xD

Need to setup prometheus stats and grafana to track number of API requests during scale up process and workers cycling with jobs.

Nobody here did try this before ? I'm really surprised, tought it was pretty common to run more than 10 parrallel actions runners, even more when it's self-hosted on a kube cluster ;D

ps: I also tried to use Github APP instead of a Github PAT for Authentication, but without any improvements :/

mumoshu · 2020-12-19T02:42:42Z

Nobody here did try this before

Honestly, I've never tried that with low sync-period. My strategy so far has been to use as long as possible sync-period to avoid rate limiting.

Also, I think the API has too strict rate limit for our use-case. But I'm open to suggestions or data to help us make the controller even more efficient.

mumoshu · 2020-12-19T03:00:28Z

Also- could you run curl -H "Accept: application/vnd.github.v3+json" -H "Authorization: bearer $GITHUB_TOKEN" https://api.github.com/rate_limit periodically and see the below section for the remaining number? That would give you some data to tune your sync-period.

...
  "rate": {
    "limit": 5000,
    "used": 74,
    "remaining": 4926,
    "reset": 1608349448
  }
}

At least, PercentageRunnersBusy would call List-runners API for every sync period per HRA, and the runner and runnerreplicaset controllres would call Create registration token API, 1~3 List-runners API, and Delete runner API for each job to run. If there is much more API calls than that, there might be a bug in the controller.

theobolo · 2020-12-20T12:31:59Z

Hello everyone,

I did a lot of tests yesterday, and i found that there is a big problem when you forget the replicas: 1 line on the RunnerDeployment specs !

The Autoscaler is working, but it's consistently trying to scale down to 1 worker, wich is the number of replicas expected by the ReplicaSet...

This is really misleading because it's kinda working, but it cause a lot of troubles and also a lot of unnecessary API Requests, because runners are scaling up and then instantly scaled down to 1 replica. It leads to have some Runners in "Offline" state, with some skipped jobs on Github Actions side, etc...

I suspect that some of thoses issues : #62 | #77 are related to this, because it's exactly the behaviour that i've seen on my cluster :

Some pods are "Completed", and Github Actions UI show them "Offline", i'm not 100% sure if that's related... Without "replicas: 1", i didn't experienced this case: Offline runners, etc ...

So don't use replicas: on your RunnerDeployment specs when you use an HorizontalRunnerAutoscaler !
It's a common thing (since the classic HPA on Kubernetes is behaving in the same way) but i feel that it should maybe be a little bit explained on your documentation.

On the API Request limitation side, fixing this, gave me a lot more room to use small --sync-period values with a lot of Max Replicas on the HorizontalRunnerAutoscaler.

I did hit the limit by using 50 workers in parallel with a --sync-period=1m but after 40 et 45min wich is a lot better.

I'll use your tip @mumoshu to track API Calls and find the sweet spot between max replicas and sync-period. I'm very attracted by this controller to minimize infrastructure cost by using a fast and responsive HorizontalRunnerAutoscaler, but for the moment we need to deal with Github API limitations ...

I'll give you more insights soon, my goal is to achieve 100 workers in parallel with a decent --sync-period (less than 10min...).

See ya

alexgreco86 · 2021-01-18T20:19:45Z

Not sure what's happening but all of a sudden I'm getting high API call usage even with no HPA's installed. I have the controller manager installed on a single cluster and it goes up about 10-20 API calls every few seconds or so.

This all started after a botched deployment of a second HPA which created about 2k runners in our Org but even after we removed that installation and revoked the token, it's still eating through our rate.

Edit: Looks like it was due to the 2k offline runners being created in our Org. The reconciliation process seemed to be checking all those runners page by page and eating up our limit. I've deleted them all using the rest API and its a lot less usage now.

andyfeller · 2021-04-28T13:18:37Z

Outside of changing the GitHub API behavior or playing around with syncing settings, any thoughts about having the controller register itself as a GitHub App so it can take advantage of higher server-to-server rate limiting?

Different server-to-server request rate limits apply to GitHub Apps if the app is installed on organizations or repositories owned by a GitHub Enterprise Cloud account.

Normal server-to-server rate limits
GitHub Apps making server-to-server requests use the installation's minimum rate limit of 5,000 requests per hour. Organization installations with more than 20 users receive another 50 requests per hour for each user. Installations that have more than 20 repositories receive another 50 requests per hour for each repository. The maximum rate limit for an installation is 12,500 requests per hour.

GitHub Enterprise Cloud server-to-server rate limits
GitHub Apps that are installed on an organization or repository owned by a GitHub Enterprise Cloud account and make server-to-server requests have a rate limit of 15,000 requests per hour.

Genuinely ignorant question; there might be a reason this is using PATs

stale · 2021-05-28T17:27:16Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

ghost mentioned this issue Nov 26, 2020

GetRegistrationToken: Remove 10 minute buffer to token expiration #214

Merged

theobolo mentioned this issue Mar 8, 2021

[Github Webhook HRA] Not able to get it working... #377

Closed

callum-tait-pbx mentioned this issue Apr 21, 2021

docs: adding scaling configuration detail #469

Merged

stale bot added the stale label May 28, 2021

stale bot closed this as completed Jun 11, 2021

toast-gear mentioned this issue Nov 30, 2021

docs: various clarifications #976

Merged

toast-gear mentioned this issue Sep 22, 2022

Runners continuously get terminated immediately with my custom runner image and custom cert #1834

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too many GitHub's API requests reach the rate limit #206

Too many GitHub's API requests reach the rate limit #206

eb-trigo commented Nov 19, 2020 •

edited

Loading

ghost commented Nov 25, 2020

ghost commented Nov 25, 2020 •

edited by ghost

Loading

ghost commented Nov 30, 2020

theobolo commented Dec 18, 2020

ghost commented Dec 18, 2020

theobolo commented Dec 18, 2020

theobolo commented Dec 18, 2020

ghost commented Dec 18, 2020 •

edited by ghost

Loading

theobolo commented Dec 18, 2020 •

edited

Loading

theobolo commented Dec 19, 2020 •

edited

Loading

mumoshu commented Dec 19, 2020 •

edited

Loading

mumoshu commented Dec 19, 2020 •

edited

Loading

theobolo commented Dec 20, 2020 •

edited

Loading

alexgreco86 commented Jan 18, 2021 •

edited

Loading

andyfeller commented Apr 28, 2021 •

edited

Loading

stale bot commented May 28, 2021

Too many GitHub's API requests reach the rate limit #206

Too many GitHub's API requests reach the rate limit #206

Comments

eb-trigo commented Nov 19, 2020 • edited Loading

ghost commented Nov 25, 2020

ghost commented Nov 25, 2020 • edited by ghost Loading

ghost commented Nov 30, 2020

theobolo commented Dec 18, 2020

ghost commented Dec 18, 2020

theobolo commented Dec 18, 2020

theobolo commented Dec 18, 2020

ghost commented Dec 18, 2020 • edited by ghost Loading

theobolo commented Dec 18, 2020 • edited Loading

theobolo commented Dec 19, 2020 • edited Loading

mumoshu commented Dec 19, 2020 • edited Loading

mumoshu commented Dec 19, 2020 • edited Loading

theobolo commented Dec 20, 2020 • edited Loading

alexgreco86 commented Jan 18, 2021 • edited Loading

andyfeller commented Apr 28, 2021 • edited Loading

stale bot commented May 28, 2021

eb-trigo commented Nov 19, 2020 •

edited

Loading

ghost commented Nov 25, 2020 •

edited by ghost

Loading

ghost commented Dec 18, 2020 •

edited by ghost

Loading

theobolo commented Dec 18, 2020 •

edited

Loading

theobolo commented Dec 19, 2020 •

edited

Loading

mumoshu commented Dec 19, 2020 •

edited

Loading

mumoshu commented Dec 19, 2020 •

edited

Loading

theobolo commented Dec 20, 2020 •

edited

Loading

alexgreco86 commented Jan 18, 2021 •

edited

Loading

andyfeller commented Apr 28, 2021 •

edited

Loading