Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many GitHub's API requests reach the rate limit #206

Closed
eb-trigo opened this issue Nov 19, 2020 · 16 comments
Closed

Too many GitHub's API requests reach the rate limit #206

eb-trigo opened this issue Nov 19, 2020 · 16 comments
Labels

Comments

@eb-trigo
Copy link

eb-trigo commented Nov 19, 2020

Hi,

After deploying ~20 runners/pods I noticed that after half an hour they are up and running I received an error that appeared inside the controller-manager container saying I've reached the maximum of GitHub Apps's rate limit of 5,000 requests per hour.

Why the actions-runner-controller API is making too many concurrent requests in such a short time to Github?
Is it possible to reduce it significantly?

These are the settings I added to avoid this issue but they did not help:

kind: HorizontalRunnerAutoscaler
spec:
  scaleDownDelaySecondsAfterScaleOut: 60

And --sync-period=10m

@ZacharyBenamram
Copy link
Contributor

I think a main contributor to this problem are the number of times the Runner Reconciler updates the registration token and then re-queues the reconcile job. Would you mind confirming the number of times Updated registration token is logged by the manager per hour (lifetime of the token)

@ZacharyBenamram
Copy link
Contributor

ZacharyBenamram commented Nov 25, 2020

Since updates to the Runner resource are not evaluated immediately but are placed on the controller-runtime workqueue, the token expiration date is not immediately updated, the following function returns false https://git.blendlabs.com/blend/actions-runner-controller/blob/184538e3ee1f85477884d29bc6dc33454f072f5d/controllers/runner_controller.go#L265 and the reconcile function returns Requeue: true in the runner reconcile func below:

		if updated, err := r.updateRegistrationToken(ctx, runner); err != nil {
			return ctrl.Result{}, err
		} else if updated {
			return ctrl.Result{Requeue: true}, nil
		}

I wonder if there is a good solution to delay requeue - one option is to add time.Sleep(X * time.Second) before returning the requeue result.

@ZacharyBenamram
Copy link
Contributor

Hi @eb-trigo - can you try running with the latest release? I above PR should reduce the total number of api requests made to GitHub substantially.

@theobolo
Copy link

@ZacharyBenamram I'm using latest images for the controller and for the runners (DinD runner image).

I'm still regularly banned from Github API.
I'm operating an EKS cluster for our self-hosted Actions Runner, witch can auto-scale to 100 Runners maximum (m5.x2large with 1 runner/k8s node), i've tried to set : scaleDownDelaySecondsAfterScaleOut: 120 but nothing work.

My controller is banned within 10 or 20min it depends...
I was hoping to be able to scale more than that, but i'm stuck with the rate limiting.

Do you think that there's a solution except the PR about GetRegistrationToken ?
Do i need to increase : --sync-period to something like 30m or 1h ?

@ZacharyBenamram
Copy link
Contributor

Hey @theobolo - another step you can take is to use the alternate hpa scheme PercentageRunnersBusy. This reduces the number of api calls made to github considerably. Instead of iterating through each repository, and then checking each workflow, and subsequently each job - the new hpa scheme only makes one call to github to get the number of runners busy. This should reduce the total number of requests you are making. I would keep --sync-period to 1m to test to begin with. If you set sync-period too high, then the hpa will become useless.

@theobolo
Copy link

@ZacharyBenamram Thanks for your answer, i'm gonna give it a try right now :)

@theobolo
Copy link

@ZacharyBenamram I'm certainly missing something but controller is printing this log, and the AutoScaling is not working :

2020-12-18T17:47:40.323Z	ERROR	controller-runtime.controller	Reconciler error	{"controller": "horizontalrunnerautoscaler", "request": "actions-runner-system/actions-runner-aosforce-autoscaler", "error": "organization and repository are both empty"}

I updated my CRDs to use the last modifications btw

@ZacharyBenamram
Copy link
Contributor

ZacharyBenamram commented Dec 18, 2020

@theobolo - are you able to add your organization to the runnerdeployment spec? And can you also provide your runner deployment + horizontal runner autoscaler specs here?

@theobolo
Copy link

theobolo commented Dec 18, 2020

Ok that's working using Organization to my runnerdeployment spec, using repository scoped RunnerDeployment seems not working. I can't use both within the RunnerDeployment (as excepted).

There is my RunnerDeployment + Horizontal witch is working now :

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: actions-runner-aos
  namespace: actions-runner-system
spec:
  replicas: 1
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: role
                operator: In
                values:
                - actions-runner
            topologyKey: "kubernetes.io/hostname"
      # repository: go-aos/aos-app
      organization: go-aos
      image: summerwind/actions-runner-dind:latest
      dockerdWithinRunnerContainer: true
      volumes:
      - emptyDir:
          medium: Memory
        name: runner-work    
      volumeMounts:
      - name: runner-work 
        mountPath: "/home/runner"
      env:
      - name: TZ
        value: Europe/Paris
      resources:
        limits:
          cpu: "7"
          memory: "28Gi"
        requests:
          cpu: "7"
          memory: "28Gi"
      workDir: /home/runner/work

---

apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: actions-runner-aos-autoscaler
  namespace: actions-runner-system
spec:
  scaleTargetRef:
    name: actions-runner-aos
  minReplicas: 1
  maxReplicas: 50
  scaleDownDelaySecondsAfterScaleOut: 60
  metrics:
  - type: PercentageRunnersBusy
    scaleUpThreshold: '0.75'
    scaleDownThreshold: '0.3'
    scaleUpFactor: '1.4'
    scaleDownFactor: '0.7'

And this one, witch is not :

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: actions-runner-aosforce
  namespace: actions-runner-system
spec:
  replicas: 1
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: role
                operator: In
                values:
                - actions-runner
            topologyKey: "kubernetes.io/hostname"
      repository: go-aos/aosforce-app
      image: summerwind/actions-runner-dind:latest
      dockerdWithinRunnerContainer: true
      volumes:
      - emptyDir:
          medium: Memory
        name: runner-work    
      volumeMounts:
      - name: runner-work 
        mountPath: "/home/runner"
      env:
      - name: TZ
        value: Europe/Paris
      resources:
        limits:
          cpu: "7"
          memory: "28Gi"
        requests:
          cpu: "7"
          memory: "28Gi"
      workDir: /home/runner/work

---

apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: actions-runner-aosforce-autoscaler
  namespace: actions-runner-system
spec:
  scaleTargetRef:
    name: actions-runner-aosforce
  minReplicas: 1
  maxReplicas: 5
  scaleDownDelaySecondsAfterScaleOut: 60
  metrics:
  - type: PercentageRunnersBusy
    scaleUpThreshold: '0.75'
    scaleDownThreshold: '0.3'
    scaleUpFactor: '1.4'
    scaleDownFactor: '0.7'

I'll push testing a little bit tonight, i'm off for the moment. Thanks for your Quick answer @ZacharyBenamram

@theobolo
Copy link

theobolo commented Dec 19, 2020

@ZacharyBenamram I did try with this config & --sync-period=1m :

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: actions-runner-aos
  namespace: actions-runner-system
spec:
  replicas: 1
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: role
                operator: In
                values:
                - actions-runner
            topologyKey: "kubernetes.io/hostname"
      # repository: go-aos/aos-app
      organization: go-aos
      image: summerwind/actions-runner-dind:latest
      dockerdWithinRunnerContainer: true
      volumes:
      - emptyDir:
          medium: Memory
        name: runner-work    
      volumeMounts:
      - name: runner-work 
        mountPath: "/home/runner"
      env:
      - name: TZ
        value: Europe/Paris
      resources:
        limits:
          cpu: "7"
          memory: "28Gi"
        requests:
          cpu: "7"
          memory: "28Gi"
      workDir: /home/runner/work

---

apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: actions-runner-aos-autoscaler
  namespace: actions-runner-system
spec:
  scaleTargetRef:
    name: actions-runner-aos
  minReplicas: 1
  maxReplicas: 20
  scaleDownDelaySecondsAfterScaleOut: 60
  metrics:
  - type: PercentageRunnersBusy
    scaleUpThreshold: '0.75'
    scaleDownThreshold: '0.3'
    scaleUpFactor: '1.4'
    scaleDownFactor: '0.7'

Unfortunately i ended being rate limited after maybe 30mins more or less.
My strategie was to load up my 20 workers during 1 hour with jobs.
Thoses jobs are running during 5-6mins.

BTW, I did success to run 10 parrallel workers, during 1 hour with 5min jobs on each.... wich is quite disapointing xD

Need to setup prometheus stats and grafana to track number of API requests during scale up process and workers cycling with jobs.

Nobody here did try this before ? I'm really surprised, tought it was pretty common to run more than 10 parrallel actions runners, even more when it's self-hosted on a kube cluster ;D

ps: I also tried to use Github APP instead of a Github PAT for Authentication, but without any improvements :/

@mumoshu
Copy link
Collaborator

mumoshu commented Dec 19, 2020

Nobody here did try this before

Honestly, I've never tried that with low sync-period. My strategy so far has been to use as long as possible sync-period to avoid rate limiting.

Also, I think the API has too strict rate limit for our use-case. But I'm open to suggestions or data to help us make the controller even more efficient.

@mumoshu
Copy link
Collaborator

mumoshu commented Dec 19, 2020

Also- could you run curl -H "Accept: application/vnd.github.v3+json" -H "Authorization: bearer $GITHUB_TOKEN" https://api.github.com/rate_limit periodically and see the below section for the remaining number? That would give you some data to tune your sync-period.

...
  "rate": {
    "limit": 5000,
    "used": 74,
    "remaining": 4926,
    "reset": 1608349448
  }
}

At least, PercentageRunnersBusy would call List-runners API for every sync period per HRA, and the runner and runnerreplicaset controllres would call Create registration token API, 1~3 List-runners API, and Delete runner API for each job to run. If there is much more API calls than that, there might be a bug in the controller.

@theobolo
Copy link

theobolo commented Dec 20, 2020

Hello everyone,

I did a lot of tests yesterday, and i found that there is a big problem when you forget the replicas: 1 line on the RunnerDeployment specs !

The Autoscaler is working, but it's consistently trying to scale down to 1 worker, wich is the number of replicas expected by the ReplicaSet...

This is really misleading because it's kinda working, but it cause a lot of troubles and also a lot of unnecessary API Requests, because runners are scaling up and then instantly scaled down to 1 replica. It leads to have some Runners in "Offline" state, with some skipped jobs on Github Actions side, etc...

I suspect that some of thoses issues : #62 | #77 are related to this, because it's exactly the behaviour that i've seen on my cluster :

Some pods are "Completed", and Github Actions UI show them "Offline", i'm not 100% sure if that's related... Without "replicas: 1", i didn't experienced this case: Offline runners, etc ...

So don't use replicas: on your RunnerDeployment specs when you use an HorizontalRunnerAutoscaler !
It's a common thing (since the classic HPA on Kubernetes is behaving in the same way) but i feel that it should maybe be a little bit explained on your documentation.

On the API Request limitation side, fixing this, gave me a lot more room to use small --sync-period values with a lot of Max Replicas on the HorizontalRunnerAutoscaler.

I did hit the limit by using 50 workers in parallel with a --sync-period=1m but after 40 et 45min wich is a lot better.

I'll use your tip @mumoshu to track API Calls and find the sweet spot between max replicas and sync-period. I'm very attracted by this controller to minimize infrastructure cost by using a fast and responsive HorizontalRunnerAutoscaler, but for the moment we need to deal with Github API limitations ...

I'll give you more insights soon, my goal is to achieve 100 workers in parallel with a decent --sync-period (less than 10min...).

See ya

@alexgreco86
Copy link

alexgreco86 commented Jan 18, 2021

Not sure what's happening but all of a sudden I'm getting high API call usage even with no HPA's installed. I have the controller manager installed on a single cluster and it goes up about 10-20 API calls every few seconds or so.

This all started after a botched deployment of a second HPA which created about 2k runners in our Org but even after we removed that installation and revoked the token, it's still eating through our rate.

Edit: Looks like it was due to the 2k offline runners being created in our Org. The reconciliation process seemed to be checking all those runners page by page and eating up our limit. I've deleted them all using the rest API and its a lot less usage now.

@andyfeller
Copy link
Contributor

andyfeller commented Apr 28, 2021

Outside of changing the GitHub API behavior or playing around with syncing settings, any thoughts about having the controller register itself as a GitHub App so it can take advantage of higher server-to-server rate limiting?

Different server-to-server request rate limits apply to GitHub Apps if the app is installed on organizations or repositories owned by a GitHub Enterprise Cloud account.

Normal server-to-server rate limits
GitHub Apps making server-to-server requests use the installation's minimum rate limit of 5,000 requests per hour. Organization installations with more than 20 users receive another 50 requests per hour for each user. Installations that have more than 20 repositories receive another 50 requests per hour for each repository. The maximum rate limit for an installation is 12,500 requests per hour.

GitHub Enterprise Cloud server-to-server rate limits
GitHub Apps that are installed on an organization or repository owned by a GitHub Enterprise Cloud account and make server-to-server requests have a rate limit of 15,000 requests per hour.

Genuinely ignorant question; there might be a reason this is using PATs

@stale
Copy link

stale bot commented May 28, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants