Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Github Webhook HRA] Not able to get it working... #377

Closed
theobolo opened this issue Mar 8, 2021 · 39 comments
Closed

[Github Webhook HRA] Not able to get it working... #377

theobolo opened this issue Mar 8, 2021 · 39 comments
Labels

Comments

@theobolo
Copy link

theobolo commented Mar 8, 2021

Hello everyone,

Since my scaling problems exposed on this issue #206, I've found an efficient workaround...

I'm using multiple kubernetes clusters (5 actually) with one github actions controller deployed on each one.

  • Each controller is managing a pool of 20 workers, autoscaled using the - type: PercentageRunnersBusy method.
  • Each controller is using a unique Github APP for Github API auth. wich gives me approx. 6700 API calls per hour on each clusters.
  • Each controller have a sync-period configured on 1m

It's working well, and it was the only solution i found to be able to run 100 runners concurently with the action runner controller.

@mumoshu

Btw, that's not why i'm here today. Since i've seen the new Github Webhook HRA feature, i absolutely need it to stop doing this kind of workaround and to be able to use the controller "at scale".

Unfortunately, i'm not able to get it working using the last Helm chart version 0.7.0.
I tried with : latest/v0.17.0/canary versions of the controller-image, and i'm using the 'master' branch CRDs.

When i declare the HRA like this :

apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: actions-runner-aos-autoscaler
  namespace: default
spec:
  scaleTargetRef:
    name: actions-runner-aos
  minReplicas: 1
  maxReplicas: 10
  scaleUpTriggers:
  - githubEvent:
      checkRun:
        types: ["created"]
        status: "queued"
    amount: 1
    duration: "5m"

The github-actions-controller is crashing with this log :

2021-03-08T14:40:39.333Z INFO controller-runtime.metrics metrics server is starting to listen {"addr": "127.0.0.1:8080"}
2021-03-08T14:40:39.333Z INFO controller-runtime.builder Registering a mutating webhook {"GVK": "actions.summerwind.dev/v1alpha1, Kind=Runner", "path": "/mutate-actions-summerwind-dev-v1alpha1-runner"}
2021-03-08T14:40:39.333Z INFO controller-runtime.webhook registering webhook {"path": "/mutate-actions-summerwind-dev-v1alpha1-runner"}
2021-03-08T14:40:39.333Z INFO controller-runtime.builder Registering a validating webhook {"GVK": "actions.summerwind.dev/v1alpha1, Kind=Runner", "path": "/validate-actions-summerwind-dev-v1alpha1-runner"}
2021-03-08T14:40:39.333Z INFO controller-runtime.webhook registering webhook {"path": "/validate-actions-summerwind-dev-v1alpha1-runner"}
2021-03-08T14:40:39.333Z INFO controller-runtime.builder Registering a mutating webhook {"GVK": "actions.summerwind.dev/v1alpha1, Kind=RunnerDeployment", "path": "/mutate-actions-summerwind-dev-v1alpha1-runnerdeployment"}
2021-03-08T14:40:39.333Z INFO controller-runtime.webhook registering webhook {"path": "/mutate-actions-summerwind-dev-v1alpha1-runnerdeployment"}
2021-03-08T14:40:39.333Z INFO controller-runtime.builder Registering a validating webhook {"GVK": "actions.summerwind.dev/v1alpha1, Kind=RunnerDeployment", "path": "/validate-actions-summerwind-dev-v1alpha1-runnerdeployment"}
2021-03-08T14:40:39.333Z INFO controller-runtime.webhook registering webhook {"path": "/validate-actions-summerwind-dev-v1alpha1-runnerdeployment"}
2021-03-08T14:40:39.333Z INFO controller-runtime.builder Registering a mutating webhook {"GVK": "actions.summerwind.dev/v1alpha1, Kind=RunnerReplicaSet", "path": "/mutate-actions-summerwind-dev-v1alpha1-runnerreplicaset"}
2021-03-08T14:40:39.333Z INFO controller-runtime.webhook registering webhook {"path": "/mutate-actions-summerwind-dev-v1alpha1-runnerreplicaset"}
2021-03-08T14:40:39.333Z INFO controller-runtime.builder Registering a validating webhook {"GVK": "actions.summerwind.dev/v1alpha1, Kind=RunnerReplicaSet", "path": "/validate-actions-summerwind-dev-v1alpha1-runnerreplicaset"}
2021-03-08T14:40:39.334Z INFO controller-runtime.webhook registering webhook {"path": "/validate-actions-summerwind-dev-v1alpha1-runnerreplicaset"}
2021-03-08T14:40:39.334Z INFO setup starting manager
2021-03-08T14:40:39.334Z INFO controller-runtime.manager starting metrics server {"path": "/metrics"}
2021-03-08T14:40:39.435Z INFO controller-runtime.webhook.webhooks starting webhook server
2021-03-08T14:40:39.435Z INFO controller-runtime.certwatcher Updated current TLS certificate
2021-03-08T14:40:39.435Z INFO controller-runtime.webhook serving webhook server {"host": "", "port": 9443}
2021-03-08T14:40:39.436Z INFO controller-runtime.certwatcher Starting certificate watcher
2021-03-08T14:40:56.134Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"ConfigMap","namespace":"default","name":"controller-leader-election-helper","uid":"900760ed-cad7-435b-964f-e3694c664fbe","apiVersion":"v1","resourceVersion":"5323021"}, "reason": "LeaderElection", "message": "actions-controller-actions-runner-controller-554966bb8b-lbwvt_6caf86f4-a576-4e77-b0c5-51d19c018b26 became leader"}
2021-03-08T14:40:56.134Z INFO controller-runtime.controller Starting EventSource {"controller": "horizontalrunnerautoscaler", "source": "kind source: /, Kind="}
2021-03-08T14:40:56.134Z INFO controller-runtime.controller Starting EventSource {"controller": "runner", "source": "kind source: /, Kind="}
2021-03-08T14:40:56.134Z INFO controller-runtime.controller Starting EventSource {"controller": "runnerreplicaset", "source": "kind source: /, Kind="}
2021-03-08T14:40:56.134Z INFO controller-runtime.controller Starting EventSource {"controller": "runnerreplicaset", "source": "kind source: /, Kind="}
2021-03-08T14:40:56.135Z INFO controller-runtime.controller Starting EventSource {"controller": "runnerdeployment", "source": "kind source: /, Kind="}
2021-03-08T14:40:56.234Z INFO controller-runtime.controller Starting Controller {"controller": "horizontalrunnerautoscaler"}
2021-03-08T14:40:56.234Z INFO controller-runtime.controller Starting EventSource {"controller": "runner", "source": "kind source: /, Kind="}
2021-03-08T14:40:56.235Z INFO controller-runtime.controller Starting Controller {"controller": "runnerreplicaset"}
2021-03-08T14:40:56.235Z INFO controller-runtime.controller Starting EventSource {"controller": "runnerdeployment", "source": "kind source: /, Kind="}
2021-03-08T14:40:56.235Z INFO controller-runtime.controller Starting Controller {"controller": "runnerdeployment"}
2021-03-08T14:40:56.335Z INFO controller-runtime.controller Starting workers {"controller": "runnerreplicaset", "worker count": 1}
2021-03-08T14:40:56.335Z INFO controllers.RunnerReplicaSet debug {"runnerreplicaset": "default/actions-runner-aos-h9ppg", "desired": 1, "available": 1}
2021-03-08T14:40:56.335Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "runnerreplicaset", "request": "default/actions-runner-aos-h9ppg"}
2021-03-08T14:40:56.336Z INFO controller-runtime.controller Starting Controller {"controller": "runner"}
2021-03-08T14:40:56.335Z INFO controller-runtime.controller Starting workers {"controller": "horizontalrunnerautoscaler", "worker count": 1}
E0308 14:40:56.336609 1 runtime.go:78] Observed a panic: runtime.boundsError{x:0, y:0, signed:true, code:0x0} (runtime error: index out of range [0] with length 0)
goroutine 343 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x15aabe0, 0xc00027ed80)
/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/runtime/runtime.go:74 +0xa6
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/runtime/runtime.go:48 +0x89
panic(0x15aabe0, 0xc00027ed80)
/usr/local/go/src/runtime/panic.go:969 +0x1b9
github.com/summerwind/actions-runner-controller/controllers.(*HorizontalRunnerAutoscalerReconciler).calculateReplicasByQueuedAndInProgressWorkflowRuns(0xc0002e2ac0, 0x13d38ba, 0x10, 0xc00027ed60, 0x1f, 0xc000704ca0, 0x12, 0x0, 0x0, 0xc00071cb40, ...)
/workspace/controllers/autoscaling.go:50 +0xe7e
github.com/summerwind/actions-runner-controller/controllers.(*HorizontalRunnerAutoscalerReconciler).determineDesiredReplicas(0xc0002e2ac0, 0x13d38ba, 0x10, 0xc00027ed60, 0x1f, 0xc000704ca0, 0x12, 0x0, 0x0, 0xc00071cb40, ...)
/workspace/controllers/autoscaling.go:31 +0xb8
github.com/summerwind/actions-runner-controller/controllers.(*HorizontalRunnerAutoscalerReconciler).computeReplicas(0xc0002e2ac0, 0x13d38ba, 0x10, 0xc00027ed60, 0x1f, 0xc000704ca0, 0x12, 0x0, 0x0, 0xc00071cb40, ...)
/workspace/controllers/horizontalrunnerautoscaler_controller.go:142 +0x7b
github.com/summerwind/actions-runner-controller/controllers.(*HorizontalRunnerAutoscalerReconciler).Reconcile(0xc0002e2ac0, 0xc00017a7e0, 0x7, 0xc00027f9e0, 0x1d, 0x428f095d4, 0xc000558cf0, 0xc0002d27e8, 0xc0002d27e0)

I tried to delete :

  minReplicas: 1
  maxReplicas: 10

to follow the README.md exemple, but the controller is not happy either and keeps saying to add minReplicas and maxReplicas to work.

I know that this feature is in early stage, so it won't be suprised if this is not working yet, just wanted to be sure that you are aware of this :D

👍

@robwhitby
Copy link
Contributor

I had the same issue, the HorizontalRunnerAutoscaler still requires one of the "normal" scaling types under metrics as well as scaleUpTriggers.

this works for me:

apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: runner-autoscaler
  namespace: actions
spec:
  minReplicas: 1
  maxReplicas: 3
  scaleTargetRef:
    name: runner
  metrics:
  - type: PercentageRunnersBusy
    scaleUpThreshold: '1'       
    scaleDownThreshold: '0.5'
    ScaleUpAdjustment: '1'
    ScaleDownAdjustment: '1' 
  scaleUpTriggers:
  - githubEvent:
      checkRun:
        types: ["created"]
        status: "queued"
    amount: 1
    duration: "5m"

@avdhoot
Copy link

avdhoot commented Mar 8, 2021

facing same issue: After below error if add minReplicas & maxReplicas . Then get similar trackback like above.

2021-03-08T15:45:29.468Z        ERROR   controllers.HorizontalRunnerAutoscaler  Could not compute replicas      {"horizontalrunnerautoscaler": "actions-runner-system/****-runner-autoscaler", "error": "horizontalrunnerautoscaler actions-runner-system/****-runner-autoscaler is missing minReplicas"}

@theobolo
Copy link
Author

theobolo commented Mar 8, 2021

@robwhitby Your solution is working, thanks a lot ! :D

It should maybe be added on the Readme.md or be more explicit ?

Like this :

...
kind: HorizontalRunnerAutoscaler
spec:
  ...
  scaleTargetRef:
    name: myrunners
  ...
  scaleUpTriggers:
  - githubEvent:
      checkRun:
        types: ["created"]
        status: "queued"
    amount: 1
    duration: "5m"

ps: is anyone here, found a different solution to scale, without the Github Webhook ?

I was wondering if the controller can be namespaced wich should allow to run multiple controllers with multiple app auth, inside the same cluster. Should be an alternative to the Github API ratelimiting, without using the Github Webhook HRA

@callum-tait-pbx
Copy link
Contributor

callum-tait-pbx commented Mar 8, 2021

@theobolo I've not tried it but the PercentageRunnersBusy metric in theory allows for multiple controllers scaling runners in specific namespaces. See #223 for the original work on the feature with some comments from the author.

EDIT actually, I think I miss-read it, I don't think you can run more than 1 controller. The only options you have really are:

  • Setup your controller auth as a github app for the increased API count
  • Use the PercentageRunnersBusy schema as that uses fewer API calls compared to TotalNumberOfQueuedAndInProgressWorkflowRuns
  • Increase the sync period config

@theobolo
Copy link
Author

theobolo commented Mar 8, 2021

@callum-tait-pbx Hello, thanks for your answer.

I'm already aware of my possibilities, but i was wondering if i was missing something.

I'm already using PercentageRunnersBusy on my HRA.
As i mentionned at the very beginning of my message here, i'm already using Github APP for authentication, actually... i'm running 5 different controllers, using 5 different Github APP for auth, hosted on 5 different Kubernetes clusters to avoid API limitations per controller :D

You can't run multiple controllers on the same cluster since they are checking "cluster-wide" resources like RunnerDeployments, etc...

I was asking to @mumoshu if he thinks that adding namespacing capabilities on the controller sounds a doable / realistic / usefull ? x)

I was thinking about running the controller with a --namespace argument to specify in wich namespace the controller should watch resources (HRA/RunnerDeployments/RunnerReplicaSet/Runners/etc...).

For the moment unfortunately, it's not possible.

My targets in term of scaling are :

  • concurency : more than 100 runners at the same time (submitting 1 PR on our repo = 90 tests running...)
  • autoscaling based on PercentageRunnersBusy with aggresive scaling
  • --sync-period : not more than 1 minute

The only solution to acheive that is by doing what i've described.
I also want to keep the sync-period as low as possible, to be able to scale efficiently and reduce costs.

The sweet spot for my controllers to acheive this, is to not go over 20 pods per HRA.

Running 1 hour straight of tests (tests are running in 5 or 6 minutes), i'm only hitting Github API RateLimiting at the 56th or 57th minute, wich is great. The drawback with that method is that you need to use multiples of everything ... clusters / controllers / Github APP / etc..

This afternoon, i've tested the Github Webhook HRA, but still it's not fitting perfectly my case, because it reacts to check_run events.

The problem with that: when you have Matrix jobs with dynamic provisionning using a pre-job to populate the matrix, using the Github webhook HRA, is a little bit flaky, because since i'm running only 1 "Actions", it only scale by 1 runner.

Yes i know... i can modify the HRA to say, "hey scale 2 or 3 or even more runners on a check_run event". But since my jobs are dynamic matrix triggering dynamic number of jobs, it's pretty hard to define a 'fixed' value to increase when a check_run is popping fitting multiple repository testing scenarios :/

BTW, i'm very thanksfull for all the work done on this controller, it's working better and better everyday. <3

@mumoshu
Copy link
Collaborator

mumoshu commented Mar 8, 2021

@theobolo Thanks for your detailed report! I'm still reading all the comments carefully and let me reply only one of them right now

I was wondering if the controller can be namespaced wich should allow to run multiple controllers with multiple app auth,

I think it's still undocumented but the webhook-based autoscaler has a flag for configuring the namespace to be watched by the controller.

https://github.com/summerwind/actions-runner-controller/blob/ab1c39de5732d449fe129b64b95b317aab11a6bf/cmd/githubwebhookserver/main.go#L72

We can probably add a similar flag to the controller-manager easily. I'll give it a shot soon.

@mumoshu
Copy link
Collaborator

mumoshu commented Mar 9, 2021

Yes i know... i can modify the HRA to say, "hey scale 2 or 3 or even more runners on a check_run event". But since my jobs are dynamic matrix triggering dynamic number of jobs, it's pretty hard to define a 'fixed' value to increase when a check_run is popping fitting multiple repository testing scenarios :/

@theobolo Does pullRequest or push trigger help?

I was wondering if your workflow looks like a matrix build occasionally seen in CI. If so, you could add a scaleUpTrigger of pullRequest whose type is synchronize, or a push on a specific branch to trigger a larger scale-up.

@mumoshu
Copy link
Collaborator

mumoshu commented Mar 9, 2021

See #380 for the namespace restriction feature.

@avdhoot
Copy link

avdhoot commented Mar 9, 2021

@mumoshu Is metrics scaling is required in when we using GitHub webhook scaling?

@mumoshu
Copy link
Collaborator

mumoshu commented Mar 9, 2021

@avdhoot No. But, omitting metrics result in the use of TotalNumberOfQueuedAndInProgressWorkflowRuns metric

https://github.com/summerwind/actions-runner-controller/blob/4fa53153111489691c57cee9cd11fdafb9e3d5bd/controllers/autoscaling.go#L75

Also, minReplicas and maxReplicas are required regardless of you configure metrics or not (#377 (comment)).

@avdhoot
Copy link

avdhoot commented Mar 9, 2021

@mumoshu If we set minReplicas and maxReplicas I get below error. controller goes in crashbackoffloop because the error.

https://github.com/summerwind/actions-runner-controller/blob/434823bcb362700165b49715b22cbf75eb112ca6/controllers/autoscaling.go#L71

E0304 15:10:51.166281       1 runtime.go:78] Observed a panic: runtime.boundsError{x:0, y:0, signed:true, code:0x0} (runtime error: index out of range [0] with length 0)
goroutine 329 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x15bb680, 0xc00084e1e0)
        /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/runtime/runtime.go:74 +0xa6
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/runtime/runtime.go:48 +0x89
panic(0x15bb680, 0xc00084e1e0)
        /usr/local/go/src/runtime/panic.go:969 +0x1b9
github.com/summerwind/actions-runner-controller/controllers.(*HorizontalRunnerAutoscalerReconciler).calculateReplicasByQueuedAndInProgressWorkflowRuns(0xc00007bb00, 0x13e2ae4, 0x10, 0xc00084e1c0, 0x1f, 0xc0008ab920, 0xf, 0x0, 0x0, 0xc0008899a0, ...)
        /workspace/controllers/autoscaling.go:92 +0x105e
github.com/summerwind/actions-runner-controller/controllers.(*HorizontalRunnerAutoscalerReconciler).determineDesiredReplicas(0xc00007bb00, 0x13e2ae4, 0x10, 0xc00084e1c0, 0x1f, 0xc0008ab920, 0xf, 0x0, 0x0, 0xc0008899a0, ...)
        /workspace/controllers/autoscaling.go:73 +0xd8
github.com/summerwind/actions-runner-controller/controllers.(*HorizontalRunnerAutoscalerReconciler).computeReplicas(0xc00007bb00, 0x13e2ae4, 0x10, 0xc00084e1c0, 0x1f, 0xc0008ab920, 0xf, 0x0, 0x0, 0xc0008899a0, ...)
        /workspace/controllers/horizontalrunnerautoscaler_controller.go:206 +0x98
github.com/summerwind/actions-runner-controller/controllers.(*HorizontalRunnerAutoscalerReconciler).Reconcile(0xc00007bb00, 0xc0000cb3c0, 0x15, 0xc0000cb380, 0x1a, 0x4847d7902, 0xc0004081b0, 0xc0001ee3f8, 0xc0001ee3f0)
        /workspace/controllers/horizontalrunnerautoscaler_controller.go:95 +0xf78
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0004ead80, 0x15367a0, 0xc00071cbe0, 0x0)
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:256 +0x166
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0004ead80, 0xc00030ae00)
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232 +0xb0
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc0004ead80)
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211 +0x2b
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc0008ec380)
        /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:152 +0x5f
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0008ec380, 0x3b9aca00, 0x0, 0x1, 0xc00037e300)
        /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:153 +0x105
k8s.io/apimachinery/pkg/util/wait.Until(0xc0008ec380, 0x3b9aca00, 0xc00037e300)
        /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:88 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:193 +0x32d
panic: runtime error: index out of range [0] with length 0 [recovered]
        panic: runtime error: index out of range [0] with length 0

@mumoshu
Copy link
Collaborator

mumoshu commented Mar 9, 2021

@avdhoot Thanks! Seems like a bug:

https://github.com/summerwind/actions-runner-controller/blob/4fa53153111489691c57cee9cd11fdafb9e3d5bd/controllers/autoscaling.go#L94

I'll fix it asap. Also, the line number in your logs say that you're not using canary image. Please be sure to use canary image if you're trying webhook-based autoscaler.

@mumoshu
Copy link
Collaborator

mumoshu commented Mar 9, 2021

@avdhoot Could you share your manifest yaml for HRA and runner deployment? You seem to be missing .RunnerDeployment.Spec.Template.Spec.Repository. Probably you're trying to omit metrics for organizational runners?

Not saying your configuration is wrong. Just trying to fully understand your use-case.

@mumoshu
Copy link
Collaborator

mumoshu commented Mar 9, 2021

@avdhoot The gotcha here is that you need to define either rd.Spec.Template.Spec.Repository or hra.Spec.Metrics[].RepositoryNames to make TotalNumberOfQueuedAndInProgressWorkflowRuns work.

If you're trying to omit Metrics for organizational runners, it might cause a panic today and that might be what you're seeing. I can fix it by disabling the default autoscaling completely when you omitted them. That's feasible. But I'm not yet sure if that's what you want, as I quite not understand your use-case (yet).

@avdhoot
Copy link

avdhoot commented Mar 9, 2021

Right. I am omitting Metrics for organizational runners. We want to avoid multiple RunnerDeployment per/group repo.
Hence we thought scaling only through github-webhook will sufficient(mean we can omit metrics). Let me know if miss understanding whole stuff..

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: my-org-runner
  namespace: actions-runner-system
spec:
  template:
    spec:
      organization: my-org
      image: summerwind/actions-runner:v2.276.1
      env: []
      resources:
        limits:
          memory: "4G"
        requests:
          cpu: 10m
          memory: "2G"
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: my-org-runner-autoscaler
  namespace: actions-runner-system
spec:
  minReplicas: 1
  maxReplicas: 5
  scaleTargetRef:
    name: my-org-runner
  scaleUpTriggers:
  - githubEvent:
      checkRun:
        types: ["created"]
        status: "queued"
    amount: 1
    duration: "5m"

@mumoshu
Copy link
Collaborator

mumoshu commented Mar 9, 2021

@avdhoot Thanks! Got it. I'm working on the fix at #381.

@mumoshu
Copy link
Collaborator

mumoshu commented Mar 9, 2021

@avdhoot The fix should be available in the current canary image. Would you mind giving it a shot?

@avdhoot
Copy link

avdhoot commented Mar 9, 2021

@mumoshu ye. It is working.... Thanks 🙏

@Puneeth-n
Copy link
Contributor

Puneeth-n commented Mar 9, 2021

@mumoshu I just switched to Webhook based scaling and I am using the canary image. I am constantly getting no horizontalrunnerautoscaler to scale for this github event from the github-webhook-server. I used #377 (comment) for reference

2021/03/09 14:08:53 http: superfluous response.WriteHeader call from github.com/summerwind/actions-runner-controller/controllers.(*HorizontalRunnerAutoscalerGitHubWebhook).Handle.func1 (horizontal_runner_autoscaler_webhook.go:81)
2021/03/09 14:08:54 http: superfluous response.WriteHeader call from github.com/summerwind/actions-runner-controller/controllers.(*HorizontalRunnerAutoscalerGitHubWebhook).Handle.func1 (horizontal_runner_autoscaler_webhook.go:81)
2021-03-09T14:08:55.876Z	INFO	controllers.Runner	Scale target not found. If this is unexpected, ensure that there is exactly one repository-wide or organizational runner deployment that matches this webhook event	{"event": "check_run", "hookID": "284640692", "delivery": "fbd4d0b0-80e0-11eb-881e-e093a24dde6b"}
2021-03-09T14:08:55.876Z	INFO	controllers.Runner	no horizontalrunnerautoscaler to scale for this github event	{"event": "check_run", "hookID": "284640692", "delivery": "fbd4d0b0-80e0-11eb-881e-e093a24dde6b", "eventType": "check_run"}
2021-03-09T14:08:55.982Z	INFO	controllers.Runner	Scale target not found. If this is unexpected, ensure that there is exactly one repository-wide or organizational runner deployment that matches this webhook event	{"event": "check_run", "hookID": "284640692", "delivery": "fbd37120-80e0-11eb-8b12-20a20f79d69d"}
2021-03-09T14:08:55.982Z	INFO	controllers.Runner	no horizontalrunnerautoscaler to scale for this github event	{"event": "check_run", "hookID": "284640692", "delivery": "fbd37120-80e0-11eb-8b12-20a20f79d69d", "eventType": "check_run"}
2021-03-09T14:08:57.475Z	INFO	controllers.Runner	Scale target not found. If this is unexpected, ensure that there is exactly one repository-wide or organizational runner deployment that matches this webhook event	{"event": "check_run", "hookID": "284640692", "delivery": "fcbdaa10-80e0-11eb-9f64-a7b55f99cc13"}
2021-03-09T14:08:57.475Z	INFO	controllers.Runner	no horizontalrunnerautoscaler to scale for this github event	{"event": "check_run", "hookID": "284640692", "delivery": "fcbdaa10-80e0-11eb-9f64-a7b55f99cc13", "eventType": "check_run"}
2021-03-09T14:08:57.515Z	INFO	controllers.Runner	Scale target not found. If this is unexpected, ensure that there is exactly one repository-wide or organizational runner deployment that matches this webhook event	{"event": "check_run", "hookID": "284640692", "delivery": "fcc28c10-80e0-11eb-80fd-628a3ca0e5d6"}
2021-03-09T14:08:57.515Z	INFO	controllers.Runner	no horizontalrunnerautoscaler to scale for this github event	{"event": "check_run", "hookID": "284640692", "delivery": "fcc28c10-80e0-11eb-80fd-628a3ca0e5d6", "eventType": "check_run"}
2021-03-09T14:08:59.115Z	INFO	controllers.Runner	Scale target not found. If this is unexpected, ensure that there is exactly one repository-wide or organizational runner deployment that matches this webhook event	{"event": "check_run", "hookID": "284640692", "delivery": "fda15350-80e0-11eb-8ad6-f6305783dc16"}
2021-03-09T14:08:59.115Z	INFO	controllers.Runner	no horizontalrunnerautoscaler to scale for this github event	{"event": "check_run", "hookID": "284640692", "delivery": "fda15350-80e0-11eb-8ad6-f6305783dc16", "eventType": "check_run"}
2021-03-09T14:08:59.192Z	INFO	controllers.Runner	Scale target not found. If this is unexpected, ensure that there is exactly one repository-wide or organizational runner deployment that matches this webhook event	{"event": "check_run", "hookID": "284640692", "delivery": "fda54af0-80e0-11eb-975d-bbb6e3d84892"}
2021-03-09T14:08:59.193Z	INFO	controllers.Runner	no horizontalrunnerautoscaler to scale for this github event	{"event": "check_run", "hookID": "284640692", "delivery": "fda54af0-80e0-11eb-975d-bbb6e3d84892", "eventType": "check_run"}
2021-03-09T14:09:01.059Z	INFO	controllers.Runner	Scale target not found. If this is unexpected, ensure that there is exactly one repository-wide or organizational runner deployment that matches this webhook event	{"event": "check_run", "hookID": "284640692", "delivery": "fee285e0-80e0-11eb-8ed8-7f735b7eb9bf"}
2021-03-09T14:09:01.059Z	INFO	controllers.Runner	no horizontalrunnerautoscaler to scale for this github event	{"event": "check_run", "hookID": "284640692", "delivery": "fee285e0-80e0-11eb-8ed8-7f735b7eb9bf", "eventType": "check_run"}
2021-03-09T14:09:01.151Z	INFO	controllers.Runner	Scale target not found. If this is unexpected, ensure that there is exactly one repository-wide or organizational runner deployment that matches this webhook event	{"event": "check_run", "hookID": "284640692", "delivery": "fee06300-80e0-11eb-9469-2476be42c9c4"}
2021-03-09T14:09:01.151Z	INFO	controllers.Runner	no horizontalrunnerautoscaler to scale for this github event	{"event": "check_run", "hookID": "284640692", "delivery": "fee06300-80e0-11eb-9469-2476be42c9c4", "eventType": "check_run"}
2021-03-09T14:09:03.727Z	INFO	controllers.Runner	Scale target not found. If this is unexpected, ensure that there is exactly one repository-wide or organizational runner deployment that matches this webhook event	{"event": "check_run", "hookID": "284640692", "delivery": "ffe772c0-80e0-11eb-9387-e23f62b36c79"}
2021-03-09T14:09:03.727Z	INFO	controllers.Runner	no horizontalrunnerautoscaler to scale for this github event	{"event": "check_run", "hookID": "284640692", "delivery": "ffe772c0-80e0-11eb-9387-e23f62b36c79", "eventType": "check_run"}
2021-03-09T14:09:03.742Z	INFO	controllers.Runner	Scale target not found. If this is unexpected, ensure that there is exactly one repository-wide or organizational runner deployment that matches this webhook event	{"event": "check_run", "hookID": "284640692", "delivery": "ffe5ec20-80e0-11eb-9bef-b460aad578f7"}
2021-03-09T14:09:03.742Z	INFO	controllers.Runner	no horizontalrunnerautoscaler to scale for this github event	{"event": "check_run", "hookID": "284640692", "delivery": "ffe5ec20-80e0-11eb-9bef-b460aad578f7", "eventType": "check_run"}
2021-03-09T14:09:06.634Z	INFO	controllers.Runner	Scale target not found. If this is unexpected, ensure that there is exactly one repository-wide or organizational runner deployment that matches this webhook event	{"event": "check_run", "hookID": "284640692", "delivery": "01a687e0-80e1-11eb-8f15-9ff2a8e10ebb"}
2021-03-09T14:09:06.634Z	INFO	controllers.Runner	no horizontalrunnerautoscaler to scale for this github event	{"event": "check_run", "hookID": "284640692", "delivery": "01a687e0-80e1-11eb-8f15-9ff2a8e10ebb", "eventType": "check_run"}
2021-03-09T14:09:06.777Z	INFO	controllers.Runner	Scale target not found. If this is unexpected, ensure that there is exactly one repository-wide or organizational runner deployment that matches this webhook event	{"event": "check_run", "hookID": "284640692", "delivery": "01a21b10-80e1-11eb-975a-5781484f1bc1"}
2021-03-09T14:09:06.777Z	INFO	controllers.Runner	no horizontalrunnerautoscaler to scale for this github event	{"event": "check_run", "hookID": "284640692", "delivery": "01a21b10-80e1-11eb-975a-5781484f1bc1", "eventType": "check_run"}
2021/03/09 14:09:07 http: superfluous response.WriteHeader call from github.com/summerwind/actions-runner-controller/controllers.(*HorizontalRunnerAutoscalerGitHubWebhook).Handle.func1 (horizontal_runner_autoscaler_webhook.go:81)
2021/03/09 14:09:08 http: superfluous response.WriteHeader call from github.com/summerwind/actions-runner-controller/controllers.(*HorizontalRunnerAutoscalerGitHubWebhook).Handle.func1 (horizontal_runner_autoscaler_webhook.go:81)
2021-03-09T14:09:09.212Z	INFO	controllers.Runner	Scale target not found. If this is unexpected, ensure that there is exactly one repository-wide or organizational runner deployment that matches this webhook event	{"event": "check_run", "hookID": "284640692", "delivery": "0374df40-80e1-11eb-9ef3-c08c1f19db25"}
2021-03-09T14:09:09.212Z	INFO	controllers.Runner	no horizontalrunnerautoscaler to scale for this github event	{"event": "check_run", "hookID": "284640692", "delivery": "0374df40-80e1-11eb-9ef3-c08c1f19db25", "eventType": "check_run"}
2021-03-09T14:09:09.374Z	INFO	controllers.Runner	Scale target not found. If this is unexpected, ensure that there is exactly one repository-wide or organizational runner deployment that matches this webhook event	{"event": "check_run", "hookID": "284640692", "delivery": "03770220-80e1-11eb-9f55-e1b7c3f43573"}
2021-03-09T14:09:09.374Z	INFO	controllers.Runner	no horizontalrunnerautoscaler to scale for this github event	{"event": "check_run", "hookID": "284640692", "delivery": "03770220-80e1-11eb-9f55-e1b7c3f43573", "eventType": "check_run"}

config:

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: comtravo-github-actions-deployment
  namespace: ${kubernetes_namespace.ci.metadata[0].name}
spec:
  template:
    spec:
      nodeSelector:
        ${var.eks_node_labels.spot.key}: ${var.eks_node_labels.spot.value}
      image: harbor.infra.comtravo.com/cache/comtravo/actions-runner:v2.276.1
      imagePullPolicy: Always
      repository: ${local.actions.git_repository}
      serviceAccountName: ${local.actions.service_account_name}
      securityContext:
        fsGroup: 1447
      resources:
        limits:
          cpu: "1"
          memory: "4Gi"
        requests:
          cpu: "1m"
          memory: "256Mi"
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: comtravo-github-actions-deployment-autoscaler
  namespace: ${kubernetes_namespace.ci.metadata[0].name}
spec:
  scaleTargetRef:
    name: comtravo-github-actions-deployment
  minReplicas: 4
  maxReplicas: 100
  metrics:
  - type: TotalNumberOfQueuedAndInProgressWorkflowRuns
    repositoryNames:
      - ${local.actions.git_repository}
  scaleUpTriggers:
  - githubEvent:
      checkRun:
        types: ["created"]
        status: "queued"
    amount: 1
    duration: "2m"

@Puneeth-n
Copy link
Contributor

@mumoshu I think it is nothing to worry about. I checked the delivery id and it was check_run and status completed Can we probably improve the logging by adding status

@theobolo
Copy link
Author

theobolo commented Mar 9, 2021

@mumoshu Thanks a lot for your quick updates ! That's fantastic x)

I'm gonna test the namespace flag this week and try to wrap up my multiple cluster setup within only a single one. I'll let you know if i have any issue with it !

BTW you'r absolutely right, my actions workflows are dynamic matrixes. Since we have a big mono-repo using lerna, we use a preliminary job on each pull requests tests to detect wich packages were modified with lerna list --since origin/main --json.

With the result of this first job, we are "populating" the matrix steps of the next job, by using the output of the first one.

For example, sometimes unit tests are only running on 2 or 3 packages, sometimes it could be more than 20 or 30 tests for each pull requests. Actually each pull requests have 5 different matrix actions pipelines to run : Unit Tests / Lint tests / Build Tests / QA Tests / Integration Tests

Anyway, i'll try to find a good setup first using mixed pullRequest and check_run triggers.

But, isn't there a way to just have something like :

"Whenever a new "job" is popping"... scale x runner during x min ?

So i'll be able to just scale 1 runner each time a new "job" is triggered, and not only when a whole "actions" is triggered.

@mumoshu
Copy link
Collaborator

mumoshu commented Mar 9, 2021

@mumoshu I think it is nothing to worry about. I checked the delivery id and it was check_run and status completed Can we probably improve the logging by adding status

Great feedback! I'll do it soon.

@mumoshu
Copy link
Collaborator

mumoshu commented Mar 9, 2021

For example, sometimes unit tests are only running on 2 or 3 packages, sometimes it could be more than 20 or 30 tests for each pull requests.

But, isn't there a way to just have something like :

"Whenever a new "job" is popping"... scale x runner during x min ?

So i'll be able to just scale 1 runner each time a new "job" is triggered, and not only when a whole "actions" is triggered.

@theobolo I was in the impression that PercentageRunnersBusy with the lowest sync period that is allowed within your GitHub API quote would be the best solution for you. Apparently, it isn't, from what you said at the beginning of this thread?

Each controller have a sync-period configured on 1m

In other words, can we say that there would be no problem if it works for a sync-period of 10 sec (or 1 sec)?

Btw, that's not why i'm here today. Since i've seen the new Github Webhook HRA feature, i absolutely need it to stop doing this kind of workaround and to be able to use the controller "at scale".

I wasn't sure what you mean by "workaround" here. To me, tweaking min/max replicas and scale-up triggers, metrics to keep the controller bringing up a sufficient number of runners for your use-case just seemed the right path.

Probably your goal would be to free you from tweaking those settings whereas possible, perhaps by utilizing some domain knowledge(that can't be obtained from github API)?

What if we've added a CLI application that can be executed on your actions workflow like add-temporary-runners $NS/$HRA --replicas $SIZE_OF_THE_MATRIX --duration 10m to add the number of runners equals to your matrix size for e.g. 10 minutes? That way, you could add arbitrary number of runners on demand, without fearing GitHub API limit.

@theobolo
Copy link
Author

@mumoshu I'm gonna answer your comment soon, by the way i'm not able to scale anymore using the canary image since your : 728829b merge.

It seems that the controller is not seeing any registered runner when HRA check is done :

2021-03-10T10:13:19.164Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "runner-controller", "request": "actions-1/actions-runner-aos-kjw6k-9mgc8"}
2021-03-10T10:13:19.167Z DEBUG controllers.HorizontalRunnerAutoscaler Calculated desired replicas {"replicas_min": 1, "replicas_max": 20, "replicas_desired_before": 1, "replicas_desired": 1, "num_runners": 0, "num_runners_registered": 0, "num_runners_busy": 0, "namespace": "actions-1", "runner_deployment": "actions-runner-aos", "horizontal_runner_autoscaler": "actions-runner-aos-autoscaler", "enterprise": "", "organization": "go-aos", "repository": ""}
2021-03-10T10:13:19.179Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "horizontalrunnerautoscaler-controller", "request": "actions-1/actions-runner-aos-autoscaler"}
2021-03-10T10:13:19.179Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "horizontalrunnerautoscaler-controller", "request": "actions-1/actions-runner-aos-autoscaler"}
2021-03-10T10:13:19.358Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "runner-controller", "request": "actions-1/actions-runner-aos-kjw6k-9mgc8"}

You can see that the HRA don't report any registered runner.

My setup right now :

Master helm chart = (v0.8.0)
Canary images

And thoses HRA / RunnerDeployment :

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: actions-runner-aos
  namespace: actions-1
spec:
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: role
                operator: In
                values:
                - actions-runner
            topologyKey: "kubernetes.io/hostname"
      organization: go-aos
      image: summerwind/actions-runner-dind:latest
      imagePullPolicy: IfNotPresent
      dockerdWithinRunnerContainer: true
      volumes:
      - emptyDir:
          medium: Memory
        name: runner-work    
      volumeMounts:
      - name: runner-work 
        mountPath: "/runner/work"
      env:
      - name: TZ
        value: Europe/Paris
      resources:
        limits:
          cpu: "7.5"
          memory: "30100Mi"
        requests:
          cpu: "7.5"
          memory: "30100Mi"
      tolerations:
        - key: "node.kubernetes.io/unreachable"
          operator: "Exists"
          effect: "NoExecute"
          tolerationSeconds: 10
      workDir: /runner/work
  
---

apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: actions-runner-aos-autoscaler
  namespace: actions-1
spec:
  scaleTargetRef:
    name: actions-runner-aos
  minReplicas: 1
  maxReplicas: 20
  metrics:
  - type: PercentageRunnersBusy
    scaleUpThreshold: '0.75'
    scaleDownThreshold: '0.3'
    scaleUpFactor: '2'
    scaleDownFactor: '0.5'

@mumoshu
Copy link
Collaborator

mumoshu commented Mar 10, 2021

@theobolo Thanks for the feedback! That's interesting... 728829b should not affect PercentageRunnersBusy at all. Do you have any other insights on what you're seeing? Like any other suspicious logs, k8s events. Also, would you mind sharing me the result of kubectl get horizontalrunnerautoscaler -o yaml? I'm particularly interested in seeing status of the problematic HRA to see if there's anything wrong with the API caching.

@theobolo
Copy link
Author

theobolo commented Mar 10, 2021

@mumoshu Yes i'm really not sure about what's going on, but since yesterday commits on the canary version something seems broken.

There is the result of kubectl get horizontalrunnerautoscaler -o yaml :

apiVersion: v1
items:
- apiVersion: actions.summerwind.dev/v1alpha1
  kind: HorizontalRunnerAutoscaler
  metadata:
    annotations:
      fluxcd.io/sync-checksum: 623ba77cdf072cd6e9ebba14e7a9552b2fb537e9
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"actions.summerwind.dev/v1alpha1","kind":"HorizontalRunnerAutoscaler","metadata":{"annotations":{"fluxcd.io/sync-checksum":"623ba77cdf072cd6e9ebba14e7a9552b2fb537e9"},"labels":{"fluxcd.io/sync-gc-mark":"sha2
56.wK5ze4MM-3zAFbDgaEE4Z38H3S897uqlu0Yl0MUPEwA"},"name":"actions-runner-aos-autoscaler","namespace":"actions-1"},"spec":{"maxReplicas":20,"metrics":[{"scaleDownFactor":"0.5","scaleDownThreshold":"0.3","scaleUpFactor":"2","scaleUp
Threshold":"0.75","type":"PercentageRunnersBusy"}],"minReplicas":1,"scaleTargetRef":{"name":"actions-runner-aos"}}}
    creationTimestamp: "2021-03-10T09:44:02Z"
    generation: 1
    labels:
      fluxcd.io/sync-gc-mark: sha256.wK5ze4MM-3zAFbDgaEE4Z38H3S897uqlu0Yl0MUPEwA
    managedFields:
    - apiVersion: actions.summerwind.dev/v1alpha1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            .: {}
            f:fluxcd.io/sync-checksum: {}
            f:kubectl.kubernetes.io/last-applied-configuration: {}
          f:labels:
            .: {}
            f:fluxcd.io/sync-gc-mark: {}
        f:spec:
          .: {}
          f:maxReplicas: {}
          f:metrics: {}
          f:minReplicas: {}
          f:scaleTargetRef:
            .: {}
            f:name: {}
      manager: kubectl
      operation: Update
      time: "2021-03-10T09:44:02Z"
    - apiVersion: actions.summerwind.dev/v1alpha1
      fieldsType: FieldsV1
      fieldsV1:
        f:status:
          .: {}
          f:cacheEntries: {}
          f:desiredReplicas: {}
      manager: manager
      operation: Update
      time: "2021-03-10T10:53:19Z"
    name: actions-runner-aos-autoscaler
    namespace: actions-1
    resourceVersion: "6037408"
    selfLink: /apis/actions.summerwind.dev/v1alpha1/namespaces/actions-1/horizontalrunnerautoscalers/actions-runner-aos-autoscaler
    uid: 23aa818e-3b4e-48d4-874f-39ca296c1c26
  spec:
    maxReplicas: 20
    metrics:
    - scaleDownFactor: "0.5"
      scaleDownThreshold: "0.3"
      scaleUpFactor: "2"
      scaleUpThreshold: "0.75"
      type: PercentageRunnersBusy
    minReplicas: 1
    scaleTargetRef:
      name: actions-runner-aos
  status:
    cacheEntries:
    - expirationTime: "2021-03-10T10:54:09Z"
      key: desiredReplicas
      value: 1
    desiredReplicas: 1
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

There is nothing suspect, i've just 1 runner running and it never scale. No errors on the controller.

image

And maybe that's gonna help, there is my values.yaml :

apiVersion: v1
kind: ConfigMap
metadata:
  name: actions-controller-helm-values
  namespace: actions-1
data:
  values.yaml: |
    # Default values for actions-runner-controller.
    # This is a YAML-formatted file.
    # Declare variables to be passed into your templates.

    labels: {}

    replicaCount: 1

    syncPeriod: 1m

    # Only 1 authentication method can be deployed at a time
    # Uncomment the configuration you are applying and fill in the details
    authSecret:
      create: true
      name: "controller-manager"
      ### GitHub Apps Configuration
      github_app_id:
      github_app_installation_id:
      github_app_private_key:
      ### GitHub PAT Configuration
      #github_token: ""

    image:
      repository: summerwind/actions-runner-controller
      tag: "canary"
      dindSidecarRepositoryAndTag: "docker:dind"
      pullPolicy: Always

    kube_rbac_proxy:
      image:
        repository: quay.io/brancz/kube-rbac-proxy
        tag: v0.8.0

    imagePullSecrets: []
    nameOverride: ""
    fullnameOverride: ""

    serviceAccount:
      # Specifies whether a service account should be created
      create: true
      # Annotations to add to the service account
      annotations: {}
      # The name of the service account to use.
      # If not set and create is true, a name is generated using the fullname template
      name: ""

    podAnnotations: {}

    podSecurityContext:
      {}
      # fsGroup: 2000

    securityContext:
      {}
      # capabilities:
      #   drop:
      #   - ALL
      # readOnlyRootFilesystem: true
      # runAsNonRoot: true
      # runAsUser: 1000

    service:
      type: ClusterIP
      port: 443

    resources:
      # We usually recommend not to specify default resources and to leave this as a conscious
      # choice for the user. This also increases chances charts run on environments with little
      # resources, such as Minikube. If you do want to specify resources, uncomment the following
      # lines, adjust them as necessary, and remove the curly braces after 'resources:'.
      # limits:
      #   cpu: 1
      #   memory: 512Mi
      # requests:
      #   cpu: 100m
      #   memory: 128Mi

    autoscaling:
      enabled: false
      # minReplicas: 1
      # maxReplicas: 100

    nodeSelector:
      role: controller-runner

    tolerations: []

    affinity: {}

    # Leverage a PriorityClass to ensure your pods survive resource shortages
    # ref: https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/
    # PriorityClass: system-cluster-critical
    priorityClassName: ""

    env:
      {}
      # http_proxy: "proxy.com:8080"
      # https_proxy: "proxy.com:8080"
      # no_proxy: ""

    scope:
      singleNamespace: true
      watchNamespace: "actions-1"

    githubWebhookServer:
      enabled: false
      labels: {}
      replicaCount: 1
      syncPeriod: 10m
      secret:
        create: true
        name: "github-webhook-server"
        ### GitHub Webhook Configuration
        #github_webhook_secret_token: ""
      imagePullSecrets: []
      nameOverride: ""
      fullnameOverride: ""
      serviceAccount:
        # Specifies whether a service account should be created
        create: true
        # Annotations to add to the service account
        annotations: {}
        # The name of the service account to use.
        # If not set and create is true, a name is generated using the fullname template
        name: ""
      podAnnotations: {}
      podSecurityContext: {}
      # fsGroup: 2000
      securityContext: {}
      resources: {}
      nodeSelector: {}
      tolerations: []
      affinity: {}
      priorityClassName: ""
      service:
        type: LoadBalancer
        ports:
          - port: 80
            targetPort: http
            protocol: TCP
            name: http
            # nodePort: 30080
      ingress:
        enabled: false
        annotations:
          {}
          # kubernetes.io/ingress.class: nginx
          # kubernetes.io/tls-acme: "true"
        hosts:
          - host: chart-example.local
            paths: []
        tls: []
        #  - secretName: chart-example-tls
        #    hosts:
        #      - chart-example.local

I'm currently testing the "namespace scope" feature.

ps: I've seen that i've an old Runner stuck in the default namespace that i'm not able to delete. Since the controller in the default namespace has been transfered to the 'actions-1' namespace with scope feature enabled. Maybe that's causing some underlying problems, but it shouldn't since i'm using controller scoped to actions-1 namespace. I checked on Github and this runner is not present on the Org. actions runners list, probably nothing to worry about here. I'm just adding information for your debug here :)

I'm gonna recreate my cluster, to start from scratch with the latests charts / images / features. I'll update my comment in 30min.

Edit : After recreating the Cluster from scratch, with master ref for Helm Charts, Canary images, and the RunnerDeployment & HRA linked in this comment, i've the same problem. So it's not related to stale/old resources. 👍

@mumoshu
Copy link
Collaborator

mumoshu commented Mar 11, 2021

@theobolo I reviewd your values.yaml and it looks mostly good. One thing that I'm suspicious about is this part:

    authSecret:
      create: true
      name: "controller-manager"
      ### GitHub Apps Configuration
      github_app_id:
      github_app_installation_id:
      github_app_private_key:
      ### GitHub PAT Configuration
      #github_token: ""

Seems like the chart is configured to create a secret with empty github creds. If the chart worked as told, it should creat an invalid secret and the controller should log authentication failures against GitHub API. But it has not?

Would you mind verifying the content of the secret?

@theobolo
Copy link
Author

theobolo commented Mar 11, 2021

@mumoshu oh ! i'm really sorry, i've just ripped off my tokens from the values.yml, to paste it here securely :) It's correctly configured on my running controller deployment. Sorry should have say that earlier ...

To be clear, that's not the problem. I'm still trying to figure out why it stopped working.

edit: I changed the actions-controller image to the v0.17.0 version, and the HRA is now working properly, so there is something broken in the lastests canary versions.

image

@mumoshu
Copy link
Collaborator

mumoshu commented Mar 11, 2021

@theobolo I now believe it was due to a regression introduced in #355.
#386 should fix it.

@theobolo
Copy link
Author

@mumoshu Wonderful, i'll give it a try as soon as it's merged ;) Thanks a lot !

mumoshu added a commit that referenced this issue Mar 11, 2021
PercentageRunnerBusy seems to have regressed since #355 due to that RunnerDeployment.Spec.Selector is empty by default and the HRA controller was using that empty selector to query runners, which somehow returned 0 runners. This fixes that by using the newly added automatic `runner-deployment-name` label for the default runner label and the selector, which avoids querying with empty selector.

Ref #377 (comment)
mumoshu added a commit that referenced this issue Mar 11, 2021
PercentageRunnerBusy seems to have regressed since #355 due to that RunnerDeployment.Spec.Selector is empty by default and the HRA controller was using that empty selector to query runners, which somehow returned 0 runners. This fixes that by using the newly added automatic `runner-deployment-name` label for the default runner label and the selector, which avoids querying with empty selector.

Ref #377 (comment)
@mumoshu
Copy link
Collaborator

mumoshu commented Mar 11, 2021

@theobolo Thanks! FYI, I've just merged #386 and the canary tag will be updated soon.

@theobolo
Copy link
Author

@mumoshu It's working again :D ! Thanks a lot, let's test now 🚀

mumoshu added a commit that referenced this issue Mar 14, 2021
… enabled

Relates to #379 (comment)
Relates to #377 (comment)

When you defined HRA.Spec.ScaleUpTriggers[] but HRA.Spec.Metrics[], the HRA controller will now enable ScaleUpTriggers alone and insteaed of automatically enabling TotalNumberOfQueuedAndInProgressWorkflowRuns. This allows you to use ScaleUpTriggers alone, so that the autoscaling is done without calling GitHub API at all, which should grealy decrease the change of GitHub API calls get rate-limited.
mumoshu added a commit that referenced this issue Mar 14, 2021
… enabled (#391)

Relates to #379 (comment)
Relates to #377 (comment)

When you defined HRA.Spec.ScaleUpTriggers[] but HRA.Spec.Metrics[], the HRA controller will now enable ScaleUpTriggers alone and insteaed of automatically enabling TotalNumberOfQueuedAndInProgressWorkflowRuns. This allows you to use ScaleUpTriggers alone, so that the autoscaling is done without calling GitHub API at all, which should grealy decrease the change of GitHub API calls get rate-limited.
@mumoshu
Copy link
Collaborator

mumoshu commented Mar 14, 2021

omitting metrics result in the use of TotalNumberOfQueuedAndInProgressWorkflowRuns metric

I got to think this is confusing and there's no actual benefit making it the default behavior. I've change the controller code, and since #391, omitting Metrics[] just result in ScaleUpTriggers[] being used alone. Doing so, the controller would completely skip GitHub API calls for autoscaling, which alleviates the rate-limit issue!

@callum-tait-pbx
Copy link
Contributor

callum-tait-pbx commented Mar 14, 2021

@mumoshu Once this is wrapped up could we get a formal release of the controller? I'd quite like to upgrade our setup at work and would rather version pin it rather than run the canary image.

@mumoshu
Copy link
Collaborator

mumoshu commented Mar 14, 2021

@callum-tait-pbx I'd definitely like to do so! Currently, I have only two remaining potential issues to investigate so probably we'll be having the next release after that.

@theobolo
Copy link
Author

@mumoshu Hello Yusuke, i've tested heavily the new watch-namespace feature you've implemented.

I can say that's working very well ;) I've launched 1 cluster with 5 namespaced controllers, each one in charge of 20 runners, with a sync-period of 1m. And ... it's amazing, that's it :D no more to say except thank you a lot again.

By the way, i'm gonna test i little bit more the Github Webhook HRA on my side to find the best Autoscaling mecanism for my case. I think you'r right, the PercentageRunnersBusy is fitting well for me, with multiple watch-namespace controller.

image

Now i'm able to scale up to 100 runners constantly without any Github API limitation, with 1 cluster ! 🥇

@callum-tait-pbx
Copy link
Contributor

callum-tait-pbx commented Mar 16, 2021

This looks awesome. @mumoshu the watch-namespace feature probably needs adding with some detail into the README.md once the canary image is ready for publishing otherwise no one is going to know how to use it which would be a shame as it is such a great feature!

@stale
Copy link

stale bot commented May 15, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label May 15, 2021
@stale stale bot closed this as completed May 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants