-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I want to disable "Pod has failed to start more than 5 times" #2721
Comments
Hello! Thank you for filing an issue. The maintainers will triage your issue shortly. In the meantime, please take a look at the troubleshooting guide for bug reports. If this is a feature request, please review our contribution guidelines. |
Hey @kahirokunn, Thanks for submitting this issue! This makes sense to me |
@nikola-jokic How about setting Limit to an arbitrary value, and then setting -1 to infinite? |
Hey @kahirokunn, We decided not to implement infinite retries. The reason is that having min runners should ensure that you have some of them "warm", while spot instances are kind of the opposite. This mechanism is created, so the faulty configuration can be spotted as well as not to issue many requests to the actions service in case the error is not recoverable without manual intervention. I will close this issue now but feel free to comment on it |
@nikola-jokic |
@nikola-jokic |
That is a good suggestion @kahirokunn, we should probably include it in the docs or FAQs and in the values.yaml file @apc-jnytnai0613 if we implemented user specified number of retries, it would effectively be the same as having infinite retries. If you specify a sufficiently large number, it would act as almost an infinite retry until you figure out that something is not working as expected. Having a limited number of retries also signals that something is wrong, and it should probably be fixed. The 5 is not a magic number here, but it was a number that is sufficiently large that if after 5 times something does not work, it is likely an error. This number is not set in stone, and is subject to change, but so far, the only issue with it is running on a spot instance which ARC is not designed for |
Hi everyone! We are experiencing the same problem. We have configured a minRunners to 1 so we don't have to wait for a node provision when a new workflow is triggered, which adds abut 4 minutes more to our job time. When Google stops our instance, and starts a new one, the EphemeralRunner stucks in failed state. We must delete it manually to have our Workflows working again... Why not using spot instances for this? I think using them for CI worflows is a perfect use case... Spot instances are 60% of the cost of a standard one, which is ideal for non always running workflows. IMHO, I don't see a problem on making the failure limit configurable. Setting it to a higer value (non infinite), would solve this problem. |
Hey @romancin, I completely agree that using spot instances is a good choice. But the idea of spot instances is somewhat contrary to specifying One of the benefits of scale sets is also that they can avoid rate limiting. And having infinite or large number of restarts in these scenarios can unnecessarily increase the load on the server without any performance benefits |
@nikola-jokic This not only affects spot instances but any kind of interruption or eviction, such as node upgrades or consolidation. Another issue here is that there's no defined time period for the five failures. For example, over a weekend, if there are no jobs in a git repository, and an ephemeral runner is unlucky enough to be rescheduled five times, GitHub Actions practically stop working for that EphemeralRunnerSet and require manual intervention. In general, I think most else in k8s is self-healing, so removing this limit or adding time/rate limiting would be more in line with the behavior elsewhere in Kubernetes. For example, we have many Deployment/ReplicaSet instances affected by the same node scale-downs, and we don't have to do any manual intervention there. Idea for improvement: |
@chlunde I consider your idea good. |
/reopen |
+1 |
What would you like added?
Currently, if a user fails five times, they are marked as failed, but how about making it possible to set this to any number in the settings?
If you set -1, it will be infinite.
Why is this needed?
RUNNER was started on a SPOT instance with minRunners at 1.
Each time the spot instance terminates, the runner naturally terminates as well.
When runner terminates, a failure is added to EphemeralRunner.
Because of this, I cannot make the runner reside on the spot instance.
Additional context
actions-runner-controller/controllers/actions.github.com/ephemeralrunner_controller.go
Line 426 in 2bf83d0
actions-runner-controller/controllers/actions.github.com/ephemeralrunner_controller.go
Line 192 in 2bf83d0
#2722
The text was updated successfully, but these errors were encountered: