Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I want to disable "Pod has failed to start more than 5 times" #2721

Open
kahirokunn opened this issue Jul 6, 2023 · 13 comments
Open

I want to disable "Pod has failed to start more than 5 times" #2721

kahirokunn opened this issue Jul 6, 2023 · 13 comments
Assignees
Labels
enhancement New feature or request gha-runner-scale-set Related to the gha-runner-scale-set mode

Comments

@kahirokunn
Copy link
Contributor

kahirokunn commented Jul 6, 2023

What would you like added?

Currently, if a user fails five times, they are marked as failed, but how about making it possible to set this to any number in the settings?
If you set -1, it will be infinite.

Why is this needed?

RUNNER was started on a SPOT instance with minRunners at 1.
Each time the spot instance terminates, the runner naturally terminates as well.
When runner terminates, a failure is added to EphemeralRunner.
Because of this, I cannot make the runner reside on the spot instance.

Additional context

func (r *EphemeralRunnerReconciler) markAsFailed(ctx context.Context, ephemeralRunner *v1alpha1.EphemeralRunner, log logr.Logger) error {

case len(ephemeralRunner.Status.Failures) > 5:

#2722

@kahirokunn kahirokunn added enhancement New feature or request needs triage Requires review from the maintainers labels Jul 6, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Jul 6, 2023

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

@nikola-jokic
Copy link
Contributor

Hey @kahirokunn,

Thanks for submitting this issue! This makes sense to me ☺️ The 5 limit was arbitrarily chosen so if the image is faulty, the controller would not re-create it indefinitely. But we should make this limit configurable, in my opinion

@nikola-jokic nikola-jokic self-assigned this Jul 7, 2023
@kahirokunn
Copy link
Contributor Author

@nikola-jokic How about setting Limit to an arbitrary value, and then setting -1 to infinite?

@nikola-jokic
Copy link
Contributor

Hey @kahirokunn,

We decided not to implement infinite retries. The reason is that having min runners should ensure that you have some of them "warm", while spot instances are kind of the opposite. This mechanism is created, so the faulty configuration can be spotted as well as not to issue many requests to the actions service in case the error is not recoverable without manual intervention.

I will close this issue now but feel free to comment on it ☺️

@kahirokunn
Copy link
Contributor Author

@nikola-jokic
Now on to the documentation, etc,
Running a minRunner's worth of pods on top of a SPOT instance is not recommended as it causes a has failed to start more than 5 times error.
How about adding a "has failed to start more than 5 times" document?
I am thinking of writing it myself if you wish.

@apc-jnytnai0613
Copy link

@nikola-jokic
Instead of an infinite number of retries, why not allow the user to specify the number of retries? I can implement this.

@nikola-jokic
Copy link
Contributor

That is a good suggestion @kahirokunn, we should probably include it in the docs or FAQs and in the values.yaml file

@apc-jnytnai0613 if we implemented user specified number of retries, it would effectively be the same as having infinite retries. If you specify a sufficiently large number, it would act as almost an infinite retry until you figure out that something is not working as expected.

Having a limited number of retries also signals that something is wrong, and it should probably be fixed. The 5 is not a magic number here, but it was a number that is sufficiently large that if after 5 times something does not work, it is likely an error. This number is not set in stone, and is subject to change, but so far, the only issue with it is running on a spot instance which ARC is not designed for

@romancin
Copy link

Hi everyone! We are experiencing the same problem. We have configured a minRunners to 1 so we don't have to wait for a node provision when a new workflow is triggered, which adds abut 4 minutes more to our job time. When Google stops our instance, and starts a new one, the EphemeralRunner stucks in failed state. We must delete it manually to have our Workflows working again...

Why not using spot instances for this? I think using them for CI worflows is a perfect use case... Spot instances are 60% of the cost of a standard one, which is ideal for non always running workflows. IMHO, I don't see a problem on making the failure limit configurable. Setting it to a higer value (non infinite), would solve this problem.

@nikola-jokic
Copy link
Contributor

Hey @romancin,

I completely agree that using spot instances is a good choice. But the idea of spot instances is somewhat contrary to specifying minRunners. Spot instances are picked to save money when there is no activity on your scale set, while minRunners is used to speed up the startup time of workflows when the scale set is inactive (because you have the number of min runners ready to receive a job and thus don't have to wait for the runner to be up).
Having said that, it is not obvious what the correct value should be. It would highly depend on the times when workflows are scheduled. In the worst case, it would effectively be the same as having minRunners: 0.

One of the benefits of scale sets is also that they can avoid rate limiting. And having infinite or large number of restarts in these scenarios can unnecessarily increase the load on the server without any performance benefits

@chlunde
Copy link

chlunde commented Dec 18, 2023

@nikola-jokic This not only affects spot instances but any kind of interruption or eviction, such as node upgrades or consolidation.

Another issue here is that there's no defined time period for the five failures.

For example, over a weekend, if there are no jobs in a git repository, and an ephemeral runner is unlucky enough to be rescheduled five times, GitHub Actions practically stop working for that EphemeralRunnerSet and require manual intervention.

In general, I think most else in k8s is self-healing, so removing this limit or adding time/rate limiting would be more in line with the behavior elsewhere in Kubernetes. For example, we have many Deployment/ReplicaSet instances affected by the same node scale-downs, and we don't have to do any manual intervention there.

Idea for improvement:
Keep timestamps (instead of "true") when tracking failures. Garbage collect failures over 15 minutes old when appending to the array. This would require 5 failures over 15 minutes to get stuck.

@kahirokunn
Copy link
Contributor Author

@chlunde I consider your idea good.

@kahirokunn
Copy link
Contributor Author

/reopen

@gavriellan-ma
Copy link

@chlunde I consider your idea good.

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request gha-runner-scale-set Related to the gha-runner-scale-set mode
Projects
None yet
Development

No branches or pull requests

7 participants