I want to disable "Pod has failed to start more than 5 times" #2721

kahirokunn · 2023-07-06T00:25:43Z

What would you like added?

Currently, if a user fails five times, they are marked as failed, but how about making it possible to set this to any number in the settings?
If you set -1, it will be infinite.

Why is this needed?

RUNNER was started on a SPOT instance with minRunners at 1.
Each time the spot instance terminates, the runner naturally terminates as well.
When runner terminates, a failure is added to EphemeralRunner.
Because of this, I cannot make the runner reside on the spot instance.

Additional context

actions-runner-controller/controllers/actions.github.com/ephemeralrunner_controller.go

Line 426 in 2bf83d0

    
           func (r *EphemeralRunnerReconciler) markAsFailed(ctx context.Context, ephemeralRunner *v1alpha1.EphemeralRunner, log logr.Logger) error {

actions-runner-controller/controllers/actions.github.com/ephemeralrunner_controller.go

Line 192 in 2bf83d0

case len(ephemeralRunner.Status.Failures) > 5:

#2722

github-actions · 2023-07-06T00:26:19Z

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

nikola-jokic · 2023-07-07T13:38:21Z

Hey @kahirokunn,

Thanks for submitting this issue! This makes sense to me ☺️ The 5 limit was arbitrarily chosen so if the image is faulty, the controller would not re-create it indefinitely. But we should make this limit configurable, in my opinion

kahirokunn · 2023-07-07T22:52:13Z

@nikola-jokic How about setting Limit to an arbitrary value, and then setting -1 to infinite?

nikola-jokic · 2023-10-11T08:57:35Z

Hey @kahirokunn,

We decided not to implement infinite retries. The reason is that having min runners should ensure that you have some of them "warm", while spot instances are kind of the opposite. This mechanism is created, so the faulty configuration can be spotted as well as not to issue many requests to the actions service in case the error is not recoverable without manual intervention.

I will close this issue now but feel free to comment on it ☺️

kahirokunn · 2023-10-11T09:33:25Z

@nikola-jokic
Now on to the documentation, etc,
Running a minRunner's worth of pods on top of a SPOT instance is not recommended as it causes a has failed to start more than 5 times error.
How about adding a "has failed to start more than 5 times" document?
I am thinking of writing it myself if you wish.

apc-jnytnai0613 · 2023-10-11T09:39:44Z

@nikola-jokic
Instead of an infinite number of retries, why not allow the user to specify the number of retries? I can implement this.

nikola-jokic · 2023-10-11T11:41:21Z

That is a good suggestion @kahirokunn, we should probably include it in the docs or FAQs and in the values.yaml file

@apc-jnytnai0613 if we implemented user specified number of retries, it would effectively be the same as having infinite retries. If you specify a sufficiently large number, it would act as almost an infinite retry until you figure out that something is not working as expected.

Having a limited number of retries also signals that something is wrong, and it should probably be fixed. The 5 is not a magic number here, but it was a number that is sufficiently large that if after 5 times something does not work, it is likely an error. This number is not set in stone, and is subject to change, but so far, the only issue with it is running on a spot instance which ARC is not designed for

romancin · 2023-10-23T11:02:56Z

Hi everyone! We are experiencing the same problem. We have configured a minRunners to 1 so we don't have to wait for a node provision when a new workflow is triggered, which adds abut 4 minutes more to our job time. When Google stops our instance, and starts a new one, the EphemeralRunner stucks in failed state. We must delete it manually to have our Workflows working again...

Why not using spot instances for this? I think using them for CI worflows is a perfect use case... Spot instances are 60% of the cost of a standard one, which is ideal for non always running workflows. IMHO, I don't see a problem on making the failure limit configurable. Setting it to a higer value (non infinite), would solve this problem.

nikola-jokic · 2023-10-23T11:35:43Z

Hey @romancin,

I completely agree that using spot instances is a good choice. But the idea of spot instances is somewhat contrary to specifying minRunners. Spot instances are picked to save money when there is no activity on your scale set, while minRunners is used to speed up the startup time of workflows when the scale set is inactive (because you have the number of min runners ready to receive a job and thus don't have to wait for the runner to be up).
Having said that, it is not obvious what the correct value should be. It would highly depend on the times when workflows are scheduled. In the worst case, it would effectively be the same as having minRunners: 0.

One of the benefits of scale sets is also that they can avoid rate limiting. And having infinite or large number of restarts in these scenarios can unnecessarily increase the load on the server without any performance benefits

chlunde · 2023-12-18T10:11:06Z

@nikola-jokic This not only affects spot instances but any kind of interruption or eviction, such as node upgrades or consolidation.

Another issue here is that there's no defined time period for the five failures.

For example, over a weekend, if there are no jobs in a git repository, and an ephemeral runner is unlucky enough to be rescheduled five times, GitHub Actions practically stop working for that EphemeralRunnerSet and require manual intervention.

In general, I think most else in k8s is self-healing, so removing this limit or adding time/rate limiting would be more in line with the behavior elsewhere in Kubernetes. For example, we have many Deployment/ReplicaSet instances affected by the same node scale-downs, and we don't have to do any manual intervention there.

Idea for improvement:
Keep timestamps (instead of "true") when tracking failures. Garbage collect failures over 15 minutes old when appending to the array. This would require 5 failures over 15 minutes to get stuck.

kahirokunn · 2023-12-21T06:13:01Z

@chlunde I consider your idea good.

kahirokunn · 2024-01-25T05:42:45Z

/reopen

gavriellan-ma · 2024-07-24T19:14:58Z

@chlunde I consider your idea good.

+1

kahirokunn added enhancement New feature or request needs triage Requires review from the maintainers labels Jul 6, 2023

kahirokunn mentioned this issue Jul 6, 2023

When the Runner terminates, refer to why the Runner terminated and suppress the addition of EphemeralRunner to the failure. #2722

Closed

nikola-jokic added gha-runner-scale-set Related to the gha-runner-scale-set mode and removed needs triage Requires review from the maintainers labels Jul 7, 2023

nikola-jokic self-assigned this Jul 7, 2023

nikola-jokic closed this as completed Oct 11, 2023

nikola-jokic mentioned this issue Oct 24, 2023

ephemeralrunners.actions.github.com out of sync with running pods #3014

Closed

4 tasks

nikola-jokic reopened this Jan 25, 2024

Link- added this to the gha-runner-scale-set-0.9.0 milestone Jan 29, 2024

nikola-jokic removed this from the gha-runner-scale-set-0.9.0 milestone Mar 19, 2024

nikola-jokic mentioned this issue Apr 29, 2024

New jobs are not picked up whilst there are runners sitting in Evicted status #3428

Closed

4 tasks

garymm mentioned this issue Jun 11, 2024

Scale Set Contoller stuck: thinks runner pods are failed, but they have been deleted #3592

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I want to disable "Pod has failed to start more than 5 times" #2721

I want to disable "Pod has failed to start more than 5 times" #2721

kahirokunn commented Jul 6, 2023 •

edited

Loading

github-actions bot commented Jul 6, 2023

nikola-jokic commented Jul 7, 2023

kahirokunn commented Jul 7, 2023

nikola-jokic commented Oct 11, 2023

kahirokunn commented Oct 11, 2023

apc-jnytnai0613 commented Oct 11, 2023

nikola-jokic commented Oct 11, 2023

romancin commented Oct 23, 2023

nikola-jokic commented Oct 23, 2023

chlunde commented Dec 18, 2023

kahirokunn commented Dec 21, 2023

kahirokunn commented Jan 25, 2024

gavriellan-ma commented Jul 24, 2024

I want to disable "Pod has failed to start more than 5 times" #2721

I want to disable "Pod has failed to start more than 5 times" #2721

Comments

kahirokunn commented Jul 6, 2023 • edited Loading

What would you like added?

Why is this needed?

Additional context

github-actions bot commented Jul 6, 2023

nikola-jokic commented Jul 7, 2023

kahirokunn commented Jul 7, 2023

nikola-jokic commented Oct 11, 2023

kahirokunn commented Oct 11, 2023

apc-jnytnai0613 commented Oct 11, 2023

nikola-jokic commented Oct 11, 2023

romancin commented Oct 23, 2023

nikola-jokic commented Oct 23, 2023

chlunde commented Dec 18, 2023

kahirokunn commented Dec 21, 2023

kahirokunn commented Jan 25, 2024

gavriellan-ma commented Jul 24, 2024

kahirokunn commented Jul 6, 2023 •

edited

Loading