New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jobs expire while on queue #2466
Comments
@Nuru Hey! As you might have already figured out, your scale trigger duration is too short, and that's the issue. Just make it longer, like 60m or 2h or whatever you like. ARC is able to scale it down when the workflow job is completed, enabling you to not waste your compute resources.
This is unavoidable because all we have are webhook events emitted by GitHub Actions and hence ARC have no consistent view of runners and workflow jobs. GitHub Actions team is currently developing a new autoscaling solution that does the scheduling decision on the GitHub Actions side, which will enable you to completely omit setting the scale trigger duration. |
@Nuru Can I close this as answered? Once again, |
@mumoshu wrote
I understand webhook delivery isn't guaranteed, but I do not think that is relevant to my suggestion. The situation I am complaining about is where the webhook delivers a "queued" event and the webhook adds a capacity reservation, but the HRA does not add capacity because it is already at My suggestion is along the lines of when the HRA gets a new capacity reservation and it cannot scale up due to already being at max capacity, it extends the capacity reservation indefinitely (or maybe, say, a year, or 90 days, or some separately configurable duration). When a job completes and the ephemeral runner is deleted, ideally the HRA would launch a new pod and reset the expiration on the oldest capacity reservation that has the longer expiration to the then current time plus I acknowledge that I am not 100% sure of the details, but it seems to me there is enough knowledge about the runners locally to know if they are being fed jobs or not, and to drain the queue of backlogged reservation requests based on that. The end result would be that if runners remain idle, then the HRA will scale the runner pool size down to |
That's almost how it's supposed to work since v0.27.1 (#2258), by resetting the duration on every "completed" or "cancelled" workflow job event. This works because ARC is very likely to receive either webhook event before the duration expires.
There can be a race between GitHub Actions scheduling a job to the idle runner and ARC terminates the idle runner, if that's what you mean. I need to repeat my first two replies for that. ARC doesn't have a consistent view of queued jobs. Reservations are there to do our best to provision the desired number of runners at the approximately desired timings. |
If you have a more concrete proposal about what we can do to enhance the situation, please feel free to submit a pull request. I'll eagerly review it and try to learn from it! But also be sure that GitHub Actions team is currently developing a new autoscaling solution that does scheduling on their end, so that we can provide a nicer experience without hacking it based on wrong assumptions. |
Well, and this isn't actually a bug, to be honest... I've explained this behavior multiple times in the issues and #2258 was the first attempt to make it better. If there's something that can be improved upon that, again, I'd appreciate it if you could provide a more concrete proposal, or even better a pull request that is formatted similarly to #2258, so that I can see what can be really improved. Otherwise, you can continue this in Discussions! |
@mumoshu wrote:
I see. #2258 says it implements what I listed as my expected behavior, but that does not seem to be what I experienced. I hesitate to offer a PR because I do not fully understand the code or have a good way to test it, but I can offer these observations. If you can clarify some things for me, then maybe I can offer solutions. This code implies that Lines 224 to 226 in 5675409
Which means that it is expected that running replicas have corresponding capacity reservations that are subject to expiration. So what happens if they expire before the job finishes? This line Line 147 in 5675409
means that any expired capacity reservations will not be included in the list. Which calls into question the implied expectation (see below) that the first This code also does not seem right to me: Lines 179 to 186 in 5675409
I don't see anything ever setting a reservation's So like I said, I don't fully understand the code. If you can help me understand this, then I can maybe offer better suggestions. |
Hey!
Maybe you linked wrong lines...?
Yeah, I think your observation is correct here. I vaguely remember that we introduced "batching" to add capacity reservations buffered in a short period at once so that the write rate to k8s-apiserver and etcd is reduced. Perhaps that matching logic is not working well since batching, if you really see real issues due to that? 🤔 To be clear, so far I have seen no real problems potentially coming from code you referred myself. So probably it can either be just that other code handling scale down after the introduction of batching, or you found a real problem. In the latter case, I'm definitely eager to review your pull request! |
I reread relevant code and it turns out our batching doesn't merge multiple scale triggers into one. It just batches scale triggers into one K8s API update call. And our github-webhook-server creates scale trigger of amount 1 or -1, not other numbers. All in all, for the lines you mentioned, I can say the amount is guaranteed to be 1! For your reference, this is the only place we instantiate scale operations of amount 1 or -1: actions-runner-controller/controllers/actions.summerwind.net/horizontal_runner_autoscaler_webhook.go Lines 210 to 220 in 1a491cb
This is how we batch updates to capacity reservations: Lines 162 to 166 in 1a491cb
|
@mumoshu Thank you for the clarification. The first lines of code I references, particularly
implies all the
The fact that the webhook only scales up by a hardcoded Still a bunch of issues around scaling with other (non-webhook) triggers and accounting for expiration of reservations that are paired with running replicas when updating backlogged reservations. And documentation. I addressed what I felt comfortable with in #2502 |
@mumoshu Even with my PR #2502, it appears that cancelled jobs are not being removed from HRA capacity reservations. I am seeing matrix jobs being cancelled because of fail-fast, but the HRA capacity reservations not being removed. I'm having a hard time tracing through the code to see where that is supposed to happen. Can you please point it out out to me? |
Hey @Nuru! Thanks for your efforts. To better understand it, could you clarify what the "real" issue #2502 is supposed to fix in short? #2258 should have fixed it by readjusting existing capacity reservations effective times on scale down. Is it that #2502 is supposed to fix it by further readjusting existing capacity reservations effective times on scale up too, but it didn't actually work according to your test, right? |
@Nuru Then our issue can potentially be- ARC doesn't properly handle scale down when workflow jobs of I believe ARC already does handle scale down correctly on workflow jobs of |
Re #2502, I'd love it if you could add some previously failing tests that can pass after your fix, so that I can better understand the exact error cases you think you are affected! Thanks in advance for your support. |
@mumoshu I tried to describe the errors I fixed in the description of #2502. However, I do not have the resources to devote to creating tests that demonstrate the failures and fixes. Note that the Among the fixes was that if reservations "in use" (associated with an active runner) expire, the replacement reservation did not get its expiration time extended. Similarly, when the runner pool was at It may also have been that I was affected by the fact that the expiration times were only being extended on scale down, which I also fixed. When a large set of new jobs is created, it may be a long time before a scale down event occurs, but the expired reservations were being removed every batch cycle (every 3 seconds). Now the expiration time extensions match the expiration time pruning cycle. Also, the replacement times were getting extended by the duration associated with the scaleDown trigger, which is only correct if there is only ever a single duration for scale up or scale down, which I do not think is a good assumption for the code to make. Plus, if the batch contained multiple scaleDown events, the expirations were extended for each event; now they are only extended once per batch. Also, I could not find the code that handles the webhook receiving a |
Checks
Controller Version
0.27.1
Helm Chart Version
0.23.0
CertManager Version
1.10.2
Deployment Method
Helm
cert-manager installation
Yes, installed cert manager from official sources.
Checks
Resource Definitions
To Reproduce
Set webhook trigger
duration
to a reasonable number like "5m" to cover the time for the HRA to scale up a runner. Set HRAminReplicas
to 0. Start 5 times the HRAmaxReplicas
number of jobs, with each job taking longer than the webhook triggerduration
to complete. In other words, create a huge backlog of jobs.Watch the autoscaler scale the runner pool to zero runners while there remains a huge backlog of jobs in the queue.
Describe the bug
The capacity reservations expire before the jobs are even queued because the HRA cannot scale up past its
maxReplicas
. The webhook based autoscaler expires most of the jobs on the queue before they have had a chance to be started, and scales the runner pool down even though there are still a ton of jobs waiting.Describe the expected behavior
The timer on the HRA
duration
should start either when the job is assigned to a runner or when the HRA tries to scale up. If the HRA is already atmaxReplicas
, the reservation should live indefinitely, until it has a chance to be assigned to an idle runner, at which point the duration timer can start.Alternately there could be a separate timeout for how long a job can live in the backlog before the autoscaler forgets about it. It doesn't make sense to me that I have to include in the
duration
period the amount of time it might take the maxed-out cluster to work through a huge backlog which may be several hours, because, as I understand it, this is also the potential amount of time an idle runner will be left running before being scaled down if, for some reason, the job canceled/completed event is missed.Whole Controller Logs
Whole Runner Pod Logs
Additional Context
You can see in the webhook server logs entries like this:
I don't believe this is a coding error. I believe this is a flaw in the design of the capacity reservation system. Capacity that has been reserved but cannot be filled because the runner pool is already at max capacity should not expire while waiting for real capacity to become available.
The text was updated successfully, but these errors were encountered: