Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs expire while on queue #2466

Open
7 tasks done
Nuru opened this issue Apr 2, 2023 · 16 comments
Open
7 tasks done

Jobs expire while on queue #2466

Nuru opened this issue Apr 2, 2023 · 16 comments
Labels
bug Something isn't working needs triage Requires review from the maintainers

Comments

@Nuru
Copy link
Contributor

Nuru commented Apr 2, 2023

Checks

Controller Version

0.27.1

Helm Chart Version

0.23.0

CertManager Version

1.10.2

Deployment Method

Helm

cert-manager installation

Yes, installed cert manager from official sources.

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
  • I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
  • My actions-runner-controller version (v0.x.y) does support the feature
  • I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
  • I've migrated to the workflow job webhook event (if you using webhook driven scaling)

Resource Definitions

apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  annotations:
    meta.helm.sh/release-name: infra-runner-arm64
    meta.helm.sh/release-namespace: actions-runner-system

  labels:
    app.kubernetes.io/managed-by: Helm
  name: infra-runner-arm64
  namespace: actions-runner-system

spec:
  maxReplicas: 64
  minReplicas: 0
  scaleDownDelaySecondsAfterScaleOut: 300
  scaleTargetRef:
    name: infra-runner-arm64
  scaleUpTriggers:
    - amount: 1
      duration: 4m
      githubEvent:
        workflowJob: {}
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: infra-runner-arm64
  namespace: actions-runner-system

spec:
  template:
    spec:
      dockerdWithinRunnerContainer: true
      env:
        - name: RUNNER_GRACEFUL_STOP_TIMEOUT
          value: '90'
      group: armEnabled
      image: summerwind/actions-runner-dind
      imagePullPolicy: IfNotPresent
      labels:
        - self-hosted
        - Linux
        - linux
        - Ubuntu
        - ubuntu
        - arm64
        - ARM64
        - aarch64
        - core-auto
      nodeSelector:
        kubernetes.io/arch: arm64
        kubernetes.io/os: linux
      organization: <redacted>
      resources:
        limits:
          cpu: 2000m
          memory: 2048Mi
        requests:
          cpu: 1000m
          memory: 1024Mi
      serviceAccountName: actions-runner
      terminationGracePeriodSeconds: 100
      tolerations:
        - effect: NoSchedule
          key: kubernetes.io/arch
          operator: Equal
          value: arm64
      volumeMounts:
        - mountPath: /home/runner/work/shared
          name: shared-volume
      volumes:
        - name: shared-volume
          persistentVolumeClaim:
            claimName: infra-runner-arm64

To Reproduce

Set webhook trigger duration to a reasonable number like "5m" to cover the time for the HRA to scale up a runner. Set HRA minReplicas to 0. Start 5 times the HRA maxReplicas number of jobs, with each job taking longer than the webhook trigger duration to complete. In other words, create a huge backlog of jobs.

Watch the autoscaler scale the runner pool to zero runners while there remains a huge backlog of jobs in the queue.

Describe the bug

The capacity reservations expire before the jobs are even queued because the HRA cannot scale up past its maxReplicas. The webhook based autoscaler expires most of the jobs on the queue before they have had a chance to be started, and scales the runner pool down even though there are still a ton of jobs waiting.

Describe the expected behavior

The timer on the HRA duration should start either when the job is assigned to a runner or when the HRA tries to scale up. If the HRA is already at maxReplicas, the reservation should live indefinitely, until it has a chance to be assigned to an idle runner, at which point the duration timer can start.

Alternately there could be a separate timeout for how long a job can live in the backlog before the autoscaler forgets about it. It doesn't make sense to me that I have to include in the duration period the amount of time it might take the maxed-out cluster to work through a huge backlog which may be several hours, because, as I understand it, this is also the potential amount of time an idle runner will be left running before being scaled down if, for some reason, the job canceled/completed event is missed.

Whole Controller Logs

Logs have already rolled over.

Whole Runner Pod Logs

Logs have already rolled over

Additional Context

You can see in the webhook server logs entries like this:

2023-04-02T21:01:17Z	DEBUG	controllers.webhookbasedautoscaler	Patching hra infra-runner-arm64 for capacityReservations update	{"before": 220, "expired": 15, "added": 0, "completed": -2, "after": 205}

I don't believe this is a coding error. I believe this is a flaw in the design of the capacity reservation system. Capacity that has been reserved but cannot be filled because the runner pool is already at max capacity should not expire while waiting for real capacity to become available.

@Nuru Nuru added bug Something isn't working needs triage Requires review from the maintainers labels Apr 2, 2023
@mumoshu
Copy link
Collaborator

mumoshu commented Apr 2, 2023

@Nuru Hey! As you might have already figured out, your scale trigger duration is too short, and that's the issue. Just make it longer, like 60m or 2h or whatever you like. ARC is able to scale it down when the workflow job is completed, enabling you to not waste your compute resources.

  scaleUpTriggers:
    - amount: 1
      duration: 4m

This is unavoidable because all we have are webhook events emitted by GitHub Actions and hence ARC have no consistent view of runners and workflow jobs.

GitHub Actions team is currently developing a new autoscaling solution that does the scheduling decision on the GitHub Actions side, which will enable you to completely omit setting the scale trigger duration.

@mumoshu
Copy link
Collaborator

mumoshu commented Apr 2, 2023

@Nuru Can I close this as answered? Once again, how long a job can live in the backlog is impossible with the current model of ARC. Also, the reservation should live indefinitely, until it has a chance to be assigned to an idle runner this isn't a good idea because webhook delivery isn't guaranteed. If GitHub Actions, something in-between, or ARC somehow missed sending/revecing a webhook event, ARC needs to eventually converge into the desired state, not hanging in a diverged state. Never design/implement something assuming non-guaranteed thing guaranteed!

@Nuru
Copy link
Contributor Author

Nuru commented Apr 3, 2023

@mumoshu wrote

isn't a good idea because webhook delivery isn't guaranteed

I understand webhook delivery isn't guaranteed, but I do not think that is relevant to my suggestion.

The situation I am complaining about is where the webhook delivers a "queued" event and the webhook adds a capacity reservation, but the HRA does not add capacity because it is already at maxReplicas.

My suggestion is along the lines of when the HRA gets a new capacity reservation and it cannot scale up due to already being at max capacity, it extends the capacity reservation indefinitely (or maybe, say, a year, or 90 days, or some separately configurable duration). When a job completes and the ephemeral runner is deleted, ideally the HRA would launch a new pod and reset the expiration on the oldest capacity reservation that has the longer expiration to the then current time plus duration. If the newly launched pod remains idle for duration, then HRA can reset the expirations on all the reservations with long expirations.

I acknowledge that I am not 100% sure of the details, but it seems to me there is enough knowledge about the runners locally to know if they are being fed jobs or not, and to drain the queue of backlogged reservation requests based on that. The end result would be that if runners remain idle, then the HRA will scale the runner pool size down to minReplicas based just on the fact that the runners are idle for too long, having nothing to do with receiving webhook events.

@mumoshu
Copy link
Collaborator

mumoshu commented Apr 4, 2023

My suggestion is along the lines of when the HRA gets a new capacity reservation and it cannot scale up due to already being at max capacity, it extends the capacity reservation indefinitely

That's almost how it's supposed to work since v0.27.1 (#2258), by resetting the duration on every "completed" or "cancelled" workflow job event. This works because ARC is very likely to receive either webhook event before the duration expires.
It doesn't track which reservation is active or not or continuously check/reset durations because it's costly in terms of computation and custom resources size though. However I think that's unnecessary. With a proper scale trigger duration configured, #2258 would be enough.

The end result would be that if runners remain idle, then the HRA will scale the runner pool size down to minReplicas based just on the fact that the runners are idle for too long, having nothing to do with receiving webhook events.

There can be a race between GitHub Actions scheduling a job to the idle runner and ARC terminates the idle runner, if that's what you mean. I need to repeat my first two replies for that. ARC doesn't have a consistent view of queued jobs. Reservations are there to do our best to provision the desired number of runners at the approximately desired timings.

@mumoshu
Copy link
Collaborator

mumoshu commented Apr 4, 2023

If you have a more concrete proposal about what we can do to enhance the situation, please feel free to submit a pull request. I'll eagerly review it and try to learn from it!

But also be sure that GitHub Actions team is currently developing a new autoscaling solution that does scheduling on their end, so that we can provide a nicer experience without hacking it based on wrong assumptions.

@mumoshu
Copy link
Collaborator

mumoshu commented Apr 4, 2023

Well, and this isn't actually a bug, to be honest... I've explained this behavior multiple times in the issues and #2258 was the first attempt to make it better. If there's something that can be improved upon that, again, I'd appreciate it if you could provide a more concrete proposal, or even better a pull request that is formatted similarly to #2258, so that I can see what can be really improved.

Otherwise, you can continue this in Discussions!

@Nuru
Copy link
Contributor Author

Nuru commented Apr 4, 2023

@mumoshu wrote:

That's almost how it's supposed to work since v0.27.1 (#2258), by resetting the duration on every "completed" or "cancelled" workflow job event. This works because ARC is very likely to receive either webhook event before the duration expires.

I see. #2258 says it implements what I listed as my expected behavior, but that does not seem to be what I experienced. I hesitate to offer a PR because I do not fully understand the code or have a good way to test it, but I can offer these observations. If you can clarify some things for me, then maybe I can offer solutions.

This code implies that r.Replicas is always exactly 1:

before := len(hra.Spec.CapacityReservations)
expired := before - len(copy.Spec.CapacityReservations)
after := len(copy.Spec.CapacityReservations)

Which means that it is expected that running replicas have corresponding capacity reservations that are subject to expiration. So what happens if they expire before the job finishes? This line

copy.Spec.CapacityReservations = getValidCapacityReservations(copy)

means that any expired capacity reservations will not be included in the list. Which calls into question the implied expectation (see below) that the first copy.Spec.MaxReplicas on the list will be the ones corresponding to running replicas.

This code also does not seem right to me:

if !found && r.Replicas+amount == 0 {
found = true
foundIdx = i
} else {
// Note that we nil-check max replicas because this "fix" is needed only when there is the upper limit of runners.
// In other words, you don't need to reset effective time and expiration time when there is no max replicas.
// That's because the desired replicas would already contain the reservation since it's creation.
if found && copy.Spec.MaxReplicas != nil && i > foundIdx+*copy.Spec.MaxReplicas {

I don't see anything ever setting a reservation's Replicas = 0, plus I don't see any guarantee that r.Replicas is 1 (what if the HRA's scaleUpTriggers.amount == 2?) or that amount == -1 when amount < 0. So I don't understand the r.Replicas+amount == 0 test or why the code only updates the expiration times of foundIdx+*copy.Spec.MaxReplicas reservations and not all of the reservations.

So like I said, I don't fully understand the code. If you can help me understand this, then I can maybe offer better suggestions.

@mumoshu
Copy link
Collaborator

mumoshu commented Apr 9, 2023

Hey!

This code implies that r.Replicas is always exactly 1:

Maybe you linked wrong lines...?

I don't see any guarantee that r.Replicas is 1

Yeah, I think your observation is correct here. I vaguely remember that we introduced "batching" to add capacity reservations buffered in a short period at once so that the write rate to k8s-apiserver and etcd is reduced. Perhaps that matching logic is not working well since batching, if you really see real issues due to that? 🤔
If that's true you should be able to reproduce the issue in unit tests.

To be clear, so far I have seen no real problems potentially coming from code you referred myself. So probably it can either be just that other code handling scale down after the introduction of batching, or you found a real problem. In the latter case, I'm definitely eager to review your pull request!

@mumoshu
Copy link
Collaborator

mumoshu commented Apr 10, 2023

I reread relevant code and it turns out our batching doesn't merge multiple scale triggers into one. It just batches scale triggers into one K8s API update call. And our github-webhook-server creates scale trigger of amount 1 or -1, not other numbers. All in all, for the lines you mentioned, I can say the amount is guaranteed to be 1!

For your reference, this is the only place we instantiate scale operations of amount 1 or -1:

if e.GetAction() == "queued" {
target.Amount = 1
break
} else if e.GetAction() == "completed" && e.GetWorkflowJob().GetConclusion() != "skipped" && e.GetWorkflowJob().GetRunnerID() > 0 {
// A negative amount is processed in the tryScale func as a scale-down request,
// that erases the oldest CapacityReservation with the same amount.
// If the first CapacityReservation was with Replicas=1, this negative scale target erases that,
// so that the resulting desired replicas decreases by 1.
target.Amount = -1
break
}

This is how we batch updates to capacity reservations:

copy.Spec.CapacityReservations = append(copy.Spec.CapacityReservations, v1alpha1.CapacityReservation{
EffectiveTime: metav1.Time{Time: now},
ExpirationTime: metav1.Time{Time: now.Add(scale.trigger.Duration.Duration)},
Replicas: amount,
})

@Nuru
Copy link
Contributor Author

Nuru commented Apr 12, 2023

@mumoshu Thank you for the clarification.

The first lines of code I references, particularly

after := len(copy.Spec.CapacityReservations) 

implies all the CapacityReservations have Replica = 1 because it says the number of reservations equals the number of replicas reserved. If Replica were other than 1, you would need something like (pseudocode):

after := sum(copy.Spec.CapacityReservations[*].Replicas)

The fact that the webhook only scales up by a hardcoded 1 means that, in fact, the HRA scaleUpTriggers.amount is ignored, right? So that should probably be documented somewhere.

Still a bunch of issues around scaling with other (non-webhook) triggers and accounting for expiration of reservations that are paired with running replicas when updating backlogged reservations. And documentation.

I addressed what I felt comfortable with in #2502

@Nuru
Copy link
Contributor Author

Nuru commented Apr 12, 2023

@mumoshu Even with my PR #2502, it appears that cancelled jobs are not being removed from HRA capacity reservations. I am seeing matrix jobs being cancelled because of fail-fast, but the HRA capacity reservations not being removed. I'm having a hard time tracing through the code to see where that is supposed to happen. Can you please point it out out to me?

@mumoshu
Copy link
Collaborator

mumoshu commented Apr 13, 2023

Hey @Nuru! Thanks for your efforts. To better understand it, could you clarify what the "real" issue #2502 is supposed to fix in short? #2258 should have fixed it by readjusting existing capacity reservations effective times on scale down. Is it that #2502 is supposed to fix it by further readjusting existing capacity reservations effective times on scale up too, but it didn't actually work according to your test, right?

@mumoshu
Copy link
Collaborator

mumoshu commented Apr 13, 2023

@Nuru Then our issue can potentially be- ARC doesn't properly handle scale down when workflow jobs of status=cancelled are received(?)

I believe ARC already does handle scale down correctly on workflow jobs of status=completed. status=canceled ones should also be handled in such way as long as those cancellations are happened after workflow jobs of status=queued.

@mumoshu
Copy link
Collaborator

mumoshu commented Apr 13, 2023

Re #2502, I'd love it if you could add some previously failing tests that can pass after your fix, so that I can better understand the exact error cases you think you are affected! Thanks in advance for your support.

@Nuru
Copy link
Contributor Author

Nuru commented Apr 14, 2023

@mumoshu I tried to describe the errors I fixed in the description of #2502. However, I do not have the resources to devote to creating tests that demonstrate the failures and fixes. Note that the horizontal_runner_autoscaler_batch_scale does not have a test suite of any kind yet. I have already spent 3 times the amount of time on the PR that I had allocated and cannot afford to generate a new test suite and mocks.

Among the fixes was that if reservations "in use" (associated with an active runner) expire, the replacement reservation did not get its expiration time extended. Similarly, when the runner pool was at maxReplicas capacity and a job finished, the replacement reservation was not getting its expiration time extended. Additionally, the value logged as expired was completely wrong and the value logged as completed was negative when I believe it should be positive.

It may also have been that I was affected by the fact that the expiration times were only being extended on scale down, which I also fixed. When a large set of new jobs is created, it may be a long time before a scale down event occurs, but the expired reservations were being removed every batch cycle (every 3 seconds). Now the expiration time extensions match the expiration time pruning cycle.

Also, the replacement times were getting extended by the duration associated with the scaleDown trigger, which is only correct if there is only ever a single duration for scale up or scale down, which I do not think is a good assumption for the code to make. Plus, if the batch contained multiple scaleDown events, the expirations were extended for each event; now they are only extended once per batch.

Also, I could not find the code that handles the webhook receiving a status=cancelled and generating a corresponding scaleDown event. If you could point it out to me, I'd appreciate it.

@Nuru
Copy link
Contributor Author

Nuru commented Apr 19, 2023

@mumoshu I figured out why cancelled jobs were not triggering a scale down. Fixed in #2520

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Requires review from the maintainers
Projects
None yet
Development

No branches or pull requests

2 participants