Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

self hosted runner is not accepting jobs from queue. #592

Closed
npalm opened this issue Jul 15, 2020 · 24 comments
Closed

self hosted runner is not accepting jobs from queue. #592

npalm opened this issue Jul 15, 2020 · 24 comments
Labels
bug Something isn't working

Comments

@npalm
Copy link

npalm commented Jul 15, 2020

Describe the bug
Self hosted idle runner is not consuming queued jobs (runner version 263, release version).

To Reproduce
Steps to reproduce the behavior:

  1. Assume you have a runner in the state offline (self-hosted)
  2. New action workflow is triggered
  3. Create a new runner (config and run). We scale them automatically
  4. New runner is not consuming the queued jobs.

This was working perfectly for a few months, but got recently broeken.

  1. It got a bit more strange, trigger a new workflow. The runner wil see this jobs, start an upgrade to the prereleas, see Auto updater wants to update to a pre-release version #581 . And jobs stars, Queued jobs remains stucks today (2020-07-15 between 12 and 18 CET). Now around 20:30 CET the queued jobs are consumed as well. But all still strange since it is the event of the second trigger that cause the update of a non released version. Should 267 not be released?

Expected behavior
Runner in the idle status should consume jobs (labels are matching).

Runner Version and Platform

GitHub cloud + runner version 263

OS of the machine running the runner? OSX/Windows/Linux/...

What's not working?

No error message, see behaviror above.

Job Log Output

If applicable, include the relevant part of the job / step log output here. All sensitive information should already be masked out, but please double-check before pasting here.

Runner and Worker's Diagnostic Logs

n/a

reference: our setup is available here: https://github.com/philips-labs/terraform-aws-github-runner

@npalm npalm added the bug Something isn't working label Jul 15, 2020
@N2D4
Copy link

N2D4 commented Jul 15, 2020

EDIT: In our case, it turned out to be a misconfiguration on our side. As soon as the runner exited, we rebooted the machine, which caused the update to fail. I'll leave this comment here in case it helps someone else regardless.

Original:

Same issue here, runner seems to try updating from 2.263.0 to 2.267.1 every time we queue a job, and does not actually run the job. Logs:

2020-07-15 19:37:23Z: Listening for Jobs
Runner update in progress, do not shutdown runner.
Downloading 2.267.1 runner
Waiting for current job finish running.
Generate and execute update script.
Runner will exit shortly for update, should back online within 10 seconds.

√ Connected to GitHub

2020-07-15 19:38:55Z: Listening for Jobs
Runner update in progress, do not shutdown runner.
Downloading 2.267.1 runner
Waiting for current job finish running.
Generate and execute update script.
Runner will exit shortly for update, should back online within 10 seconds.

√ Connected to GitHub

2020-07-15 19:39:34Z: Listening for Jobs

This seems to have started a few hours ago.

@mforutan
Copy link

mforutan commented Aug 10, 2020

We are seeing the same issue, our setup has two jobs which first one is scaling up the runner if needed and wait for it, then second job is running the actual pipeline, but some time the second job doesn't start when the runner state is offline even after the runner is ready to accept jobs, this is not happening if the runner is in the idle state instead of offline state

Edit: Adding 60s sleep time at the end of the first job and combining with needs: scale-job attribute for the second job works as workaround for us

@patrickmscott
Copy link

We are having the same issue. If jobs are queued when all runners are offline, those jobs are never run when runners come back online.

@PickledChris
Copy link

PickledChris commented Aug 24, 2020

+1, we have this issue as well, we have an offline runner for each repo, then schedule the most recent runner version when we detect a workflow.
Manually restarting the job seems to work as there's then a runner scheduled for it to be deployed to, but I'd have to build some automation to automatically restart jobs that had been queued...

@MartinNowak
Copy link

It behaves as if the workflows/runs are scheduled onto specified runners when they are created/triggered, so that they wouldn't get rescheduled when new runners are added.
Not sure what the problem is (in particular since @npalm mentioned it did work a while ago.
Is the source code available somewhere?

@vitobotta
Copy link

Having this issue right now. I have idle workers "listening for jobs" and nothing happens. Workflows are not started. I tried cancelling and restarting but it doesn't seem to help. I started to see this today. Anyone else experiencing this at the moment?

@deeno35
Copy link

deeno35 commented Sep 10, 2020

Same for us. Restarting didn't do anything, but I manually downloaded actions-runner-linux-x64-2.273.1.tar.gz and untar'd right on top of the existing install and then restarted the daemon. That got jobs flowing again.

I have a feeling the updater was having issues updating from 2.273.0 -> 2.273.1 and the server side was not placing jobs on runners running a previous version.

This has happened to me quite a number of times in various ways since I've started using github actions (5 or 6?) where the runner tried to pick up a new release and jobs stopped running because of that and needed manual intervention.

@deeno35
Copy link

deeno35 commented Sep 15, 2020

It's back! 2.273.1 -> 2.273.2 auto update killed our runner. Upon manual restart, the instance is sitting around idle, yet jobs are waiting to be picked up. Manual untar of 2.273.2 right on top of my current install + service restart fixed it again.

This is the cause of the runner dying in the first place (I've seen this 3 or 4 times):

/home/ec2-user/actions-runner/_work/_update.sh: line 31: nul: Permission denied

Is there a reason /dev/null isn't being used in _update.sh? Is this to be platform agnostic?

@lokesh755
Copy link
Contributor

lokesh755 commented Sep 16, 2020

We’re aware of the current issue where runners added after queuing a run will not pick up the jobs if the added runner is not the latest version. We’re rolling out the fix and will update the issue again once it’s deployed everywhere. In meantime, you can unblock yourselves by adding the latest runner every time.

@lanen
Copy link

lanen commented Sep 18, 2020

Is that all right now ?

My Workflow also hang in queued. My Self-host version is : actions-runner-linux-arm64-2.272.3.tar.gz

[2020-09-18 03:01:20Z WARN GitHubActionsService] Authentication failed with status code 401.
Transfer-Encoding: chunked^M
WWW-Authenticate: Bearer^M
Strict-Transport-Security: max-age=2592000^M
X-TFS-ProcessId: c244f0e2-fc5b-4e43-9aad-94f7924d2494^M
ActivityId: 6333337f-d4cd-464f-8b37-d79565ebf02c^M
X-TFS-Session: b10f7d39-de6b-4a31-90a0-7a218a3041b8^M
X-VSS-E2EID: 2da95130-1a17-4e57-86a2-e70746624609^M
X-VSS-SenderDeploymentId: 13a19993-c6bc-326c-afb4-32c5519f46f0^M
X-TFS-ServiceError: The+user+%27System%3aPublicAccess%3baaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa%27+is+not+authorized+to+access+this+resource.^M
X-VSS-S2STargetService: 0000005A-0000-8888-8000-000000000000/visualstudio.com^M
X-MSEdge-Ref: Ref A: 9DC762193D88407F83C306B199AF8276 Ref B: HKBEDGE0306 Ref C: 2020-09-18T03:01:20Z^M
Date: Fri, 18 Sep 2020 03:01:20 GMT^M

@lokesh755
Copy link
Contributor

lokesh755 commented Sep 30, 2020

Fix has been deployed. And new runners (with older versions) should be able to pick the jobs that are queued before adding them.

@lokesh755
Copy link
Contributor

@lanen It seems like a different issue. Did you check on the UI, if the runner exists or it's deleted?

@pamidu-A
Copy link

@lokesh755 i am still facing same issue (runner version : v2.273.5) runners didn't pick the jobs that are queued before adding them.

@joseproura
Copy link

Same happens to me with the last verison 2.276.1, I have been able to solve adding a delay in the creation of a fargate that runs the runner, but something happens when 2 jobs fire to close and one of them is missed

@PavelSusloparov
Copy link

Have a similar issue specifically with the organization setup.
Personal GitHub queue work fine for the same scenario and jobs are getting picked up.

@lokesh755
Copy link
Contributor

@PavelSusloparov Could you provide your org and repo details?

@RobinDaugherty
Copy link

Same issue here with actions-runner-osx-x64-2.278.0

@darwinz
Copy link

darwinz commented Jun 16, 2021

We also had the same issue running version 2.277.1 on Ubuntu (actions-runner-linux-x64-2.277.1.tar.gz). We ended up upgrading manually to version 2.278.0 to get past the issue

@igagis
Copy link

igagis commented Jun 22, 2021

I have self-hosted runner on ARM machine and in about 50% of cases it does not pick up the job, no matter if it is queued or not. Cancelling the run and starting again makes it pick up the job. It is very annoying that I have to cancel and re-run all jobs almost for every commit because of this issue.

Runner version 2.278.0

@ViacheslavKudinov
Copy link

ViacheslavKudinov commented Jul 9, 2021

Hi @lokesh755 it is still valid issue on 2.278.0.
We have it now.
In run log:

Found online and idle self-hosted runner(s) in the current repository's organization/enterprise account that matches the required labels: 'self-hosted, ****'
Waiting for a self-hosted runner to pick up this job...

but at the same time new 2 runners were spin up.

@project-administrator
Copy link

I've just re-created the runner. First problem is that it does not pick up the job:

Found online and idle self-hosted runner(s) in the current repository that matches the required labels: ...
Waiting for a self-hosted runner to pick up this job...

Second problem is that I am not able to delete the old offline runner from the web-ui:

Sorry, there was a problem deleting your runner.

@chrisdone
Copy link

Second problem is that I am not able to delete the old offline runner from the web-ui:

Sorry, there was a problem deleting your runner.

Me neither.

Screenshot from 2021-12-06 14-45-04

Not happy with this at all.

I'll just have to make a fresh runner on github one and make my ci scripts stop using the old one.

@igagis
Copy link

igagis commented Dec 6, 2021

Because of this issue I nowadays don't use official github's runner for self-hosted machines. I switched to using this alternative runner and it works perfectly for me. It can even be installed via Debian package.

@btmurrell
Copy link

btmurrell commented Nov 16, 2023

We had the same problem. The solution was very obscure. A support person asked me "did you recently change this repo from private to public?" -- yes, i did... why would that matter?

There is a security setting in your runner group (something i never configured) which, by default, prevents a self-hosted runner in that group from picking up jobs from public repos. Change it to suit your needs, heeding the warning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests