Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ubuntu 24.04 runner jobs being canceled unprompted after about 4 minutes #9959

Closed
2 of 14 tasks
RytoEX opened this issue May 29, 2024 · 33 comments
Closed
2 of 14 tasks

Comments

@RytoEX
Copy link

RytoEX commented May 29, 2024

Description

Ubuntu 24.04 runners are dying after about 4 minutes with:

Error: The operation was canceled.

The run summary shows the following warning:

Received request to deprovision: The request was cancelled by the remote provider.

or the following error:

The hosted runner encountered an error while running your job. (Error Type: Failure).

Ubuntu 24.04 runners worked fine up through approximately Wed, 29 May 2024 07:05:42 GMT. Failures started occurring at approximately Wed, 29 May 2024 07:52:23 GMT.

Platforms affected

  • Azure DevOps
  • GitHub Actions - Standard Runners
  • GitHub Actions - Larger Runners

Runner images affected

  • Ubuntu 20.04
  • Ubuntu 22.04
  • Ubuntu 24.04
  • macOS 11
  • macOS 12
  • macOS 13
  • macOS 13 Arm64
  • macOS 14
  • macOS 14 Arm64
  • Windows Server 2019
  • Windows Server 2022

Image version and build link

Version: 20240516.4.0
https://github.com/obsproject/obs-studio/actions/runs/9289765863/job/25568048125

Is it regression?

Yes. Ubuntu 22.04 still works. https://github.com/obsproject/obs-studio/actions/runs/9289765863/job/25568047499

Expected behavior

I expect Ubuntu 24.04 runners to run successfully without being cancelled by the provisioner.

Actual behavior

Runs on Ubuntu 24.04 runners are being cancelled by the provisioner after about 4 minutes of runtime.

Repro steps

  1. Set up a job to run on Ubuntu 24.04 that will run for more than 4 minutes.
  2. Wait.
@njzjz
Copy link

njzjz commented May 29, 2024

Got the same issue with 22.04: https://github.com/deepmodeling/deepmd-kit/actions/runs/9293232817

@igsilya
Copy link

igsilya commented May 29, 2024

We're experiencing the same as well. It's all other the place with 22.04. One example: https://github.com/ovsrobot/ovs/actions/runs/9287012507/attempts/2

@Piedone
Copy link

Piedone commented May 29, 2024

We see the same with Orchard Core. E.g.: https://github.com/OrchardCMS/OrchardCore/actions/runs/9293755080

@daveisfera
Copy link

Got the same issue with 22.04: https://github.com/deepmodeling/deepmd-kit/actions/runs/9293232817

This doesn't appear to be the same issue

@quark17
Copy link

quark17 commented May 30, 2024

We are also experiencing this, where the workflow runs fine on other Ubuntu and macOS runners, but ubuntu-24.04 gets cancelled due to deprovision request at about 4 minutes:

When I tried to re-run the failed jobs, it doesn't even try to run it, and just shows "this jobs was skipped":

@TedHartMS
Copy link

We are seeing this on Garnet using 22.04.4: https://github.com/microsoft/garnet/actions/runs/9294668900/usage

@eenagy
Copy link

eenagy commented May 30, 2024

We are encountering the same issue: https://github.com/eth-pkg/eth-nodes/actions/runs/9297384932/job/25587569214.

Ubuntu 22.04 works as a viable alternative, as shown in https://github.com/eth-pkg/eth-nodes/actions/runs/9287865369. However, it fails due to other non-image-related bugs.

@rursprung
Copy link

i also encounter the "Error: The operation was canceled." problem with Ubuntu 24.04 and it happens every time (two out of two, anyway) at the exact same position in the build:

note that the 2nd one was faster as it was able to re-use some cached build artefacts of the first, i.e. this isn't a time-based kill. there's nothing visible in the logs which would indicate why it'd fail at exactly this position.
building on a local Ubuntu 24.04 works just fine.

@andreasabel
Copy link

Me too: https://github.com/agda/agda/actions/runs/9288539542
https://githubstatus.com does not indicate any problems with actions.

@florinb-cyberhaven
Copy link

I have the same issue and it happens always, starting with this week

@mhajas
Copy link

mhajas commented May 30, 2024

We have the same issue with the following image that was first used yesterday

Image: ubuntu-22.04
Version: 20240526.1.0

it seems this was not present with

Image: ubuntu-22.04
Version: 20240516.1.0

GHA run: https://github.com/keycloak/keycloak-benchmark/actions/runs/9280514320/attempts/1

There are also some discussions created here:
https://github.com/orgs/community/discussions/126469
https://github.com/orgs/community/discussions/126539

@victorjulien
Copy link

Affects us at https://github.com/OISF/suricata pretty badly too. Using 22.04 image mostly.

@rgacogne
Copy link

https://github.com/PowerDNS/pdns is affected as well, GH actions workflows are basically useless for us right now, despite https://www.githubstatus.com's claims that all is fine.

@ArneTR
Copy link

ArneTR commented May 30, 2024

@m3dwards
Copy link

Others with same issue: #9848 (comment)

@mchades
Copy link

mchades commented May 30, 2024

+1, Same problem

@lucasssvaz
Copy link

+1

@lehins
Copy link

lehins commented May 30, 2024

We are having the same problem for the last day or so on both ubuntu-latest and ubuntu-22.04
https://github.com/IntersectMBO/cardano-ledger/actions/runs/9296498410

@Joshua-Riek
Copy link

I'm also having this problem when using Ubuntu 24.04 or 22.04 runners.

@curiositycasualty
Copy link

curiositycasualty commented May 30, 2024

For me, installing valgrind, or even just apt upgrade, was causing a number of associated systemd service to restart, resulting in apt exiting w/ 143.

@mhajas
Copy link

mhajas commented May 31, 2024

Based on the following comments:
https://github.com/orgs/community/discussions/126469#discussioncomment-9611720
and
https://github.com/orgs/community/discussions/126539#discussioncomment-9610045
, it seems the issue is no longer present. We haven't seen this error so far in our daily runs.

@ShadowJonathan
Copy link

ShadowJonathan commented May 31, 2024

Same here, https://github.com/ShadowJonathan/Soar/actions/runs/9302373522 runs now, whereas a few hours it did not.

@Piedone
Copy link

Piedone commented May 31, 2024

Yep, fixed for all recent https://github.com/OrchardCMS/OrchardCore/actions/workflows/functional_all_db.yml workflow runs.

@lehins
Copy link

lehins commented May 31, 2024

I wonder if somehow issues with cloudflare had anything to do with this or was it just a coincidence that that cloudflare was degraded last couple of days.

Our Hydra CI was affected because of this during the same period because of it.

@eenagy
Copy link

eenagy commented May 31, 2024

Still having the same issues: https://github.com/eth-pkg/eth-nodes/actions/runs/9297384932/job/25655431965
operation is still canceled

@Alexey-Ayupov
Copy link
Collaborator

Hello @ArneTR . It should have been fixed.

If you have any other questions feel free to reach us.

@ctapmex
Copy link

ctapmex commented Jun 3, 2024

@eenagy
Copy link

eenagy commented Jun 3, 2024

@cpu
Copy link

cpu commented Jun 3, 2024

The fix was also insufficient for my repo's 24.04 actions (exemplar). They are failing with the same unexpected cancel early in the run.

@randombit
Copy link

For those still hitting this - the problem is that if you install the wrong package, apt decides to restart services, and when it does so it apparently kills something important which causes the build to self cancel.

If you set export NEEDRESTART_MODE=l before running apt, the service restart is disabled and the build can continue.

@eenagy
Copy link

eenagy commented Jun 5, 2024

For those still hitting this - the problem is that if you install the wrong package, apt decides to restart services, and when it does so it apparently kills something important which causes the build to self cancel.

If you set export NEEDRESTART_MODE=l before running apt, the service restart is disabled and the build can continue.

Thank you for the help. However, I would expect the GitHub runner to function consistently across all images. This issue seems like something that should be addressed on the image runner's side, rather than requiring a workaround on our end.

@simondeziel
Copy link

simondeziel commented Jun 5, 2024

For those still hitting this - the problem is that if you install the wrong package, apt decides to restart services, and when it does so it apparently kills something important which causes the build to self cancel.

I suspect that's needrestart restarting runner-provisioner.service which was reported as https://bugs.launchpad.net/ubuntu/+source/needrestart/+bug/2067800

If you set export NEEDRESTART_MODE=l before running apt, the service restart is disabled and the build can continue.

I didn't try but sudo apt-get autopurge -y needrestart also worked around the issue.

@cpu
Copy link

cpu commented Jun 5, 2024

I didn't try but sudo apt-get autopurge -y needrestart also worked around the issue.

I gave this a try and can confirm it also seems to resolve the issue of the 24.04 runner being cancelled unexpectedly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests