GitHub Actions step is hanging until timeout #1326

Lastique · 2021-09-13T13:10:51Z

Describe the bug

At some point, a GitHub Actions step stops producing output and hangs until the workflow is terminated by timeout.

You can see the output stops being produced if you actively watch the job output in the browser. If you try opening an already hanging job in the browser, the step output is not displayed at all: the ">" sign next to the step is not visible and the icon next to is is a spinning yellow circle. The output becomes available once the workflow is terminated.

Here is an example of one such run. Here is the log with timestamps:

boost_log_gha_hang.txt.gz

I'm sure the processes running in the GHA step are not hanging (this is just a C++ compilation process, which does succeed normally).

This problem seems to be not unique to my setup, there are other reports as well. In my case it happens most often with macos-10.15 runner, but I've also seen it happen with ubuntu-20.04 runner.

To Reproduce

At the moment, this reproduces quite reliably by running the workflow on my Boost.Log repository (git revision boostorg/log@2f058c6). I intend to try working around the problem (e.g. by disabling the offending runner), so this may not stay reproducible in later revisions.

Expected behavior

The runner should not hang.

Runner Version and Platform

I'm using the free runners hosted by GitHub.

The text was updated successfully, but these errors were encountered:

This is another attempt to work around the CI job hanging, as reported in actions/runner#1326.

Lastique · 2021-09-15T18:59:21Z

After deeper investigation it turned out that a process indeed could hang while executing a job step. After this was fixed, the job no longer hangs.

However, I still find it incorrect that I could not access step logs while the job was hanging. This makes it impossible to know what's going on until the job times out or is cancelled. I'm leaving this bug open for this issue to be fixed.

phadej · 2021-11-10T20:27:45Z

I have seen this lately consitently, e.g.

https://github.com/haskell/aeson/runs/4149850593?check_suite_focus=true is one job where the step doesn't take long (compare to other compiler versions).

This is very frustrating, also knowing that a runner spins there for 6 hours not doing anything meaningful.

phadej · 2021-11-11T14:21:51Z

Another job https://github.com/haskell/aeson/runs/4178377747?check_suite_focus=true these seem to fail always at the same spot, which is good as the failure is somewhat deterministic. But I have no idea what it may be. I added a memory limit compiler could use (quite conservative), but that doesn't seem to help. Sometimes job succeed sometimes they don't. I still suspect that it is some limit jobs hit, and then they are lost.

phadej · 2021-11-11T18:44:46Z

I tried out my out-of-memory hypothesis: When the memory gets used the whole job becomes a lot slower and unresponsive:

When I removed the runtime system memory limit the job failed

https://github.com/haskell-CI/haskell-ci/runs/4180347133?check_suite_focus=true

In these cases the job however didn't timeout, but it was unresponsive and logs weren't streamed.

Can high memory usage cause slowdown of the job? Is there anyway to get analytics of how much resources job runners have used?

phadej · 2021-11-12T15:59:09Z

This problem started to happening quite a lot: e.g.

Previously these jobs didn't have any problems.

chris-griffin · 2022-02-16T18:56:27Z

One observation is that while the timed-out step will not show any log detail in the Action GUI (it will continue to render as in-progress/yellow), detailed logs of that step are available via the gear icon => "View raw logs".

nikola-jokic · 2022-03-16T08:47:41Z

Hi everyone,

I will close this issue now since @chris-griffin explained how can you view raw logs. Thank you @chris-griffin!

@phadej, if you are still seeing this issue, and if you are running your workflows on your self-hosted runner, could you double-check if you have enough computing power? It might be due to that. If you cannot find what is causing this issue, please submit a new issue, and all the relevant information so we can re-produce it and see what is causing you this trouble.

phadej · 2022-03-16T08:56:30Z

@nikola-jokic all jobs above were run on GitHub provided machines.

I don't agree that this issue is resolved. Jobs timeout for non-obvious reasons, and I cannot verify that raw logs would told the reason as they are not available anymore for those jobs. Bummer.

nikola-jokic · 2022-03-16T08:59:31Z

Hi @phadej, I can take a look, no problem. But while looking at your workflows, the logs are expired. That is why I asked if you might file an issue so we can take a look. Maybe you can suggest how to improve logging to help you with that issue as well?

jeacott1 · 2022-05-12T06:24:29Z

I'm having a lot of grief recently with this issue.
mostly its hung at "Job is about to start running on the hosted runner: Hosted Agent (hosted)" - for hours!

nikola-jokic · 2022-05-12T08:05:04Z

Hi @jeacott1,

If the job is about to start, and is not started and hanging, the job is not landed on the Runner. It might be due to the resolved incident.

If the issue persist, please post your feedback on the GitHub Community Support Forum which is actively monitored? Using the forum ensures that we route your problem to the correct team. ☺️

adarkbandwidth · 2022-06-30T20:24:45Z

We are seeing this exact issue as well. I know this issue is closed, but no attempt was made to fix it here. Frustrating.

nikola-jokic · 2022-07-01T10:18:40Z

Hi @adarkbandwidth,

Did you post this issue to the GitHub Community Support Forum? Since the workflow did not land on the runner, the issue does not belong to this repository. That is the reason why it is closed.

Please, post your feedback to the Community Support Forum ☺️

bryanmacfarlane · 2022-07-01T12:14:42Z

There is likely not an bug/issue to fix.

As mentioned above, a good thing to do is set a timeout on the job. If it typically takes 5 minutes for example in normal conditions, setting a timeout of 20 will help free the concurrency slot (self hosted machine or cost if hosted) in the event of a hang.

Outside of getting the raw logs (which may or may not have more details - depends on the output of the process the step is running), you can also try to run on a self hosted VM (just for debugging) that you have access to (can log in and watch resources etc). It's a good debugging technique. You can even install the runner on your dev desktop. Note that might not match the resources of the hosted VM. If you want a VM that ~matches characteristics, you can get an azure DS2V2 VM with 2 cores.

There is a potential feature to get better visibility into what's happening on the hosted machine. Debug into the hosted machine/runner etc. we could also expose perf counters, get a dump of a process, etc. out of disk space detection is another potential issue / feature. That has been discussed in and is on a features back log but not bug level work. We could even have the runner detect no output after xxx time but even then it doesn't fix anything - it's just a more efficient timeout option.

A clarification of typical issues:

test hang / deadlock: the runner runs tests which runs your tests. if the tests deadlock, the test process just hangs, no output is created. the tests might not hang on your desktop but on another machine / VM with different I/O characteristics, it could expose the fragile code and deadlock. This could also happen in any process you supply for the runner to run.
There is nothing the runner can do here. It's waiting for output and a return code.
resource issues OOM: the runner runs a step which runs the process / cmdlines you provide. if those processes use too much memory it can starve other processes on the machine. on linux we run your steps in a cgroup to ensure you can only use 80% of the memory and not starve the runner of resources. The linux OOM killer will kill your process and it will fail. On other OSes there isn't that capability and a runaway process using too much resources can actually starve the runner of memory, cpu, i/o etc
out of disk space: if the processes you supply in steps run the machine out of disk space, it can be challenging to figure out what's going on as the runner cannot even create logs to get them up to the service. As mentioned above, there is a potential feature to monitor resources and get better visibility. In this case, we could monitor when x% is reached and add a warning to the build. That falls under feature work mentioned above. Note that this won't lead to a hang that's mentioned here. The runner maintains a heartbeat (renews a lock) to the service and if it "goes away" the server will abandon the job. So that's likely not your issue.

Hope that helps clarifies many of the common issues, challenges and potential features.

adarkbandwidth · 2022-07-01T15:52:36Z

Thank you for this detailed response. We're focusing on working through these issues. We're not using a hosted runner so we can't observe behavior outside of the Github logs (which are missing in this case), but we can observe the outputs for some steps.

adarkbandwidth · 2022-07-12T16:57:58Z

FWIW, we realized that when we cancel the action, we do then get logs from the hosted runner, and we're able to troubleshoot from there. Our code is hanging—not GHA.

fringedgentian · 2023-02-22T18:52:37Z

We had this same issue and it was due to including apache commons 2.9 or later that caused it. We reverted to 2.7 and the test ran without issue.

implementation 'commons-io:commons-io:2.9.0'

prein · 2023-06-22T09:07:53Z

Would it be a good idea (if possible) to implement something like an inactivity / idle timeout in addition to the already implemented "general" timeout? Imagine a job that is expected to run for 20 minutes on average. I set the timeout to 25 min. Even if the job hangs in the first minute of execution it will be killed only after 25 minutes. If there is idle timeout that I can set to 5 min, then if the job hangs in 1st min of the run, it will be killed in the 6th minute saving 19 min in my example.

prein · 2023-06-22T14:37:54Z

Compare https://support.circleci.com/hc/en-us/articles/360007188574-Build-has-Hit-Timeout-Limit

MhdSadd · 2023-11-24T07:28:14Z

After deeper investigation it turned out that a process indeed could hang while executing a job step. After this was fixed, the job no longer hangs.

However, I still find it incorrect that I could not access step logs while the job was hanging. This makes it impossible to know what's going on until the job times out or is cancelled. I'm leaving this bug open for this issue to be fixed.

@Lastique how did you fix the hanging job without getting a log about why it might be hanging?
I have a nestjs-digitalocean CI which hangs on the build command until it timeout, same thing as everyone, no log except for this in the raw log

2023-11-23T23:12:57.3718126Z Waiting for a runner to pick up this job...
2023-11-23T23:12:57.8616058Z Job is waiting for a hosted runner to come online.
2023-11-23T23:13:00.7732105Z Job is about to start running on the hosted runner: GitHub Actions 2 (hosted)

locally it builds fine on my Mac ios Ventura 13.6.1 (22G313)

Lastique · 2023-11-24T08:49:30Z

@Lastique how did you fix the hanging job without getting a log about why it might be hanging?

I don't remember the details at this point, but I was able to debug the problem. I think, the log was visible if I was monitoring the job from its start, or after it was cancelled.

That doesn't seem like your case though. In your case the job doesn't start for whatever reason.

Mac OS runners are a scarce resource on GHA, you do sometimes have to wait for a long time for a runner to become available. Again, I'm not sure if that's what happens in your case.

MhdSadd · 2023-11-24T21:35:52Z

@Lastique so my action is actually configured to run ubuntu-latest, and it was running fine before, recently though it starts the job and goes through the steps until it gets to the build step (this is the step it hangs on, until it timeout).
The CI is meant to clone and deploy nestjs microservice via SSH to digitalocean.

MhdSadd · 2023-11-27T18:08:12Z

Turns out my own issue is memory-related, using Digitalocean's droplet to deploy several microservices, turns out each time we try to deploy the service and we're low on memory the OOM killer terminates the process but the action keeps trying until it times out. Increasing our resources and adding a swap memory fixed it.
My question though is, is there not a better way for the action to notify us after the trials or throw some sort of out of memory error in its log.

talregev · 2024-01-08T21:48:33Z

Still happen.

Lastique added the bug Something isn't working label Sep 13, 2021

Lastique added a commit to boostorg/log that referenced this issue Sep 13, 2021

Further reduced the number of tested configs on MacOS in GHA.

d1aaf95

This is another attempt to work around the CI job hanging, as reported in actions/runner#1326.

phadej mentioned this issue Nov 10, 2021

haskell-ci jobs sometimes hangs haskell-CI/haskell-ci#553

Closed

This was referenced Nov 12, 2021

Revise upper bound on base for cryptohash-sha1 to build with GHC 9.2 haskell-infra/hackage-trustees#319

Closed

Allow building with hashable-1.4.* ekmett/lens#992

Merged

abejgonzalez mentioned this issue Feb 4, 2022

Intermittent GH-A Freezing firesim/firesim#919

Open

nikola-jokic self-assigned this Mar 10, 2022

nikola-jokic closed this as completed Mar 16, 2022

This was referenced Oct 14, 2022

CI tests appear to run indefinitely when parallelized, but only on Github, not locally fsprojects/FSharp.Control.TaskSeq#25

Closed

Deadlock when synchronously awaiting an async method in GitHub Actions xunit/xunit#2587

Closed

tueda mentioned this issue Nov 7, 2022

Hangs on GitHub Actions vermaseren/form#417

Open

aartiPl mentioned this issue Nov 16, 2022

No useful information for debugging hanging Github actions vermaseren/form#421

Closed

Poitrin mentioned this issue Feb 14, 2023

theme pull in a non-empty and non-theme directory hangs on CI environment (even with SHOPIFY_CLI_TTY) Shopify/cli#1136

Closed

2 tasks

thescientist13 mentioned this issue Feb 18, 2023

GitHub Actions for some SSR specs are timing out (experimental runs are fine) ProjectEvergreen/greenwood#1070

Closed

JacksonBurns mentioned this issue Jun 3, 2023

CI for Mac frequently timing out ReactionMechanismGenerator/RMG-Py#2453

Closed

thmsbinder mentioned this issue Dec 6, 2023

Make GitHub workflows reliable and predictable project-oak/oak#4534

Open

LeStarch mentioned this issue Jan 4, 2024

MacOS CI repeated fails nasa/fprime#2462

Closed

talregev mentioned this issue Jan 7, 2024

Run python tests with timeout borglab/gtsam#1706

Closed

kelly-sovacool mentioned this issue Jan 18, 2024

stub run local & on github actions hangs indefinitely CCBR/CHAMPAGNE#171

Open

talregev mentioned this issue Jan 26, 2024

Ubuntu 22.04 for python tbb. hopefully will solve the hang problem. borglab/gtsam#1719

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Actions step is hanging until timeout #1326

GitHub Actions step is hanging until timeout #1326

Lastique commented Sep 13, 2021

Lastique commented Sep 15, 2021

phadej commented Nov 10, 2021 •

edited

phadej commented Nov 11, 2021

phadej commented Nov 11, 2021

phadej commented Nov 12, 2021 •

edited

chris-griffin commented Feb 16, 2022

nikola-jokic commented Mar 16, 2022

phadej commented Mar 16, 2022 •

edited

nikola-jokic commented Mar 16, 2022

jeacott1 commented May 12, 2022

nikola-jokic commented May 12, 2022 •

edited

adarkbandwidth commented Jun 30, 2022

nikola-jokic commented Jul 1, 2022

bryanmacfarlane commented Jul 1, 2022 •

edited

adarkbandwidth commented Jul 1, 2022

adarkbandwidth commented Jul 12, 2022

fringedgentian commented Feb 22, 2023

prein commented Jun 22, 2023 •

edited

prein commented Jun 22, 2023

MhdSadd commented Nov 24, 2023

Lastique commented Nov 24, 2023

MhdSadd commented Nov 24, 2023

MhdSadd commented Nov 27, 2023

talregev commented Jan 8, 2024

GitHub Actions step is hanging until timeout #1326

GitHub Actions step is hanging until timeout #1326

Comments

Lastique commented Sep 13, 2021

Runner Version and Platform

Lastique commented Sep 15, 2021

phadej commented Nov 10, 2021 • edited

phadej commented Nov 11, 2021

phadej commented Nov 11, 2021

phadej commented Nov 12, 2021 • edited

chris-griffin commented Feb 16, 2022

nikola-jokic commented Mar 16, 2022

phadej commented Mar 16, 2022 • edited

nikola-jokic commented Mar 16, 2022

jeacott1 commented May 12, 2022

nikola-jokic commented May 12, 2022 • edited

adarkbandwidth commented Jun 30, 2022

nikola-jokic commented Jul 1, 2022

bryanmacfarlane commented Jul 1, 2022 • edited

adarkbandwidth commented Jul 1, 2022

adarkbandwidth commented Jul 12, 2022

fringedgentian commented Feb 22, 2023

prein commented Jun 22, 2023 • edited

prein commented Jun 22, 2023

MhdSadd commented Nov 24, 2023

Lastique commented Nov 24, 2023

MhdSadd commented Nov 24, 2023

MhdSadd commented Nov 27, 2023

talregev commented Jan 8, 2024

phadej commented Nov 10, 2021 •

edited

phadej commented Nov 12, 2021 •

edited

phadej commented Mar 16, 2022 •

edited

nikola-jokic commented May 12, 2022 •

edited

bryanmacfarlane commented Jul 1, 2022 •

edited

prein commented Jun 22, 2023 •

edited