Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GitHub Actions step is hanging until timeout #1326

Closed
Lastique opened this issue Sep 13, 2021 · 24 comments
Closed

GitHub Actions step is hanging until timeout #1326

Lastique opened this issue Sep 13, 2021 · 24 comments
Assignees
Labels
bug Something isn't working

Comments

@Lastique
Copy link

Describe the bug

At some point, a GitHub Actions step stops producing output and hangs until the workflow is terminated by timeout.

You can see the output stops being produced if you actively watch the job output in the browser. If you try opening an already hanging job in the browser, the step output is not displayed at all: the ">" sign next to the step is not visible and the icon next to is is a spinning yellow circle. The output becomes available once the workflow is terminated.

Here is an example of one such run. Here is the log with timestamps:

boost_log_gha_hang.txt.gz

I'm sure the processes running in the GHA step are not hanging (this is just a C++ compilation process, which does succeed normally).

This problem seems to be not unique to my setup, there are other reports as well. In my case it happens most often with macos-10.15 runner, but I've also seen it happen with ubuntu-20.04 runner.

To Reproduce

At the moment, this reproduces quite reliably by running the workflow on my Boost.Log repository (git revision boostorg/log@2f058c6). I intend to try working around the problem (e.g. by disabling the offending runner), so this may not stay reproducible in later revisions.

Expected behavior

The runner should not hang.

Runner Version and Platform

I'm using the free runners hosted by GitHub.

@Lastique Lastique added the bug Something isn't working label Sep 13, 2021
Lastique added a commit to boostorg/log that referenced this issue Sep 13, 2021
This is another attempt to work around the CI job hanging, as reported
in actions/runner#1326.
@Lastique
Copy link
Author

After deeper investigation it turned out that a process indeed could hang while executing a job step. After this was fixed, the job no longer hangs.

However, I still find it incorrect that I could not access step logs while the job was hanging. This makes it impossible to know what's going on until the job times out or is cancelled. I'm leaving this bug open for this issue to be fixed.

@phadej
Copy link

phadej commented Nov 10, 2021

I have seen this lately consitently, e.g.

This is very frustrating, also knowing that a runner spins there for 6 hours not doing anything meaningful.

@phadej
Copy link

phadej commented Nov 11, 2021

Another job https://github.com/haskell/aeson/runs/4178377747?check_suite_focus=true these seem to fail always at the same spot, which is good as the failure is somewhat deterministic. But I have no idea what it may be. I added a memory limit compiler could use (quite conservative), but that doesn't seem to help. Sometimes job succeed sometimes they don't. I still suspect that it is some limit jobs hit, and then they are lost.

@phadej
Copy link

phadej commented Nov 11, 2021

I tried out my out-of-memory hypothesis: When the memory gets used the whole job becomes a lot slower and unresponsive:

When I removed the runtime system memory limit the job failed

In these cases the job however didn't timeout, but it was unresponsive and logs weren't streamed.

Can high memory usage cause slowdown of the job? Is there anyway to get analytics of how much resources job runners have used?

@chris-griffin
Copy link

One observation is that while the timed-out step will not show any log detail in the Action GUI (it will continue to render as in-progress/yellow), detailed logs of that step are available via the gear icon => "View raw logs".

image
image

@nikola-jokic nikola-jokic self-assigned this Mar 10, 2022
@nikola-jokic
Copy link
Member

Hi everyone,

I will close this issue now since @chris-griffin explained how can you view raw logs. Thank you @chris-griffin!

@phadej, if you are still seeing this issue, and if you are running your workflows on your self-hosted runner, could you double-check if you have enough computing power? It might be due to that. If you cannot find what is causing this issue, please submit a new issue, and all the relevant information so we can re-produce it and see what is causing you this trouble.

@phadej
Copy link

phadej commented Mar 16, 2022

@nikola-jokic all jobs above were run on GitHub provided machines.

I don't agree that this issue is resolved. Jobs timeout for non-obvious reasons, and I cannot verify that raw logs would told the reason as they are not available anymore for those jobs. Bummer.

@nikola-jokic
Copy link
Member

Hi @phadej, I can take a look, no problem. But while looking at your workflows, the logs are expired. That is why I asked if you might file an issue so we can take a look. Maybe you can suggest how to improve logging to help you with that issue as well?

@jeacott1
Copy link

I'm having a lot of grief recently with this issue.
mostly its hung at "Job is about to start running on the hosted runner: Hosted Agent (hosted)" - for hours!

@nikola-jokic
Copy link
Member

nikola-jokic commented May 12, 2022

Hi @jeacott1,

If the job is about to start, and is not started and hanging, the job is not landed on the Runner. It might be due to the resolved incident.

If the issue persist, please post your feedback on the GitHub Community Support Forum which is actively monitored? Using the forum ensures that we route your problem to the correct team. ☺️

@adarkbandwidth
Copy link

We are seeing this exact issue as well. I know this issue is closed, but no attempt was made to fix it here. Frustrating.

@nikola-jokic
Copy link
Member

Hi @adarkbandwidth,

Did you post this issue to the GitHub Community Support Forum? Since the workflow did not land on the runner, the issue does not belong to this repository. That is the reason why it is closed.

Please, post your feedback to the Community Support Forum ☺️

@bryanmacfarlane
Copy link
Member

bryanmacfarlane commented Jul 1, 2022

There is likely not an bug/issue to fix.

As mentioned above, a good thing to do is set a timeout on the job. If it typically takes 5 minutes for example in normal conditions, setting a timeout of 20 will help free the concurrency slot (self hosted machine or cost if hosted) in the event of a hang.

Outside of getting the raw logs (which may or may not have more details - depends on the output of the process the step is running), you can also try to run on a self hosted VM (just for debugging) that you have access to (can log in and watch resources etc). It's a good debugging technique. You can even install the runner on your dev desktop. Note that might not match the resources of the hosted VM. If you want a VM that ~matches characteristics, you can get an azure DS2V2 VM with 2 cores.

There is a potential feature to get better visibility into what's happening on the hosted machine. Debug into the hosted machine/runner etc. we could also expose perf counters, get a dump of a process, etc. out of disk space detection is another potential issue / feature. That has been discussed in and is on a features back log but not bug level work. We could even have the runner detect no output after xxx time but even then it doesn't fix anything - it's just a more efficient timeout option.

A clarification of typical issues:

  • test hang / deadlock: the runner runs tests which runs your tests. if the tests deadlock, the test process just hangs, no output is created. the tests might not hang on your desktop but on another machine / VM with different I/O characteristics, it could expose the fragile code and deadlock. This could also happen in any process you supply for the runner to run.
    There is nothing the runner can do here. It's waiting for output and a return code.
  • resource issues OOM: the runner runs a step which runs the process / cmdlines you provide. if those processes use too much memory it can starve other processes on the machine. on linux we run your steps in a cgroup to ensure you can only use 80% of the memory and not starve the runner of resources. The linux OOM killer will kill your process and it will fail. On other OSes there isn't that capability and a runaway process using too much resources can actually starve the runner of memory, cpu, i/o etc
  • out of disk space: if the processes you supply in steps run the machine out of disk space, it can be challenging to figure out what's going on as the runner cannot even create logs to get them up to the service. As mentioned above, there is a potential feature to monitor resources and get better visibility. In this case, we could monitor when x% is reached and add a warning to the build. That falls under feature work mentioned above. Note that this won't lead to a hang that's mentioned here. The runner maintains a heartbeat (renews a lock) to the service and if it "goes away" the server will abandon the job. So that's likely not your issue.

Hope that helps clarifies many of the common issues, challenges and potential features.

@adarkbandwidth
Copy link

Thank you for this detailed response. We're focusing on working through these issues. We're not using a hosted runner so we can't observe behavior outside of the Github logs (which are missing in this case), but we can observe the outputs for some steps.

@adarkbandwidth
Copy link

FWIW, we realized that when we cancel the action, we do then get logs from the hosted runner, and we're able to troubleshoot from there. Our code is hanging—not GHA.

@fringedgentian
Copy link

We had this same issue and it was due to including apache commons 2.9 or later that caused it. We reverted to 2.7 and the test ran without issue.

implementation 'commons-io:commons-io:2.9.0'

@prein
Copy link

prein commented Jun 22, 2023

Would it be a good idea (if possible) to implement something like an inactivity / idle timeout in addition to the already implemented "general" timeout? Imagine a job that is expected to run for 20 minutes on average. I set the timeout to 25 min. Even if the job hangs in the first minute of execution it will be killed only after 25 minutes. If there is idle timeout that I can set to 5 min, then if the job hangs in 1st min of the run, it will be killed in the 6th minute saving 19 min in my example.

@prein
Copy link

prein commented Jun 22, 2023

@MhdSadd
Copy link

MhdSadd commented Nov 24, 2023

After deeper investigation it turned out that a process indeed could hang while executing a job step. After this was fixed, the job no longer hangs.

However, I still find it incorrect that I could not access step logs while the job was hanging. This makes it impossible to know what's going on until the job times out or is cancelled. I'm leaving this bug open for this issue to be fixed.

@Lastique how did you fix the hanging job without getting a log about why it might be hanging?
I have a nestjs-digitalocean CI which hangs on the build command until it timeout, same thing as everyone, no log except for this in the raw log

2023-11-23T23:12:57.3718126Z Waiting for a runner to pick up this job...
2023-11-23T23:12:57.8616058Z Job is waiting for a hosted runner to come online.
2023-11-23T23:13:00.7732105Z Job is about to start running on the hosted runner: GitHub Actions 2 (hosted)

locally it builds fine on my Mac ios Ventura 13.6.1 (22G313)

@Lastique
Copy link
Author

@Lastique how did you fix the hanging job without getting a log about why it might be hanging?

I don't remember the details at this point, but I was able to debug the problem. I think, the log was visible if I was monitoring the job from its start, or after it was cancelled.

That doesn't seem like your case though. In your case the job doesn't start for whatever reason.

Mac OS runners are a scarce resource on GHA, you do sometimes have to wait for a long time for a runner to become available. Again, I'm not sure if that's what happens in your case.

@MhdSadd
Copy link

MhdSadd commented Nov 24, 2023

@Lastique so my action is actually configured to run ubuntu-latest, and it was running fine before, recently though it starts the job and goes through the steps until it gets to the build step (this is the step it hangs on, until it timeout).
The CI is meant to clone and deploy nestjs microservice via SSH to digitalocean.

@MhdSadd
Copy link

MhdSadd commented Nov 27, 2023

Turns out my own issue is memory-related, using Digitalocean's droplet to deploy several microservices, turns out each time we try to deploy the service and we're low on memory the OOM killer terminates the process but the action keeps trying until it times out. Increasing our resources and adding a swap memory fixed it.
My question though is, is there not a better way for the action to notify us after the trials or throw some sort of out of memory error in its log.

@talregev
Copy link

talregev commented Jan 8, 2024

Still happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests