Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

safe_sleep.sh inadvertently top 100% CPU #2380

Open
leleobhz opened this issue Jan 18, 2023 · 6 comments
Open

safe_sleep.sh inadvertently top 100% CPU #2380

leleobhz opened this issue Jan 18, 2023 · 6 comments
Labels
bug Something isn't working good-first-issue keep Label can be added as soon as we are sure the work on the issue is necessary

Comments

@leleobhz
Copy link

Describe the bug

safe_sleep.sh is built as while loop aiming to be safer than count on sleep or ping binaries. While loop the way function was implemented leads to 100% CPU usage without try another alternatives (Eg. Testing if sleep is available).

To Reproduce
Steps to reproduce the behavior:

  1. Start runner
  2. Wait

Expected behavior

A lesser CPU usage from runner while idle.

Actual behavior

image

Runner Version and Platform

Linux 2.300.2 running on Docker.

@leleobhz leleobhz added the bug Something isn't working label Jan 18, 2023
@AvaStancu AvaStancu added good-first-issue keep Label can be added as soon as we are sure the work on the issue is necessary labels Feb 1, 2023
@fhammerl
Copy link
Contributor

fhammerl commented Feb 1, 2023

Thanks @leleobhz, labeled the issue.

good-first-issue keep

This barebones implementation for safe_sleep.sh was chosen for its broad availability on most platforms.

A better 'safe sleep' must

  • NOT use 100% CPU
  • Be compatible with all platforms the runner supports (some installations might not have sleep or ping)

A good start would be to check for the existence of such utils before resorting to the above CPU intensive counting

@LiranBarton
Copy link

Hey @fhammerl, was this bug fixed?

We have runners in ASG that scales to MAX because of this script.
Any suggestions ? (did not fully understand why we need this script - can we remove it on boot?)

@leleobhz
Copy link
Author

leleobhz commented May 8, 2023

We have runners in ASG that scales to MAX because of this script. Any suggestions ? (did not fully understand why we need this script - can we remove it on boot?)

@LiranBarton hello!

This script as far I understood the reasons aims to have a minimal sleep() implementation for environments that does'nt have one using pure shell.

Main questions here is this implementation lead's to CPU ocupation and may be uses only if sleep are not present on system, as last fallback.

I suggest this way to handle sleep() issues if it's really needed (But personally I can't see why sleep does need a pure-shell implementation here since we have both bash or busybox that someway implement this function. Also, maybe embed busybox a better solution instead custom implementation).

@tlhakhan
Copy link

tlhakhan commented Jun 25, 2023

I also noticed high CPU utilization on my nodes as well, and it ended up being the safe_sleep.sh script. See below of a snippet of my top output.

top - 00:34:09 up 17 days, 16:44,  2 users,  load average: 4.13, 4.24, 4.16
Tasks: 380 total,   5 running, 375 sleeping,   0 stopped,   0 zombie
%Cpu(s): 50.5 us,  0.3 sy,  0.0 ni, 49.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  24013.8 total,   4765.2 free,   2526.9 used,  16721.7 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  20531.0 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 441516 1001      20   0    4360   1424   1268 R 100.0   0.0  64:04.64 safe_sleep.sh
 441518 1001      20   0    4360   1480   1324 R 100.0   0.0  64:02.87 safe_sleep.sh
3655252 1001      20   0    4360   1528   1372 R 100.0   0.0   1660:54 safe_sleep.sh
3655285 1001      20   0    4360   1480   1320 R 100.0   0.0   1660:53 safe_sleep.sh
   1237 root      20   0   10.7g  95480  34908 S   2.3   0.4 326:20.76 etcd
   1225 root      20   0 1335604 496236  74396 S   1.7   2.0 525:00.35 kube-apiserver
    776 root      20   0 2381692 131208  65896 S   0.7   0.5 470:19.95 kubelet
   4125 root      20   0 2199256  75772  45664 S   0.7   0.3 110:59.80 calico-node

I modified safe_sleep.sh script to use sleep instead. After the small change, I see lower CPU utilization on my nodes.

I overwrote the safe_sleep.sh in my actions-runner image:

Edit 2023-06-25:
Below is the difference on my hypervisor CPU utilization after the pushing the safe_sleep.sh override on all my worker nodes.
Screenshot 2023-06-25 at 11 36 23 AM

I haven't noticed any issues with my runners after these changes (🤞).
Screenshot 2023-06-25 at 11 43 33 AM

@tlhakhan
Copy link

tlhakhan commented Jul 3, 2023

Just a note, the above fix wasn't permanent because the actions runner seems to automatically update over time and pulls the original safe_sleep.sh implementation.

A snippet of the runner logs, where it performs a update from 2.299.1 to 2.305.0.

Current runner version: '2.299.1'
2023-07-02 14:39:21Z: Listening for Jobs
Runner update in progress, do not shutdown runner.
Downloading 2.305.0 runner
Waiting for current job finish running.
Generate and execute update script.
Runner will exit shortly for update, should be back online within 10 seconds.
Runner update process finished.
Runner listener exit because of updating, re-launch runner after successful update

My top output:

top - 11:53:49 up 25 days,  7:46,  1 user,  load average: 2.16, 2.12, 2.17
Tasks: 345 total,   3 running, 342 sleeping,   0 stopped,   0 zombie
%Cpu(s): 25.6 us,  0.0 sy,  0.0 ni, 74.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  15973.8 total,   6997.4 free,   2751.9 used,   6224.5 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  12887.3 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 291974 1001      20   0    4360    792    636 R 100.0   0.0   1513:58 safe_sleep.sh
 291976 1001      20   0    4360    844    688 R 100.0   0.0   1514:00 safe_sleep.sh
      1 root      20   0  164864   7432   3956 S   0.0   0.0  21:50.24 systemd
      2 root      20   0       0      0      0 S   0.0   0.0   0:00.94 kthreadd
      3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp
      4 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_par_gp
      6 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/0:0H-events_highpri
      8 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 mm_percpu_wq

Edit:
I found the following blog post that stops the self-update on the runners.
https://github.blog/changelog/2022-02-01-github-actions-self-hosted-runners-can-now-disable-automatic-updates/.

@taliastocks
Copy link

There are already a bunch of binaries bundled with the runner, e.g. bin/Runner.Listener. If this is really a huge cross-platform concern, can't it be solved by distributing something like bin/Sleep to implement sleep? Rather than a bash busy loop?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good-first-issue keep Label can be added as soon as we are sure the work on the issue is necessary
Projects
None yet
Development

No branches or pull requests

6 participants