Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rootless podman won't start when exposing ports #2942

Closed
duritong opened this issue Apr 15, 2019 · 13 comments · Fixed by #3162
Closed

Rootless podman won't start when exposing ports #2942

duritong opened this issue Apr 15, 2019 · 13 comments · Fixed by #3162
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. rootless

Comments

@duritong
Copy link

duritong commented Apr 15, 2019

/kind bug

Description

Trying to run a rootless podman container exposing ports on CentOS 7, this works fine on Fedora 29, though with the same versions it doesn't work on CentOS 7.

When trying to start a rootless container, it even blocks any other kind of execution of podman as this user.

I am not fully sure, whether this is actually a bug in podman, runc, slirp4netns . Though I assumed opening it here is the best place to get started.

For what I am trying to achieve (running rootless containers through podman on CentOS 7 in the end managed by systemd) I backported the current versions of podman, runc and slirp4netns that I have on Fedora 29 to CentOS 7 (rebuild of packages). On Fedora 29 what I try to do works fine.
Additionally, I have the new shadow utils from @vbatts from https://copr.fedorainfracloud.org/coprs/vbatts/shadow-utils-newxidmap/

These are the packages:

podman.x86_64 0:1.2.0-1555257720.git3bd528e5.el7                              
runc.x86_64 2:1.0.0-89.dev.git029124d.el7
shadow-utils46-newxidmap.x86_64 2:4.6-4.el7                                   
slirp4netns.x86_64 0:0.3.0-1.alpha.2.git30883b5.el7 

When I try to do the following things just hang:

$ whoami
testuser
$ podman run -p 8080:80 --name test docker.io/nginx:latest

The container is never started and podman just hangs there.

When trying to investigate with podman, it just hangs as well:

$ su - testuser -c 'podman ps'
# hangs

Investigating further shows that it hangs waiting on a futex:

[snip]
[pid 15007] write(2, "\33[37mDEBU\33[0m[0000] Setting maxi"..., 66DEBU[0000] Setting maximum workers to 8                 
) = 66
[pid 15007] openat(AT_FDCWD, "/home/testuser/.local/share/containers/storage/libpod/bolt_state.db", O_RDWR|O_CREAT|O_CLOEXEC, 0600) = 5
[pid 15007] epoll_ctl(4, EPOLL_CTL_ADD, 5, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=3527261952, u64=140096770522880}}) = -1 EPERM (Operation not permitted)
[pid 15007] epoll_ctl(4, EPOLL_CTL_DEL, 5, 0xc000559194) = -1 EPERM (Operation not permitted)
[pid 15007] flock(5, LOCK_EX|LOCK_NB)   = 0
[pid 15007] fstat(5, {st_mode=S_IFREG|0600, st_size=131072, ...}) = 0
[pid 15007] pread64(5, "\0\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\355\332\f\355\2\0\0\0\0\20\0\0\0\0\0\0"..., 4096, 0) = 4096
[pid 15007] fstat(5, {st_mode=S_IFREG|0600, st_size=131072, ...}) = 0
[pid 15007] mmap(NULL, 131072, PROT_READ, MAP_SHARED, 5, 0) = 0x7f6ad2382000
[pid 15007] madvise(0x7f6ad2382000, 131072, MADV_RANDOM) = 0
[pid 15007] epoll_pwait(4, [], 128, 0, NULL, 64) = 0
[pid 15007] munmap(0x7f6ad2382000, 131072) = 0
[pid 15007] flock(5, LOCK_UN)           = 0
[pid 15007] close(5)                    = 0
[pid 15007] futex(0x7f6ad23a2040, FUTEX_WAIT, 2147498528, NULL <unfinished ...>
[pid 15003] <... sched_yield resumed> ) = 0
[pid 15003] futex(0x2a51190, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 15003] epoll_pwait(4, [], 128, 0, NULL, 1479) = 0
[pid 15003] nanosleep({0, 20000}, NULL) = 0
[pid 15003] futex(0x2a59980, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 15002] <... futex resumed> )       = 0
[pid 15003] nanosleep({0, 20000},  <unfinished ...>
[pid 15002] epoll_pwait(4, [], 128, 0, NULL, 1479) = 0
[pid 15002] futex(0x2a59980, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 15003] <... nanosleep resumed> NULL) = 0
[pid 15003] futex(0x2a51270, FUTEX_WAIT_PRIVATE, 0, {60, 0}

You can also observer this through ps (https://access.redhat.com/solutions/237383):

Sl+  14874 14874 14874 14812 podman          do_wait
Sl+  14874 14875 14874 14812 podman          futex_wait_queue_me
Sl+  14874 14876 14874 14812 podman          futex_wait_queue_me
Sl+  14874 14877 14874 14812 podman          futex_wait_queue_me
Sl+  14874 14878 14874 14812 podman          futex_wait_queue_me
Sl+  14874 14879 14874 14812 podman          futex_wait_queue_me
Sl+  14874 14885 14874 14812 podman          futex_wait_queue_me
Sl+  14880 14880 14874 14874 podman          futex_wait_queue_me
Sl+  14880 14883 14874 14874 podman          futex_wait_queue_me
Sl+  14880 14884 14874 14874 podman          futex_wait_queue_me
Sl+  14880 14886 14874 14874 podman          futex_wait_queue_me
Sl+  14880 14887 14874 14874 podman          ep_poll
Sl+  14880 14888 14874 14874 podman          futex_wait_queue_me
Sl+  14880 14889 14874 14874 podman          ep_poll
Sl+  14880 14930 14874 14874 podman          futex_wait_queue_me
Ssl  14891 14891 14891     1 conmon          poll_schedule_timeout
Ssl  14891 14893 14891     1 gmain           poll_schedule_timeout
Ssl  14899 14899 14899 14891 runc:[2:INIT]   pipe_wait
Ssl  14899 14901 14899 14891 runc:[2:INIT]   futex_wait_queue_me
Ssl  14899 14902 14899 14891 runc:[2:INIT]   futex_wait_queue_me
Ssl  14899 14903 14899 14891 runc:[2:INIT]   futex_wait_queue_me
Ssl  14899 14904 14899 14891 runc:[2:INIT]   futex_wait_queue_me
S    14907 14907 14907 14880 slirp4netns     poll_schedule_timeout

Or on the podman process itself:

$ strace -ff -p 14874
strace: Process 14874 attached with 7 threads
[pid 14885] futex(0xc00070d640, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 14879] futex(0xc00079ebc0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 14878] restart_syscall(<... resuming interrupted futex ...> <unfinished ...>
[pid 14877] futex(0x2a77978, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 14876] futex(0x2a77aa0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 14875] restart_syscall(<... resuming interrupted futex ...> <unfinished ...>
[pid 14874] wait4(14880,  <unfinished ...>
[pid 14878] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 14878] futex(0x2a51270, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 14875] <... restart_syscall resumed> ) = 0
[pid 14878] futex(0xc00070d640, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 14875] sched_yield( <unfinished ...>
[pid 14885] <... futex resumed> )       = 0
[pid 14878] <... futex resumed> )       = 1
[pid 14885] futex(0xc00070d640, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 14878] futex(0x2a5d580, FUTEX_WAIT_PRIVATE, 0, {29, 999243748} <unfinished ...>
[pid 14875] <... sched_yield resumed> ) = 0
[pid 14875] futex(0x2a51190, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 14875] epoll_pwait(4, [], 128, 0, NULL, 1622) = 0
[pid 14875] nanosleep({0, 20000}, NULL) = 0
[pid 14875] futex(0x2a51270, FUTEX_WAIT_PRIVATE, 0, {60, 0}

runc seems to wait on slirp4netns:

$ strace -ff -p 14899
strace: Process 14899 attached with 5 threads
[pid 14904] futex(0xc0000f3d40, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 14903] futex(0x563746d01738, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 14902] futex(0x563746d01820, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 14901] restart_syscall(<... resuming interrupted futex ...> <unfinished ...>
[pid 14899] openat(AT_FDCWD, "/proc/self/fd/4", O_WRONLY|O_CLOEXEC <unfinished ...>
[pid 14901] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 14901] epoll_pwait(7, [], 128, 0, NULL, 1682) = 0
[pid 14901] nanosleep({0, 20000}, NULL) = 0
[pid 14901] futex(0x563746ce4730, FUTEX_WAIT_PRIVATE, 0, {60, 0}
$ ls -l /proc/14899/fd/4
l---------. 1 testuser testuser 64 Apr 15 13:11 /proc/14899/fd/4 -> /tmp/run-1001/runc/b8645995e6ba51ec84c2f64fb50a94b541926410ca34b7b42a3657045b7dc1ef/exec.fifo

While slirp4netns seems to poll the interface:

]$ strace -ff -p 14907
strace: Process 14907 attached
restart_syscall(<... resuming interrupted poll ...>) = 0
poll([{fd=7, events=POLLIN|POLLHUP}, {fd=3, events=POLLHUP}, {fd=5, events=POLLIN|POLLHUP}], 3, 1000) = 0 (Timeout)
poll([{fd=7, events=POLLIN|POLLHUP}, {fd=3, events=POLLHUP}, {fd=5, events=POLLIN|POLLHUP}], 3, 1000) = 0 (Timeout)
poll([{fd=7, events=POLLIN|POLLHUP}, {fd=3, events=POLLHUP}, {fd=5, events=POLLIN|POLLHUP}], 3, 1000) = 0 (Timeout)
poll([{fd=7, events=POLLIN|POLLHUP}, {fd=3, events=POLLHUP}, {fd=5, events=POLLIN|POLLHUP}], 3, 1000) = 0 (Timeout)
poll([{fd=7, events=POLLIN|POLLHUP}, {fd=3, events=POLLHUP}, {fd=5, events=POLLIN|POLLHUP}], 3, 1000) = 0 (Timeout)
poll([{fd=7, events=POLLIN|POLLHUP}, {fd=3, events=POLLHUP}, {fd=5, events=POLLIN|POLLHUP}], 3, 1000^Cstrace: Process 14907 detached
 <detached ...>

$ ls -l /proc/14907/fd/7
lrwx------. 1 testuser testuser 64 Apr 15 13:12 /proc/14907/fd/7 -> /dev/net/tun

Starting the container without exposing the port works fine:

$ whoami
testuser
$ podman run --name test docker.io/nginx:latest
$ ps aux | grep nginx
testuser 15117  1.2  4.6 532244 23192 pts/1    Sl+  13:15   0:00 podman run --name test docker.io/nginx:latest
testuser 15122  8.8  6.9 540696 34804 pts/1    Sl+  13:15   0:00 podman run --name test docker.io/nginx:latest
testuser 15142  0.7  0.6  32644  3260 ?        Ss   13:15   0:00 nginx: master process nginx -g daemon off;
100100   15156  0.0  0.3  33100  1592 ?        S    13:15   0:00 nginx: worker process

What is missing to have port binding in rootless mode on EL7?

To more easily reproduce the environment one can use the following Vagrantfile:

Vagrant.configure("2") do |config|
  config.vm.box = "centos/7"

  config.vm.provision "shell", inline: <<-SHELL
    set -e
    yum update -y
    yum install tmux -y
    cat <<-EOF > /etc/yum.repos.d/glei.repo
[glei]
name=CentOS-\$releasever - glei
baseurl=http://yum.glei.ch/el7/x86_64/
enabled=1
gpgcheck=1
priority=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-glei
timeout=10
metadata_expire=300
EOF
    cat <<-EOF > /etc/pki/rpm-gpg/RPM-GPG-KEY-glei
-----BEGIN PGP PUBLIC KEY BLOCK-----

mQINBFsqViwBEADcu6AP6M1Z3zlcv5+zS/qoMQQWpC+BuChY67Um7VMwdlK1MLE9
fYH4ExlUYc8xC52g/W56tNAngzDNsI7BvtcTf0tw1j4kk7HT5qFmFVW4w1cgaJDD
k12SXNYnNYDYYCL8OgLsT1XXVeUaOO3imRC3Ta/zk3yvdr2TIitApUQQ5kfwyxEs
NVs3VFbmSmSvAdQ4+BulSU8lOTgiN5K2vds8fMOhJ7vXWOBoGGn2xMMDCCvt1XTW
JqSJ0XeaPB1XfyZ37dcBIIwqsUogJXmXwp4kCZhEwV04zwcz6WIJJWAZUz2/GnWL
8Xg8vtORL37DYHV/Rr9sPjfdmR8939tWgydKm9B46+0YbamjgYJr/sT81CpzPK57
58xehkHpBIPqkwsEz8kHGIVltQlXjrVg7MBWvpso1Ub348A2h9Y7ZJ2GSMh0SdTz
jl0GqZ7s2J6sjYCNuCdNUjLp0aWqHBtz6AyvtSaIMxDsKILILBSDGqYBdUbACVhv
hsPr3JFM337bdyygGHryLVhgAbLrRLGfErhJCHYg7anEO4gFVwuU3N0owI3gwKyF
9GZgHv7yzA96vjmUVhDBSX39jMNzNst1JEEmGZ1lekKh/GmWKF9Ie7nJxe4s3Plg
f2bp82UNVD2+c1mJinGVgJEIVTO0E0MwcCkO+9jgUUKX0SNgL7yLAIQzpwARAQAB
tFlHbGVpLmNoIFBhY2thZ2VzIChVc2VkIHRvIHNpZ24gcGFja2FnZXMgYW5kIHNv
ZnR3YXJlIGZvdW5kIG9uIGdsZWkuY2gpIDxwYWNrYWdlc0BnbGVpLmNoPokCVAQT
AQoAPhYhBDI6YGQYCWNykksGY/g0SdQ37BWGBQJbKlYsAhsDBQkDwmcABQsJCAcD
BRUKCQgLBRYCAwEAAh4BAheAAAoJEPg0SdQ37BWGyKkQAImOgfWJTPo/y0Xg5Ppo
+YfpQp1umlLv3H++hU4pjnmXZCEtGrsK5JNaxREeIkqz3RCWSN6qbqW6b37Q07J7
bmozgnKqKV2t3VcNSyLGKCLiTdqYbQzg8kyEibsVz3mMjjKtcqqZkSXNm1qNweHo
NhWDdY7EDz6FH2kfukou7rEMRDoh73j389p1U4nL07kC1kGf7+dsXpI33SjftWDp
2mMhApfDh4s6wKG8S4UrtGw/8QDMD4Q4MsgxMK5L2nEUwG+quA9GGYditxM1ZzPU
paDyHKT7mpQ8sZCIHLnLthAEXMQkMgleZv/PNjfOgU9A1YgztbEMRKIrmjib2DWu
bMGu+aSaHqggzWLHgl1ROugE7fsv/hSDq2jZIHhH9hE/hXlrrlE9vcfy5xaaFKRH
pH33Y1yacvLUHozVl0OS2Z8OgENNUdPyAA/medoENJc6doM45CLqzBUUQvUBM7uS
MU5BY33ocM9/kf0Q9cwRzFaUJ31rZRaw37VYVB6gHhSs6pXGcqe9KMYjayQHIb73
hfaw7b/E7VkGJjgQP4zKSJQ2IbXN+tGaxXNbmqiBoyeBoBaZiO3G06aRV9OX4n1S
nhwnuA4BQ8OiUGNjoY5fXF3vZSmHwBIYkJ9gA6YnzjcBiXapIogjv8HPoGovIESZ
isbql4d9fRHaigs/GFkfg2gQuQINBFsqViwBEADfmOwHpZKNJpjASVsybVvy42xE
ienAiYsKm+7JygQodXel14kV2tvDGQl6mFMmeI/z/R3B0swGXV1ta9WC26AOf0tn
a785qp2mfSaGARlcZrriCM7Psk/Ut4Fwx0grcwprjV+JXNA0Aw3Bj1vBO3FRGupi
Z+Mw2hT3vdOU7fWAUyUz0ptJRBTlEcnHE2o7lyYqv6y/LW5cZMey73PrqqQj7BSh
gCGQGZjqpplrkNruPQv2ABAYMuoASK2aGUs+VjyO2VgBaA1hHbqnMeOMPZxOWZvT
fkhGkwOb57CLZ2eelM0WkHHyO+TqZ7cF5QrDm0RfdCM0h0vMC5h9mHMq+xzArsM6
ukXu/9gpzvkDO3V6JQMSIqtdX9zBVhzS7fKMsvaj0seZn9AIFp7Dr5YRcZ/aEYpu
BOxI0kbbhJ/2pNmH8MQGOVQIHXAznIaoc3al5Xv5EXYUGweRcdriLGXfjpaU2n3R
5vkch1aAT9Ych9MWUlpDdsY1xjS7IFUBpZTr8LuMGN9vezKV5ZpP9pgfgC5k4vSg
ZZHucZxsM9PAItOsPFj/GBPSpEjOMYjfbRbxWBfunJJkVyShQL4ykKLvCIgKT/8w
PmqWcO9u37L+useHh97rPZM6djU88Unr36H7fsPph1qlDhHabBQq+Z32LipqfJ+o
Ok71SY8GOOzOmGfGGQARAQABiQI8BBgBCgAmFiEEMjpgZBgJY3KSSwZj+DRJ1Dfs
FYYFAlsqViwCGwwFCQPCZwAACgkQ+DRJ1DfsFYb1wxAAzFuiRLWFvj29cjXlN7g0
NvnkzrQGIn+tgnLRa4SwvE8mqXLZoIPwdza3Gf7/CaFsZkfRf/d6yXPSbPNVFBT5
rEMKKu6eUV/SjQzK7yfhTfX2kllHf0hNxdZqXnN9pWOFAezJWpOJ+JqM28KLzrCj
53kx/11MpGDZqSB4JecnN2txs2ZkfzOi8L35BSxM8dgX1pIPgx+ZKu5IsfYH7YHb
TX0g+wfSyKLUoKHpu15BwktAFZUjv1JnMEocKqcy7lzQugp39uyI+iq6nx33V+2S
nTN3vWnv5Y2BiI9rVaoxHWXrbTWtj3GbQ3Unu14eO02GcoyImDNyD478RAn/LEo3
fGj3pmJwi1Wlm/oTErn3A16yL/JuOKFS7cxrihTac7caPI3K6xw7QkiQn2AYCFQH
5VZ0PR4bVgQNphgAK1R+xdaVyPKrg0Amuig2Szt9WCVy+tgBCGgyIj4VNiQsKNIa
R//u8nH+YUtxku52kRAIoH9aExxbYS732Mjcu1N6+1qiT6DXfdtzkP7JiyhzewMm
nBS4xlpwt7FKT7vppzJrMj1IyepFpS0w6ESaCHkFLuRe/FsdxZt7QGVpl5owUGKr
FkN0Krzhq3krECj7rFnWWrpIwKIcu8JL0/cm5ZLXFJjZOjn4UKMkhrkhI2k16SQe
Soh7BsWhhM/tKfuHGZB+4P4=
=VLNm
-----END PGP PUBLIC KEY BLOCK-----
EOF
    yum install -y podman shadow-utils46-newxidmap slirp4netns runc
    echo "user.max_user_namespaces=28633" > /etc/sysctl.d/userns.conf
    sysctl -p /etc/sysctl.d/userns.conf
    adduser testuser
    echo "testuser:100000:65536" > /etc/subuid
    echo "testuser:100000:65536" > /etc/subgid

    cat > /etc/systemd/system/podman-test.service <<EOF
[Unit]
Description=Pod test

[Service]
KillMode=none

User=testuser
Group=testuser

SyslogIdentifier=podman-test

ExecStartPre=-/usr/bin/podman rm test
ExecStart=/usr/bin/podman run -p 8080:80 --name test docker.io/nginx:latest
ExecStop=/usr/bin/podman stop test

[Install]
WantedBy=multi-user.target
EOF

    systemctl daemon-reload
  SHELL
end
@openshift-ci-robot openshift-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 15, 2019
@mheon
Copy link
Member

mheon commented Apr 15, 2019

@giuseppe PTAL - looks like slirp might be the culprit here?

@mheon mheon added the rootless label Apr 15, 2019
@AkihiroSuda
Copy link
Collaborator

could you try slirp4netns 0.3.0 final?

@duritong
Copy link
Author

I just tried with a rebuilt package for 0.3.0 final and it does not make a difference.

@giuseppe
Copy link
Member

I've tried on a fresh centos 7 Digital Ocean droplet and it works fine for me.

I have manually installed slirp4netns, runc and podman to the latest git version.

[testuser@centos-s-4vcpu-8gb-fra1-01 ~]$ cat /etc/os-release 
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

[testuser@centos-s-4vcpu-8gb-fra1-01 ~]$ /usr/local/bin/podman run -d -p 8080:80 --name test docker.io/nginx:latest
Trying to pull docker.io/nginx:latest...Getting image source signatures
Copying blob 994d4a01fbe9 done
Copying blob eb51733b5bc0 done
Copying blob 27833a3ba0a5 done
Copying config bb776ce485 done
Writing manifest to image destination
Storing signatures
cf5687e0dabee20cd5fba609426eac431882f3770ab6f54cde7af33061ed1ad0
[testuser@centos-s-4vcpu-8gb-fra1-01 ~]$ curl localhost:8080
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
    body {
        width: 35em;
        margin: 0 auto;
        font-family: Tahoma, Verdana, Arial, sans-serif;
    }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>

@duritong
Copy link
Author

Thanks for testing, @giuseppe !

Part of your first line (centos-s-4vcpu-8gb-fra1-01) made me aware of another assumption I had, though which I forgot to investigate: I was able to successfully do the run with exposing a port also a few times while developing on various systems, though I had my dev system (in a Vagrantbox) where it was consistently failing. And it didn't matter whether the system was Virtualbox or KVM based. But the other box where it worked was a KVM box with 4 vCPUs and your hostname indicating 4 vCPUs reminded me of that:

So if you add the following to the Vagrantfile:

  config.vm.provider :virtualbox do |v,override|
    v.customize ["modifyvm", :id, "--cpus", 4]
    #v.customize ["modifyvm", :id, "--cpus", 1]
  end

And switch between 4 or 1 vCPU it works (mutliple CPUs) or hangs (only 1 vCPU).

So there still seem to be an issue, though it only appear when things are executed on a single CPU. I would say this might still be an issue as on a busy system things might get stalled as well, as not everything gets executed in the right order.

@duritong
Copy link
Author

Ok I did a test script:

$ cat testing_script.sh 
#!/bin/bash

echo "Iteration / cpu count / reported cpu count / result"
for cpu_count in 1 2 3 4; do
  for i in `seq 1 10`; do
    CPU_COUNT=$cpu_count vagrant up > /dev/null
    echo -n "${i} / ${cpu_count} / "
    vagrant ssh -c 'echo -n "$(grep processor /proc/cpuinfo  | wc -l) / "; sudo systemctl start podman-test.service; sleep 10; curl -s 127.0.0.1:8080 -o /dev/null && echo success || echo failed' 2>&1 | grep -v Shared
    vagrant halt > /dev/null
  done
done

This got me the following output:

Iteration / cpu count / reported cpu count / result
1 / 1 / 1 / failed
2 / 1 / 1 / failed
3 / 1 / 1 / failed
4 / 1 / 1 / failed
5 / 1 / 1 / failed
6 / 1 / 1 / failed
7 / 1 / 1 / failed
8 / 1 / 1 / failed
9 / 1 / 1 / failed
10 / 1 / 1 / failed
1 / 2 / 2 / failed
2 / 2 / 2 / success
3 / 2 / 2 / success
4 / 2 / 2 / failed
5 / 2 / 2 / failed
6 / 2 / 2 / failed
7 / 2 / 2 / success
8 / 2 / 2 / failed
9 / 2 / 2 / failed
10 / 2 / 2 / failed
1 / 3 / 3 / success
2 / 3 / 3 / success
3 / 3 / 3 / success
4 / 3 / 3 / success
5 / 3 / 3 / success
6 / 3 / 3 / failed
7 / 3 / 3 / success
8 / 3 / 3 / success
9 / 3 / 3 / success
10 / 3 / 3 / success
1 / 4 / 4 / success
2 / 4 / 4 / success
3 / 4 / 4 / success
4 / 4 / 4 / success
5 / 4 / 4 / success
6 / 4 / 4 / success
7 / 4 / 4 / success
8 / 4 / 4 / success
9 / 4 / 4 / success
10 / 4 / 4 / success

With my very small test sample I would conclude that it only works reliable with 4 vCPUs. And actually it's 4 processes that are communicating with each other: podman, conmon, runc & slirp4netns

@AkihiroSuda
Copy link
Collaborator

Plain slirp4netns without Podman works?
(I assume it should work, as slirp4netns is single-threaded)

@duritong
Copy link
Author

duritong commented Apr 17, 2019

How would I test that?

But I would assume so, as slirp4netns is started, but not configured to listen on port 8080, which imho happens over the control socket, wich is what imho podman is waiting for, but never gets it with only one cpu.

@AkihiroSuda
Copy link
Collaborator

@duritong
Copy link
Author

duritong commented Apr 17, 2019

So this works on a machine with one vCPU.

Also note, that rootless containers without exposing a port, works also on hosts with only one vCPU. The problem begins as soon as I am with < 4 vCPUs AND try to expose a port.

@giuseppe
Copy link
Member

I wonder if the version of go you are using can make any difference. We are using some go routines, but nothing that should block if there are not enough cores

@duritong
Copy link
Author

EPEL comes with 1.11.5, which is what is being used when rebuilding the podman package

@giuseppe
Copy link
Member

there is a case that I could finally reproduce, I've opened a PR: #3162

giuseppe added a commit to giuseppe/libpod that referenced this issue May 20, 2019
enable polling also when using inotify.  It is generally useful to
have it as under high load inotify can lose notifications.  It also
solves a race condition where the file is created while the watcher
is configured and it'd wait until the timeout and fail.

Closes: containers#2942

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 24, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. rootless
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants