Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

signal flake: Timed out waiting for BYE #16091

Closed
edsantiago opened this issue Oct 8, 2022 · 22 comments · Fixed by #16287, #17141 or #18681
Closed

signal flake: Timed out waiting for BYE #16091

edsantiago opened this issue Oct 8, 2022 · 22 comments · Fixed by #16287, #17141 or #18681
Assignees
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Collaborator

Haven't seen this one before (note, "sigkill" in test name is a misnomer, it's actually SIGINT)

[+0226s] not ok 81 podman sigkill
...
# #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# #| FAIL: Timed out waiting for BYE from container
# #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Basically, it looks like container is never receiving the INT signal)

ubuntu root

@edsantiago edsantiago added the flakes Flakes from Continuous Integration label Oct 8, 2022
@vrothberg
Copy link
Member

Could be a timeout issue? We're waiting at most 5 seconds for the container to echo BYE. Maybe the machine was under heavy load? There are many syscalls involved and many chances the scheduler could kick in.

@vrothberg
Copy link
Member

@edsantiago, I wonder if the test could just do a podman_run wait --condition=running foo instead of https://github.com/containers/podman/blob/main/test/system/032-sig-proxy.bats#L9-L21.

@edsantiago
Copy link
Collaborator Author

Timeout: are you talking load on the host? On the bare-metal? Could be, but is there any way to determine that? (If you're talking about the VM itself: the only thing running is system tests. Should not be high load).

wait --condition: seems racy, because the podman run & may not have created the container yet.

@vrothberg
Copy link
Member

@edsantiago can we start the container with --detach?

@vrothberg
Copy link
Member

Timeout: are you talking load on the host? On the bare-metal? Could be, but is there any way to determine that? (If you're talking about the VM itself: the only thing running is system tests. Should not be high load).

That's the only guess I have. 5 seconds seems like a small timeout compared to other ones.

@edsantiago
Copy link
Collaborator Author

--detach: no, because we then can't actually test sig-proxy :-)

@vrothberg
Copy link
Member

--detach: no, because we then can't actually test sig-proxy :-)

Too late for a coffee for me today :-) Thanks, Ed!

@edsantiago
Copy link
Collaborator Author

Failed in Fedora gating tests:

# Arch:amd64 OS:fedora38 Runtime:crun Rootless:false Events:journald Logdriver:journald Cgroups:v2+systemd Net:netavark
...
not ok 81 podman sigkill
# (from function `die' in file ./helpers.bash, line 594,
#  in test file ./032-sig-proxy.bats, line 36)
#   `die "Timed out waiting for BYE from container"' failed
# # podman rm -t 0 --all --force --ignore
# # podman ps --all --external --format {{.ID}} {{.Names}}
# # podman images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
# quay.io/libpod/testimage:20220615 f26aa69bb3f3
# # podman container exists foo
# # podman logs foo
# READY
# # podman logs foo
# READY
# # podman logs foo
# READY
# # podman logs foo
# READY
# # podman logs foo
# READY
# # podman logs foo
# READY
# #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# #| FAIL: Timed out waiting for BYE from container
# #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

@vrothberg
Copy link
Member

@edsantiago do the flakes in the gating tests support the theory of the machine being under load (or just sufficiently slow) to justify bumping timeouts?

@edsantiago
Copy link
Collaborator Author

I don't know enough about the gating-test environment to answer that. I actually don't know ANYTHING about that environment. But yeah, I guess there's not much choice other than to bump timeouts.

vrothberg added a commit to vrothberg/libpod that referenced this issue Oct 25, 2022
Bump the timeout waiting for the container to process the signal.
The comparatively short timeout is most likely responsible for
flakes in gating tests.

Fixes: containers#16091
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
@edsantiago
Copy link
Collaborator Author

edsantiago commented Nov 7, 2022

New symptom in remote f37-aarch64 root:

# podman-remote --url unix:/tmp/podman_tmp_pDLe container exists foo
# podman-remote --url unix:/tmp/podman_tmp_pDLe logs foo
# podman-remote --url unix:/tmp/podman_tmp_pDLe inspect --format {{.State.Running}} foo
false
# podman-remote --url unix:/tmp/podman_tmp_pDLe inspect --format {{.State.ExitCode}} foo
READY
0
#/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
#| FAIL: Container exited (status: 0) before we saw 'READY':
#\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Hmmmmm, then again, this is podman logs, the problem could be #16132

openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/podman that referenced this issue Nov 15, 2022
Bump the timeout waiting for the container to process the signal.
The comparatively short timeout is most likely responsible for
flakes in gating tests.

Fixes: containers#16091
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
lsm5 pushed a commit to lsm5/podman that referenced this issue Nov 15, 2022
Bump the timeout waiting for the container to process the signal.
The comparatively short timeout is most likely responsible for
flakes in gating tests.

Fixes: containers#16091
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
(cherry picked from commit f0ba2d8)
Signed-off-by: Lokesh Mandvekar <lsm5@fedoraproject.org>
@edsantiago
Copy link
Collaborator Author

Reopening, because we have all-new sigproxy tests, and ALL of them are flaking in gating:

not ok 88 podman sigproxy test: run
# (from function `die' in file ./helpers.bash, line 541,
#  from function `_test_sigproxy' in file ./032-sig-proxy.bats, line 42,
#  in test file ./032-sig-proxy.bats, line 56)
#   `_test_sigproxy c_run $kidpid' failed
# $ podman rm -t 0 --all --force --ignore
# $ podman ps --all --external --format {{.ID}} {{.Names}}
# $ podman images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
# quay.io/libpod/testimage:20221018 f5a99120db64
# READY
# $ podman container exists c_run
# $ podman logs c_run
# READY
# $ podman logs c_run
# READY
# $ podman logs c_run
# READY
# $ podman logs c_run
# READY
# $ podman logs c_run
# READY
# $ podman logs c_run
# READY
# $ podman logs c_run
# READY
# $ podman logs c_run
# READY
# $ podman logs c_run
# READY
# $ podman logs c_run
# READY
# $ podman logs c_run
# READY
# #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# #| FAIL: Timed out waiting for BYE from container
# #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# # [teardown]
# $ podman pod rm -t 0 --all --force --ignore
# $ podman rm -t 0 --all --force --ignore
# 9489d22f8e2c3a8e35be5fe184c0261b246fcbb504983739d22275386bb9261c
# $ podman network prune --force
# $ podman volume rm -a -f
not ok 89 podman sigproxy test: start
# (from function `die' in file ./helpers.bash, line 541,
#  from function `_test_sigproxy' in file ./032-sig-proxy.bats, line 42,
#  in test file ./032-sig-proxy.bats, line 66)
#   `_test_sigproxy c_start $kidpid' failed
# $ podman rm -t 0 --all --force --ignore
# $ podman ps --all --external --format {{.ID}} {{.Names}}
# $ podman images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
# quay.io/libpod/testimage:20221018 f5a99120db64
# $ podman create --name c_start quay.io/libpod/testimage:20221018 sh -c trap "echo BYE;exit 0" INT;echo READY;while :;do sleep 0.1;done
# 3f73349bb84d7825dde8173810afd4dd57f13341db6edf0fcbf0872fd7406dc6
# READY
# $ podman container exists c_start
# $ podman logs c_start
# READY
# $ podman logs c_start
# READY
# $ podman logs c_start
# READY
# $ podman logs c_start
# READY
# $ podman logs c_start
# READY
# $ podman logs c_start
# READY
# $ podman logs c_start
# READY
# $ podman logs c_start
# READY
# $ podman logs c_start
# READY
# $ podman logs c_start
# READY
# $ podman logs c_start
# READY
# #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# #| FAIL: Timed out waiting for BYE from container
# #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# # [teardown]
# $ podman pod rm -t 0 --all --force --ignore
# $ podman rm -t 0 --all --force --ignore
# 3f73349bb84d7825dde8173810afd4dd57f13341db6edf0fcbf0872fd7406dc6
# $ podman network prune --force
# $ podman volume rm -a -f
not ok 90 podman sigproxy test: attach
# (from function `die' in file ./helpers.bash, line 541,
#  from function `_test_sigproxy' in file ./032-sig-proxy.bats, line 42,
#  in test file ./032-sig-proxy.bats, line 76)
#   `_test_sigproxy c_attach $kidpid' failed
# $ podman rm -t 0 --all --force --ignore
# $ podman ps --all --external --format {{.ID}} {{.Names}}
# $ podman images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
# quay.io/libpod/testimage:20221018 f5a99120db64
# $ podman run -d --name c_attach quay.io/libpod/testimage:20221018 sh -c trap "echo BYE;exit 0" INT;echo READY;while :;do sleep 0.1;done
# 724591f9cdf76f9b4e8d336794bd2921db9c5e2d9a6b14cc6b21106b9571c1a2
# $ podman container exists c_attach
# $ podman logs c_attach
# READY
# $ podman logs c_attach
# READY
# $ podman logs c_attach
# READY
# $ podman logs c_attach
# READY
# $ podman logs c_attach
# READY
# $ podman logs c_attach
# READY
# $ podman logs c_attach
# READY
# $ podman logs c_attach
# READY
# $ podman logs c_attach
# READY
# $ podman logs c_attach
# READY
# $ podman logs c_attach
# READY
# #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# #| FAIL: Timed out waiting for BYE from container
# #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

@edsantiago edsantiago reopened this Jan 17, 2023
@vrothberg
Copy link
Member

Should we bump the timeout to 20? This seems to have worked for other tests - assuming when the nodes are under stress.

@edsantiago
Copy link
Collaborator Author

I guess there's not much else we can do right now. I'll open a PR.

@edsantiago
Copy link
Collaborator Author

Reopening: failed again in 4.4 gating tests, and this is 4.4 so it presumably has the timeout bump

@edsantiago edsantiago reopened this Feb 6, 2023
edsantiago added a commit to edsantiago/libpod that referenced this issue Apr 3, 2023
One of our oldest most frustrating flakes is containers#16091, "Timed
out waiting for BYE".

In containers#17489 we added some debug output to see if the problem
was a container hang of some sort. It does not seem to be
(see containers#17675), and the debug output makes it hard to read
failure logs, so let's remove it.

Signed-off-by: Ed Santiago <santiago@redhat.com>
edsantiago added a commit to edsantiago/libpod that referenced this issue Apr 3, 2023
One of our oldest most frustrating flakes is containers#16091, "Timed
out waiting for BYE".

In containers#17489 we added some debug output to see if the problem
was a container hang of some sort. It does not seem to be
(see containers#17675), and the debug output makes it hard to read
failure logs, so let's remove it.

Signed-off-by: Ed Santiago <santiago@redhat.com>
@edsantiago edsantiago changed the title signal flake signal flake: Timed out waiting for BYE Apr 5, 2023
@edsantiago
Copy link
Collaborator Author

Continuing to fail in fedora gating

@vrothberg
Copy link
Member

@edsantiago, does it also happen in CI? I'd love to see (journal) logs.

@edsantiago
Copy link
Collaborator Author

It has happened in CI, but so rarely (most recently Nov 2022) that the logs are unavailable.

@edsantiago
Copy link
Collaborator Author

This is causing RHEL gating tests to fail. And, I can reproduce 100% on a 1mt f38 VM. I think sigproxy is broken:

Terminal 1:

$ podman run -i quay.io/libpod/testimage:20221018 sh -c 'trap "echo BYE; exit 0" INT;while :;do sleep 1;done'
[blocks, as expected]

Terminal 2:

# ps auxww --forest |grep -2 podman
...
fedora      2336  0.0  0.2   7528  4224 pts/0    S    10:25   0:00              \_ -bash
fedora      4613  0.0  2.1 1131356 42880 pts/0   Sl+  10:31   0:00                  \_ podman run -i quay.io/libpod/testimage:20221018 sh -c trap "echo BYE; exit 0" INT;while :;do sleep 1;done
fedora      4618  0.2  2.4 1279076 49792 pts/0   Sl+  10:31   0:00                      \_ podman run -i quay.io/libpod/testimage:20221018 sh -c trap "echo BYE; exit 0" INT;while :;do sleep 1;done
fedora      4629  0.0  0.1   5012  2944 pts/0    S    10:31   0:00                          \_ /usr/bin/slirp4netns --disable-host-loopback --mtu=65520 --enable-sandbox --enable-seccomp --enable-ipv6 -c -e 3 -r 4 --netns-type=path /run/user/1000/netns/netns-7b481dfb-9731-fb9e-34dd-0331d9806ec8 tap0
# kill -INT 4613     <--- this brings terminal 1 back to the prompt BUT DOES NOT EMIT BYE

# ps auxww --forest | grep -2 podman
[still shows the 4618 process]
# kill -INT 4618   <---- ***NOW*** I get the BYE in terminal 1

@containers/podman-maintainers PTAL ASAP. This has been a nasty flake for many months. Now that it's reproducible, let's squash it.

Side note: how can this pass in CI????

@edsantiago
Copy link
Collaborator Author

VM available on request

@edsantiago
Copy link
Collaborator Author

I give up. The only thing I've learned is that sometimes ps auxww --forest | grep -2 podman shows two identical nested podman processes, sometimes it shows only one:

fedora      6137  0.0  0.2   7416  4224 pts/2    Ss   16:32   0:00          \_ -bash
fedora      6287 13.0  3.3 1280596 67636 pts/2   Sl+  16:33   0:00              \_ podman run -i quay.io/libpod/testimage:20221018 sh -c trap "echo BYE; exit 0" INT;while :;do sleep 1;done
fedora      6308  0.0  0.1   5012  2944 pts/2    S    16:33   0:00                  \_ /usr/bin/slirp4netns --disable-host-loopback --mtu=65520 --enable-sandbox --enable-seccomp --enable-ipv6 -c -e 3 -r 4 --netns-type=path /run/user/1000/netns/netns-99648371-19a3-60bc-88ee-a134fcf858d3 tap0

^^^ the good case

In the good case, like this one, killing the child of bash works as expected, spits out the BYE. In the bad case, with two podman processes, we never get BYE.

All tests were run using podman system reset in between. Sometimes under su - fedora, in which case runRoot is /tmp/something, sometimes under ssh, in which case runRoot is /run/user/something. I think the latter (ssh) works more often than su, but I've seen cases where su works.

This is way beyond me. @containers/podman-maintainers, once again, please look into this.

@Luap99
Copy link
Member

Luap99 commented May 25, 2023

The extra reexec happens when we cannot join the rootless pause process namespace.

While I cannot reproduce the missing BYE I can still reproduce this behaviour locally by checking the exit code. With the extra reexec we exit 1 while without it we exit as expected 0.

@Luap99 Luap99 self-assigned this May 25, 2023
Luap99 added a commit to Luap99/libpod that referenced this issue May 25, 2023
There are quite a lot of places in podman were we have some signal
handlers, most notably libpod/shutdown/handler.go.

However when we rexec we do not want any of that and just send all
signals we get down to the child obviously. So before we install our
signal handler we must first reset all others with signal.Reset().

Also while at it fixed a problem were the joinUserAndMountNS() code path
would not forward signals at all. This cod epath is used when you have
running containers but the pause process was killed.

Fixes containers#16091
Given that signal handlers run in different goroutines parallel it would
explain why it flakes sometimes in CI. However to my understanding this
falke can only happen when the pause process is dead before we run the
podman command. So the question still is what kills the pause process?

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Luap99 added a commit to Luap99/libpod that referenced this issue May 25, 2023
There are quite a lot of places in podman were we have some signal
handlers, most notably libpod/shutdown/handler.go.

However when we rexec we do not want any of that and just send all
signals we get down to the child obviously. So before we install our
signal handler we must first reset all others with signal.Reset().

Also while at it fix a problem were the joinUserAndMountNS() code path
would not forward signals at all. This code path is used when you have
running containers but the pause process was killed.

Fixes containers#16091
Given that signal handlers run in different goroutines parallel it would
explain why it flakes sometimes in CI. However to my understanding this
flake can only happen when the pause process is dead before we run the
podman command. So the question still is what kills the pause process?

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Luap99 added a commit to Luap99/libpod that referenced this issue May 25, 2023
There are quite a lot of places in podman were we have some signal
handlers, most notably libpod/shutdown/handler.go.

However when we rexec we do not want any of that and just send all
signals we get down to the child obviously. So before we install our
signal handler we must first reset all others with signal.Reset().

Also while at it fix a problem were the joinUserAndMountNS() code path
would not forward signals at all. This code path is used when you have
running containers but the pause process was killed.

Fixes containers#16091
Given that signal handlers run in different goroutines parallel it would
explain why it flakes sometimes in CI. However to my understanding this
flake can only happen when the pause process is dead before we run the
podman command. So the question still is what kills the pause process?

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Luap99 added a commit to Luap99/libpod that referenced this issue May 25, 2023
There are quite a lot of places in podman were we have some signal
handlers, most notably libpod/shutdown/handler.go.

However when we rexec we do not want any of that and just send all
signals we get down to the child obviously. So before we install our
signal handler we must first reset all others with signal.Reset().

Also while at it fix a problem were the joinUserAndMountNS() code path
would not forward signals at all. This code path is used when you have
running containers but the pause process was killed.

Fixes containers#16091
Given that signal handlers run in different goroutines parallel it would
explain why it flakes sometimes in CI. However to my understanding this
flake can only happen when the pause process is dead before we run the
podman command. So the question still is what kills the pause process?

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 26, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
3 participants