signal flake: Timed out waiting for BYE #16091

edsantiago · 2022-10-08T00:05:31Z

Haven't seen this one before (note, "sigkill" in test name is a misnomer, it's actually SIGINT)

[+0226s] not ok 81 podman sigkill
...
# #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# #| FAIL: Timed out waiting for BYE from container
# #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Basically, it looks like container is never receiving the INT signal)

ubuntu root

The text was updated successfully, but these errors were encountered:

vrothberg · 2022-10-12T14:29:18Z

Could be a timeout issue? We're waiting at most 5 seconds for the container to echo BYE. Maybe the machine was under heavy load? There are many syscalls involved and many chances the scheduler could kick in.

vrothberg · 2022-10-12T14:30:44Z

@edsantiago, I wonder if the test could just do a podman_run wait --condition=running foo instead of https://github.com/containers/podman/blob/main/test/system/032-sig-proxy.bats#L9-L21.

edsantiago · 2022-10-12T14:33:15Z

Timeout: are you talking load on the host? On the bare-metal? Could be, but is there any way to determine that? (If you're talking about the VM itself: the only thing running is system tests. Should not be high load).

wait --condition: seems racy, because the podman run & may not have created the container yet.

vrothberg · 2022-10-12T14:39:42Z

@edsantiago can we start the container with --detach?

vrothberg · 2022-10-12T14:41:04Z

Timeout: are you talking load on the host? On the bare-metal? Could be, but is there any way to determine that? (If you're talking about the VM itself: the only thing running is system tests. Should not be high load).

That's the only guess I have. 5 seconds seems like a small timeout compared to other ones.

edsantiago · 2022-10-12T14:46:46Z

--detach: no, because we then can't actually test sig-proxy :-)

vrothberg · 2022-10-12T14:50:48Z

--detach: no, because we then can't actually test sig-proxy :-)

Too late for a coffee for me today :-) Thanks, Ed!

edsantiago · 2022-10-20T12:55:42Z

Failed in Fedora gating tests:

# Arch:amd64 OS:fedora38 Runtime:crun Rootless:false Events:journald Logdriver:journald Cgroups:v2+systemd Net:netavark
...
not ok 81 podman sigkill
# (from function `die' in file ./helpers.bash, line 594,
#  in test file ./032-sig-proxy.bats, line 36)
#   `die "Timed out waiting for BYE from container"' failed
# # podman rm -t 0 --all --force --ignore
# # podman ps --all --external --format {{.ID}} {{.Names}}
# # podman images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
# quay.io/libpod/testimage:20220615 f26aa69bb3f3
# # podman container exists foo
# # podman logs foo
# READY
# # podman logs foo
# READY
# # podman logs foo
# READY
# # podman logs foo
# READY
# # podman logs foo
# READY
# # podman logs foo
# READY
# #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# #| FAIL: Timed out waiting for BYE from container
# #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vrothberg · 2022-10-20T12:58:13Z

@edsantiago do the flakes in the gating tests support the theory of the machine being under load (or just sufficiently slow) to justify bumping timeouts?

edsantiago · 2022-10-20T12:59:31Z

I don't know enough about the gating-test environment to answer that. I actually don't know ANYTHING about that environment. But yeah, I guess there's not much choice other than to bump timeouts.

Bump the timeout waiting for the container to process the signal. The comparatively short timeout is most likely responsible for flakes in gating tests. Fixes: containers#16091 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

edsantiago · 2022-11-07T13:34:46Z

New symptom in remote f37-aarch64 root:

# podman-remote --url unix:/tmp/podman_tmp_pDLe container exists foo
# podman-remote --url unix:/tmp/podman_tmp_pDLe logs foo
# podman-remote --url unix:/tmp/podman_tmp_pDLe inspect --format {{.State.Running}} foo
false
# podman-remote --url unix:/tmp/podman_tmp_pDLe inspect --format {{.State.ExitCode}} foo
READY
0
#/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
#| FAIL: Container exited (status: 0) before we saw 'READY':
#\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Hmmmmm, then again, this is podman logs, the problem could be #16132

Bump the timeout waiting for the container to process the signal. The comparatively short timeout is most likely responsible for flakes in gating tests. Fixes: containers#16091 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

Bump the timeout waiting for the container to process the signal. The comparatively short timeout is most likely responsible for flakes in gating tests. Fixes: containers#16091 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com> (cherry picked from commit f0ba2d8) Signed-off-by: Lokesh Mandvekar <lsm5@fedoraproject.org>

edsantiago · 2023-01-17T14:27:29Z

Reopening, because we have all-new sigproxy tests, and ALL of them are flaking in gating:

not ok 88 podman sigproxy test: run
# (from function `die' in file ./helpers.bash, line 541,
#  from function `_test_sigproxy' in file ./032-sig-proxy.bats, line 42,
#  in test file ./032-sig-proxy.bats, line 56)
#   `_test_sigproxy c_run $kidpid' failed
# $ podman rm -t 0 --all --force --ignore
# $ podman ps --all --external --format {{.ID}} {{.Names}}
# $ podman images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
# quay.io/libpod/testimage:20221018 f5a99120db64
# READY
# $ podman container exists c_run
# $ podman logs c_run
# READY
# $ podman logs c_run
# READY
# $ podman logs c_run
# READY
# $ podman logs c_run
# READY
# $ podman logs c_run
# READY
# $ podman logs c_run
# READY
# $ podman logs c_run
# READY
# $ podman logs c_run
# READY
# $ podman logs c_run
# READY
# $ podman logs c_run
# READY
# $ podman logs c_run
# READY
# #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# #| FAIL: Timed out waiting for BYE from container
# #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# # [teardown]
# $ podman pod rm -t 0 --all --force --ignore
# $ podman rm -t 0 --all --force --ignore
# 9489d22f8e2c3a8e35be5fe184c0261b246fcbb504983739d22275386bb9261c
# $ podman network prune --force
# $ podman volume rm -a -f
not ok 89 podman sigproxy test: start
# (from function `die' in file ./helpers.bash, line 541,
#  from function `_test_sigproxy' in file ./032-sig-proxy.bats, line 42,
#  in test file ./032-sig-proxy.bats, line 66)
#   `_test_sigproxy c_start $kidpid' failed
# $ podman rm -t 0 --all --force --ignore
# $ podman ps --all --external --format {{.ID}} {{.Names}}
# $ podman images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
# quay.io/libpod/testimage:20221018 f5a99120db64
# $ podman create --name c_start quay.io/libpod/testimage:20221018 sh -c trap "echo BYE;exit 0" INT;echo READY;while :;do sleep 0.1;done
# 3f73349bb84d7825dde8173810afd4dd57f13341db6edf0fcbf0872fd7406dc6
# READY
# $ podman container exists c_start
# $ podman logs c_start
# READY
# $ podman logs c_start
# READY
# $ podman logs c_start
# READY
# $ podman logs c_start
# READY
# $ podman logs c_start
# READY
# $ podman logs c_start
# READY
# $ podman logs c_start
# READY
# $ podman logs c_start
# READY
# $ podman logs c_start
# READY
# $ podman logs c_start
# READY
# $ podman logs c_start
# READY
# #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# #| FAIL: Timed out waiting for BYE from container
# #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# # [teardown]
# $ podman pod rm -t 0 --all --force --ignore
# $ podman rm -t 0 --all --force --ignore
# 3f73349bb84d7825dde8173810afd4dd57f13341db6edf0fcbf0872fd7406dc6
# $ podman network prune --force
# $ podman volume rm -a -f
not ok 90 podman sigproxy test: attach
# (from function `die' in file ./helpers.bash, line 541,
#  from function `_test_sigproxy' in file ./032-sig-proxy.bats, line 42,
#  in test file ./032-sig-proxy.bats, line 76)
#   `_test_sigproxy c_attach $kidpid' failed
# $ podman rm -t 0 --all --force --ignore
# $ podman ps --all --external --format {{.ID}} {{.Names}}
# $ podman images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
# quay.io/libpod/testimage:20221018 f5a99120db64
# $ podman run -d --name c_attach quay.io/libpod/testimage:20221018 sh -c trap "echo BYE;exit 0" INT;echo READY;while :;do sleep 0.1;done
# 724591f9cdf76f9b4e8d336794bd2921db9c5e2d9a6b14cc6b21106b9571c1a2
# $ podman container exists c_attach
# $ podman logs c_attach
# READY
# $ podman logs c_attach
# READY
# $ podman logs c_attach
# READY
# $ podman logs c_attach
# READY
# $ podman logs c_attach
# READY
# $ podman logs c_attach
# READY
# $ podman logs c_attach
# READY
# $ podman logs c_attach
# READY
# $ podman logs c_attach
# READY
# $ podman logs c_attach
# READY
# $ podman logs c_attach
# READY
# #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# #| FAIL: Timed out waiting for BYE from container
# #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vrothberg · 2023-01-17T15:17:49Z

Should we bump the timeout to 20? This seems to have worked for other tests - assuming when the nodes are under stress.

edsantiago · 2023-01-17T15:19:47Z

I guess there's not much else we can do right now. I'll open a PR.

edsantiago · 2023-02-06T16:23:55Z

Reopening: failed again in 4.4 gating tests, and this is 4.4 so it presumably has the timeout bump

One of our oldest most frustrating flakes is containers#16091, "Timed out waiting for BYE". In containers#17489 we added some debug output to see if the problem was a container hang of some sort. It does not seem to be (see containers#17675), and the debug output makes it hard to read failure logs, so let's remove it. Signed-off-by: Ed Santiago <santiago@redhat.com>

edsantiago · 2023-04-05T12:46:12Z

Continuing to fail in fedora gating

vrothberg · 2023-04-24T12:48:46Z

@edsantiago, does it also happen in CI? I'd love to see (journal) logs.

edsantiago · 2023-04-24T13:41:21Z

It has happened in CI, but so rarely (most recently Nov 2022) that the logs are unavailable.

edsantiago · 2023-05-24T14:39:07Z

This is causing RHEL gating tests to fail. And, I can reproduce 100% on a 1mt f38 VM. I think sigproxy is broken:

Terminal 1:

$ podman run -i quay.io/libpod/testimage:20221018 sh -c 'trap "echo BYE; exit 0" INT;while :;do sleep 1;done'
[blocks, as expected]

Terminal 2:

# ps auxww --forest |grep -2 podman
...
fedora      2336  0.0  0.2   7528  4224 pts/0    S    10:25   0:00              \_ -bash
fedora      4613  0.0  2.1 1131356 42880 pts/0   Sl+  10:31   0:00                  \_ podman run -i quay.io/libpod/testimage:20221018 sh -c trap "echo BYE; exit 0" INT;while :;do sleep 1;done
fedora      4618  0.2  2.4 1279076 49792 pts/0   Sl+  10:31   0:00                      \_ podman run -i quay.io/libpod/testimage:20221018 sh -c trap "echo BYE; exit 0" INT;while :;do sleep 1;done
fedora      4629  0.0  0.1   5012  2944 pts/0    S    10:31   0:00                          \_ /usr/bin/slirp4netns --disable-host-loopback --mtu=65520 --enable-sandbox --enable-seccomp --enable-ipv6 -c -e 3 -r 4 --netns-type=path /run/user/1000/netns/netns-7b481dfb-9731-fb9e-34dd-0331d9806ec8 tap0
# kill -INT 4613     <--- this brings terminal 1 back to the prompt BUT DOES NOT EMIT BYE

# ps auxww --forest | grep -2 podman
[still shows the 4618 process]
# kill -INT 4618   <---- ***NOW*** I get the BYE in terminal 1

@containers/podman-maintainers PTAL ASAP. This has been a nasty flake for many months. Now that it's reproducible, let's squash it.

Side note: how can this pass in CI????

edsantiago · 2023-05-24T14:41:44Z

VM available on request

edsantiago · 2023-05-24T20:37:26Z

I give up. The only thing I've learned is that sometimes ps auxww --forest | grep -2 podman shows two identical nested podman processes, sometimes it shows only one:

fedora      6137  0.0  0.2   7416  4224 pts/2    Ss   16:32   0:00          \_ -bash
fedora      6287 13.0  3.3 1280596 67636 pts/2   Sl+  16:33   0:00              \_ podman run -i quay.io/libpod/testimage:20221018 sh -c trap "echo BYE; exit 0" INT;while :;do sleep 1;done
fedora      6308  0.0  0.1   5012  2944 pts/2    S    16:33   0:00                  \_ /usr/bin/slirp4netns --disable-host-loopback --mtu=65520 --enable-sandbox --enable-seccomp --enable-ipv6 -c -e 3 -r 4 --netns-type=path /run/user/1000/netns/netns-99648371-19a3-60bc-88ee-a134fcf858d3 tap0

^^^ the good case

In the good case, like this one, killing the child of bash works as expected, spits out the BYE. In the bad case, with two podman processes, we never get BYE.

All tests were run using podman system reset in between. Sometimes under su - fedora, in which case runRoot is /tmp/something, sometimes under ssh, in which case runRoot is /run/user/something. I think the latter (ssh) works more often than su, but I've seen cases where su works.

This is way beyond me. @containers/podman-maintainers, once again, please look into this.

Luap99 · 2023-05-25T08:53:23Z

The extra reexec happens when we cannot join the rootless pause process namespace.

While I cannot reproduce the missing BYE I can still reproduce this behaviour locally by checking the exit code. With the extra reexec we exit 1 while without it we exit as expected 0.

There are quite a lot of places in podman were we have some signal handlers, most notably libpod/shutdown/handler.go. However when we rexec we do not want any of that and just send all signals we get down to the child obviously. So before we install our signal handler we must first reset all others with signal.Reset(). Also while at it fixed a problem were the joinUserAndMountNS() code path would not forward signals at all. This cod epath is used when you have running containers but the pause process was killed. Fixes containers#16091 Given that signal handlers run in different goroutines parallel it would explain why it flakes sometimes in CI. However to my understanding this falke can only happen when the pause process is dead before we run the podman command. So the question still is what kills the pause process? Signed-off-by: Paul Holzinger <pholzing@redhat.com>

There are quite a lot of places in podman were we have some signal handlers, most notably libpod/shutdown/handler.go. However when we rexec we do not want any of that and just send all signals we get down to the child obviously. So before we install our signal handler we must first reset all others with signal.Reset(). Also while at it fix a problem were the joinUserAndMountNS() code path would not forward signals at all. This code path is used when you have running containers but the pause process was killed. Fixes containers#16091 Given that signal handlers run in different goroutines parallel it would explain why it flakes sometimes in CI. However to my understanding this flake can only happen when the pause process is dead before we run the podman command. So the question still is what kills the pause process? Signed-off-by: Paul Holzinger <pholzing@redhat.com>

edsantiago added the flakes Flakes from Continuous Integration label Oct 8, 2022

vrothberg mentioned this issue Oct 25, 2022

sig-proxy test: bump timeout #16287

Merged

openshift-merge-robot closed this as completed in #16287 Oct 25, 2022

edsantiago reopened this Jan 17, 2023

edsantiago mentioned this issue Jan 17, 2023

sig-proxy system test: bump timeout #17141

Merged

openshift-merge-robot closed this as completed in #17141 Jan 17, 2023

edsantiago reopened this Feb 6, 2023

edsantiago mentioned this issue Apr 3, 2023

System tests: unverbosify a flake log #18021

Merged

edsantiago changed the title ~~signal flake~~ signal flake: Timed out waiting for BYE Apr 5, 2023

Luap99 self-assigned this May 25, 2023

Luap99 mentioned this issue May 25, 2023

pkg/rootless: correctly handle proxy signals on reexec #18681

Merged

openshift-merge-robot closed this as completed in #18681 May 27, 2023

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 26, 2023

github-actions bot locked as resolved and limited conversation to collaborators Aug 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

signal flake: Timed out waiting for BYE #16091

signal flake: Timed out waiting for BYE #16091

edsantiago commented Oct 8, 2022

vrothberg commented Oct 12, 2022

vrothberg commented Oct 12, 2022

edsantiago commented Oct 12, 2022

vrothberg commented Oct 12, 2022

vrothberg commented Oct 12, 2022

edsantiago commented Oct 12, 2022

vrothberg commented Oct 12, 2022

edsantiago commented Oct 20, 2022

vrothberg commented Oct 20, 2022

edsantiago commented Oct 20, 2022

edsantiago commented Nov 7, 2022 •

edited

edsantiago commented Jan 17, 2023

vrothberg commented Jan 17, 2023

edsantiago commented Jan 17, 2023

edsantiago commented Feb 6, 2023

edsantiago commented Apr 5, 2023

vrothberg commented Apr 24, 2023

edsantiago commented Apr 24, 2023

edsantiago commented May 24, 2023

edsantiago commented May 24, 2023

edsantiago commented May 24, 2023

Luap99 commented May 25, 2023

signal flake: Timed out waiting for BYE #16091

signal flake: Timed out waiting for BYE #16091

Comments

edsantiago commented Oct 8, 2022

vrothberg commented Oct 12, 2022

vrothberg commented Oct 12, 2022

edsantiago commented Oct 12, 2022

vrothberg commented Oct 12, 2022

vrothberg commented Oct 12, 2022

edsantiago commented Oct 12, 2022

vrothberg commented Oct 12, 2022

edsantiago commented Oct 20, 2022

vrothberg commented Oct 20, 2022

edsantiago commented Oct 20, 2022

edsantiago commented Nov 7, 2022 • edited

edsantiago commented Jan 17, 2023

vrothberg commented Jan 17, 2023

edsantiago commented Jan 17, 2023

edsantiago commented Feb 6, 2023

edsantiago commented Apr 5, 2023

vrothberg commented Apr 24, 2023

edsantiago commented Apr 24, 2023

edsantiago commented May 24, 2023

edsantiago commented May 24, 2023

edsantiago commented May 24, 2023

Luap99 commented May 25, 2023

edsantiago commented Nov 7, 2022 •

edited