Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rawhide: podman hangs, unstoppable #21504

Closed
edsantiago opened this issue Feb 4, 2024 · 26 comments
Closed

rawhide: podman hangs, unstoppable #21504

edsantiago opened this issue Feb 4, 2024 · 26 comments
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Collaborator

Weird new flake, manifesting in many different tests but the common factors are:

  • 'Expected process to exit. It did not.'
  • 'could not remove container, ... could not be stopped'

I'm pretty sure it has to do with the new CI VMs. The table below shows the diffs between old-VMs-without-the-hang and new-VMs. I include all OSes, not just rawhide, in order to help isolate what-changed-in-rawhide-only vs what-changed-everywhere. The most suspicious changes IMO are conmon and containers-common, but it could also be systemd or kernel. (Sorry, I don't have an easy way to show kernel versions).

I think I've seen more of these flakes but they're really hard to categorize so I'm filing as-is right now and will follow up over time.

  • 20240201t143038z-f39f38d13
  • 20240102t212217z-f39f38d13 ⇑
debian prior-fedora fedora fedora-aws rawhide
base 13 38-1.6 39-1.5 ? 40-0
aardvark-dns 1.4.0-5 1.9.0-1 1.10.0-1 1.9.0-1 1.10.0-1
1.9.0-1 ⇑ 1.9.0-1 ⇑
buildah 1.33.3+ds1-2 1.33.2-1 1.34.0-1 1.34.0-1 1.34.0-3
1.32.2+ds1-1 ⇑ 1.33.2-1 ⇑ 1.33.2-1 ⇑ 1.33.2-1 ⇑
conmon 2.1.6+ds1-1 2:2.1.8-2 2:2.1.8-2 2:2.1.8-2 2:2.1.8-4
2:2.1.8-2 ⇑
container-selinux ? 2:2.227.0-1 2:2.229.0-1 2:2.229.0-1 2:2.229.0-2
2:2.226.0-1 ⇑ 2:2.226.0-1 ⇑ 2:2.226.0-1 ⇑ 2:2.226.0-1 ⇑
containers-common ? 4:1-89 4:1-95 4:1-95 5:0.57.1-7
4:1-101 ⇑
criu 3.17.1-3 3.18-1 3.19-2 3.19-2 3.19-4
3.19-2 ⇑
crun 1.14-1 1.12-1 1.14-1 1.14-1 1.14-1
1.12-1 ⇑ 1.12-1 ⇑ 1.12-1 ⇑ 1.12-1 ⇑
golang 2:1.21~2 1.20.13-1 1.21.6-1 1.21.6-1 1.22~rc2-1
1.20.12-1 ⇑ 1.21.5-1 ⇑ 1.21.5-1 ⇑ 1.21.5-1 ⇑
netavark 1.4.0-4 1.9.0-1 1.10.1-5 1.10.1-5 1.10.1-5
1.9.0-1 ⇑ 1.9.0-1 ⇑ 1.9.0-1 ⇑
passt 2023-12-30 2023-12-30 ? 2023-12-30 2023-12-30
2023-12-04 ⇑ 2023-12-04 ⇑ 2023-12-04 ⇑ 2023-12-04 ⇑
podman 4.9.0+ds1-2 5:4.8.3-1 5:4.9.0-1 5:4.9.0-1 5:4.9.0-1
4.7.2+ds1-2 ⇑ 5:4.7.2-1 ⇑ 5:4.8.2-1 ⇑ 5:4.8.1-1 ⇑ 5:4.8.1-1 ⇑
runc 1.1.10+ds1-1 2:1.1.8-1 2:1.1.8-1 2:1.1.8-1 2:1.1.9-2
2:1.1.9-1 ⇑
skopeo 1.13.3+ds1-2 1:1.14.1-1 1:1.14.1-1 1:1.14.0-1 1:1.14.1-2
1:1.14.0-1 ⇑ 1:1.14.0-1 ⇑ 1:1.14.0-1 ⇑
slirp4netns 1.2.1-1 1.2.2-1 1.2.2-1 1.2.2-1 1.2.2-2
1.2.2-1 ⇑
systemd-container 255.3-2 253.15-2 254.8-2 254.8-2 255.3-1
255.2-3 ⇑ 253.14-1 ⇑ 254.7-1 ⇑ 254.7-1 ⇑ 255.1-1 ⇑
tar ? ? ? ? 2:1.35-3
2:1.35-2 ⇑
x x x x x x
int(8) podman(7) rawhide(9) root(6) host(9) sqlite(9)
sys(1) remote(2) rootless(3)
@edsantiago edsantiago added the flakes Flakes from Continuous Integration label Feb 4, 2024
@Luap99
Copy link
Member

Luap99 commented Feb 5, 2024

I also just saw it on one of my PRs https://cirrus-ci.com/task/5624583085096960

This looks really strange, reading the code I would most likely assume the kernel here, we SIGKILL the process before we wait up to 5s for it to be dead (checked with kill -0). So we can try to bump the 5s timeout although the fact that all the following rm tries fail in the same way leaves me to believe something is wrong with sending SIGKILL then waiting for the pid to disappear, I mean looking at this there is a obvious pid reuse bug here, i.e. we kill old pid and when we check if pid is alive it could be already reassigned although that sounds super unlikely to me.

It would be a good start if we could capture ps auxww and print the container pid podman inspect --format {{.State.Pid}} <name> when this happens to see if the container process is still there.

@edsantiago
Copy link
Collaborator Author

FINALLY! I instrumented #17831 and have been hammering at it. Finally got a failure in the right place. It's just the ps, not podman inspect. Hope it helps.

@Luap99
Copy link
Member

Luap99 commented Feb 6, 2024

So no top process in sight which is good as it means the container process is in fact gone, however we can clearly see conmon is still around so the question is why is conmon not exiting even though the container process is gone.

I really need to see the inspect of the container, the only thing I can imagine here is that the pid is set to 0. I find it unlikely that kill would work otherwise. Something like kill(0,0) would always work and thus we might think waiting for the pid timed out.
Looking at code it is most likely that our container is stucked in the wrong state.

You could also try adding podman events --since 0 --stream=false as debug command but I think the inspect command is more important.

Either way I think the root cause is something wrong with conmon, and to debug that I fear we need to attach gdb to see were it hangs and for that we would need a reproducer. Although given per your table conmon hasn't really changed and given so far all cases linked here were seen on rawhide I think it may be the kernel change which is the underlying cause. The minimum you can try is to build new images to get a newer kernel and maybe this is already fixed.

@edsantiago
Copy link
Collaborator Author

I don't think I can run podman inspect. My instrumentation (ps auxww) is in matchers.go, which does not have access to any test context.

To build new VMs, I'm waiting for netavark to go stable on f38/f39. Maybe tomorrow.

And, I haven't tried to reproduce it in a VM because it's so unreliable. But I'll try to make time tomorrow.

@edsantiago
Copy link
Collaborator Author

Another one. ps aux only, sorry. I have wasted a LOT of time trying to find some way to run podman commands from the context that sees the failure, including a lot of rereading of gomega docs. No clue. If anyone has ideas, please offer!

@Luap99
Copy link
Member

Luap99 commented Feb 8, 2024

Ah, remote logs are much cluttered due the stupid extra conmon exit delay, anyhow this is relevant container part:

           root      101580  0.0  0.0   9868  2212 ?        Ss   11:10   0:00 /usr/bin/conmon --api-version 1 -c 5a0c6d10c1a71ecb47d3f0e0417e0a7678f6cc860ca995fcfa76525dca0e5e9b -u 5a0c6d10c1a71ecb47d3f0e0417e0a7678f6cc860ca995fcfa76525dca0e5e9b -r /usr/bin/crun -b /tmp/podman_test3705908020/root/overlay-containers/5a0c6d10c1a71ecb47d3f0e0417e0a7678f6cc860ca995fcfa76525dca0e5e9b/userdata -p /tmp/podman_test3705908020/runroot/overlay-containers/5a0c6d10c1a71ecb47d3f0e0417e0a7678f6cc860ca995fcfa76525dca0e5e9b/userdata/pidfile -n modest_hopper --exit-dir /tmp/podman_test3705908020/exits --full-attach -s -l journald --log-level warning --syslog --runtime-arg --log-format=json --runtime-arg --log --runtime-arg=/tmp/podman_test3705908020/runroot/overlay-containers/5a0c6d10c1a71ecb47d3f0e0417e0a7678f6cc860ca995fcfa76525dca0e5e9b/userdata/oci-log --conmon-pidfile /tmp/podman_test3705908020/runroot/overlay-containers/5a0c6d10c1a71ecb47d3f0e0417e0a7678f6cc860ca995fcfa76525dca0e5e9b/userdata/conmon.pid --exit-command /var/tmp/go/src/github.com/containers/podman/bin/podman --exit-command-arg --root --exit-command-arg /tmp/podman_test3705908020/root --exit-command-arg --runroot --exit-command-arg /tmp/podman_test3705908020/runroot --exit-command-arg --log-level --exit-command-arg warning --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /tmp/podman_test3705908020 --exit-command-arg --network-config-dir --exit-command-arg /etc/containers/networks --exit-command-arg --network-backend --exit-command-arg netavark --exit-command-arg --volumepath --exit-command-arg /tmp/podman_test3705908020/root/volumes --exit-command-arg --db-backend --exit-command-arg sqlite --exit-command-arg --transient-store=false --exit-command-arg --runtime --exit-command-arg crun --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --storage-opt --exit-command-arg overlay.imagestore=/tmp/ginkgo692055407/imagecachedir --exit-command-arg --events-backend --exit-command-arg file --exit-command-arg container --exit-command-arg cleanup --exit-command-arg 5a0c6d10c1a71ecb47d3f0e0417e0a7678f6cc860ca995fcfa76525dca0e5e9b
           root      101582 92.2  0.0      0     0 ?        Rs   11:10   1:32  \_ [touch]

The container process (touch) is alive and ps reports 92% cpu usage, WTF? At this point the command is running for at least 90 seconds per the log. And ps says touch is actually running and not in a sleeping state.

@edsantiago
Copy link
Collaborator Author

Quick observation: a large fraction of the failures are in podman build tests, and (I think) in RUN steps:

[+0353s] not ok 83 [030] podman run no /etc/mtab 
...
<+187ms> # # podman build -t nomtab /tmp/podman_bats.ptI9oE/build-test
<+0120s> # STEP 1/2: FROM quay.io/libpod/testimage:20221018
         # STEP 2/2: RUN rm /etc/mtab
         # timeout: sending signal TERM to command ‘/var/tmp/go/src/github.com/containers/podman/bin/podman’
<+025ms> # [ rc=124 (** EXPECTED 0 **) ]
         # *** TIMED OUT ***

Since this is MUCH easier to instrument than e2e tests, I've been running a loop test on it. No failures in many hours but I'll keep it going. It's not a kernel diff, kernel is 6.8.0-0.rc0.20240112git70d201a40823.5.fc40.x86_6 in CI and in my 1mt VM.

Flake catalog so far:

x x x x x x
int(25) podman(25) rawhide(38) root(26) host(38) sqlite(38)
sys(13) remote(13) rootless(12)

@Luap99
Copy link
Member

Luap99 commented Feb 8, 2024

Yeah hard to tell if they all happen due the same cause. Could also be two different problems. On the plus side build doesn't use conmon at all which is further proof that it must be the kernel miss behaving.

I also have a 1mt VM running over the system test to try to reproduce but so far no luck either.

@edsantiago
Copy link
Collaborator Author

New symptom: complete test hang, no useful output whatsoever.

Interesting thing is, this is system tests, which I instrumented as follows:

    # FIXME FIXME FIXME 2024-02-07 for hang
    run_podman '?' rm -t 0 --all --force --ignore
    if [[ $status -ne 0 ]]; then
        foo=$(expr "$output" : "container \(.*\) as it could not be stopped")
        if [[ -n "$foo" ]]; then
            echo "# BARF BARF BARF"
            run_podman '?' container inspect $foo
            echo "$_LOG_PROMPT ps auxww --forest"
            ps auxww --forest
        fi

Hypothesis: the flake triggered, and podman inspect CID is hanging. Admittedly not a great hypothesis, I don't see how this could happen, but it's trivial to test: run ps auxww first, before inspect. I've just done so and repushed.

@Luap99
Copy link
Member

Luap99 commented Feb 9, 2024

But the podman commands shouldn't hang forever though because we run them through the run_podman function which uses the 90s timeout. So maybe ps, bash, bats or the timout command itsef hangs? Looking at this weird symptoms I really start to wonder if any process could just start hanging forever with this kernel.

@edsantiago
Copy link
Collaborator Author

Latest list:

x x x x x x
int(39) podman(42) rawhide(64) root(38) host(64) sqlite(64)
sys(25) remote(22) rootless(26)

@edsantiago
Copy link
Collaborator Author

Here is the current package-version table, possibly useful to see what is different between rawhide and all the others:

debian prior-fedora fedora fedora-aws rawhide
base 13 38-1.6 39-1.5 ? 40-0
kernel 6.6.13-1 6.7.4-100 6.7.4-200 6.7.4-200 6.8.0-0.rc3.20240209git1f719a2f3fa6.31
aardvark-dns 1.4.0-5 1.10.0-1 1.10.0-1 1.10.0-1 1.10.0-1
netavark 1.4.0-4 1.10.3-1 1.10.3-1 1.10.3-1 1.10.3-1
buildah 1.33.5+ds1-4 1.34.0-1 1.34.0-1 1.34.0-1 1.34.0-3
conmon 2.1.6+ds1-1 2.1.8-2 2.1.8-2 2.1.8-2 2.1.8-4
container-selinux ? 2.228.1-1 2.229.0-1 2.229.0-1 2.229.0-2
containers-common ? 1-89 1-99 1-99 0.57.3-3
criu 3.17.1-3 3.18-1 3.19-2 3.19-2 3.19-4
crun 1.14-1 1.14-1 1.14.1-1 1.14-1 1.14.1-1
golang 2:1.21~2 1.20.13-1 1.21.7-1 1.21.6-1 1.22.0-1
passt 2023-12-30 2023-12-30 2023-12-30 2023-12-30 2023-12-30
podman 4.9.2+ds1-2 4.8.3-1 4.9.2-1 4.9.0-1 5.0.0~rc1-2
runc 1.1.12+ds1-1 1.1.12-1 1.1.12-1 1.1.12-1 1.1.12-1
skopeo 1.13.3+ds1-2 1.14.1-1 1.14.2-1 1.14.0-1 1.14.2-1
slirp4netns 1.2.1-1 1.2.2-1 1.2.2-1 1.2.2-1 1.2.2-2
systemd-container 255.3-2 253.15-2 254.9-1 254.9-1 255.3-1
tar 1.34+dfsg-1.2 1.34-8 1.35-2 1.35-2 1.35-3

@mheon
Copy link
Member

mheon commented Feb 14, 2024

I concur with @Luap99 this smells like a kernel issue. I'll look into it.

@edsantiago
Copy link
Collaborator Author

Rawhide kernel 6.8.0-0.rc4.20240214git7e90b5c295ec.37, built in containers/automation_images#328 (comment) , does not fix the problem.

log 1

log 2

@edsantiago
Copy link
Collaborator Author

Seen in OpenQA

@Luap99
Copy link
Member

Luap99 commented Feb 20, 2024

Ok that is a good pointer, I downloaded the logs and run journalctl on it and found
kernel: WARNING: stack going in the wrong direction? at do_syscall_64+0x89/0x170

That certainly doesn't sounds good at all, going back through our journal logs I do indeed see in some of them the same warning.
That doesn't mean it is related to this but it also cannot mean good things.
Any chance you can script this to grep through all our journald logs and see if we only see this on rawhide when there was a test failure?

@edsantiago
Copy link
Collaborator Author

Oh, nice catch. I don't fetch journal logs in my flake db but it's easy to fetch them. I will start working on it and report back.

@edsantiago
Copy link
Collaborator Author

Well, mixed news. Out of 78 flakes in my DB, the string "stack going" only appears in journal logs for 7 of them:

Next: that string appears in all instances of #21749, the new oom flake. This does not surprise me: I had a hunch that these were connected.

Next: that string DOES NOT appear in any rawhide journals that I looked at for flakes other than the hang. (This is admittedly a poor and non-thorough data point, because I just hand-picked logs).

Next: no other relevant matches (in these logs) for 'kernel.*warn' (I am ignoring Keylock active)

HTH.

@edsantiago
Copy link
Collaborator Author

Another one in OpenQA

@edsantiago
Copy link
Collaborator Author

Sigh. I am sorry to report that the flake lives on in today's new VMs, kernel 6.8.0-0.rc5.41.

@edsantiago
Copy link
Collaborator Author

@Luap99 I realized that I do have access to the CID and test environment, and can run podman inspect. It doesn't seem very useful any more give your reproducer, but is there something else I can do to instrument and debug? Like get {{.State.Pid}} and examine anything in /proc?

@Luap99
Copy link
Member

Luap99 commented Feb 21, 2024

Nothing needed for now.

@edsantiago
Copy link
Collaborator Author

Build c20240227t125812z-f39f38d13 has rc6 kernel. Testing now in #17831, which I will keep rerunning all day. Will report back if I see the hang. I'm tempted to open a podman PR to update VMs, will probably do so after I see how 17831 does on its first run.

@edsantiago
Copy link
Collaborator Author

I am a rational human being. I don't believe in superstitions or jinxing or any such nonsense. Even so, I'm unreasonably scared to mention that #17831 has run six times today, using the new kernel, and not seen this flake. Nor did we see it in #21841, the PR that brought new VMs into podman CI.

@mheon
Copy link
Member

mheon commented Feb 28, 2024

Are we comfortable closing this given that the new kernel has arrived? We can always re-open if the demons reappear in the shadows.

@edsantiago
Copy link
Collaborator Author

It seems pretty likely that rc6 fixes the hang.

@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label May 29, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators May 29, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

No branches or pull requests

3 participants