open pidfd: no such process #18452

edsantiago · 2023-05-04T01:09:27Z

Not a new flake -- this one started cropping up in 2020, but never often enough to even merit an issue. Now, it's happening constantly in #18442, a PR that adds error checking to e2e-test cleanup.

$ podman [options] stop --all -t 0
open pidfd: No such process

This may not be a complete list--it has been overwhelming:

fedora-37 : int podman fedora-37 root container boltdb
- 05-03 11:21 in podman pod container can override pod not sharing pid
fedora-38 : int podman fedora-38 root container boltdb
- 05-03 11:17 in podman pod correctly sets up PIDNS
fedora-38 : int podman fedora-38 root host boltdb
- 05-03 11:19 in podman generate kube sharing pid namespace
fedora-38 : int podman fedora-38 root host sqlite
- 05-03 11:19 in podman play kube should share ipc,net,uts when shareProcessNamespace is set
- 05-03 11:19 in podman play kube test with init containers and annotation set
- 05-03 11:19 in podman play kube multi doc yaml with multiple services, pods and deployments
- 05-03 11:19 in podman play kube with replicas limits the count to 1 and emits a warning
- 05-03 11:19 in podman play kube override with udp should keep tcp from YAML file
fedora-38 : int podman fedora-38 rootless host sqlite
- 05-03 11:18 in podman play kube test with init containers and annotation set
- 05-03 11:18 in podman play kube deployment more than 1 replica test correct command
- 05-03 11:18 in podman play kube multi doc yaml with multiple services, pods and deployments
rawhide : int podman rawhide root host sqlite
- 05-03 11:18 in podman generate kube with volume
- 05-03 11:18 in podman generate kube sharing pid namespace
- 05-03 11:18 in podman play kube should share ipc,net,uts when shareProcessNamespace is set
- 05-03 11:18 in podman play kube expose character device inside container
- 05-03 11:18 in podman play kube override with udp should keep tcp from YAML file
- 05-03 11:18 in podman pod correctly sets up PIDNS
rawhide : int podman rawhide rootless host sqlite
- 05-03 11:17 in podman play kube support container liveness probe
- 05-03 11:17 in podman play kube test with init containers and annotation set
- 05-03 11:17 in podman play kube test with DirectoryOrCreate HostPath type volume and non-existent directory path
- 05-03 11:17 in podman generate and reimport kube on pod

The text was updated successfully, but these errors were encountered:

vrothberg · 2023-05-04T07:48:16Z

I am under the impression that's a crun error (see https://github.com/containers/crun/blob/645b75b04613debb0a3bdf122fdb0b22d835db68/src/libcrun/linux.c#L5320).

@giuseppe, @flouthoc, does crun need some error handling as done in Podman?

// Track lifetime of conmon precisely using pidfd_open + poll.           
// There are many cases for this to fail, for instance conmon is dead    
// or pidfd_open is not supported (pre linux 5.3), so fall back to the   
// traditional loop with poll + sleep                                    
if fd, err := unix.PidfdOpen(c.state.ConmonPID, 0); err == nil {         
        return fd                                                        
} else if err != unix.ENOSYS && err != unix.ESRCH {                      
        logrus.Debugf("PidfdOpen(%d) failed: %v", c.state.ConmonPID, err)
}

giuseppe · 2023-05-04T08:41:40Z

crun must report when the kill command fails, as it is in this case because the process doesn't exist. Should Podman ignore it?

vrothberg · 2023-05-04T08:49:01Z

crun must report when the kill command fails, as it is in this case because the process doesn't exist. Should Podman ignore it?

That sounds good to me. Could crun return with a specific exit code such that Podman can discriminate a no-such-pid error from others? I'd like to avoid parsing error strings.

vrothberg · 2023-05-04T08:50:30Z

Meh ... but runc and other runtimes wouldn't behave that way.

Luap99 · 2023-05-04T08:50:52Z

What I have observed in #18442 is that it looks like podman sometimes tries to cleanup() twice. So it could be very well that podman call crun twice resulting in the error.

But generally speaking there is always the race between podman calling the runtime to kill it and the runtime trying to kill. The process can exit in the meantime. I am not sure how we can make podman ignore it, string matching is way to fragile considering that we support more than one oci runtime.

vrothberg · 2023-05-04T08:54:19Z

I have an idea. Let me wrap something up to discuss the details in the PR.

There is an inherent race when stopping/killing a container with other processes attemptimg to do the same and also with the container having exited in the meantime. In those cases, the OCI runtime may fail to kill the container as it has already exited. Handle those races by first checking if the container state has changed before returning the error. [NO NEW TESTS NEEDED] - as it's a hard to test race. Fixes: containers#18452 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

vrothberg · 2023-05-04T09:03:51Z

-> #18457

There is an inherent race when stopping/killing a container with other processes attempting to do the same and also with the container having exited in the meantime. In those cases, the OCI runtime may fail to kill the container as it has already exited. Handle those races by first checking if the container state has changed before returning the error. [NO NEW TESTS NEEDED] - as it's a hard to test race. Fixes: containers#18452 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>

Luap99 · 2023-05-04T10:58:09Z

Maybe a different angle, all of the above tests use kube or pods directly. So it could be very well a bug in our pod handling.

vrothberg · 2023-05-04T11:23:06Z

Maybe a different angle, all of the above tests use kube or pods directly. So it could be very well a bug in our pod handling.

Nice idea. Let's see if the logs pop up and then we can continue digging.

github-actions · 2023-06-04T00:07:56Z

A friendly reminder that this issue had no activity for 30 days.

github-actions · 2023-07-07T00:07:55Z

A friendly reminder that this issue had no activity for 30 days.

github-actions · 2023-08-11T00:05:58Z

A friendly reminder that this issue had no activity for 30 days.

edsantiago · 2023-08-14T18:59:02Z

Seen in my logs on July 25 but I need to reiterate my usual warning about this sort of flake: it does not cause a CI failure, even in my no-retries PR, so the only way for me to see it is if it happens in the same logfile as a flake that I do detect. IOW, this is likely happening often with no indication.

Luap99 · 2023-08-15T09:20:25Z

If you want to make this fatal pick #18442 in your no retry PR.

edsantiago · 2023-08-15T11:12:56Z

I considered it but didn't see the point? As I recall, this was (is?) such a frequent one that it would be exhausting to sift through; like the "no logs from conmon" one.

But I'll give it a try, gather some data, and report back.

edsantiago · 2023-08-15T13:20:30Z

Oh.... now - I - remember - why - I - stopped - doing - this.

Luap99 · 2023-08-15T13:30:26Z

Which is the reason why we need this so badly, it is clear that the teardown logic is not as robust as it should be. The amount of logrus errors/warnings are very concerning.

edsantiago · 2023-08-15T14:33:00Z

Agreed! But, like my no-flake-retries work, it can only succeed if we make a concerted effort to squash all bugs and then enable the checks. And until that happens, new failures are going to creep in.

FWIW the kube failure is a constant, not a flake, so it might be easy to fix. At first glance maybe kube down could fix it. I'll try to take a look today or tomorrow.

edsantiago · 2023-08-23T13:41:38Z

Seen just now on my laptop, running simple interactive commands, not in int tests or anything:

$ time bin/podman kube play --service-container /tmp/foo.yaml
...
$ bin/podman kube down /tmp/foo.yaml
2023-08-23T13:39:24.557276Z: open pidfd: No such process
Pods stopped:
...

rhatdan · 2023-08-23T19:50:30Z

@giuseppe This is an error returned from crun meaning the pid no longer exists. It is called in the step where we are killing the pidfd, Should we ignore this error?

edsantiago · 2023-08-23T20:06:25Z

I think we do ignore the error. At least, podman does not pass it on in the exit code.

The problem is that the message is shown to the user. This (1) causes tests to fail if they're expecting empty output, and (2) causes concern in end users, probably also wasted time (what happened? Are my containers ok? I need to look; Maybe I should check the web).

rhatdan · 2023-08-23T20:25:56Z

Well if the message is being shown, isn't it coming from crun then? Podman is probably ignoring the error, but someone is writing it to stdout/stderr.

giuseppe · 2023-08-24T09:41:39Z

@giuseppe This is an error returned from crun meaning the pid no longer exists. It is called in the step where we are killing the pidfd, Should we ignore this error?

yes I think we should ignore it if we try to kill the process. It means the container already exited

A race exists where a container engine tells crun to kill a container, process, where the process already died. In this case do not return an error. Fixes: containers/podman#18452 Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>

when the "kill" command fails, print the stderr from the OCI runtime only after we check the container state. It also simplifies the code since we don't have to hard code the error messages we want to ignore. Closes: containers#18452 [NO NEW TESTS NEEDED] it fixes a flake. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

giuseppe · 2023-08-26T20:42:22Z

tentative fix: #19760

not 100% sure it solves the problem as I was not able to reproduce locally yet to confirm it

when the "kill" command fails, print the stderr from the OCI runtime only after we check the container state. It also simplifies the code since we don't have to hard code the error messages we want to ignore. Closes: containers#18452 [NO NEW TESTS NEEDED] it fixes a flake. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

edsantiago added the flakes Flakes from Continuous Integration label May 4, 2023

vrothberg mentioned this issue May 4, 2023

container stop: handle race #18457

Closed

Luap99 mentioned this issue May 9, 2023

Fix podman rm -fa with dependencies #18507

Merged

github-actions bot added the stale-issue label Jun 4, 2023

rhatdan removed the stale-issue label Jun 5, 2023

github-actions bot added the stale-issue label Jul 7, 2023

edsantiago removed the stale-issue label Jul 11, 2023

edsantiago mentioned this issue Jul 11, 2023

race: new rm -fa with dependencies is not actually waiting for container removal #18874

Closed

github-actions bot added the stale-issue label Aug 11, 2023

github-actions bot removed the stale-issue label Aug 15, 2023

rhatdan mentioned this issue Aug 25, 2023

When attempting to kill a dead processes, do not return failure containers/crun#1282

Closed

giuseppe mentioned this issue Aug 26, 2023

oci: print stderr only after checking state #19760

Merged

openshift-merge-robot closed this as completed in #19760 Aug 28, 2023

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Nov 27, 2023

github-actions bot locked as resolved and limited conversation to collaborators Nov 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

open pidfd: no such process #18452

open pidfd: no such process #18452

edsantiago commented May 4, 2023

vrothberg commented May 4, 2023 •

edited

giuseppe commented May 4, 2023

vrothberg commented May 4, 2023

vrothberg commented May 4, 2023

Luap99 commented May 4, 2023

vrothberg commented May 4, 2023

vrothberg commented May 4, 2023

Luap99 commented May 4, 2023

vrothberg commented May 4, 2023

github-actions bot commented Jun 4, 2023

github-actions bot commented Jul 7, 2023

github-actions bot commented Aug 11, 2023

edsantiago commented Aug 14, 2023

Luap99 commented Aug 15, 2023

edsantiago commented Aug 15, 2023

edsantiago commented Aug 15, 2023

Luap99 commented Aug 15, 2023

edsantiago commented Aug 15, 2023

edsantiago commented Aug 23, 2023

rhatdan commented Aug 23, 2023

edsantiago commented Aug 23, 2023

rhatdan commented Aug 23, 2023

giuseppe commented Aug 24, 2023

giuseppe commented Aug 26, 2023

open pidfd: no such process #18452

open pidfd: no such process #18452

Comments

edsantiago commented May 4, 2023

vrothberg commented May 4, 2023 • edited

giuseppe commented May 4, 2023

vrothberg commented May 4, 2023

vrothberg commented May 4, 2023

Luap99 commented May 4, 2023

vrothberg commented May 4, 2023

vrothberg commented May 4, 2023

Luap99 commented May 4, 2023

vrothberg commented May 4, 2023

github-actions bot commented Jun 4, 2023

github-actions bot commented Jul 7, 2023

github-actions bot commented Aug 11, 2023

edsantiago commented Aug 14, 2023

Luap99 commented Aug 15, 2023

edsantiago commented Aug 15, 2023

edsantiago commented Aug 15, 2023

Luap99 commented Aug 15, 2023

edsantiago commented Aug 15, 2023

edsantiago commented Aug 23, 2023

rhatdan commented Aug 23, 2023

edsantiago commented Aug 23, 2023

rhatdan commented Aug 23, 2023

giuseppe commented Aug 24, 2023

giuseppe commented Aug 26, 2023

vrothberg commented May 4, 2023 •

edited