Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix podman rm -fa with dependencies #18507

Merged
merged 14 commits into from Jun 12, 2023

Conversation

mheon
Copy link
Member

@mheon mheon commented May 8, 2023

This is a major consolidation of the logic for removing dependency containers.

As previous, the logic was contained in a separate function which ran prior to actually removing the container. This didn't work properly, and was honestly confusing in its flow - we had two completely separate paths for removing containers based on whether they had, or did not have, a dependency? This consolidates all removal logic into the existing container removal functions. As part of this, we needed to make some signature changes, as removeContainer can now remove 1 or more containers, and 0 or more pods, and we should probably let the user know what containers and pods were removed as part of this. Pod removal also needed to be changed to report what containers were removed through that (this was also plumbed through the remote API, but is not yet exposed via the podman pod rm CLI because I feel like it'd be a breaking change to do so).

Fixes #18180

Does this PR introduce a user-facing change?

Fixed a bug where `podman rm -fa` could fail to remove dependency containers (such as pod infra containers).

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 8, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mheon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 8, 2023
@github-actions github-actions bot added the kind/api-change Change to remote API; merits scrutiny label May 8, 2023
@mheon mheon force-pushed the fix_rm_depends branch 2 times, most recently from 165f62e to 0e5c3d4 Compare May 9, 2023 00:14
Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you squash the lint fixes in to the correct commits please?

libpod/runtime_ctr.go Show resolved Hide resolved
test/e2e/rm_test.go Show resolved Hide resolved
@mheon mheon force-pushed the fix_rm_depends branch 2 times, most recently from e7d1a55 to 9a5b581 Compare May 9, 2023 14:33
@Luap99
Copy link
Member

Luap99 commented May 9, 2023

wow I didn't even expect the stderr check to fail but this is exactly why it is needed:

           # podman [options] rm -fa
           time="2023-05-09T15:12:09Z" level=error msg="container not running"
           697f78b2fd3388420ac0839c6f3f071ebf87bb1e62a1f8ab0ec3d86862aab638
           7f5e8d5b95b63e2b41741d069db79b1038e198f734facafd6fa690b1cc446f5b

@Luap99
Copy link
Member

Luap99 commented May 9, 2023

Because you have to repush anyway: During the demo I also saw 9bd833b, feel free revert that as well.

@Luap99
Copy link
Member

Luap99 commented May 9, 2023

wow I didn't even expect the stderr check to fail but this is exactly why it is needed:

           # podman [options] rm -fa
           time="2023-05-09T15:12:09Z" level=error msg="container not running"
           697f78b2fd3388420ac0839c6f3f071ebf87bb1e62a1f8ab0ec3d86862aab638
           7f5e8d5b95b63e2b41741d069db79b1038e198f734facafd6fa690b1cc446f5b

In case this isn't clear time="2023-05-09T15:12:09Z" level=error msg="container not running" is logged by runc not podman. In the crun tests it is open pidfd: No such process

Which makes me think about #18452 except it is consistently in your PR so it is not a flake with your change. To me this sounds like we are trying to stop a container which is already stopped.

@mheon
Copy link
Member Author

mheon commented May 9, 2023

Since both runtimes are doing it I can't complain too hard, but this definitely feels like unnecessary error logging that is just going to cause problems... There is no way I see to fully eliminate the TOCTOU with stopping containers, so the best thing for Podman to do is just to go ahead and kill the container, and then handle errors from the runtime as necessary (usually by just ignoring them - Podman here exits cleanly because it knows that nothing went wrong, the container's stopped, just not by us).

As to why I've made it 100% consistent... I have some ideas (parallel removal of both dependency and dependent container in two threads, with the container with dependencies trying to recurse to the dependent, introduces a lot of potentially interesting race conditions), but again, it shouldn't matter - Podman did the right thing and ate the error, the only issue is the OCI runtimes insisting on printing unnecessary warnings.

@Luap99
Copy link
Member

Luap99 commented May 9, 2023

I agree there is always a race and we have little control over that. But as a user I really do not want to see an error reported. It just makes no sense. Problem 1 is that we pass the plain stdout/err streams down to the oci runtime which means we have zero control over the output they print.
Second, the part I do not fully understand here is where is the race in your test? You spawn two top containers. The process should not exit on its own. Unless there is a bug in both runtimes podman actually calls the oci runtime twice to kill the process, otherwise the runtime would not complain.

@mheon
Copy link
Member Author

mheon commented May 9, 2023

My current thinking is as follows: To remove a container, we have to stop it, which unlocks the container to wait for it to successfully terminate. Given that there are two threads trying to remove the first container (the original thread, which presumably gets in first and successfully stops and removes it, and the second thread, which started removing the other container, tried to recurse to its dependency container, found it was locked, but was able to grab the lock when stop unlocked it), we probably have two stopContainer operations overlapping.

@mheon
Copy link
Member Author

mheon commented May 9, 2023

Re-pushed with the extra commit reverted and changed removeContainer to accept a struct. I'll keep thinking about the OCI runtime thing. I'm leaning towards not printing runtime STDOUT/STDERR unless --log-level=debug or below is set, but that's a potentially significant change and I'd want signoff from the team on that.

@edsantiago
Copy link
Collaborator

I think we might find that the runtime is responsible for some fatal errors where it's good to have users see the messages; like this, for a very quick example:

case "$runtime" in
# crun 0.20.1 changes the error message
# from /proc/thread-self/attr/exec`: .* unable to assign
# to /proc/self/attr/keycreate`: .* unable to process
crun) expect="\`/proc/.*\`: OCI runtime error: unable to \(assign\|process\) security attribute" ;;
# runc 1.1 changed the error message because of new selinux pkg that uses standard os.PathError, see
# https://github.com/opencontainers/selinux/pull/148/commits/a5dc47f74c56922d58ead05d1fdcc5f7f52d5f4e
# from failed to set /proc/self/attr/keycreate on procfs
# to write /proc/self/attr/keycreate: invalid argument
runc) expect=".*: \(failed to set\|write\) /proc/self/attr/keycreate.*" ;;
*) skip "Unknown runtime '$runtime'";;
esac

Also, @mheon, tests are failing with a panic. Looks like an arg mismatch or deref of uninitialized pointer

@vrothberg
Copy link
Member

I agree there is always a race and we have little control over that. But as a user I really do not want to see an error reported. It just makes no sense. Problem 1 is that we pass the plain stdout/err streams down to the oci runtime which means we have zero control over the output they print.

Yes, we use os.Stdout and os.Stderr when exec'ing the runtimes. We should probably use streams and debug-log them by default.

@Luap99
Copy link
Member

Luap99 commented May 10, 2023

I don't know what kind of errors the runtimes report but there could be a lot of ones we actually want to see as @edsantiago shows. And hiding them on debug level will make flake catching/debugging much harder in CI.

@vrothberg
Copy link
Member

I don't know what kind of errors the runtimes report but there could be a lot of ones we actually want to see as @edsantiago shows. And hiding them on debug level will make flake catching/debugging much harder in CI.

Absolutely. In case of an error, we need to think of a way to add the stdout and stderr logs to the error message.

In case there is no error but some logs on stdout/stderr, then I think they should go into the debug log.

@Luap99
Copy link
Member

Luap99 commented May 10, 2023

I don't know what kind of errors the runtimes report but there could be a lot of ones we actually want to see as @edsantiago shows. And hiding them on debug level will make flake catching/debugging much harder in CI.

Absolutely. In case of an error, we need to think of a way to add the stdout and stderr logs to the error message.

In case there is no error but some logs on stdout/stderr, then I think they should go into the debug log.

Agreed but this is not something this PR should do because it will hide the issue, this patch is currently broken. It tries to kill the same container twice which is not something we should do.

It is trivial to reproduce locally. The flake happens more than 50% of the tries:

$ bin/podman run --name t1 -d alpine top && bin/podman run --name t2 --network container:t1 -d alpine top && bin/podman rm -fa
23cf575c1dcfcc865ac8332957eaf1e111acc475ff9f6a225636968aae8d1de1
6b7a43da9f040230a53ede1b04dc1a907ebc0a3a7bf194fc6372b7229af035f6
ERRO[0000] Sending signal 15 to container 6b7a43da9f040230a53ede1b04dc1a907ebc0a3a7bf194fc6372b7229af035f6 
ERRO[0000] Sending signal 15 to container 6b7a43da9f040230a53ede1b04dc1a907ebc0a3a7bf194fc6372b7229af035f6 
2023-05-10T08:40:02.420152Z: open pidfd: No such process
ERRO[0000] Sending signal 15 to container 23cf575c1dcfcc865ac8332957eaf1e111acc475ff9f6a225636968aae8d1de1 
6b7a43da9f040230a53ede1b04dc1a907ebc0a3a7bf194fc6372b7229af035f6
23cf575c1dcfcc865ac8332957eaf1e111acc475ff9f6a225636968aae8d1de1

Note I patched the logrus.Error in to prove podman is actually trying to kill it twice:

diff --git a/libpod/oci_conmon_common.go b/libpod/oci_conmon_common.go
index d803276b1..99a6dccc5 100644
--- a/libpod/oci_conmon_common.go
+++ b/libpod/oci_conmon_common.go
@@ -356,7 +356,7 @@ func generateResourceFile(res *spec.LinuxResources) (string, []string, error) {
 // If all is set, send to all PIDs in the container.
 // All is only supported if the container created cgroups.
 func (r *ConmonOCIRuntime) KillContainer(ctr *Container, signal uint, all bool) error {
-       logrus.Debugf("Sending signal %d to container %s", signal, ctr.ID())
+       logrus.Errorf("Sending signal %d to container %s", signal, ctr.ID())
        runtimeDir, err := util.GetRuntimeDir()
        if err != nil {
                return err

@mheon
Copy link
Member Author

mheon commented May 10, 2023 via email

@Luap99
Copy link
Member

Luap99 commented May 10, 2023

I fully expected this, and honestly I don’t see it as much of a problem
modulo the OCI runtime error - there are no real consequences for the
double kill beyond a few wasted syscalls. If you want it fixed I will look
into it but it will be a substantial change.

Maybe I am missing something but to me this looks like our locking is just not correctly working. The problem here is clearly two threads trying to stop in parallel. So why do we allow two threads to change the container state at the same time? If locking would work as I expect it the second thread should be blocked and then when it gets the lock it must check for stopped/removed and not try to do the same again.

@vrothberg
Copy link
Member

Not sure whether that applies to this very symptom but we decided to allow for stopping a container even if it's in the stopping state since the stopping/killing process may have been killed itself.

@mheon
Copy link
Member Author

mheon commented May 11, 2023

My current tracing supports what @vrothberg is saying. Stopping the container deliberately unlocks (to ensure podman ps and friends don't hang). This allows the second thread to get in and recursively start removing the container that is in the process of being stopped, which rapidly leads to the second thread also trying to stop the container. The race is whether the container has time to exit before the second thread sends a signal. It's actually a quite narrow race because we do ping container PID1 with Signal 0 beforehand, so I'm surprised it manages to be this consistent.

@mheon
Copy link
Member Author

mheon commented May 11, 2023

Added the second, in-memory container lock we were talking about; let's see if the tests start deadlocking.

@TomSweeneyRedHat
Copy link
Member

No joy on the tests @mheon

libpod/container.go Outdated Show resolved Hide resolved
mheon added 2 commits June 1, 2023 16:27
This reverts commit 9bd833b.

With the fix for `podman rm -fa` merged, we no longer require
this patch.

Signed-off-by: Matthew Heon <matthew.heon@pm.me>
We had something like 6 different boolean options (removing a
container turns out to be rather complicated, because there are a
million-odd things that want to do it), and the function
signature was getting unreasonably large. Change to a struct to
clean things up.

Signed-off-by: Matthew Heon <matthew.heon@pm.me>
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 1, 2023
Comment on lines 773 to 774
c.lock.Lock()
defer c.lock.Unlock()
c.perProcessLock.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more that could cause dealock, no?

| thread 1             | thread 2         |
| c lock()             |                  |
| perProcces lock()    |                  |
| # now trying to stop |                  |
| c unlock()           |                  |
|                      | c lock() (lock taken)        |
|                      | perProcces lock() # hang on this lock |
| c lock() after stop is hanging now |
---> possible ABBA deadlock

Or maybe I missing something here, the CI logs do not show any deadlock but instead still show the no such process runtime logs so I assume the lock isn't working right now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent a few days digging into this, and the results are deeply concerning. Findings summarized in commit message for 7f43cf5 in this PR. In short, I think it's fixed, but locks have been very broken within the same process for a very long time, and it seems like sync.Mutex also does not behave as we expect.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am considering removing the per-process lock given this, but I'd like a signoff by @vrothberg since I recall he felt it was needed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things are still failing, but it looks like it's back to the original issue (unlocking during stop -> second thread stopping container locks and continues) and no longer due to defective Libpod locks. So it's turned from a 100% thing into a race, albeit a very common race.

The good news is that the last commit does appear to have resolved our in-memory locks not actually providing mutual exclusion within the same thread. The bad news is that the sync.Mutex is, indeed, not doing its job and it seems that it's still possible for multiple threads to stop the container simultaneously. I am at a loss for why that might be. Will investigate further.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand now. We have multiple copies of the Container struct, so multiple copies of the sync.Mutex, so it's not protecting anything. Looking into making it a pointer and having the DB share it between all containers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but I don't think it is worth it as I really believe it will cause deadlocks see my graph above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. Can we solve that by taking the per-process lock after the main container lock, but retaining the behavior where it is released second? This might also merit discussing during the watercooler, I think we need some sort of fix here because the race here is pretty nasty in CI.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting decision for posterity: we will remove the extra lock, and just ignore the error messages from the OCI runtime specific to killing a container on runc/crun that is already dead (though we can still print them as debug logs).

@@ -136,6 +136,14 @@ shm_struct_t *setup_lock_shm(char *path, uint32_t num_locks, int *error_code) {
goto CLEANUP_UNMAP;
}

// Ensure that recursive locking of a mutex by the same OS thread (which may
// refer to numerous goroutines) blocks.
ret_code = pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_NORMAL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTHREAD_MUTEX_DEFAULT
Attempting to recursively lock a mutex of this type results in undefined behavior. Attempting to unlock a mutex of this type which was not locked by the calling thread results in undefined behavior. Attempting to unlock a mutex of this type which is not locked results in undefined behavior. An implementation may map this mutex to one of the other mutex types.

What a lovely default.

Note that the shm lock code already does the required runtime.LockOSThread() on lock so the os thread will not be reused until we call runtime.UnlockOSThread() in the unlock call. Therefore it should be impossible for go to schedule two goroutines on this thread and we should never run into this situation where we actually need this. also see TestLockAndUnlockTwoSemaphore

Also on my fedora laptop PTHREAD_MUTEX_DEFAULT already maps to PTHREAD_MUTEX_NORMAL so this shouldn't make a difference at least on current fedora. Of course it is still required to explicitly set it given it is not specified anywhere and other c libs may work differently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd think that lock/unlock OS thread would do that, but I'm actually seeing that, when we start a parallel container removal (5 containers, and 48 jobs), every single one of them starts with the same PID. I looked to see if we were holding a lock that might force every part of Podman onto the same OS thread (and basically eliminate threading within Podman), but that didn't seem to be the case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless, it seems that Golang is still capable of swapping between Goroutines at will on a given thread, even after LockOSThread() is run. It just guarantees that the goroutine in question will only start on the thread in question, not that it won't schedule others there as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That goes directly against the go docs for this function

LockOSThread wires the calling goroutine to its current operating system thread. The calling goroutine will always execute in that thread, and no other goroutine will execute in it, until the calling goroutine has made as many calls to UnlockOSThread as to LockOSThread. If the calling goroutine exits without unlocking the thread, the thread will be terminated.

If it actually still executes other goroutines on that thread we have a major issues. We also use this to switch namespaces, set selinux lables, etc... If this doesn't work right now this needs to be fixed in golang.

@edsantiago
Copy link
Collaborator

@mheon FWIW your handle recursive locking properly commit continues to work just fine on #17831. I can't remember if it's four runs or five, but either way no unusual flakes.

mheon added 5 commits June 7, 2023 14:09
This fixes a lint issue, but I'm keeping it in its own commit so
it can be reverted independently if necessary; I don't know what
side effects this may have. I don't *think* there are any
issues, but I'm not sure why it wasn't a pointer in the first
place, so there may have been a reason.

Signed-off-by: Matthew Heon <matthew.heon@pm.me>
This was causing some CI flakes. I'm pretty sure that the pods
being removed already isn't a bug, but just the result of another
container in the pod removing it first - so no reason not to
ignore the errors.

Signed-off-by: Matthew Heon <matthew.heon@pm.me>
If the first container to get the pod lock is the infra container
it's going to want to remove the entire pod, which will also
remove every other container in the pod. Subsequent containers
will get the pod lock and try to access the pod, only to realize
it no longer exists - and that, actually, the container being
removed also no longer exists.

Signed-off-by: Matthew Heon <matthew.heon@pm.me>
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
We use shared-memory pthread mutexes to handle mutual exclusion
in Libpod. It turns out that these have configurable options for
how to handle a recursive lock (IE, a thread trying to lock a
lock that the same thread had previously locked). The mutex can
either deadlock, or allow the duplicate lock without deadlocking.
Default behavior is, helpfully, unspecified, so if not explicitly
set there is no clear indication of which of these behaviors will
be seen. Unfortunately, today is the first I learned of this, so
our initial implementation did *not* explicitly set our preferred
behavior.

This turns out to be a major problem with a language like Golang,
where multiple goroutines can (and often do) use the same OS
thread. So we can have two goroutines trying to stop the same
container, and if the no-deadlock mutex behavior is in use, both
threads will successfully acquire the lock because the C library,
not knowing about Go's lightweight threads, sees the same PID
trying to lock a mutex twice, and allows it without question.

It appears that, at least on Fedora/RHEL/Debian libc, the default
(unspecified) behavior of the locks is the non-deadlocking
version - so, effectively, our locks have been of questionable
utility within the same Podman process for the last four years.
This is somewhat concerning.

What's even more concerning is that the Golang-native sync.Mutex
that was also in use did nothing to prevent the duplicate locking
(I don't know if I like the implications of this).

Anyways, this resolves the major issue of our locks not working
correctly by explicitly setting the correct pthread mutex
behavior.

Signed-off-by: Matthew Heon <matthew.heon@pm.me>
@mheon mheon force-pushed the fix_rm_depends branch 4 times, most recently from 40620ff to ec7c6c1 Compare June 8, 2023 05:07
There are certain messages logged by OCI runtimes when killing a
container that has already stopped that we really do not care
about when stopping a container. Due to our architecture, there
are inherent races around stopping containers, and so we cannot
guarantee that *we* are the people to kill it - but that doesn't
matter because Podman only cares that the container has stopped,
not who delivered the fatal signal.

Unfortunately, the OCI runtimes don't understand this, and log
various warning messages when the `kill` command is invoked on a
container that was already dead. These cause our tests to fail,
as we now check for clean STDERR when running Podman. To work
around this, capture STDERR for the OCI runtime in a buffer only
for stopping containers, and go through and discard any of the
warnings we identified as spurious.

Signed-off-by: Matthew Heon <matthew.heon@pm.me>
@mheon
Copy link
Member Author

mheon commented Jun 8, 2023

Tests passing, this is ready now. @Luap99 PTAL

Comment on lines +423 to +436

// Before handling error from KillContainer, convert STDERR to a []string
// (one string per line of output) and print it, ignoring known OCI runtime
// errors that we don't care about
stderrLines := strings.Split(stderr.String(), "\n")
for _, line := range stderrLines {
if line == "" {
continue
}
if strings.Contains(line, "container not running") || strings.Contains(line, "open pidfd: No such process") {
logrus.Debugf("Failure to kill container (already stopped?): logged %s", line)
continue
}
fmt.Fprintf(os.Stderr, "%s\n", line)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we do this only in the err != nil case?

I am not happy with string matches either but cannot offer something better so am ok with it, I guess CI will tell us quickly if these strings ever change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking we'd still want to log warnings even in the err=nil case, but I don't feel that strongly and would be happy to downgrade them to debug logs if there is no error

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah honestly not sure either, I am good with this as well.
PR is green in this sate which is a very good sign.

@Luap99
Copy link
Member

Luap99 commented Jun 8, 2023

LGTM

@mheon
Copy link
Member Author

mheon commented Jun 8, 2023

@containers/podman-maintainers PTAL

@TomSweeneyRedHat
Copy link
Member

TomSweeneyRedHat commented Jun 8, 2023

LGTM
but definitely want another head nod given its size.

}

logrus.Infof("Removing pod %s (dependency of container %s)", pod.ID(), c.ID())
if err := r.removePod(ctx, pod, true, force, timeout); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This should just have returned r.removePod...

No need to hold up merge.

@rhatdan
Copy link
Member

rhatdan commented Jun 12, 2023

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 12, 2023
@openshift-merge-robot openshift-merge-robot merged commit 3cae574 into containers:main Jun 12, 2023
87 checks passed
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 11, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/api-change Change to remote API; merits scrutiny lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. release-note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

podman rm --all not working with containers that have dependencies
7 participants