Improve robustness of pod removal #3082

mheon · 2019-05-08T01:30:16Z

Removing a pod must first removal all containers in the pod. Libpod requires the state to remain consistent at all times, so references to a deleted pod must all be cleansed first.

Pods can have many containers in them. We presently iterate through all of them, and if an error occurs trying to clean up and remove any single container, we abort the entire operation (but cannot recover anything already removed - pod removal is not an atomic operation).

Because of this, if a removal error occurs partway through, we can end up with a pod in an inconsistent state that is no longer usable. What's worse, if the error is in the infra container, and it's persistent, we get zombie pods - completely unable to be removed.

When we saw some of these same issues with containers not in pods, we modified the removal code there to aggressively purge containers from the database, then try to clean up afterwards. Take the same approach here, and make cleanup errors nonfatal. Once we've gone ahead and removed containers, we need to see pod deletion through to the end - we'll log errors but keep
going.

Also, fix some other small things (most notably, we didn't make events for the containers removed).

Removing a pod must first removal all containers in the pod. Libpod requires the state to remain consistent at all times, so references to a deleted pod must all be cleansed first. Pods can have many containers in them. We presently iterate through all of them, and if an error occurs trying to clean up and remove any single container, we abort the entire operation (but cannot recover anything already removed - pod removal is not an atomic operation). Because of this, if a removal error occurs partway through, we can end up with a pod in an inconsistent state that is no longer usable. What's worse, if the error is in the infra container, and it's persistent, we get zombie pods - completely unable to be removed. When we saw some of these same issues with containers not in pods, we modified the removal code there to aggressively purge containers from the database, then try to clean up afterwards. Take the same approach here, and make cleanup errors nonfatal. Once we've gone ahead and removed containers, we need to see pod deletion through to the end - we'll log errors but keep going. Also, fix some other small things (most notably, we didn't make events for the containers removed). Signed-off-by: Matthew Heon <matthew.heon@pm.me>

openshift-ci-robot · 2019-05-08T01:30:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mheon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mheon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mheon · 2019-05-08T01:31:13Z

I'm thinking about modifying the API to also return a map[string]error for errors removing individual containers, so we can accurately return errors that occurred as we evicted individual containers (instead of only logging like we do now)

Ensure that, if an error occurs somewhere along the way when we remove a pod, it's preserved until the end and returned, even as we continue to remove the pod. Signed-off-by: Matthew Heon <matthew.heon@pm.me>

mheon · 2019-05-08T01:45:07Z

Pushed a patch that fixes without API changes. It's less elegant, but gets the job done.

mheon · 2019-05-08T01:58:03Z

@baude I'm starting to think we need a 1.3.1 - between this and #3073 we merged some pretty major fixes for things getting messed up after a reboot

mheon · 2019-05-08T02:34:46Z

Test issues: this PR bumped CGroup removal from a warning that was never reported, to a reported error. It looks like removing cgroupfs CGroups on CI is very unreliable?

mheon · 2019-05-08T03:27:34Z

@cevich I'm seeing the journalctl error messages out of the failed Ubuntu job here, for reference

cevich · 2019-05-08T13:23:44Z

@mheon thx, useful to know it goes with passing tests also.

(edit) Mmmm, I see it's coming from vendor code as well. I was going to send a PR and even add a unittest, but...sigh wouldn't be accepted upstream (most likely).

giuseppe · 2019-05-08T13:51:53Z

LGTM

TomSweeneyRedHat · 2019-05-08T14:29:24Z

libpod/runtime_pod_linux.go

 		// Clean up network namespace, cgroups, mounts
 		if err := ctr.cleanup(ctx); err != nil {
-			return err
+			if removalErr == nil {
+				removalErr = err


I think it would be good to add a logrus.debug here. Might make it easier later on to track down exactly what went south. Ditto the other "removalErr = err" lines below.

I think most of these should already be good errors including the container ID, but I'll slap some Wrapfs on them to be sure

Alright, I checked, and everything not already wrapped is returning a well-formatted, wrapped error

@mheon Thanks!

And the errors are clear enough to know that it came from ctr.cleanup failing at line 231 vs ctr.teardownStorage failing at line 240?

If you grep for the specific strings we wrap with, yes - you'll be able to get the specific function that failed

OK then, I'll trust you on this one @mheon, thanks for the discussion.

libpod/runtime_pod_linux.go

jwhonce · 2019-05-08T16:21:47Z

libpod/runtime_pod_linux.go

 		// Clean up network namespace, cgroups, mounts
 		if err := ctr.cleanup(ctx); err != nil {
-			return err
+			if removalErr == nil {
+				removalErr = err


@mheon Thanks!

TomSweeneyRedHat · 2019-05-08T16:38:44Z

LGTM

jwhonce · 2019-05-08T18:06:10Z

/lgtm

openshift-ci-robot requested review from TomSweeneyRedHat and umohnani8 May 8, 2019 01:30

openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M labels May 8, 2019

Preserve errors returned by removing pods

e9c78b4

Ensure that, if an error occurs somewhere along the way when we remove a pod, it's preserved until the end and returned, even as we continue to remove the pod. Signed-off-by: Matthew Heon <matthew.heon@pm.me>

openshift-ci-robot added size/L and removed size/M labels May 8, 2019

TomSweeneyRedHat reviewed May 8, 2019

View reviewed changes

jwhonce reviewed May 8, 2019

View reviewed changes

openshift-ci-robot assigned jwhonce May 8, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 8, 2019

openshift-merge-robot merged commit 7b54ebb into containers:master May 8, 2019

This was referenced May 8, 2019

Inconsistent state recovery #3088

Closed

podman rm -a shows "No such file or directory" #2900

Closed

mheon added the Release Notes 1.3.1 label May 14, 2019

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 26, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve robustness of pod removal #3082

Improve robustness of pod removal #3082

mheon commented May 8, 2019

openshift-ci-robot commented May 8, 2019

mheon commented May 8, 2019

mheon commented May 8, 2019

mheon commented May 8, 2019 •

edited

Loading

mheon commented May 8, 2019

mheon commented May 8, 2019

cevich commented May 8, 2019 •

edited

Loading

giuseppe commented May 8, 2019

TomSweeneyRedHat May 8, 2019

mheon May 8, 2019

mheon May 8, 2019

jwhonce May 8, 2019

TomSweeneyRedHat May 8, 2019

mheon May 8, 2019

TomSweeneyRedHat May 8, 2019

jwhonce May 8, 2019

TomSweeneyRedHat commented May 8, 2019

jwhonce commented May 8, 2019

Improve robustness of pod removal #3082

Improve robustness of pod removal #3082

Conversation

mheon commented May 8, 2019

openshift-ci-robot commented May 8, 2019

mheon commented May 8, 2019

mheon commented May 8, 2019

mheon commented May 8, 2019 • edited Loading

mheon commented May 8, 2019

mheon commented May 8, 2019

cevich commented May 8, 2019 • edited Loading

giuseppe commented May 8, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomSweeneyRedHat commented May 8, 2019

jwhonce commented May 8, 2019

mheon commented May 8, 2019 •

edited

Loading

cevich commented May 8, 2019 •

edited

Loading