Correct clean-up of Windows Layers in testsuite #5133

TBBle · 2021-03-07T01:29:15Z

A fix for sys.ForceRemoveAll on Windows: Clean up individual layers from the windows snapshotter using wclayer APIs

Walk the provided directory to find WCOW layers
Walk the layer directories in numerically-falling order
- This is fine for the test suite, but in wider usage, we would actually need to walk child->parent using the snapshotter's state.
For each layer: Unprepare (ignoring ERROR_DEV_NOT_EXIST), Deactivate, Destroy.
Then os.RemoveAll the root dir like we do on non-Windows.

Tested with a now-reverted hack to reproduce the issue reliably on CI. It reproduced on the fourth of eight repeats of the Windows Integration test. Note that due to 30 minute timeout, it would be cancelled during run 5. I'm not sure why the suite takes 4 minutes now, it used to take 1 minute when running in parallel, based on some of the logs in #4924 (comment).

Fixes: #4924

I have cherry-picked the fix into #4419, since that PR enables the full Snapshot test suite on Windows, and hence reproduces #4924 almost every time, including cases where the tests themselves have failed and may have left the system in a bad state, i.e. files locked open preventing clean-up. This change should remove the panic we see in this case, but may still fail to clean up with nice, readable errors. That said, even when #4419 fails, with this change it's cleaning up without error, so I suspect its "file still open" problems are transitory, or the clean-up routine here works correctly to unstick them.

k8s-ci-robot · 2021-03-07T01:29:23Z

Hi @TBBle. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

theopenlab-ci · 2021-03-07T01:37:35Z

Build succeeded.

containerd-build-arm64 : SUCCESS in 6m 35s (non-voting)

theopenlab-ci · 2021-03-07T03:06:21Z

Build succeeded.

containerd-build-arm64 : SUCCESS in 6m 21s (non-voting)

theopenlab-ci · 2021-03-07T03:40:45Z

Build succeeded.

containerd-build-arm64 : SUCCESS in 6m 17s (non-voting)

theopenlab-ci · 2021-03-07T04:07:18Z

Build succeeded.

containerd-build-arm64 : SUCCESS in 5m 38s (non-voting)

theopenlab-ci · 2021-03-07T04:30:17Z

Build succeeded.

containerd-build-arm64 : SUCCESS in 5m 43s (non-voting)

theopenlab-ci · 2021-03-07T05:10:51Z

Build succeeded.

containerd-build-arm64 : SUCCESS in 5m 38s (non-voting)

theopenlab-ci · 2021-03-13T00:28:28Z

Build succeeded.

containerd-build-arm64 : SUCCESS in 5m 45s (non-voting)

theopenlab-ci · 2021-03-14T04:12:57Z

Build succeeded.

containerd-build-arm64 : SUCCESS in 6m 15s (non-voting)

sys/filesys_windows.go

cpuguy83

I'm not familiar with the windows snapshotter layout and couldn't really determine it just looking at the code, so assuming its a base dir with directories with integer names.

Other than that the Unprepare/Deactivate calls look similar to some cleanup code in the moby windows graphdriver, so 👍

sys/filesys_windows.go

cpuguy83 · 2021-03-19T18:13:18Z

sys/filesys_windows.go

+			if layerNum, err := strconv.Atoi(filepath.Base(path)); err == nil {
+				layerNums = append(layerNums, layerNum)
+			} else {
+				return err


Returning an error doesn't seem right here because this stops cleanup.

I... kind-of want to stop cleanup and fail, because this means (unless Atoi has other failure cases) that the Windows snapshotter behaviour has changed, and this code wasn't updated. I'd want that to fail CI.

To be clear, the only circumstances I could hit this with the current code-base is an rm-<digits> directory, created during func (s *snapshotter) Remove, and having fallen through either "Failed to rename after failed commit" or "Failed to remove root filesystem". In that case, it seems quite likely we will fail to call cleanupWCOWLayer on it anyway.

I could refactor this to collect all the directories, best-effort sort them (I'd have to strip the rm- prefix so that I remove child layers before parent layers), and then record any cleanupWCOWLayer failure but still try it on all of them. The outcome would be that if a child layer is stuck, it and all its parent layers would report errors during clean-up, and we'd probably stumble over panic that prompted this PR in the first place, as we'd be trying to remove a parent layer while its child is activated.

Huh, and rebasing to master, I managed to trigger exactly this code-path on CI:

2021-03-20T00:25:28.7332957Z failed to remove test root dir failed to cleanup WCOW layers in C:\Program Files\containerd\root-test\io.containerd.snapshotter.v1.windows\snapshots: strconv.Atoi: parsing "rm-26": invalid syntax 2021-03-20T00:25:28.8327339Z exit status 1 2021-03-20T00:25:28.8328816Z FAIL github.com/containerd/containerd/integration/client 358.389s 2021-03-20T00:25:28.9091215Z mingw32-make: *** [Makefile:181: integration] Error 1

The layer itself doesn't seem to have caused any failures in the tests, so my guess this hit the "Failed to remove root filesystem" Warnf and that doesn't turn into a test failure (or get logged anywhere...). Annoyingly, there is a hcsshim::DeactivateLayer span, but not a hcsshim::DestroyLayer span, so I can't definitively tie this to a particular test.

I suspect this represents a gap in the containerd Windows snapshotter, as it's never cleaning up left-over rm- layer directories... Particularly because if I'm reading the test logs correctly, the directory in question came from "Integration 1", but the cleanup failure was the end of "Integration 2".

Thinking about it, it might be reasonable to, during this walk, just os.RemoveAll or cleanupWCOWLayer any rm-* layers we encounter during the Walk. My main concern is that if a stack of layers fails removal, and ends up all being renamed to rm-*, we'll walk them in the wrong order and hit the panic again.

Upshot for me is that if this routine fails as-is, we should have failed earlier because the tests have left the Snapshotter in an inconsistent state, and I'm not sure if trying to force the clean-up harder is productive at that point, as it seems like clean-up will fail for the same reasons that left the Snapshotter state inconsistent.

TBBle · 2021-03-20T01:08:49Z

Looks like the Linux CI pipeline is borked on master... Windows is passing, except when we somehow trigger a case which suggests (to me) a test-suite failure that went uncaught.

sys/filesys_windows.go

kevpar

LGTM. Thank you for working on this. :)

estesp · 2021-03-22T22:17:45Z

Sorry for the timing of your last rebase--you caught the partial day where we had an issue in main with the test image for Linux; if you rebase now the Linux failures should disappear.

TBBle · 2021-03-23T08:50:00Z

Three clean runs on Windows (rebase to master, comment improvement, and go mod tidy), although two of them hit unrelated failures on Linux, one in CGroups 2, and the latest one in CRI:

[Fail] [k8s.io] Image Manager [It] listImage should get exactly 3 repoTags in the result image [Conformance]
/home/runner/work/containerd/containerd/src/github.com/kubernetes-sigs/cri-tools/pkg/framework/util.go:352

As noted in microsoft/hcsshim#961, an OS-level fix should change the triggering situation (mis-ordered unmounts) from a panic to a failure, but still better to not cause the situation in the first place. ^_^

This ensures that we do not trigger assertions inside HCS by tring to call hcsshim.DestroyLayer on the parent of a currently-activated layer. It also deactivates the layers before deletion, to ensure we trigger or avert file-in-use failures due to leftover state from the tests with more detail than 'destroy failed'. Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>

Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>

cpuguy83

LGTM

estesp

LGTM

k8s-ci-robot added the needs-ok-to-test label Mar 7, 2021

TBBle force-pushed the correct-cleanup-of-windows-layers-in-test branch 2 times, most recently from 10e1f78 to 4d2d65c Compare March 7, 2021 02:58

TBBle force-pushed the correct-cleanup-of-windows-layers-in-test branch from 4d2d65c to f887038 Compare March 7, 2021 03:33

TBBle force-pushed the correct-cleanup-of-windows-layers-in-test branch from f887038 to 0255ffe Compare March 7, 2021 04:00

TBBle force-pushed the correct-cleanup-of-windows-layers-in-test branch from 0255ffe to a6c2aca Compare March 7, 2021 04:23

TBBle force-pushed the correct-cleanup-of-windows-layers-in-test branch from a6c2aca to e4abc14 Compare March 7, 2021 05:03

TBBle changed the title ~~Correct cleanup of Windows Layers in testsuite~~ Correct clean-up of Windows Layers in testsuite Mar 7, 2021

TBBle mentioned this pull request Mar 7, 2021

Support Unwrap for hcserror microsoft/hcsshim#960

Open

TBBle marked this pull request as ready for review March 7, 2021 05:15

This was referenced Mar 7, 2021

Mount snapshots on Windows #4419

Closed

Flaky windows integration test #4924

Closed

TBBle force-pushed the correct-cleanup-of-windows-layers-in-test branch from e4abc14 to 8d7e735 Compare March 13, 2021 00:21

TBBle force-pushed the correct-cleanup-of-windows-layers-in-test branch from 8d7e735 to 554e6be Compare March 14, 2021 04:05

ambarve reviewed Mar 17, 2021

View reviewed changes

sys/filesys_windows.go Show resolved Hide resolved

crosbymichael requested review from cpuguy83 and kevpar March 19, 2021 17:24

cpuguy83 reviewed Mar 19, 2021

View reviewed changes

TBBle force-pushed the correct-cleanup-of-windows-layers-in-test branch 2 times, most recently from b0a4d01 to fc1131d Compare March 20, 2021 00:43

kevpar mentioned this pull request Mar 22, 2021

hcsshim panics over an exception when destroying the base layer of a mounted layer microsoft/hcsshim#961

Closed

kevpar reviewed Mar 22, 2021

View reviewed changes

sys/filesys_windows.go Show resolved Hide resolved

kevpar approved these changes Mar 22, 2021

View reviewed changes

pacoxu mentioned this pull request Mar 23, 2021

Some failure on Windows (could be flaky) timeout during destroy layers #5247

Closed

TBBle force-pushed the correct-cleanup-of-windows-layers-in-test branch 3 times, most recently from e5968e2 to aa9ffb2 Compare March 23, 2021 08:12

TBBle added 2 commits March 25, 2021 05:26

go mod tidy the client integration test module

1fd3d12

Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>

TBBle force-pushed the correct-cleanup-of-windows-layers-in-test branch from aa9ffb2 to 1fd3d12 Compare March 24, 2021 18:26

cpuguy83 approved these changes Mar 24, 2021

View reviewed changes

estesp approved these changes Mar 24, 2021

View reviewed changes

estesp merged commit bcda849 into containerd:master Mar 24, 2021

thaJeztah mentioned this pull request Mar 25, 2021

integration/client doesn't update go.mod with latest version of containerd #5259

Closed

TBBle deleted the correct-cleanup-of-windows-layers-in-test branch March 25, 2021 11:32

thaJeztah mentioned this pull request Apr 9, 2021

go.mod: github.com/Microsoft/hcsshim v0.8.16 #5326

Merged

TBBle mentioned this pull request Aug 20, 2021

cleanupWCOWLayers assumes all layer basenames are numbers, but seems they aren't. #5736

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correct clean-up of Windows Layers in testsuite #5133

Correct clean-up of Windows Layers in testsuite #5133

TBBle commented Mar 7, 2021 •

edited

k8s-ci-robot commented Mar 7, 2021

theopenlab-ci bot commented Mar 7, 2021

theopenlab-ci bot commented Mar 7, 2021

theopenlab-ci bot commented Mar 7, 2021

theopenlab-ci bot commented Mar 7, 2021

theopenlab-ci bot commented Mar 7, 2021

theopenlab-ci bot commented Mar 7, 2021

theopenlab-ci bot commented Mar 13, 2021

theopenlab-ci bot commented Mar 14, 2021

cpuguy83 left a comment

cpuguy83 Mar 19, 2021

TBBle Mar 19, 2021

TBBle Mar 20, 2021 •

edited

TBBle Mar 20, 2021

TBBle Mar 20, 2021 •

edited

TBBle Mar 20, 2021

TBBle Mar 20, 2021

TBBle commented Mar 20, 2021

kevpar left a comment

estesp commented Mar 22, 2021

TBBle commented Mar 23, 2021 •

edited

cpuguy83 left a comment

estesp left a comment

Correct clean-up of Windows Layers in testsuite #5133

Correct clean-up of Windows Layers in testsuite #5133

Conversation

TBBle commented Mar 7, 2021 • edited

k8s-ci-robot commented Mar 7, 2021

theopenlab-ci bot commented Mar 7, 2021

theopenlab-ci bot commented Mar 7, 2021

theopenlab-ci bot commented Mar 7, 2021

theopenlab-ci bot commented Mar 7, 2021

theopenlab-ci bot commented Mar 7, 2021

theopenlab-ci bot commented Mar 7, 2021

theopenlab-ci bot commented Mar 13, 2021

theopenlab-ci bot commented Mar 14, 2021

cpuguy83 left a comment

Choose a reason for hiding this comment

cpuguy83 Mar 19, 2021

Choose a reason for hiding this comment

TBBle Mar 19, 2021

Choose a reason for hiding this comment

TBBle Mar 20, 2021 • edited

Choose a reason for hiding this comment

TBBle Mar 20, 2021

Choose a reason for hiding this comment

TBBle Mar 20, 2021 • edited

Choose a reason for hiding this comment

TBBle Mar 20, 2021

Choose a reason for hiding this comment

TBBle Mar 20, 2021

Choose a reason for hiding this comment

TBBle commented Mar 20, 2021

kevpar left a comment

Choose a reason for hiding this comment

estesp commented Mar 22, 2021

TBBle commented Mar 23, 2021 • edited

cpuguy83 left a comment

Choose a reason for hiding this comment

estesp left a comment

Choose a reason for hiding this comment

TBBle commented Mar 7, 2021 •

edited

TBBle Mar 20, 2021 •

edited

TBBle Mar 20, 2021 •

edited

TBBle commented Mar 23, 2021 •

edited