Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct clean-up of Windows Layers in testsuite #5133

Merged

Conversation

TBBle
Copy link
Contributor

@TBBle TBBle commented Mar 7, 2021

A fix for sys.ForceRemoveAll on Windows: Clean up individual layers from the windows snapshotter using wclayer APIs

  • Walk the provided directory to find WCOW layers
  • Walk the layer directories in numerically-falling order
    • This is fine for the test suite, but in wider usage, we would actually need to walk child->parent using the snapshotter's state.
  • For each layer: Unprepare (ignoring ERROR_DEV_NOT_EXIST), Deactivate, Destroy.
  • Then os.RemoveAll the root dir like we do on non-Windows.

Tested with a now-reverted hack to reproduce the issue reliably on CI. It reproduced on the fourth of eight repeats of the Windows Integration test. Note that due to 30 minute timeout, it would be cancelled during run 5. I'm not sure why the suite takes 4 minutes now, it used to take 1 minute when running in parallel, based on some of the logs in #4924 (comment).

Fixes: #4924

I have cherry-picked the fix into #4419, since that PR enables the full Snapshot test suite on Windows, and hence reproduces #4924 almost every time, including cases where the tests themselves have failed and may have left the system in a bad state, i.e. files locked open preventing clean-up. This change should remove the panic we see in this case, but may still fail to clean up with nice, readable errors. That said, even when #4419 fails, with this change it's cleaning up without error, so I suspect its "file still open" problems are transitory, or the clean-up routine here works correctly to unstick them.

@k8s-ci-robot
Copy link

Hi @TBBle. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@theopenlab-ci
Copy link

theopenlab-ci bot commented Mar 7, 2021

Build succeeded.

@TBBle TBBle force-pushed the correct-cleanup-of-windows-layers-in-test branch 2 times, most recently from 10e1f78 to 4d2d65c Compare March 7, 2021 02:58
@theopenlab-ci
Copy link

theopenlab-ci bot commented Mar 7, 2021

Build succeeded.

@TBBle TBBle force-pushed the correct-cleanup-of-windows-layers-in-test branch from 4d2d65c to f887038 Compare March 7, 2021 03:33
@theopenlab-ci
Copy link

theopenlab-ci bot commented Mar 7, 2021

Build succeeded.

@TBBle TBBle force-pushed the correct-cleanup-of-windows-layers-in-test branch from f887038 to 0255ffe Compare March 7, 2021 04:00
@theopenlab-ci
Copy link

theopenlab-ci bot commented Mar 7, 2021

Build succeeded.

@TBBle TBBle force-pushed the correct-cleanup-of-windows-layers-in-test branch from 0255ffe to a6c2aca Compare March 7, 2021 04:23
@theopenlab-ci
Copy link

theopenlab-ci bot commented Mar 7, 2021

Build succeeded.

@TBBle TBBle force-pushed the correct-cleanup-of-windows-layers-in-test branch from a6c2aca to e4abc14 Compare March 7, 2021 05:03
@TBBle TBBle changed the title Correct cleanup of Windows Layers in testsuite Correct clean-up of Windows Layers in testsuite Mar 7, 2021
@theopenlab-ci
Copy link

theopenlab-ci bot commented Mar 7, 2021

Build succeeded.

@TBBle TBBle marked this pull request as ready for review March 7, 2021 05:15
@TBBle TBBle force-pushed the correct-cleanup-of-windows-layers-in-test branch from e4abc14 to 8d7e735 Compare March 13, 2021 00:21
@theopenlab-ci
Copy link

theopenlab-ci bot commented Mar 13, 2021

Build succeeded.

@TBBle TBBle force-pushed the correct-cleanup-of-windows-layers-in-test branch from 8d7e735 to 554e6be Compare March 14, 2021 04:05
@theopenlab-ci
Copy link

theopenlab-ci bot commented Mar 14, 2021

Build succeeded.

Copy link
Member

@cpuguy83 cpuguy83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the windows snapshotter layout and couldn't really determine it just looking at the code, so assuming its a base dir with directories with integer names.

Other than that the Unprepare/Deactivate calls look similar to some cleanup code in the moby windows graphdriver, so 👍

sys/filesys_windows.go Outdated Show resolved Hide resolved
if layerNum, err := strconv.Atoi(filepath.Base(path)); err == nil {
layerNums = append(layerNums, layerNum)
} else {
return err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning an error doesn't seem right here because this stops cleanup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I... kind-of want to stop cleanup and fail, because this means (unless Atoi has other failure cases) that the Windows snapshotter behaviour has changed, and this code wasn't updated. I'd want that to fail CI.

Copy link
Contributor Author

@TBBle TBBle Mar 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, the only circumstances I could hit this with the current code-base is an rm-<digits> directory, created during func (s *snapshotter) Remove, and having fallen through either "Failed to rename after failed commit" or "Failed to remove root filesystem". In that case, it seems quite likely we will fail to call cleanupWCOWLayer on it anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could refactor this to collect all the directories, best-effort sort them (I'd have to strip the rm- prefix so that I remove child layers before parent layers), and then record any cleanupWCOWLayer failure but still try it on all of them. The outcome would be that if a child layer is stuck, it and all its parent layers would report errors during clean-up, and we'd probably stumble over panic that prompted this PR in the first place, as we'd be trying to remove a parent layer while its child is activated.

Copy link
Contributor Author

@TBBle TBBle Mar 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, and rebasing to master, I managed to trigger exactly this code-path on CI:

2021-03-20T00:25:28.7332957Z failed to remove test root dir failed to cleanup WCOW layers in C:\Program Files\containerd\root-test\io.containerd.snapshotter.v1.windows\snapshots: strconv.Atoi: parsing "rm-26": invalid syntax
2021-03-20T00:25:28.8327339Z exit status 1
2021-03-20T00:25:28.8328816Z FAIL	github.com/containerd/containerd/integration/client	358.389s
2021-03-20T00:25:28.9091215Z mingw32-make: *** [Makefile:181: integration] Error 1

The layer itself doesn't seem to have caused any failures in the tests, so my guess this hit the "Failed to remove root filesystem" Warnf and that doesn't turn into a test failure (or get logged anywhere...). Annoyingly, there is a hcsshim::DeactivateLayer span, but not a hcsshim::DestroyLayer span, so I can't definitively tie this to a particular test.

I suspect this represents a gap in the containerd Windows snapshotter, as it's never cleaning up left-over rm- layer directories... Particularly because if I'm reading the test logs correctly, the directory in question came from "Integration 1", but the cleanup failure was the end of "Integration 2".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it, it might be reasonable to, during this walk, just os.RemoveAll or cleanupWCOWLayer any rm-* layers we encounter during the Walk. My main concern is that if a stack of layers fails removal, and ends up all being renamed to rm-*, we'll walk them in the wrong order and hit the panic again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upshot for me is that if this routine fails as-is, we should have failed earlier because the tests have left the Snapshotter in an inconsistent state, and I'm not sure if trying to force the clean-up harder is productive at that point, as it seems like clean-up will fail for the same reasons that left the Snapshotter state inconsistent.

@TBBle TBBle force-pushed the correct-cleanup-of-windows-layers-in-test branch 2 times, most recently from b0a4d01 to fc1131d Compare March 20, 2021 00:43
@TBBle
Copy link
Contributor Author

TBBle commented Mar 20, 2021

Looks like the Linux CI pipeline is borked on master... Windows is passing, except when we somehow trigger a case which suggests (to me) a test-suite failure that went uncaught.

Copy link
Member

@kevpar kevpar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you for working on this. :)

@estesp
Copy link
Member

estesp commented Mar 22, 2021

Sorry for the timing of your last rebase--you caught the partial day where we had an issue in main with the test image for Linux; if you rebase now the Linux failures should disappear.

@TBBle TBBle force-pushed the correct-cleanup-of-windows-layers-in-test branch 3 times, most recently from e5968e2 to aa9ffb2 Compare March 23, 2021 08:12
@TBBle
Copy link
Contributor Author

TBBle commented Mar 23, 2021

Three clean runs on Windows (rebase to master, comment improvement, and go mod tidy), although two of them hit unrelated failures on Linux, one in CGroups 2, and the latest one in CRI:

[Fail] [k8s.io] Image Manager [It] listImage should get exactly 3 repoTags in the result image [Conformance]
/home/runner/work/containerd/containerd/src/github.com/kubernetes-sigs/cri-tools/pkg/framework/util.go:352

As noted in microsoft/hcsshim#961, an OS-level fix should change the triggering situation (mis-ordered unmounts) from a panic to a failure, but still better to not cause the situation in the first place. ^_^

This ensures that we do not trigger assertions inside HCS by tring to
call hcsshim.DestroyLayer on the parent of a currently-activated layer.

It also deactivates the layers before deletion, to ensure we trigger or
avert file-in-use failures due to leftover state from the tests with
more detail than 'destroy failed'.

Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
@TBBle TBBle force-pushed the correct-cleanup-of-windows-layers-in-test branch from aa9ffb2 to 1fd3d12 Compare March 24, 2021 18:26
Copy link
Member

@cpuguy83 cpuguy83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@estesp estesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky windows integration test
6 participants