-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correct clean-up of Windows Layers in testsuite #5133
Correct clean-up of Windows Layers in testsuite #5133
Conversation
Hi @TBBle. Thanks for your PR. I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Build succeeded.
|
10e1f78
to
4d2d65c
Compare
Build succeeded.
|
4d2d65c
to
f887038
Compare
Build succeeded.
|
f887038
to
0255ffe
Compare
Build succeeded.
|
0255ffe
to
a6c2aca
Compare
Build succeeded.
|
a6c2aca
to
e4abc14
Compare
Build succeeded.
|
e4abc14
to
8d7e735
Compare
Build succeeded.
|
8d7e735
to
554e6be
Compare
Build succeeded.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with the windows snapshotter layout and couldn't really determine it just looking at the code, so assuming its a base dir with directories with integer names.
Other than that the Unprepare/Deactivate calls look similar to some cleanup code in the moby windows graphdriver, so 👍
if layerNum, err := strconv.Atoi(filepath.Base(path)); err == nil { | ||
layerNums = append(layerNums, layerNum) | ||
} else { | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning an error doesn't seem right here because this stops cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I... kind-of want to stop cleanup and fail, because this means (unless Atoi has other failure cases) that the Windows snapshotter behaviour has changed, and this code wasn't updated. I'd want that to fail CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, the only circumstances I could hit this with the current code-base is an rm-<digits>
directory, created during func (s *snapshotter) Remove
, and having fallen through either "Failed to rename after failed commit" or "Failed to remove root filesystem". In that case, it seems quite likely we will fail to call cleanupWCOWLayer
on it anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could refactor this to collect all the directories, best-effort sort them (I'd have to strip the rm-
prefix so that I remove child layers before parent layers), and then record any cleanupWCOWLayer
failure but still try it on all of them. The outcome would be that if a child layer is stuck, it and all its parent layers would report errors during clean-up, and we'd probably stumble over panic that prompted this PR in the first place, as we'd be trying to remove a parent layer while its child is activated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, and rebasing to master
, I managed to trigger exactly this code-path on CI:
2021-03-20T00:25:28.7332957Z failed to remove test root dir failed to cleanup WCOW layers in C:\Program Files\containerd\root-test\io.containerd.snapshotter.v1.windows\snapshots: strconv.Atoi: parsing "rm-26": invalid syntax
2021-03-20T00:25:28.8327339Z exit status 1
2021-03-20T00:25:28.8328816Z FAIL github.com/containerd/containerd/integration/client 358.389s
2021-03-20T00:25:28.9091215Z mingw32-make: *** [Makefile:181: integration] Error 1
The layer itself doesn't seem to have caused any failures in the tests, so my guess this hit the "Failed to remove root filesystem" Warnf
and that doesn't turn into a test failure (or get logged anywhere...). Annoyingly, there is a hcsshim::DeactivateLayer
span, but not a hcsshim::DestroyLayer
span, so I can't definitively tie this to a particular test.
I suspect this represents a gap in the containerd Windows snapshotter, as it's never cleaning up left-over rm-
layer directories... Particularly because if I'm reading the test logs correctly, the directory in question came from "Integration 1", but the cleanup failure was the end of "Integration 2".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it, it might be reasonable to, during this walk, just os.RemoveAll
or cleanupWCOWLayer
any rm-*
layers we encounter during the Walk. My main concern is that if a stack of layers fails removal, and ends up all being renamed to rm-*
, we'll walk them in the wrong order and hit the panic again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upshot for me is that if this routine fails as-is, we should have failed earlier because the tests have left the Snapshotter in an inconsistent state, and I'm not sure if trying to force the clean-up harder is productive at that point, as it seems like clean-up will fail for the same reasons that left the Snapshotter state inconsistent.
b0a4d01
to
fc1131d
Compare
Looks like the Linux CI pipeline is borked on master... Windows is passing, except when we somehow trigger a case which suggests (to me) a test-suite failure that went uncaught. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you for working on this. :)
Sorry for the timing of your last rebase--you caught the partial day where we had an issue in main with the test image for Linux; if you rebase now the Linux failures should disappear. |
e5968e2
to
aa9ffb2
Compare
Three clean runs on Windows (rebase to master, comment improvement, and
As noted in microsoft/hcsshim#961, an OS-level fix should change the triggering situation (mis-ordered unmounts) from a panic to a failure, but still better to not cause the situation in the first place. ^_^ |
This ensures that we do not trigger assertions inside HCS by tring to call hcsshim.DestroyLayer on the parent of a currently-activated layer. It also deactivates the layers before deletion, to ensure we trigger or avert file-in-use failures due to leftover state from the tests with more detail than 'destroy failed'. Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
aa9ffb2
to
1fd3d12
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
A fix for
sys.ForceRemoveAll
on Windows: Clean up individual layers from the windows snapshotter using wclayer APIsERROR_DEV_NOT_EXIST
), Deactivate, Destroy.os.RemoveAll
the root dir like we do on non-Windows.Tested with a now-reverted hack to reproduce the issue reliably on CI. It reproduced on the fourth of eight repeats of the Windows Integration test. Note that due to 30 minute timeout, it would be cancelled during run 5. I'm not sure why the suite takes 4 minutes now, it used to take 1 minute when running in parallel, based on some of the logs in #4924 (comment).
Fixes: #4924
I have cherry-picked the fix into #4419, since that PR enables the full Snapshot test suite on Windows, and hence reproduces #4924 almost every time, including cases where the tests themselves have failed and may have left the system in a bad state, i.e. files locked open preventing clean-up. This change should remove the
panic
we see in this case, but may still fail to clean up with nice, readable errors. That said, even when #4419 fails, with this change it's cleaning up without error, so I suspect its "file still open" problems are transitory, or the clean-up routine here works correctly to unstick them.