New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use DeactivateLayer to unlock layers that we cannot rename #5422
Use DeactivateLayer to unlock layers that we cannot rename #5422
Conversation
Hi @TBBle. Thanks for your PR. I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Build succeeded.
|
recheck |
Build succeeded.
|
It seems that something has shifted in an API, and vhd.DetachVhd is returning "failed to open virtual disk: invalid argument" on Windows Server LTSC 2019. Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
a42d8ad
to
402acd7
Compare
Build succeeded.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
/ok-to-test
LGTM |
@kevpar - Might want to investigate this part of the go-winio code mentioned here. But I agree using this semantic is fine so we dont need to do that at the same time for this fix. |
@containerd/committers Can someone take a look? |
It seems that something has shifted in an API, and
vhd.DetachVhd
is returning "failed to open virtual disk: invalid argument" on Windows Server LTSC 2019.Since the snapshotter test suite is not being run for Windows, nothing is likely to have been exercising this code on CI. #4419 is trying to get the snapshot suite working, and was seeing errors like the below a bit randomly on both CI, and an LTSC 2019 eval instance, but not on my Windows 10 20H2 desktop:
Note that this is
failed to open virtual disk
, i.e. it hasn't even tried to detach the disk yet, so its current 'attached' state should not matter.Based on microsoft/go-winio#157 (comment), this should be a semantically-equivalent replacement, and lets HCS handle the best way to detach the VHD, since HCS and the VHD infrastructure will be from the same system version, avoiding risk of mismatch (as I think we're seeing here).
The code being changed was added in #2705 by @jterry75 (who may still recall any context for using
DetachVhd
instead ofDeactivateLayer
) based on moby/moby#37712.In this repro case, there is no server crash or similar, it's just under heavy load from the containerd snapshotter test suite. I haven't tried to recreate the failure from the original Docker ticket above.
As I mention elsewhere, there's no current CI that is likely to exercise this code path as until #4419, containerd does not support locally mounting a read-write container layer on Windows, let alone writing data to it. The text of the original issue in Docker suggests that a container that was running when the daemon crashed might be left in that state.
I've observed this issue on LTSC 2019 in #4419 (comment) (and confirmed it stayed stuck until I used
wclayer unmount
) and again in #4419 (comment), and with this fix applied, haven't seen the problem since with the same setup as the latter.The
failed to open virtual disk: invalid argument
error hasn't ever appeared on my other test platform, Windows 10 20H2, but I can't prove (due to lack of logs in this code-path) whether the initial rename never failed, or if the existingDetachVhd
call was successful in unlocking the layer when it did fail. Either is possible, since 20H2 isn't showing the same parallelisation issues I'm seeing on LTSC 2019 in the first place, which I suspect are part of the cause here.I'm breaking it out of #4419 for separate review because:
I am however relying on #4419 for CI and manual testing of this change.
At the time of #2705, this code was written and tested against go-winio 0.4.11.
The
OpenVirtualDisk
flags used in v0.4.11are rather different to the
OpenVirtualDisk
flags used in v0.4.17I suspect that microsoft/go-winio#111 introduced a breakage in v0.4.13, and microsoft/go-winio#157 fixed it for v0.4.14 but didn't notice that RS5 was broken. It's also possible the former worked on RS5, and the latter was a post-RS5 fix.
Unlike microsoft/go-winio#189, as far as I can tell,
OpenVirtualDiskParameters{Version: 2}
is valid on RS5, so my best guess is that the disk access mask requirements have changed since RS5.Either way, I think calling
DeactivateLayer
is the more semantically-clear action anyway. We're already using hcsshim here, and I think it's better to keep reasoning about layers rather than their details, excluding open-coded HCS workalikes likecreateScratchLayer
.TODO but not a blocker for merging this: