Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use DeactivateLayer to unlock layers that we cannot rename #5422

Conversation

TBBle
Copy link
Contributor

@TBBle TBBle commented Apr 24, 2021

It seems that something has shifted in an API, and vhd.DetachVhd is returning "failed to open virtual disk: invalid argument" on Windows Server LTSC 2019.

Since the snapshotter test suite is not being run for Windows, nothing is likely to have been exercising this code on CI. #4419 is trying to get the snapshot suite working, and was seeing errors like the below a bit randomly on both CI, and an LTSC 2019 eval instance, but not on my Windows 10 20H2 desktop:

    issues.go:89: Check snapshots failed: rename C:\Users\ADMINI~1\AppData\Local\Temp\2\snapshot-suite-Windows-014324559\root\snapshots\6 C:\Users\ADMINI~1\AppData\Local\Temp\2\snapshot-suite-Windows-014324559\root\snapshots\rm-6: Access is denied.
        failed to detach VHD: failed to open virtual disk: invalid argument
        github.com/containerd/containerd/snapshots/windows.(*snapshotter).Remove
        	C:/Users/Administrator/Documents/go/containerd/snapshots/windows/windows.go:266
...

Note that this is failed to open virtual disk, i.e. it hasn't even tried to detach the disk yet, so its current 'attached' state should not matter.

Based on microsoft/go-winio#157 (comment), this should be a semantically-equivalent replacement, and lets HCS handle the best way to detach the VHD, since HCS and the VHD infrastructure will be from the same system version, avoiding risk of mismatch (as I think we're seeing here).

The code being changed was added in #2705 by @jterry75 (who may still recall any context for using DetachVhd instead of DeactivateLayer) based on moby/moby#37712.

In this repro case, there is no server crash or similar, it's just under heavy load from the containerd snapshotter test suite. I haven't tried to recreate the failure from the original Docker ticket above.


As I mention elsewhere, there's no current CI that is likely to exercise this code path as until #4419, containerd does not support locally mounting a read-write container layer on Windows, let alone writing data to it. The text of the original issue in Docker suggests that a container that was running when the daemon crashed might be left in that state.

I've observed this issue on LTSC 2019 in #4419 (comment) (and confirmed it stayed stuck until I used wclayer unmount) and again in #4419 (comment), and with this fix applied, haven't seen the problem since with the same setup as the latter.

The failed to open virtual disk: invalid argument error hasn't ever appeared on my other test platform, Windows 10 20H2, but I can't prove (due to lack of logs in this code-path) whether the initial rename never failed, or if the existing DetachVhd call was successful in unlocking the layer when it did fail. Either is possible, since 20H2 isn't showing the same parallelisation issues I'm seeing on LTSC 2019 in the first place, which I suspect are part of the cause here.

I'm breaking it out of #4419 for separate review because:

I am however relying on #4419 for CI and manual testing of this change.


At the time of #2705, this code was written and tested against go-winio 0.4.11.

The OpenVirtualDisk flags used in v0.4.11

	if err := openVirtualDisk(
		&defaultType,
		path,
		virtualDiskAccessDETACH,
		0,
		nil,
		&handle); err != nil {
		return err
	}

are rather different to the OpenVirtualDisk flags used in v0.4.17

	handle, err := OpenVirtualDisk(
		path,
		VirtualDiskAccessNone,
		OpenVirtualDiskFlagCachedIO|OpenVirtualDiskFlagIgnoreRelativeParentLocator,
	)
	parameters := OpenVirtualDiskParameters{Version: 2}
	handle, err := OpenVirtualDiskWithParameters(
		vhdPath,
		virtualDiskAccessMask,
		openVirtualDiskFlags,
		&parameters,
	)
	if err := openVirtualDisk(
		&defaultType,
		vhdPath,
		uint32(virtualDiskAccessMask),
		uint32(openVirtualDiskFlags),
		parameters,
		&handle,
	); err != nil {

I suspect that microsoft/go-winio#111 introduced a breakage in v0.4.13, and microsoft/go-winio#157 fixed it for v0.4.14 but didn't notice that RS5 was broken. It's also possible the former worked on RS5, and the latter was a post-RS5 fix.

Unlike microsoft/go-winio#189, as far as I can tell, OpenVirtualDiskParameters{Version: 2} is valid on RS5, so my best guess is that the disk access mask requirements have changed since RS5.

Either way, I think calling DeactivateLayer is the more semantically-clear action anyway. We're already using hcsshim here, and I think it's better to keep reasoning about layers rather than their details, excluding open-coded HCS workalikes like createScratchLayer.


TODO but not a blocker for merging this:

@k8s-ci-robot
Copy link

Hi @TBBle. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@theopenlab-ci
Copy link

theopenlab-ci bot commented Apr 24, 2021

Build succeeded.

@TBBle TBBle changed the title Use DeactivateLayer to recover layers that we cannot rename Use DeactivateLayer to unlock layers that we cannot rename Apr 24, 2021
@wangxiyuan
Copy link

recheck

@theopenlab-ci
Copy link

theopenlab-ci bot commented Apr 26, 2021

Build succeeded.

It seems that something has shifted in an API, and vhd.DetachVhd is
returning "failed to open virtual disk: invalid argument" on Windows
Server LTSC 2019.

Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
@TBBle TBBle force-pushed the use-layer-deactivate-instead-of-vhd-detach branch from a42d8ad to 402acd7 Compare April 26, 2021 15:31
@theopenlab-ci
Copy link

theopenlab-ci bot commented Apr 26, 2021

Build succeeded.

Copy link
Member

@kzys kzys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

/ok-to-test

@jterry75
Copy link
Contributor

LGTM

@jterry75
Copy link
Contributor

@kevpar - Might want to investigate this part of the go-winio code mentioned here. But I agree using this semantic is fine so we dont need to do that at the same time for this fix.

@kzys
Copy link
Member

kzys commented May 27, 2021

@containerd/committers Can someone take a look?

@crosbymichael crosbymichael merged commit bd4fcf6 into containerd:master Jun 3, 2021
@TBBle TBBle deleted the use-layer-deactivate-instead-of-vhd-detach branch June 4, 2021 09:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants