Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mount snapshots on Windows #4419

Closed
wants to merge 15 commits into from
Closed

Conversation

TBBle
Copy link
Contributor

@TBBle TBBle commented Jul 23, 2020

This builds on ✔ #4415, on ✔#4399, and on ✔#4395. I think it's much easier to review the PRs in order.

Full credit to @darstahl and @crosbymichael, on whose shoulders I stood to implement this, as well as any other contributors to #2366 and #2287.

This reimplements the bare-minimum of #2366 to support Mount and Unmount under Windows, with the following approach:

  • An Active snapshot without a parent is a (read-write) bind mount to a 'Files' subdirectory, implemented as an NTFS symlink.
  • An Active snapshot with a parent is a windows-layer mount, exposed by HCS as a volume and mounted as an NTFS volume mount.
  • A View snapshot is a temporary read-write child of its Committed snapshot parent, mounted as a windows-layer per the previous point, then thrown away after use. So you can write to the resulting mount point, but it won't affect the target layer's data.
    • This isn't ideal, both because it means tests that check read-only state of mounts fail, and also because it's messy. But AFAIK, NTFS does not offer any kind of read-only Reparse Point that can be used here, so we would need this to be supported by a new filesystem filter, or implemented in WCIFS (the file system filter used to expose existing read-write layers as volumes)
    • This is compliant with the documentation for Snapshotter.View though, which allows this implementation approach.

In order to unmount a windows-layer without access to the original []mount.Mount, we store the original layer path in an Alternative Data Stream on the mount-point. Happily, we do not require the parents list to tear down the stack, so we don't need to store structured data here.

It's a shame, really. If HCS exposed an inverse for hcsshim.GetLayerMountPath (microsoft/hcsshim#909), then we wouldn't need the ADS storage, as Windows exposes an API to get the Volume back from a Volume mount, and that would let me find the Layer in-question.

With a bit of further work to remove some Linux-isms and work around various Go bugs/NTFS behaviours, this lets us enable the Snapshot test suite on Windows.

WIP work remaining:

This PR delivers a working filesystem layer for moby/buildkit#616, and once it lands, the rest of BuildKit's WCOW support should be internal to BuildKit.

@theopenlab-ci
Copy link

theopenlab-ci bot commented Jul 23, 2020

Build succeeded.

@jterry75
Copy link
Contributor

@kevpar - PTAL

@theopenlab-ci
Copy link

theopenlab-ci bot commented Jul 23, 2020

Build succeeded.

@jterry75
Copy link
Contributor

@ambarve - PTAL

@TBBle
Copy link
Contributor Author

TBBle commented Jul 23, 2020

The Windows Integration test failures look relevant (possibly something is not unmounting properly?) so I'll look into those when I come back to this.

Edit: Isolated PR #4425 shows the same behaviour, so it definitely needs investigation.

Edit again: Found the problem; unmounting from a non-mount point (or "") is not an error, and containerd relies on this to clean up bundles. Fix coming shortly (pending confirmation on #4425 CI run)

@TBBle
Copy link
Contributor Author

TBBle commented Jul 27, 2020

In case it helps get reviewed faster, I reckon this change could be rebased without #4415, #4399, and #4395, although I haven't tested that.

It would be handy to land this early, because this is the only PR that touches mount/mount_windows.go, which is vendored into BuildKit, and otherwise blocks the ability to implement COPY, ADD, or WORKDIR for WCOW there.

Edit: Opened a draft PR as #4425, so I can get a recheck of the CI failure above, to rule it in or out as a problem with this PR.

@theopenlab-ci
Copy link

theopenlab-ci bot commented Jul 28, 2020

Build succeeded.

@theopenlab-ci
Copy link

theopenlab-ci bot commented Sep 9, 2020

Build succeeded.

@theopenlab-ci
Copy link

theopenlab-ci bot commented Sep 9, 2020

Build succeeded.

Comment on lines 122 to 138
// Helpfully, both reparse points and symlinks look like links to Go
// Less-helpfully, ReadLink cannot return \\?\Volume{GUID} for a volume mount,
// and ends up returning the directory we gave it for some reason.
if mountTarget, err := os.Readlink(mount); err != nil {
// Not a mount point.
// This isn't an error, per the EINVAL handling in the Linux version
return nil
} else if mount != filepath.Clean(mountTarget) {
// Directory symlink
return errors.Wrapf(bindUnmount(mount), "failed to bind-unmount from %s", mount)
}
Copy link
Contributor Author

@TBBle TBBle Sep 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could probably be made safer using winio.DecodeReparsePoint, so we don't treat a symlink to itself as a mount point. See https://github.com/microsoft/go-winio/blob/3fe6c526287328ec859e2c1258458b685198f204/backuptar/tar.go#L160-L171 for an example usage.

That would allow us to confirm that the mount point's mounted volume does point back to the same scratch layer we read from the alternate data stream.

(See #4419 (comment))

snapshots/windows/windows.go Outdated Show resolved Hide resolved
slashedTarget := filepath.Clean(targetPath) + string(filepath.Separator)
slashedVolume := volumePath + string(filepath.Separator)

targetP, err := syscall.UTF16PtrFromString(slashedTarget)
Copy link
Contributor Author

@TBBle TBBle Sep 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per microsoft/hcsshim#852 (comment), use windows.UTF16PtrFromString.

In fact, update this whole file to be the same as the one I added to hcsshim.

Or put together a PR for these and related functions to address microsoft/go-winio#176, and then consume them from there. get microsoft/go-winio#187 landed and consume it from there,

@quangkieu
Copy link

Hey @kevpar @ambarve, I wonder if you have time to look into this pull request?

@TBBle
Copy link
Contributor Author

TBBle commented Nov 29, 2020

I had a look at the snapshot test suite, and all the suite's tests assume you can create a snapshot stack from scratch by writing to files on-disk. Looking at the changes to the tests in #2366 to make them work, it seems there's a function hcsshim.ProcesBaseLayer which (if I'm reading this correctly) can turn a directory with a subdir named Files/ into a base-layer.

That might make this feasible by using a bind-mount to Files for parent-less Active snapshots, and then Commit would be using hcsshim.ProcessBaseLayer to make it a valid parent.

Yes, that's it, combined with the files in func Base() Applier from vendor/github.com/containerd/continuityfs/fsttest/file_windows.go. Or at least, hcsshim.ProcessBaseLayer completes successfully.

Helpfully, someone has just dropped a batch of knowledge on me in microsoft/hcsshim#853 (comment) which matches this, and also demonstrates a minimal base-layer export format.

@TBBle
Copy link
Contributor Author

TBBle commented Nov 29, 2020

Converted to draft, see #4415 (comment)

@TBBle TBBle force-pushed the wcow_mount_layers branch 2 times, most recently from 4521299 to c1d0bdc Compare November 29, 2020 15:52
This is to pull in the fix for ignoring lack of xattrs support.

containerd/continuity#207
Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
The WCOW layer support does not support creating sandboxes with no
parent.  Instead, parentless scratch layers must be layed out as a
directory containing only a directory named 'Files', and all data stored
inside 'Files'. At commit-time, this will be converted in-place into a
read-only layer suitable for use as a parent layer.

The WCOW layer support also does not deal with making read-only layers,
i.e. layers that are prepared to be parent layers, visible in a
read-only manner. A bind-mount or junction point cannot be made
read-only, so a view must instead be a small sandbox layer that we can
mount via WCOW, and discard later, to protect the layer against
accidental or deliberate modification.

Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
Using symlinks for bind mounts means we are not protecting an RO-mounted
layer against modification. Windows doesn't currently appear to offer a
better approach though, as we cannot create arbitrary empty WCOW scratch
layers at this time.

For windows-layer mounts, Unmount does not have access to the mounts
used to create it. So we store the relevant data in an Alternate Data
Stream on the mountpoint in order to be able to Unmount later.

Based on approach in containerd#2366,
with sign-offs recorded as 'Based-on-work-by' trailers below.

This also partially-reverts some changes made in containerd#6034 as they are not
needed with this mounting implmentation, which no longer needs to be
handled specially by the caller compared to non-Windows mounts.

Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
Based-on-work-by: Michael Crosby <crosbymichael@gmail.com>
Based-on-work-by: Darren Stahl <darst@microsoft.com>
Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
This is necessary on Windows, as it's not possible to delete a snapshot
while it is still mounted, even if the mount-point has been deleted.

Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
In Windows, a mounted scratch layer is a junction point. However, Go
sees this as a symlink to itself, and hence does not walk through it.

This causes testutil.DumpDir and fstest.CheckDirectoryEqualWithApplier
to see an empty directory if given the mount-point as the root path.

So instead, all the test work is done in a sub-directory of the
snapshot.

It's possible that this could be fixed in containerd/continuity for
fstest.CheckDirectoryEqualWithApplier, but I have not looked into that.

Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
Filesystem permissions and ownership are not modifiable via an image
mount.

Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
NTFS, when presented with an all-caps filename, assumes you are just
being loud for no reason, and instead stores an all-lower-case filename.

Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
I'm getting weird failures from CI when `func (s *snapshotter)
createScratchLayer` calls `computestorage.SetupContainerBaseLayer` to
create new disks.

> 2021-03-07T06:07:50.7594090Z         testsuite.go:782: failed to
create scratch layer: failed to create scratch vhdx at
"C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\snapshot-suite-Windows-351506444\\root\\snapshots\\1":
failed to format writable layer vhd: The request is not supported.

So let's just avoid that code-path, since this wasn't happening before
this code-path was introduced in containerd#4912.

Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
Running the test suite in parallel (the default for the unit tests and
the first integration suite pass) sees "file in use" issues, but those
don't appear to affect non-parallel runs.

This hack is to confirm that hypothesis, as we should get a consistent
test run, as we roll forward with rebases etc.

Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
@TBBle TBBle force-pushed the wcow_mount_layers branch 3 times, most recently from f69b196 to bd51841 Compare August 11, 2022 18:49
} else {
// A view is just a read-write snapshot with a _really_ small sandbox, since we cannot actually
// make a read-only mount or junction point. https://superuser.com/q/881544/112473
sizeInBytes = 20 * 1024 * 1024 * 1024
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wounder if it wouldn't be worth implementing something using ProjFS (or some other mechanism), and get proper control over how the source is mounted instead of trying to work around the limitations of junctions.

(awesome stuff by the way, and thanks for working on all of this!)

Copy link
Contributor Author

@TBBle TBBle Sep 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about ProjFS the other day, but since AFAIK it expects you to copy the relevant file from the backing store into the directory it's mounted at to be used by the system, it seems likely to be wasteful if I used it here when I already have the files available on the filesystem via the WCIFS driver.

I think a more-likely improvement would be being able to mount a WCIFS filesystem stack read-only, i.e. with no sandbox. I expect that would require OS-level support, unless there's something hidden in the HCS or computestorage APIs already that would allow this, and we just don't expose it in hcsshim.

Using ProjFS instead of WCIFS to actually mount a layer stack on the other hand would be a fascinating and terrifying exercise. 100% that would be an external snapshotter implementation that would need to be bug-for-bug compatible with WCIFS and anything else used to expose layers into the HCS computer systems (VM, container, or host-based). More like a research project than likely to become the way, I think. And it would still have the problem that the current WCIFS on-disk format has all the files we need already extracted. It already has a triple-the-disk-space issue when exporting a layer (although I reckon once I have the Snapshotter test suite passing here, I can get that down to double-the-disk-space), and using ProjFS would introduce an up-to-double-the-disk-space issue in the mount-time behaviour.

Final thought, the liberal use of "legacy" in the related hcsshim code and discussions suggests that someone might have plans for a different layer-storage format, and that might be more-conducive to something like ProjFS for its host-mount system. But given that this stuff happens with Windows Server support lifecycle iterations, I don't expect any such work to materialise before I land this. (I also didn't expect to not have landed this last year, though, so take my expectations with a grain of salt. ^_^)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it wouldn't be the best option. I was thinking it might be of use for RO bind mounts, but that's not the only thing that is missing and would require RO access. WCIFS sounds like the path forward, but I am not sure if that filter driver has any flag that would allow setting a RO flag. I don't think it would be a huge technical challenge to add, but it might take a non-trivial amount of time until it propagates to all supported versions of Windows.

Copy link
Contributor Author

@TBBle TBBle Sep 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something else to look into is whether we can replace the symlink/junction-point appear with the Bind Filter.

I had seen that before, but only came across this API (and noticed it supports readonly binds) when just now preparing a hcsshim/WCIFS feature request (microsoft/hcsshim#1520).

We'd still like the WCIFS/hcsshim feature request, of course, since Bind Filter doesn't solve the problem of needing a scratch directory for HcsAttachLayerStorageFilter to initialise WCIFS. But it could be used to optimise the "no parents" cases, as is done in the containerd overlay driver, at least.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which is already a huge plus. It would eliminate the need for a bunch of workarounds. Great find!

What are the blocking issues that remain to be solved to land both this and the hcsshim PR this depends on? If I can, I would like to help out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that continuity PR (reverted the wcow_workaround commit) helps with comparing files, but now fails due to not being able to checksum hidden system files. See test output:

https://paste.samfira.com/public/p/dznNWyKUHCKT9s2rEJPKRhRK

Only SYSTEM has access to those files. Explicitly adding permissions for the user running the tests to access those files, should work. I'm wondering if we should create a list of files/folders that should be ecluded from comparison. $Recicle.Bin comes to mind, as well as things like pagefile.sys. What are your thoughts on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting proper dacl allows us to create a manifest of the entire mount point, and now the fun starts. We have some extra paths inside that mount point that makes the test fail:

PS C:\containerd> sh.exe -c "EXTRA_TESTFLAGS=-test.run=TestSnapshotterClient make integration"
+ integration
Windows test image: mcr.microsoft.com/windows/nanoserver:ltsc2022 , Windows build version: 20348
time="2022-09-27T07:45:07-07:00" level=info msg="running tests against containerd" revision=03536e7d953416c460ab076d497eadade3084ff1.m runtime= snapshotter= version=v1.6.0-715-g03536e7d9.m
time="2022-09-27T07:45:07-07:00" level=info msg="start to pull seed image" image="mcr.microsoft.com/windows/nanoserver:ltsc2022"
time="2022-09-27T07:45:08-07:00" level=info msg=request digest="sha256:09629875cd6ee57fa551626fa2963d8b54718ba9acbf28e7cb5d684a9cb754d7" mediatype=application/vnd.docker.image.rootfs.foreign.diff.tar.gzip size=118131331 url="https://mcr.microsoft.com/v2/windows/nanoserver/blobs/sha256:09629875cd6ee57fa551626fa2963d8b54718ba9acbf28e7cb5d684a9cb754d7"
=== RUN   TestSnapshotterClient
=== RUN   TestSnapshotterClient/Basic
    helpers.go:88: unmount C:\Users\ADMINI~1\AppData\Local\Temp\snapshot-suite-SnapshotterClient-2124266799\work\preparing
    helpers.go:88: unmount C:\Users\ADMINI~1\AppData\Local\Temp\snapshot-suite-SnapshotterClient-2124266799\work\nextlayer
    testsuite.go:216: failure reason: directory diff between C:\Users\ADMINI~1\AppData\Local\Temp\fstest2813155162 and C:\Users\ADMINI~1\AppData\Local\Temp\snapshot-suite-SnapshotterClient-2124266799\work\nextlayer
        + \WcSandboxState
        + \WcSandboxState\Hives
        + \WcSandboxState\Hives\DefaultUser_Delta
        + \WcSandboxState\Hives\Sam_Delta
        + \WcSandboxState\Hives\Security_Delta
        + \WcSandboxState\Hives\Software_Delta
        + \WcSandboxState\Hives\System_Delta
        + \WcSandboxState\initialized
        + \Windows
        + \Windows\System32
        + \Windows\System32\config
        + \Windows\System32\config\DEFAULT
        + \Windows\System32\config\SAM
        + \Windows\System32\config\SECURITY
        + \Windows\System32\config\SOFTWARE
        + \Windows\System32\config\SYSTEM
    helpers.go:69: drwxrwxrwx          0 C:\Users\ADMINI~1\AppData\Local\Temp\snapshot-suite-SnapshotterClient-2124266799
    helpers.go:69: drwxrwxrwx          0 C:\Users\ADMINI~1\AppData\Local\Temp\snapshot-suite-SnapshotterClient-2124266799\root
    helpers.go:69: drwxrwxrwx          0 C:\Users\ADMINI~1\AppData\Local\Temp\snapshot-suite-SnapshotterClient-2124266799\work
    helpers.go:69: drwxrwxrwx          0 C:\Users\ADMINI~1\AppData\Local\Temp\snapshot-suite-SnapshotterClient-2124266799\work\nextlayer

Your current workaround may be needed after all, but this time for different reasons 😄 .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the \WcSandboxState\Hives files are layered registry keys. Those are essentially deltas that get applied on top of the base hives similarly to how overlayfs works.

I think it's safe to ignore those when comparing directory diffs.

Copy link
Contributor Author

@TBBle TBBle Oct 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised WcSandboxState exists at all, I'd have assumed that only existed when the layers are used by a running container. I wonder, are those files all empty? Since no OS has been booted in that image, I can't see how there'd be any "delta" to capture, but maybe a zero delta has non-zero size.

It would make sense to me that for purposes of the snapshotter diff test, \WcSandboxState and \Windows be ignored; I'd say specifically \Windows\System32\config\XXX for the five registry hives (which I expect are trivially bind-mounted from the directory next to Files in the underlying layer image) but then having to ignore the containing directories sounds messy.

An alternative would be to do something clever with the compared-against version, using an applier or something to generate empty Deltas and initialized in WcSandboxState, and copy in the (empty in this case) hives from the underlying layer into Windows\System32\config\. That would mean we know exactly how those directories behave (since we can reproduce the behaviour and get a clean diff) but might tie the tests to a behaviour that varies across OS versions, and also means effectively reverse-engineering that behaviour. I don't think it's worth the effort to take this path (invoking Hyrum's Law in the process), compared to excluding a fairly-fixed list of directories.

A third alternative might be to adapt the tests so that on Windows, the in-image version is working on a subdirectory of the layer mount-point; this would be better than my existing hack as the appliers wouldn't see the different path, so it'd be less intrusive and a more-contained difference.

And another approach, perhaps: What if we ignore all SYSTEM files in the comparison, or even ignore "all files we can't open", as mentioned in #4419 (comment), since I would guess all the WCIFS-injected files are marked this way. This might not pan out as it'll still see the containing directories as new though.

Flowing on from this (and again getting low on the value/effort ratio) if the comparison treated any SYSTEM-marked (or otherwise unreadable) file in the layer as equal to a file of the same path in the fstest directory, irrespective of contents, then a fairly simple applier could be used for the fstest directory to create zero-length dummies for the WCIFS-injected files, without needing to care about their contents. That should be reasonably simple to implement, but I suspect would require another continuity change. (Although I haven't checked the code, sorry...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best way to move forward is to either exclude the folders in the above output, or use a sub-folder. Excluding those paths seems like the better choice.

From what I can tell, HCS creates those diffs when it prepares the layer. I have not checked the contents of the files in WcSandboxState but I suspect those will get created for every layer regardless of whether or not something changes.

We could just chalk it off to magic being done by hcs and just skip those.

As a side note, I opened a PR against your hcsshim branch that replaces the go-hivex dummy files with ones generated by the offline registry library.

Comment on lines +809 to +812
if runtime.GOOS == "windows" {
t.Skip("Read-only protection on views is not supported on WCOW")
}

Copy link
Contributor Author

@TBBle TBBle Sep 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test disagrees with the API docs, as the latter states that the mount from a View may be read-only, and that writes to the mount must be ignored, not that they must fail.

I think what it should do is if the View result is not read-only (i.e. if the write doesn't fail), request the View again and ensure that the file written does not exist in the second View.

@gabriel-samfira
Copy link
Contributor

The snapshotter tests that use fstest.CheckDirectoryEqualWithApplier() will never work on Windows, due to the post processing that happens on commit, and because of the layered registry keys that get set up during prepare. There is some magic happening on Windows that does not happen on Linux. So we'll need to adapt the tests to exclude the hives that get added during hcsshim.ConvertToBaseLayer() as well as the files from \WcSandboxState.

@kzys
Copy link
Member

kzys commented Mar 14, 2023

@TBBle Would you mind if I close this PR? @gabriel-samfira rebased this as #8043.

@TBBle
Copy link
Contributor Author

TBBle commented Mar 14, 2023

Closing in favour of #8043, thanks to @gabriel-samfira for completing this work while I'm (still) on sabbatical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

Successfully merging this pull request may close these issues.

None yet