New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mount snapshots on Windows #4419
Conversation
Build succeeded.
|
@kevpar - PTAL |
Build succeeded.
|
@ambarve - PTAL |
The Windows Integration test failures look relevant (possibly something is not unmounting properly?) so I'll look into those when I come back to this. Edit: Isolated PR #4425 shows the same behaviour, so it definitely needs investigation. Edit again: Found the problem; unmounting from a non-mount point (or "") is not an error, and containerd relies on this to clean up bundles. Fix coming shortly (pending confirmation on #4425 CI run) |
In case it helps get reviewed faster, I reckon this change could be rebased without #4415, #4399, and #4395, although I haven't tested that. It would be handy to land this early, because this is the only PR that touches mount/mount_windows.go, which is vendored into BuildKit, and otherwise blocks the ability to implement Edit: Opened a draft PR as #4425, so I can get a recheck of the CI failure above, to rule it in or out as a problem with this PR. |
Build succeeded.
|
12085fb
to
4b559a4
Compare
Build succeeded.
|
4b559a4
to
f52398e
Compare
Build succeeded.
|
mount/mount_windows.go
Outdated
// Helpfully, both reparse points and symlinks look like links to Go | ||
// Less-helpfully, ReadLink cannot return \\?\Volume{GUID} for a volume mount, | ||
// and ends up returning the directory we gave it for some reason. | ||
if mountTarget, err := os.Readlink(mount); err != nil { | ||
// Not a mount point. | ||
// This isn't an error, per the EINVAL handling in the Linux version | ||
return nil | ||
} else if mount != filepath.Clean(mountTarget) { | ||
// Directory symlink | ||
return errors.Wrapf(bindUnmount(mount), "failed to bind-unmount from %s", mount) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could probably be made safer using winio.DecodeReparsePoint
, so we don't treat a symlink to itself as a mount point. See https://github.com/microsoft/go-winio/blob/3fe6c526287328ec859e2c1258458b685198f204/backuptar/tar.go#L160-L171 for an example usage.
That would allow us to confirm that the mount point's mounted volume does point back to the same scratch layer we read from the alternate data stream.
(See #4419 (comment))
slashedTarget := filepath.Clean(targetPath) + string(filepath.Separator) | ||
slashedVolume := volumePath + string(filepath.Separator) | ||
|
||
targetP, err := syscall.UTF16PtrFromString(slashedTarget) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per microsoft/hcsshim#852 (comment), use windows.UTF16PtrFromString
.
In fact, update this whole file to be the same as the one I added to hcsshim.
Or put together a PR for these and related functions to address microsoft/go-winio#176, and then consume them from there. get microsoft/go-winio#187 landed and consume it from there,
I had a look at the snapshot test suite, and all the suite's tests assume you can create a snapshot stack from scratch by writing to files on-disk. Looking at the changes to the tests in #2366 to make them work, it seems there's a function That might make this feasible by using a bind-mount to Files for parent-less Active snapshots, and then Commit would be using Yes, that's it, combined with the files in Helpfully, someone has just dropped a batch of knowledge on me in microsoft/hcsshim#853 (comment) which matches this, and also demonstrates a minimal base-layer export format. |
Converted to draft, see #4415 (comment) |
4521299
to
c1d0bdc
Compare
This is to pull in the fix for ignoring lack of xattrs support. containerd/continuity#207 Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
The WCOW layer support does not support creating sandboxes with no parent. Instead, parentless scratch layers must be layed out as a directory containing only a directory named 'Files', and all data stored inside 'Files'. At commit-time, this will be converted in-place into a read-only layer suitable for use as a parent layer. The WCOW layer support also does not deal with making read-only layers, i.e. layers that are prepared to be parent layers, visible in a read-only manner. A bind-mount or junction point cannot be made read-only, so a view must instead be a small sandbox layer that we can mount via WCOW, and discard later, to protect the layer against accidental or deliberate modification. Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
Using symlinks for bind mounts means we are not protecting an RO-mounted layer against modification. Windows doesn't currently appear to offer a better approach though, as we cannot create arbitrary empty WCOW scratch layers at this time. For windows-layer mounts, Unmount does not have access to the mounts used to create it. So we store the relevant data in an Alternate Data Stream on the mountpoint in order to be able to Unmount later. Based on approach in containerd#2366, with sign-offs recorded as 'Based-on-work-by' trailers below. This also partially-reverts some changes made in containerd#6034 as they are not needed with this mounting implmentation, which no longer needs to be handled specially by the caller compared to non-Windows mounts. Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com> Based-on-work-by: Michael Crosby <crosbymichael@gmail.com> Based-on-work-by: Darren Stahl <darst@microsoft.com>
Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
This is necessary on Windows, as it's not possible to delete a snapshot while it is still mounted, even if the mount-point has been deleted. Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
In Windows, a mounted scratch layer is a junction point. However, Go sees this as a symlink to itself, and hence does not walk through it. This causes testutil.DumpDir and fstest.CheckDirectoryEqualWithApplier to see an empty directory if given the mount-point as the root path. So instead, all the test work is done in a sub-directory of the snapshot. It's possible that this could be fixed in containerd/continuity for fstest.CheckDirectoryEqualWithApplier, but I have not looked into that. Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
Filesystem permissions and ownership are not modifiable via an image mount. Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
NTFS, when presented with an all-caps filename, assumes you are just being loud for no reason, and instead stores an all-lower-case filename. Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
I'm getting weird failures from CI when `func (s *snapshotter) createScratchLayer` calls `computestorage.SetupContainerBaseLayer` to create new disks. > 2021-03-07T06:07:50.7594090Z testsuite.go:782: failed to create scratch layer: failed to create scratch vhdx at "C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\snapshot-suite-Windows-351506444\\root\\snapshots\\1": failed to format writable layer vhd: The request is not supported. So let's just avoid that code-path, since this wasn't happening before this code-path was introduced in containerd#4912. Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
Running the test suite in parallel (the default for the unit tests and the first integration suite pass) sees "file in use" issues, but those don't appear to affect non-parallel runs. This hack is to confirm that hypothesis, as we should get a consistent test run, as we roll forward with rebases etc. Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
f69b196
to
bd51841
Compare
} else { | ||
// A view is just a read-write snapshot with a _really_ small sandbox, since we cannot actually | ||
// make a read-only mount or junction point. https://superuser.com/q/881544/112473 | ||
sizeInBytes = 20 * 1024 * 1024 * 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wounder if it wouldn't be worth implementing something using ProjFS (or some other mechanism), and get proper control over how the source is mounted instead of trying to work around the limitations of junctions.
(awesome stuff by the way, and thanks for working on all of this!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about ProjFS the other day, but since AFAIK it expects you to copy the relevant file from the backing store into the directory it's mounted at to be used by the system, it seems likely to be wasteful if I used it here when I already have the files available on the filesystem via the WCIFS driver.
I think a more-likely improvement would be being able to mount a WCIFS filesystem stack read-only, i.e. with no sandbox. I expect that would require OS-level support, unless there's something hidden in the HCS or computestorage APIs already that would allow this, and we just don't expose it in hcsshim.
Using ProjFS instead of WCIFS to actually mount a layer stack on the other hand would be a fascinating and terrifying exercise. 100% that would be an external snapshotter implementation that would need to be bug-for-bug compatible with WCIFS and anything else used to expose layers into the HCS computer systems (VM, container, or host-based). More like a research project than likely to become the way, I think. And it would still have the problem that the current WCIFS on-disk format has all the files we need already extracted. It already has a triple-the-disk-space issue when exporting a layer (although I reckon once I have the Snapshotter test suite passing here, I can get that down to double-the-disk-space), and using ProjFS would introduce an up-to-double-the-disk-space issue in the mount-time behaviour.
Final thought, the liberal use of "legacy" in the related hcsshim code and discussions suggests that someone might have plans for a different layer-storage format, and that might be more-conducive to something like ProjFS for its host-mount system. But given that this stuff happens with Windows Server support lifecycle iterations, I don't expect any such work to materialise before I land this. (I also didn't expect to not have landed this last year, though, so take my expectations with a grain of salt. ^_^)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it wouldn't be the best option. I was thinking it might be of use for RO bind mounts, but that's not the only thing that is missing and would require RO access. WCIFS sounds like the path forward, but I am not sure if that filter driver has any flag that would allow setting a RO flag. I don't think it would be a huge technical challenge to add, but it might take a non-trivial amount of time until it propagates to all supported versions of Windows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something else to look into is whether we can replace the symlink/junction-point appear with the Bind Filter.
I had seen that before, but only came across this API (and noticed it supports readonly binds) when just now preparing a hcsshim/WCIFS feature request (microsoft/hcsshim#1520).
We'd still like the WCIFS/hcsshim feature request, of course, since Bind Filter doesn't solve the problem of needing a scratch directory for HcsAttachLayerStorageFilter
to initialise WCIFS. But it could be used to optimise the "no parents" cases, as is done in the containerd overlay driver, at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which is already a huge plus. It would eliminate the need for a bunch of workarounds. Great find!
What are the blocking issues that remain to be solved to land both this and the hcsshim PR this depends on? If I can, I would like to help out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that continuity PR (reverted the wcow_workaround
commit) helps with comparing files, but now fails due to not being able to checksum hidden system files. See test output:
https://paste.samfira.com/public/p/dznNWyKUHCKT9s2rEJPKRhRK
Only SYSTEM
has access to those files. Explicitly adding permissions for the user running the tests to access those files, should work. I'm wondering if we should create a list of files/folders that should be ecluded from comparison. $Recicle.Bin
comes to mind, as well as things like pagefile.sys
. What are your thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting proper dacl allows us to create a manifest of the entire mount point, and now the fun starts. We have some extra paths inside that mount point that makes the test fail:
PS C:\containerd> sh.exe -c "EXTRA_TESTFLAGS=-test.run=TestSnapshotterClient make integration"
+ integration
Windows test image: mcr.microsoft.com/windows/nanoserver:ltsc2022 , Windows build version: 20348
time="2022-09-27T07:45:07-07:00" level=info msg="running tests against containerd" revision=03536e7d953416c460ab076d497eadade3084ff1.m runtime= snapshotter= version=v1.6.0-715-g03536e7d9.m
time="2022-09-27T07:45:07-07:00" level=info msg="start to pull seed image" image="mcr.microsoft.com/windows/nanoserver:ltsc2022"
time="2022-09-27T07:45:08-07:00" level=info msg=request digest="sha256:09629875cd6ee57fa551626fa2963d8b54718ba9acbf28e7cb5d684a9cb754d7" mediatype=application/vnd.docker.image.rootfs.foreign.diff.tar.gzip size=118131331 url="https://mcr.microsoft.com/v2/windows/nanoserver/blobs/sha256:09629875cd6ee57fa551626fa2963d8b54718ba9acbf28e7cb5d684a9cb754d7"
=== RUN TestSnapshotterClient
=== RUN TestSnapshotterClient/Basic
helpers.go:88: unmount C:\Users\ADMINI~1\AppData\Local\Temp\snapshot-suite-SnapshotterClient-2124266799\work\preparing
helpers.go:88: unmount C:\Users\ADMINI~1\AppData\Local\Temp\snapshot-suite-SnapshotterClient-2124266799\work\nextlayer
testsuite.go:216: failure reason: directory diff between C:\Users\ADMINI~1\AppData\Local\Temp\fstest2813155162 and C:\Users\ADMINI~1\AppData\Local\Temp\snapshot-suite-SnapshotterClient-2124266799\work\nextlayer
+ \WcSandboxState
+ \WcSandboxState\Hives
+ \WcSandboxState\Hives\DefaultUser_Delta
+ \WcSandboxState\Hives\Sam_Delta
+ \WcSandboxState\Hives\Security_Delta
+ \WcSandboxState\Hives\Software_Delta
+ \WcSandboxState\Hives\System_Delta
+ \WcSandboxState\initialized
+ \Windows
+ \Windows\System32
+ \Windows\System32\config
+ \Windows\System32\config\DEFAULT
+ \Windows\System32\config\SAM
+ \Windows\System32\config\SECURITY
+ \Windows\System32\config\SOFTWARE
+ \Windows\System32\config\SYSTEM
helpers.go:69: drwxrwxrwx 0 C:\Users\ADMINI~1\AppData\Local\Temp\snapshot-suite-SnapshotterClient-2124266799
helpers.go:69: drwxrwxrwx 0 C:\Users\ADMINI~1\AppData\Local\Temp\snapshot-suite-SnapshotterClient-2124266799\root
helpers.go:69: drwxrwxrwx 0 C:\Users\ADMINI~1\AppData\Local\Temp\snapshot-suite-SnapshotterClient-2124266799\work
helpers.go:69: drwxrwxrwx 0 C:\Users\ADMINI~1\AppData\Local\Temp\snapshot-suite-SnapshotterClient-2124266799\work\nextlayer
Your current workaround may be needed after all, but this time for different reasons 😄 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the \WcSandboxState\Hives
files are layered registry keys. Those are essentially deltas that get applied on top of the base hives similarly to how overlayfs
works.
I think it's safe to ignore those when comparing directory diffs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised WcSandboxState
exists at all, I'd have assumed that only existed when the layers are used by a running container. I wonder, are those files all empty? Since no OS has been booted in that image, I can't see how there'd be any "delta" to capture, but maybe a zero delta has non-zero size.
It would make sense to me that for purposes of the snapshotter diff test, \WcSandboxState
and \Windows
be ignored; I'd say specifically \Windows\System32\config\XXX
for the five registry hives (which I expect are trivially bind-mounted from the directory next to Files in the underlying layer image) but then having to ignore the containing directories sounds messy.
An alternative would be to do something clever with the compared-against version, using an applier or something to generate empty Deltas and initialized
in WcSandboxState
, and copy in the (empty in this case) hives from the underlying layer into Windows\System32\config\
. That would mean we know exactly how those directories behave (since we can reproduce the behaviour and get a clean diff) but might tie the tests to a behaviour that varies across OS versions, and also means effectively reverse-engineering that behaviour. I don't think it's worth the effort to take this path (invoking Hyrum's Law in the process), compared to excluding a fairly-fixed list of directories.
A third alternative might be to adapt the tests so that on Windows, the in-image version is working on a subdirectory of the layer mount-point; this would be better than my existing hack as the appliers wouldn't see the different path, so it'd be less intrusive and a more-contained difference.
And another approach, perhaps: What if we ignore all SYSTEM
files in the comparison, or even ignore "all files we can't open", as mentioned in #4419 (comment), since I would guess all the WCIFS-injected files are marked this way. This might not pan out as it'll still see the containing directories as new though.
Flowing on from this (and again getting low on the value/effort ratio) if the comparison treated any SYSTEM
-marked (or otherwise unreadable) file in the layer as equal to a file of the same path in the fstest directory, irrespective of contents, then a fairly simple applier could be used for the fstest directory to create zero-length dummies for the WCIFS-injected files, without needing to care about their contents. That should be reasonably simple to implement, but I suspect would require another continuity change. (Although I haven't checked the code, sorry...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the best way to move forward is to either exclude the folders in the above output, or use a sub-folder. Excluding those paths seems like the better choice.
From what I can tell, HCS creates those diffs when it prepares the layer. I have not checked the contents of the files in WcSandboxState
but I suspect those will get created for every layer regardless of whether or not something changes.
We could just chalk it off to magic being done by hcs and just skip those.
As a side note, I opened a PR against your hcsshim
branch that replaces the go-hivex
dummy files with ones generated by the offline registry library.
if runtime.GOOS == "windows" { | ||
t.Skip("Read-only protection on views is not supported on WCOW") | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test disagrees with the API docs, as the latter states that the mount from a View may be read-only, and that writes to the mount must be ignored, not that they must fail.
I think what it should do is if the View result is not read-only (i.e. if the write doesn't fail), request the View again and ensure that the file written does not exist in the second View.
The snapshotter tests that use |
@TBBle Would you mind if I close this PR? @gabriel-samfira rebased this as #8043. |
Closing in favour of #8043, thanks to @gabriel-samfira for completing this work while I'm (still) on sabbatical. |
This builds on ✔ #4415, on ✔#4399, and on ✔#4395. I think it's much easier to review the PRs in order.
Full credit to @darstahl and @crosbymichael, on whose shoulders I stood to implement this, as well as any other contributors to #2366 and #2287.
This reimplements the bare-minimum of #2366 to support Mount and Unmount under Windows, with the following approach:
bind
mount to a 'Files' subdirectory, implemented as an NTFS symlink.Commit
on such parentless layers.windows-layer
mount, exposed by HCS as a volume and mounted as an NTFS volume mount.Snapshotter.View
though, which allows this implementation approach.In order to unmount a windows-layer without access to the original
[]mount.Mount
, we store the original layer path in an Alternative Data Stream on the mount-point. Happily, we do not require the parents list to tear down the stack, so we don't need to store structured data here.It's a shame, really. If HCS exposed an inverse for
hcsshim.GetLayerMountPath
(microsoft/hcsshim#909), then we wouldn't need the ADS storage, as Windows exposes an API to get the Volume back from a Volume mount, and that would let me find the Layer in-question.With a bit of further work to remove some Linux-isms and work around various Go bugs/NTFS behaviours, this lets us enable the Snapshot test suite on Windows.
WIP work remaining:
ErrNotSupported
continuity#207. (WIP in this branch using a direct hash-reference until there is an upstream release)wcow_workaround
directory either here, or in Should the default pathDriver walk though volume mount points on Windows? continuity#174sys.ForceRemoveAll
to also apply tosnapshots.Windows.TestWindows
. Perhaps exposesys.cleanupWCOWLayers
and call it from the test cleanup function after closing the snapshotter.This PR delivers a working filesystem layer for moby/buildkit#616, and once it lands, the rest of BuildKit's WCOW support should be internal to BuildKit.