Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow containerd on Windows 11 to use Windows Server 2022 images #8137

Closed
wants to merge 3 commits into from

Conversation

hach-que
Copy link
Contributor

Windows 11 clients support running process-isolated containers of 10.0.20348 and above. Or to put it another way, Windows 11 clients can run Windows Server 2022 containers even when their build numbers don't match. Docker Engine already supports this scenario; this change updates containerd to make it also support this scenario.

The container/OS version compatibility page documents this level of compatibility between the client OS and server OS versions.

I didn't explicitly add support for Windows 10 clients as the current version of Windows 10 is no longer compatible with the LTSC2019 container images, and there's no way to downgrade a Windows 10 install to the required version. Windows 11 compatibility with LTSC2022 is much more stable in practice than Windows 10, and Microsoft is committed to permitting any combination of Windows Server 2022 images and Windows 11 clients as outlined in the Windows Server 2022 and beyond for containers blog post.

@k8s-ci-robot
Copy link

Hi @hach-que. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hach-que
Copy link
Contributor Author

I don't know what FAIL - does not have a valid DCO means and there's no documentation on what DCO stands for so I can't fix that check.

@fuweid
Copy link
Member

fuweid commented Feb 19, 2023

I don't know what FAIL - does not have a valid DCO means and there's no documentation on what DCO stands for so I can't fix that check.

Hi @hach-que You need to sign it off by git commit -s --amend. Checkout this link https://github.com/containerd/project/blob/main/CONTRIBUTING.md#sign-your-work. Thanks

Windows 11 clients support running process-isolated containers of 10.0.20348 and above. Or to put it another way, Windows 11 clients can run Windows Server 2022 containers even when their build numbers don't match. Docker Engine already supports this scenario; this change updates containerd to make it also support this scenario.

The [container/OS version compatibility page](https://learn.microsoft.com/en-us/virtualization/windowscontainers/deploy-containers/version-compatibility?tabs=windows-server-2022%2Cwindows-11#windows-client-host-os-compatibility) documents this level of compatibility between the client OS and server OS versions.

I didn't explicitly add support for Windows 10 clients as the current version of Windows 10 is no longer compatible with the LTSC2019 container images, and there's no way to downgrade a Windows 10 install to the required version. Windows 11 compatibility with LTSC2022 is much more stable in practice than Windows 10, and Microsoft is committed to permitting any combination of Windows Server 2022 images and Windows 11 clients as outlined in the [Windows Server 2022 and beyond for containers blog post](https://techcommunity.microsoft.com/t5/containers/windows-server-2022-and-beyond-for-containers/ba-p/2712487).

Signed-off-by: June Rhodes <504826+hach-que@users.noreply.github.com>
@hach-que
Copy link
Contributor Author

Should be good now.

Signed-off-by: June Rhodes <504826+hach-que@users.noreply.github.com>
@hach-que
Copy link
Contributor Author

Looks like the failing test doesn't have anything to do with the code changes and might just be a broken environment:

FAIL
mv: cannot remove 'C:/Windows/Temp/test-integration/containerd.log': Device or resource busy
[SC] DeleteService SUCCESS

SERVICE_NAME: containerd-test 
        TYPE               : 10  WIN32_OWN_PROCESS  
        STATE              : 4  RUNNING 
                                (STOPPABLE, NOT_PAUSABLE, ACCEPTS_SHUTDOWN)
        WIN32_EXIT_CODE    : 0  (0x0)
        SERVICE_EXIT_CODE  : 0  (0x0)
        CHECKPOINT         : 0x0
        WAIT_HINT          : 0x0
mingw32-make: *** [makefile:224: cri-integration] Error 1
Error: Process completed with exit code 2.

@dcantah
Copy link
Member

dcantah commented Feb 21, 2023

cc @kevpar @kiashok, the windows platform code strikes again! 😆 Thanks for this change @hach-que, I think it makes sense (and probably should be supported). I could've sworn I remembered some other change with the same motivation but can't find it so 🤷‍♂️. I'll try and look at this soon if no one else gets to it

@dcantah dcantah self-requested a review February 21, 2023 02:00
@thaJeztah
Copy link
Member

Wondering if this is also something that could have a utility in hcsshim, so that there's a more canonical location to compare if a version is "compatible".

Signed-off-by: June Rhodes <504826+hach-que@users.noreply.github.com>
@hach-que
Copy link
Contributor Author

After going through the effort of writing my own Kubernetes manager (😅) so I could deploy out a custom build of containerd to the Windows nodes, I can confirm that these changes work. I did backport them to 1.6.18 for this test.

(Note the different kernel number presented by cmd.exe)

image

Comment on lines +24 to +37
func isClientOS() bool {
k, err := registry.OpenKey(registry.LOCAL_MACHINE, `SOFTWARE\Microsoft\Windows NT\CurrentVersion`, registry.QUERY_VALUE)
if err != nil {
return false
}
defer k.Close()

installationType, _, err := k.GetStringValue("InstallationType")
if err != nil {
return false
}

return installationType == "Client"
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thaJeztah Thinking on your comment, this kind of feels like it could be a helper in hcsshim/osversion also..

Copy link
Member

@dcantah dcantah Feb 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevpar Curious on your thoughts. This regkey approach was recommended (6 years ago however) at least as an alternative to GetProductInfo: https://learn.microsoft.com/en-us/previous-versions/windows/desktop/legacy/ee391629(v=vs.85)

@@ -53,6 +55,31 @@ func (m windowsmatcher) Match(p specs.Platform) bool {
if strings.HasPrefix(p.OSVersion, m.osVersionPrefix) {
return true
}
if m.isClientOS && p.OSVersion != "" {
split := strings.Split(p.OSVersion, ".")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already indirectly import a package for version parsing (go.mod, vendor). Should we just reuse it instead of maintaining a pretty fragile logic?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, what this is parsing isn't semver (it's Windows 4-part OS versions like 10.0.20348.1850), so from what I can tell, that library won't actually parse them correctly. 😅

@aznashwan
Copy link
Contributor

FWIW this is a "subset" of #7856, which aimed to allow pulling/running any image on RS5+ under Hyper-V isolation.

While the discussion on that (and the first instance this came up in #7431) was left at adding dedicated platform matchers for each container isolation type (host, process, and hyper-v) instead of using the single default matcher, IMHO this particular corner-case is reasonable to put there.

@thaJeztah
Copy link
Member

We already indirectly import a package for version parsing (go.mod, vendor). Should we just reuse it instead of maintaining a pretty fragile logic?

@mxpv not entirely sure if these versions should be parsed with SemVer semantics (don't think they are strictly SemVer, but Microsoft's own format and semantics) (also see https://github.com/microsoft/hcsshim/blob/5871d0c4436f131c377655a3eb09fc9b5065f11d/osversion/osversion_windows.go#L10-L17).

That said; as there may be uses outside of containerd itself to handle these versions; both parsing, comparing versions, as well as determining if versions are compatible (which turned out to be impossible in some cases based on version alone, as it sometimes depends on specific Windows kernel versions)

Taking all of that into account, I'd love to see this being either part of hcsshim, or even a separate module (if dependency trees become complicated), so that there's a canonical truth for these kind of comparisons (some of which may require deep insight into kernel features / compatibility).

See my earlier comments above, and #8137 (comment)

@kevpar Curious on your thoughts. This regkey approach was recommended (6 years ago however) at least as an alternative to GetProductInfo: https://learn.microsoft.com/en-us/previous-versions/windows/desktop/legacy/ee391629(v=vs.85)

@thaJeztah
Copy link
Member

@kevpar any thoughts / opinions on the above? 🤗

@TBBle
Copy link
Contributor

TBBle commented Mar 20, 2023

I'm not sure this change is semantically correct, as the stable ABI guarantee isn't "Windows 11 clients", it's all about the build numbers. It just happens that Windows 11 was the first thing in the wild that meets the criteria, and the only one at the time of writing for that blog post.

The actual feature is that Windows Server 2022 containers (i.e. oversion 10.0.20348) can run in process isolation on any newer Windows build, e.g. Windows 11 10.0.22000.*. This also includes current Windows Server Insiders Preview builds "VNext" ( currently 10.0.25314) as the host, which this change will fail to match.

The explicit commitment (per microsoft/Windows-Containers#117 (comment)) is that Windows Server 2022-based containers will be functional on at least the next formal Windows Server release, so I think the safest approach is to allow any host 10.0.20348 or greater to pull any image between 10.0.20348 and its own version (although until the next Windows Server release is much closer, there won't be any images greater than 10.0.20348 anyway, as MS don't produce container image bases for VNext anymore).

So I'd suggest getting rid of the m.isClientOS stuff, and hard-coding two different behaviours for m.osVersionPrefix (host version) of either pre-stable-ABI (10.0.20347 or less) and stable-ABI (10.0.20348 and greater).

The latter will come out somewhat like what the Hyper-V isolation matcher will need to look like too, although from memory the challenge with Hyper-V matching is that the knowledge of "We're using HyperV isolation" is not currently used at platform-matcher time. I believe the intent there was to use a different matcher selected by the CRI layer when running in HyperV isolation.

On that note, would it make sense to split windowsmatcher and have two matcher implementations (pre-stable-ABI and stable-ABI) and choose one as default based on host OS version? That would probably make things slightly simpler, and push a bunch of "What version are we using?" questions out of the Less and Match methods. Then Hyper-V will be a third matcher implementation. (Or a differently-configured range-matcher). Possibly a separate one again for host-process (which needs to match "No version" as well) but I believe that was already solved a different way in #8101. (Although that would affect non-host-process isolation too, and so is possibly a lingering bug...)

@hach-que
Copy link
Contributor Author

hach-que commented Mar 20, 2023

If the guarantee around stability is subject to change (i.e. as you said, LTSC2022 will run on 2025, but perhaps not 2028), then does it make sense to have - in addition to sensible defaults - some way of overriding this in the configuration? That way when new builds of Windows 11 or Windows Server are pushed out, that users of containerd can just override the compatibility in the configuration file without having to patch the containerd binary?

i.e. that way if I wanted to override this to get the effect I currently get through patching, I could do:

min_container_osversion=10.0.20348
max_container_osversion=10.0.99999

@TBBle
Copy link
Contributor

TBBle commented Mar 20, 2023

It's an interesting idea. To some extent, I'm not too fussed about this problem as I expect the Windows team to signal an ABI break in the LTSC series with plenty of time to update containerd (since it'll appear at some point in the Insiders builds and/or Azure Stack HCI too; containerd test suite can be run on such hosts fairly easily, we did that late in the LTSC 2022 release cycle for example to shake out issues.)

I recognise that I may be wrong about this, and I don't know if containerd support-lifecycle policy would disrecommend running years-old versions of containerd on newer Windows releases. It's very possible they'd be broken for other reasons, but not certain.

That said, in the success case, the "select the highest OS build-number that passes the matcher" logic being added here will mean that as long as a LTSC"2028"-based image has been pushed as well as the LTSC2022-based image to the image list, the "LTSC2028" image will be chosen on LTSC2028 or later. Of course, an the LTSC2022-based image will be chosen on LTSC2022, and "LTSC2025", because an "LTSC2028"-based image will be newer, and hence rejected. This requires no config hooks.

So the failure case where that configuration option makes a difference is one where only an LTSC2022-based image is available: the mooted config change is turning an "incompatible version" container-start failure into a "no matching version found" pull failure on an "LTSC2028" host. It wouldn't be able to turn a failure-case into a success-case, just move the failure slightly earlier. Which seems reasonably-low value for introducing a new configuration knob for a currently-configless system.

Note: The max_container_osversion config option in particular introduces the risk of someone doing exactly what you wrote there (max_container_osversion=10.0.99999) and then when an image list they're already referencing adds an "LTSC2028"-based image in "2027", suddenly their existing, stable, reliable LTSC2022 hosts are pulling an image that's too new and fail, even though an LTSC2022-based image is also present in the list. So if we do want such a config option, it should be setting a minimum only, as the maximum must always be "current build" (or lower... and that's very unlikely to ever be valuable as a matcher setting).

All the above is on the assumption that an "LTSC2028" host can't run an LTSC2022-based image. We don't know when "LTSC2028" here will happen, if ever; all we know is we aren't currently promised that will never happen; we are promised that the next LTSC release won't be "LTSC2028". Which is why I don't think it's worth putting a lot of effort into trying to predict this situation now, since we're talking about guessing more than a full Windows Server LTSC cycle into the future, at least.

@hach-que
Copy link
Contributor Author

@TBBle That might be true for Windows Server builds, but I expect the breakage for Windows 11 to be a lot less signalled, and I can totally see there being a Windows 11 update at some point after 2025 that unexpectedly switches compatibility from "2022" to "2022+2025" and then "2025 only", outside the normal Windows Server release cycle.

I don't want us to punt on making this configurable (i.e. hard-coding it so that LTSC2022 is only accepted between 10.0.20348 and 10.0.22621) and then we're doing this dance again when 23H1 or Server 2025 comes out. If the guarantee is not "2022 images will work forever for 10.0.20348 onwards", then I think having a configurable override is beneficial for those folks who are not running containerd on a Windows Server LTSC release because their blocker on getting things working will be "update the configuration" and not "patch and build new binaries".

@TBBle
Copy link
Contributor

TBBle commented Mar 21, 2023

then I think having a configurable override is beneficial for those folks who are not running containerd on a Windows Server LTSC release because their blocker on getting things working will be "update the configuration" and not "patch and build new binaries".

The same thing applies for Windows 11, in that you can't use the config options to turn a failure case into a success case, only a different failure. So there's no one who can "update the configuration to change the matched versions" to unblock themselves.

I don't want us to punt on making this configurable (i.e. hard-coding it so that LTSC2022 is only accepted between 10.0.20348 and 10.0.22621)

That's not what we should be doing. We should be encoding for LTSC2022-based image is accepted for host 10.0.20348 or greater, because the only time that produces a failure is when the host cannot run the LTSC2022-based image, and there is no entry in the manifest list for a newer base. And in that case, there is no success option, because that second clause is also a runtime requirement even if you pull an image without a manifest list (and hence don't invoke the platform matcher in the first place).

Based on the replies, I suspect my suggested implementation here hasn't been clear. The simple logic for the Windows (process-isolation) platform matcher should be:

If host < 20348: Require exact release version (i.e. 10.0.XXXX).
If host >= 20348: Accept any os.version >= 20348 and less than host version. Prefer the highest accepted host version.

That way, if a usable image is in the list, it will always be chosen, no configuration necessary, "forever"*. A base-image ABI break may allow an unusable image to be chosen but only if there is no usable image available. And we can add that once we know about it, to improve the failure description.

*: Assuming the current "Stable API" approach remains in-use. If MS revert to "exact match only" semantics for some reason, then this logic will still pick a usable image if one exists, as that's a degenerate case. However, other matching schemas may not magically work.

@hach-que
Copy link
Contributor Author

hach-que commented Mar 21, 2023

I don't want us to punt on making this configurable (i.e. hard-coding it so that LTSC2022 is only accepted between 10.0.20348 and 10.0.22621)

That's not what we should be doing. We should be encoding for LTSC2022-based image is accepted for host 10.0.20348 or greater, because the only time that produces a failure is when the host cannot run the LTSC2022-based image, and there is no entry in the manifest list for a newer base. And in that case, there is no success option, because that second clause is also a runtime requirement even if you pull an image without a manifest list (and hence don't invoke the platform matcher in the first place).

Sans the client-only part of the check, is that not what the change currently does?

Unless I misunderstood, your concern was that 2028 would be pulling 2022 images that it's not compatible with and thus we should constrain the check to only version ranges that we know work at the time containerd is built.

I saw your update while I was writing that response:

If host < 20348: Require exact release version (i.e. 10.0.XXXX).
If host >= 20348: Accept any os.version >= 20348 and less than host version. Prefer the highest accepted host version.

Seems reasonable to me.

@TBBle
Copy link
Contributor

TBBle commented Mar 21, 2023

Yes, that's correct. If we drop the isClient part of the check, then I think (without re-reviewing) this change is on the right path. That was what I meant to say in my original comment, but looking back I realise I was very unclear about that, and accidentally produced the opposite concern.

@dims
Copy link
Member

dims commented Mar 29, 2023

/ok-to-test

@k8s-ci-robot
Copy link

@dims: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/ok-to-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@samuelkarp
Copy link
Member

/ok-to-test

@Ccccclong
Copy link

This feature would be really helpful. Can I know if this PR is still active?

@hach-que
Copy link
Contributor Author

hach-que commented Jun 3, 2023

containerd 1.7 broke HostProcess support for filesystem drivers (due to the new silo stuff), so at the moment I'm just running containerd 1.6.18 with this patched change. I haven't had a chance to come back to this PR since I wouldn't be able to migrate to the upstream/merged version anyway.

@TBBle
Copy link
Contributor

TBBle commented Jun 4, 2023

Is there an open ticket for that new issue somewhere? I had a quick poke around and couldn't see one, but maybe I'm not looking in the right place or using the right keywords.

Edit: I see it: microsoft/hcsshim#1699

@k8s-ci-robot
Copy link

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

This PR is stale because it has been open 90 days with no activity. This PR will be closed in 7 days unless new comments are made or the stale label is removed.

@github-actions github-actions bot added the Stale label Mar 14, 2024
Copy link

This PR was closed because it has been stalled for 7 days with no activity.

@github-actions github-actions bot closed this Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet