Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add kernel-6.1 sources and use it for -dev variants #3121

Merged
merged 8 commits into from
Jun 15, 2023

Conversation

foersleo
Copy link
Contributor

@foersleo foersleo commented May 16, 2023

Issue number: #2855

Description of changes:

Add the sources for Linux kernel-6.1 and enable it for all the dev variants.

The following todo list does not block review of what is already there (config analysis) and will be added soon.
TODO:

  • Adjust SELinux policies for new classes user_namespace and io_uring (See Testing section below)
  • Update lookaside cache
  • potentially disable candidate options (See Testing section below)
  • Test build of an external kmod using kmod-kit

Testing done:

  • Launch a aws-dev image and see it booting successfully with the correct kernel:
bash-5.1# uname -a
Linux ip-10-1-171-244.eu-central-1.compute.internal 6.1.19 #1 SMP PREEMPT_DYNAMIC Mon May 15 16:07:55 UTC 2023 x86_64 GNU/Linux
  • Check for missing SELinux classes (See todo above):
May 16 13:22:07 localhost kernel: SELinux:  Class io_uring not defined in policy.
May 16 13:22:07 localhost kernel: SELinux:  Class user_namespace not defined in policy.
May 16 13:22:07 localhost kernel: SELinux: the above unknown classes and permissions will be denied
May 16 13:22:07 localhost kernel: SELinux:  policy capability network_peer_controls=1
May 16 13:22:07 localhost kernel: SELinux:  policy capability open_perms=1
May 16 13:22:07 localhost kernel: SELinux:  policy capability extended_socket_class=1
May 16 13:22:07 localhost kernel: SELinux:  policy capability always_check_network=0
May 16 13:22:07 localhost kernel: SELinux:  policy capability cgroup_seclabel=1
May 16 13:22:07 localhost kernel: SELinux:  policy capability nnp_nosuid_transition=1
May 16 13:22:07 localhost kernel: SELinux:  policy capability genfs_seclabel_symlinks=1
May 16 13:22:07 localhost kernel: SELinux:  policy capability ioctl_skip_cloexec=0
May 16 13:22:07 localhost systemd[1]: Successfully loaded SELinux policy in 324.808ms.
  • check for config changes against existing 5.15 kernels on all dev variants:
config-aarch64-aws-dev-diff:	 89 removed, 221 added,   4 changed
config-aarch64-metal-dev-diff:	 89 removed, 222 added,   4 changed
config-x86_64-aws-dev-diff:	 82 removed, 232 added,   6 changed
config-x86_64-metal-dev-diff:	 82 removed, 234 added,   6 changed
config-x86_64-vmware-dev-diff:	 98 removed, 230 added,   7 changed

As expected with a major kernel update there is lots of changes to the kernel config. I have tried to organize them into buckets for easier consumption and review. All changes that can be mapped to a specific commit and upstream series have been annotated with the corresponding data and a short summary has been added to all options.

The change buckets are (ordered by decreasing interest):

  • Candidates for disabling (6/468 config changes). These are candidates for explicit disabling. I currently do not really see a good use case for damon in Bottlerocket, so saving the build cycles might be a good idea. Similarly we should probably disable the advanced driver interfaces for ICE as we have done for other NIC drivers. Details in diff_disable.csv
  • Downstream config changes Amazon Linux did (32/468 config changes). These are mainly due to dropping downstream backports of some drivers and a downstream port of lustreFS as well as some smaller config changes. Non of these should be problematic. Details are in diff_al.csv
  • Changes that are not needed right away but might be interesting (9/468 config changes). This includes some options around FIPS related proc fs files to advertise certain fips properties, some net vendors that provide modern 10G+ NICs that might be relevant for metal and the audit driver to add DM_INTEGRITY logging to the audit subsystem. Details in diff_interesting.csv.
  • Unexplainable changes (5/468 config changes). These are config changes we are seeing between 5.15 and 6.1 for architectures where these options should not have been available or set to the new value all along through automatic selection or architecture dependent invisibility. I have no good explanation for these. Details in diff_wat.csv.
  • Architecture and toolchain specific changes (72/468 config changes). These are config changes where either a differentiation is made based on the architecture or where the used build toolchain will have an impact on how certain features can be used or have to be configured. Nothing we can or would do about these. Details can be found in diff_architecture.csv and diff_toolchain.csv.
  • New drivers that are not needed (85/468 config changes). These are drivers for various hardware that we do not need. A lot of these is for certain embedded applications. Nothing really interesting in this bucket. Details can be found in diff_new_drivers_not_needed.csv.
  • New functionality I am confident we do not need (47/468 config changes). This is a mix of various new functionalities, from debug features to certain subsystems to new counters that can be exposed through sysfs. I think that non of these would be relevant for Bottlerocket use cases. Details can be found in diff_new_not_needed.csv.
  • General config cleanup (197/468). This is a mix of various janitorial changes to the kernels config system. This includes removals due to deprecation of features, renaming of config options, changing of visibility through changes in dependencies, refactoring with splitting/merging certain options. Details can be found in diff_config_cleanup.csv.
  • The uninteresting remainder (15/468). A bunch of options I could not really match to any of the other buckets, but deemed not really interesting. Details in diff_uninteresting.csv.

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

@foersleo
Copy link
Contributor Author

foersleo commented May 17, 2023

The three failing checks are expected to fail with me not having uploaded the srpm to the lookaside cache. I am hesitant to do that until we have some more clarity on pushing this out.

SELinux

I will have to do some extra dive there. I do not have a lot of experience on getting these together and both new classes seem to be not similar to any other classes to model them after that.

The new classes have been introduced as follows:

While poking around I have also found that the lockdown selinux implementation has been removed through f5d0e5e9d72d3a - selinux: remove the SELinux lockdown implementation in 5.16. As far as I can tell we have not done any particular policies for the lockdown selinux implementation, so just removing them should be fine. (Edit: However, given that we still have two kernels that have the lockdown selinux support we probably need to keep it in until we retire 5.10 and 5.15 at some point in the future).

@foersleo
Copy link
Contributor Author

Added the new SELinux classes. I am not quite sure if we would want to further restrict the access/use of these generally. Particularly the user_namespace one might be interesting here.

@arnaldo2792
Copy link
Contributor

For orchestrated containers, the SELinux policies permissions are "relaxed", meaning that they are granted almost all permissions since we don't know what users are doing in their workloads. For io_uring and user_namespaces, I expect we want the same, so you may have to add something like this:

4e09a94

@markusboehme
Copy link
Member

I've got to review this in batches, but you made it easy to do so. :-)

  • Candidates for disabling (6/468 config changes): agree with those. Some of the DAMON features (LRU list sorting, reclaim) would be interesting if we could benchmark those. No harm done in leaving them out for now, they'd need to be opted into at boot time anyway.
  • Downstream config changes Amazon Linux did (32/468 config changes): A few thing that stand out here:
    • NET_VENDOR_MELLANOX: The Mellanox drivers are needed. We enabled the MLX5 driver on user request (commit 2e5458b) in addition to the MLX4 driver.
    • LUSTREFSX_FS: Same goes for Lustre support. I am surprised Amazon Linux dropped this. We were initially missing it for the 5.15 kernel, same as Amazon Linux, and it was noticed (Bottlerocket v1.24 is not able to mount fsx for lustre filesystem #2685). Amazon FSx for Lustre needs this.
    • FS_ENCRYPTION: Somewhat on the fence about including this, leaning towards not doing so. Not all filesystems support this, of the mainstream ones only ext4 does. Will offering this tie us to ext4 for the data partition? If used on additional disks (ephemeral disks, EBS volumes), EBS volume encryption or block-level encryption with dm-crypt might be more universal choices.
  • Changes that are not needed right away but might be interesting (9/468 config changes): Looks good to me.

@foersleo
Copy link
Contributor Author

I've got to review this in batches, but you made it easy to do so. :-)

[...]

  • Downstream config changes Amazon Linux did (32/468 config changes): A few thing that stand out here:
    • NET_VENDOR_MELLANOX: The Mellanox drivers are needed. We enabled the MLX5 driver on user request (commit 2e5458b) in addition to the MLX4 driver.

This one only has an impact on the VMware variant. For metal and AWS variants we explicitly set the ones we need there. The table shows an x only in the vmware-variant's column. If we need this one for vmware, we can add it back. Given that we did not explicitly enable it there before I would have thought we are good leaving it out.

Good point, will have a discussion with Al team on why they dropped this.

  • FS_ENCRYPTION: Somewhat on the fence about including this, leaning towards not doing so. Not all filesystems support this, of the mainstream ones only ext4 does. Will offering this tie us to ext4 for the data partition? If used on additional disks (ephemeral disks, EBS volumes), EBS volume encryption or block-level encryption with dm-crypt might be more universal choices.

Working through all these i was under the impression that it would allow to encrypt and not force encryption by default. Re-reading this now I come to the same conclusions as you.

@markusboehme
Copy link
Member

markusboehme commented May 19, 2023

  • Downstream config changes Amazon Linux did (32/468 config changes): A few thing that stand out here:

    • NET_VENDOR_MELLANOX: The Mellanox drivers are needed. We enabled the MLX5 driver on user request (commit 2e5458b) in addition to the MLX4 driver.

This one only has an impact on the VMware variant. For metal and AWS variants we explicitly set the ones we need there. The table shows an x only in the vmware-variant's column. If we need this one for vmware, we can add it back. Given that we did not explicitly enable it there before I would have thought we are good leaving it out.

Agree, everything is as it should be. I totally missed this only affects the VMware build.

  • FS_ENCRYPTION: Somewhat on the fence about including this, leaning towards not doing so. Not all filesystems support this, of the mainstream ones only ext4 does. Will offering this tie us to ext4 for the data partition? If used on additional disks (ephemeral disks, EBS volumes), EBS volume encryption or block-level encryption with dm-crypt might be more universal choices.

Working through all these i was under the impression that it would allow to encrypt and not force encryption by default. Re-reading this now I come to the same conclusions as you.

Yes, that's the case: Users can opt into encrypting individual directories if the filesystem supports this. Since ext4 (the current file system for the data partition) supports per-file encryption but e.g. XFS does not, I was worried that exposing this feature might get users stuck with ext4. I since looked into this some more and per-file encryption does not work through overlayfs (yet?). That lessens the concern somewhat, though it may at become relevant again in the future. For situations where the filesystem is under user control, like ephemeral disks or EBS volumes, dm-crypt seemed like a better choice.

Unexplainable changes (5/468 config changes):

  • HAVE_LIVEPATCH and related changes: Those were downstream Amazon Linux patches. Kernel live patching on Arm is not supported with the upstream kernel. While there have been patches, how to approach it is still being discussed. This will be a topic at this year's LPC as well. I'm surprised to see these missing for the 6.1 kernel, though. The documentation for Amazon Linux 2023 states it is supported.
  • MLX_PLATFORM: 🤷
  • ARCH_NR_GPIO on Arm: For 64-bit Arm this defaulted to 512 using an arch-generic definition on v5.15; only 32-bit Arm knew ARCH_NR_GPIO. This changed in v5.19. A value of 0 means it's still defaulting back to the arch-generic definition.

Architecture and toolchain specific changes (72/468 config changes):

  • ARCH_WANTS_THP_SWAP: THP_SWAP looks interesting, and apparently is an improvement on x86_64 as well. We're not currently using swap (IIRC there's an issue wishing for it), but might be something to revisit when that changes.
  • HAVE_STACK_VALIDATION: Used to be set on Arm due to support for kernel live patching.

New drivers that are not needed (85/468 config changes)Architecture and toolchain specific changes (72/468 config changes): Looks good to me!

@markusboehme
Copy link
Member

New functionality I am confident we do not need (47/468 config changes): Looks good to me.

General config cleanup (197/468):

  • VIVALDI_FMAP: Nothing to be done for us here, but that upstream change is questionable. To save 300 bytes in an all-yes build, we now have a new module for a single small function due to us building KEYBOARD_ATKBD as a module.
  • XEN_PCI_STUB: This just made me notice that we can XEN_BACKEND=n to cut out a bunch of code to act as a Xen dom0 or driver domain. Let's do this after the merge.

The uninteresting remainder (15/468): Looks good to me.

Copy link
Member

@markusboehme markusboehme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional notes:

  • I tested kdump still works.
  • Would be good to test building an out-of-tree module using the still works.
  • I'll have to defer judgement on the semantics of the SELinux changes, alas. Is my interpretation correct that everything is being allowed to use the mentioned io_uring features and create user namespaces (still blocked via sysctl)?

packages/kernel-6.1/Cargo.toml Outdated Show resolved Hide resolved
packages/kernel-6.1/Cargo.toml Outdated Show resolved Hide resolved
packages/selinux-policy/class.cil Outdated Show resolved Hide resolved
packages/kernel-6.1/config-bottlerocket-metal Show resolved Hide resolved
packages/selinux-policy/systems.cil Outdated Show resolved Hide resolved
packages/kernel-6.1/Cargo.toml Outdated Show resolved Hide resolved
packages/kernel-6.1/kernel-6.1.spec Show resolved Hide resolved
packages/selinux-policy/systems.cil Outdated Show resolved Hide resolved
packages/selinux-policy/systems.cil Outdated Show resolved Hide resolved
@foersleo
Copy link
Contributor Author

Above force push tended many of the comments. Still need to clean up the io_uring selinux policies according to ben's comment.

@foersleo
Copy link
Contributor Author

Latest force-push was to rebase on top of current develop branch. No changes to the feature branch.

@foersleo
Copy link
Contributor Author

The new change disabling FS_ENCRYPTION had the expected results disabling FS_ENCRYPTION and removing the dependent config options:

$ cat diff-report 
==> 6.1_fixups/config-aarch64-aws-dev-diff <==
-FS_ENCRYPTION_ALGS y
-FS_ENCRYPTION_INLINE_CRYPT n
 FS_ENCRYPTION y -> n

==> 6.1_fixups/config-aarch64-metal-dev-diff <==
-FS_ENCRYPTION_ALGS y
-FS_ENCRYPTION_INLINE_CRYPT n
 FS_ENCRYPTION y -> n

==> 6.1_fixups/config-x86_64-aws-dev-diff <==
-FS_ENCRYPTION_ALGS y
-FS_ENCRYPTION_INLINE_CRYPT n
 FS_ENCRYPTION y -> n

==> 6.1_fixups/config-x86_64-metal-dev-diff <==
-FS_ENCRYPTION_ALGS y
-FS_ENCRYPTION_INLINE_CRYPT n
 FS_ENCRYPTION y -> n

@bcressey
Copy link
Contributor

bcressey commented Jun 1, 2023

This is probably on your radar, but we should also add the qlen patch to 6.1.

@bcressey
Copy link
Contributor

bcressey commented Jun 1, 2023

Candidates for disabling

I'm fine with disabling these.

Downstream config changes Amazon Linux did

No concerns as long as Lustre comes back.

Changes that are not needed right away but might be interesting

The FIPS selftest sounds like something we might want or need. It might be OK if it has a negligible impact on boot time.

Unexplainable changes

+1

Architecture and toolchain specific changes

Should we be setting PREEMPT_DYNAMIC on arm64 now that it's supported?

New drivers that are not needed

+1

New functionality I am confident we do not need

EFI_ZBOOT sounds interesting if it lets us have a compressed arm64 image.

MODULE_DECOMPRESS seems good if it eventually lets us have more flexibility on the compression format we're able to use for our modules. I'm sad we had to roll back the zstd change because older versions of kmod in pods didn't support it, and it'd be nice for that limitation to go away.

General config cleanup

It seems like we had STACK_VALIDATION before for arm64, and now we don't have its replacement CONFIG_OBJTOOL. Is this related to the lack of live-patching support - we're just missing the patches that would be needed to enable it for 6.1?

Do we need the i915 driver? It seems like there was a lot of config churn from that - ACPI_WMI, DRM_DISPLAY_HELPER, etc.

The uninteresting remainder

+1

If it's creating delta from the Amazon Linux kernel config, we could potentially drop these items:

CONFIG_SERIO_I8042=m
CONFIG_KEYBOARD_ATKBD=m
CONFIG_MOUSE_PS2=m

I.e. we don't need or want the modules if they're disabled altogether in recent AL kernels.

@markusboehme
Copy link
Member

MODULE_DECOMPRESS seems good if it eventually lets us have more flexibility on the compression format we're able to use for our modules. I'm sad we had to roll back the zstd change because older versions of kmod in pods didn't support it, and it'd be nice for that limitation to go away.

We had to roll this back due to old user space not supporting zstd. Unfortunately, even with MODULE_DECOMPRESS we need a (even more) recent user space, since it has to explicitly request kernel-side decompression. On top of that the kernel currently only supports xz and gzip here, but unlike widespread user space support this would at least be easily fixable ourselves.

Signed-off-by: Leonard Foerster <foersleo@amazon.com>
Signed-off-by: Leonard Foerster <foersleo@amazon.com>
In linux 5.16 new io_uring related access controls were added to
SELinux. Add the new class and the new permission into the blanket
"systems" permission set, to not add new SELinux restrictions.

Signed-off-by: Leonard Foerster <foersleo@amazon.com>
In linux 6.1 new user_namespace related access controls were added to
SELinux. Add the new class and the new permission into the blanked
"systems" permissions set, to not add new SELinux restrictions.

Signed-off-by: Leonard Foerster <foersleo@amazon.com>
Amazon Linux enabled FS_ENCRYPTION with the introduction of kernel-6.1.
With Bottlerocket, that might lock us into certain file system choices
inadvertantly. Disable it for now and rely on block level encryption
through EBS volume encryption or dm-crypt instead.

Signed-off-by: Leonard Foerster <foersleo@amazon.com>
[port from kernel-5.15: ddc0757]

Increase the kernel's default value for the net.unix.max_dgram_qlen sysctl
to 512. This is a change to the kernel rather than a plain sysctl setting
since systemd-sysctl only applies settings to the host, while the changed
default value in the kernel also applies to every new network namespace.

Signed-off-by: Markus Boehme <markubo@amazon.com>

Signed-off-by: Leonard Foerster <foersleo@amazon.com>
For most NICs we have been going with the bare driver without any
additionally configurable features. With the 6.1 kernel we pick up
additional features for the already included ICE driver. Disable these
features as we have done for other NICs before.

Signed-off-by: Leonard Foerster <foersleo@amazon.com>
We currently do not have a use case for the DAMON subsystem. Disable it
for now to trim some code we build. If we come across a use case we can
add it into our config.

Signed-off-by: Leonard Foerster <foersleo@amazon.com>
@foersleo
Copy link
Contributor Author

foersleo commented Jun 8, 2023

Latest push includes following adjustments:

  • remove build.rs from packages/kernel-6.1
  • add max_dgram_qlen fix
  • adjustments to io_uring selinux rules
  • remove candidate kconfig options as discussed (see diff-report below as expected)
==> 6.1_fixups_2/config-aarch64-aws-dev-diff <==
-DAMON_DBGFS y
-DAMON_LRU_SORT y
-DAMON_PADDR y
-DAMON_RECLAIM y
-DAMON_SYSFS y
-DAMON_VADDR y
-PAGE_IDLE_FLAG y
 DAMON y -> n

==> 6.1_fixups_2/config-aarch64-metal-dev-diff <==
-DAMON_DBGFS y
-DAMON_LRU_SORT y
-DAMON_PADDR y
-DAMON_RECLAIM y
-DAMON_SYSFS y
-DAMON_VADDR y
-PAGE_IDLE_FLAG y
 DAMON y -> n
 ICE_SWITCHDEV y -> n

==> 6.1_fixups_2/config-x86_64-aws-dev-diff <==
-DAMON_DBGFS y
-DAMON_LRU_SORT y
-DAMON_PADDR y
-DAMON_RECLAIM y
-DAMON_SYSFS y
-DAMON_VADDR y
-PAGE_IDLE_FLAG y
 DAMON y -> n

==> 6.1_fixups_2/config-x86_64-metal-dev-diff <==
-DAMON_DBGFS y
-DAMON_LRU_SORT y
-DAMON_PADDR y
-DAMON_RECLAIM y
-DAMON_SYSFS y
-DAMON_VADDR y
-PAGE_IDLE_FLAG y
 DAMON y -> n
 ICE_HWTS y -> n
 ICE_SWITCHDEV y -> n

@armenr
Copy link

armenr commented Jun 9, 2023

Happy to stay completely out of the way and not pollute your PRs, but we're very excited for this. We've been waiting for a while to be able to properly use BBR congestion control for our workloads in EKS. 👏

If you need real-life customer testers, I volunteer.

Copy link
Contributor

@stmcginnis stmcginnis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely not my area of expertise, but everything looks correct to me and at least no structural issues that I could see.

Copy link
Member

@markusboehme markusboehme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✨ 🐧

@foersleo foersleo merged commit 5c033df into bottlerocket-os:develop Jun 15, 2023
38 checks passed
@foersleo foersleo deleted the kernel-6.1 branch June 15, 2023 07:40
@armenr
Copy link

armenr commented Jun 15, 2023

Sorry for the potentially inappropriate post on this PR, but what's a good way to test the dev/6.x builds? 🫠

@foersleo
Copy link
Contributor Author

Sorry for the potentially inappropriate post on this PR, but what's a good way to test the dev/6.x builds? 🫠

It is absolutely not inappropriate at all. We are extremely happy when people want to test and extend their helping hands in the project overall. Thank you for offering your time.

Unfortunately, we currently do not have a great way to supply you with AMIs for testing. Additionally the *-dev variants do not come with an orchestrator, so these are not super helpful if you want to use them in a kubernetes cluster.

I currently have a testing commit for nvidia-variants that aids me in testing the new kernel on those platforms for a subsequent PR enabling the new nvidia kmods. You should be able to do something similar to that to get a variant of your choice working with the 6.1 kernel for testing. You can find that commit on my fork. Instructions for building are in BUILDING.md.

Feel free to reach out if you have any further questions or need any assistance.

@armenr
Copy link

armenr commented Jun 15, 2023

Thank you @foersleo !!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants