Skip to content

dkms: fix building nvidia open kernel modules against clang and thin/full lto compiled kernel#417

Merged
scaronni merged 1 commit intodkms-project:masterfrom
ltsdw:fix-lto-building
Oct 23, 2024
Merged

dkms: fix building nvidia open kernel modules against clang and thin/full lto compiled kernel#417
scaronni merged 1 commit intodkms-project:masterfrom
ltsdw:fix-lto-building

Conversation

@ltsdw
Copy link
Contributor

@ltsdw ltsdw commented May 21, 2024

Right now there's some issues when trying to compile open-gpu-kernel-modules against a kernel compiled using Clang and Thin/Full LTO.

Also the script will pick the wrong version of the strip command otherwise the module will fail to compile successfully:

strip: BFD (GNU Binutils) 2.42.0 assertion fail /usr/src/debug/binutils/binutils-gdb/bfd/elf.c:4131
strip: BFD (GNU Binutils) 2.42.0 assertion fail /usr/src/debug/binutils/binutils-gdb/bfd/elf.c:4131
strip: BFD (GNU Binutils) 2.42.0 assertion fail /usr/src/debug/binutils/binutils-gdb/bfd/elf.c:4131
...

Is also necessary to set the OBJCOPY environment variable, otherwise even though the modules will compile the system will refuse to boot on it.

Fixes #416.

@evelikov
Copy link
Collaborator

evelikov commented Jun 4, 2024

Massive thanks for the fix. Can I trouble you for a simple test?

Say, for any/all of the Archlinux targets:

  • fetch and install the custom clang/lto kernel
  • run the tests

@ltsdw
Copy link
Contributor Author

ltsdw commented Jun 4, 2024

Thank you @evelikov .

Besides all the external modules I have compiling and installing normally on my system.
I ran the available dkms tests, this were the results:

Using kernel 6.9.1-2/x86_64
Checking module compression ...
config: CONFIG_MODULE_COMPRESS_ZSTD=y
files: /lib/modules/6.9.1-2/kernel/arch/x86/crypto/aesni-intel.ko.zst
Expected extension: .zst
Preparing a clean test environment
Test framework file hijacking
Adding the test module by version (expected error)
Adding the test module by directory
Adding the test module again (expected error)
Adding the test module by version (expected error)
Building the test module
--- test_cmd_expected_output.log	2024-06-04 16:42:18.060006565 -0300
+++ test_cmd_output.log	2024-06-04 16:42:19.721350689 -0300
@@ -1,6 +1,6 @@

 Building module:
 Cleaning build area...
-Building module(s)...
+make -j1 KERNELRELEASE=6.9.1-2 -C /usr/lib/modules/6.9.1-2/build M=/var/lib/dkms/dkms_test/1.0/build CC=clang OBJCOPY=llvm-objcopy LD=ld.lld...
 Signing module /var/lib/dkms/dkms_test/1.0/build/dkms_test.ko
 Cleaning build area...
Error: unexpected output from: dkms build -k 6.9.1-2 -m dkms_test -v 1.0

It builds and install the dkms_test normally but differs only by outputting the command used to build the module.

All seems fine.

@evelikov
Copy link
Collaborator

evelikov commented Jun 4, 2024

Something is causing the exact build command to be printed, instead of the pretty message. This should only happen when the verbose flag is set. Either by:

  • calling "dkms --verbose ...", or
  • having "verbose" set in the /etc/dkms/framework* files (check also subfolders)
  • ... we might also be inheriting the existing shell env. bits - perhaps "unset verbose" beforehand, just in case?

I'm leaning that there's a config file enabling this.

Alternatively... does the test suit work on your end, with normal (non clang/lto) kernel?

@ltsdw

This comment was marked as outdated.

@ltsdw

This comment was marked as outdated.

@ltsdw
Copy link
Contributor Author

ltsdw commented Jun 4, 2024

First, sorry about that, right now the arch linux package is sitting on version 3.0.12, I had to build 3.0.13 myself. Now the tests passes but with some caveats:

Tests output: dkms_tests_output.txt

Here are the changes I made to the run_test.sh (it isn't a suggestion to change to exact this, as some things I had to hardcode because I couldn't get around it):

@@ -496,9 +496,9 @@
 echo 'Checking modinfo'
 run_with_expected_output sh -c "modinfo /lib/modules/${KERNEL_VER}/${expected_dest_loc}/dkms_test.ko${mod_compression_ext} | head -n 4" << EOF
 filename:       /lib/modules/${KERNEL_VER}/${expected_dest_loc}/dkms_test.ko${mod_compression_ext}
-version:        1.0
-description:    A Simple dkms test module
 license:        GPL
+description:    A Simple dkms test module
+version:        1.0
 EOF
 
 if (( NO_SIGNING_TOOL == 0 )); then
@@ -621,9 +621,9 @@
 echo 'Checking modinfo'
 run_with_expected_output sh -c "modinfo /lib/modules/${KERNEL_VER}/${expected_dest_loc}/dkms_test.ko${mod_compression_ext} | head -n 4" << EOF
 filename:       /lib/modules/${KERNEL_VER}/${expected_dest_loc}/dkms_test.ko${mod_compression_ext}
-version:        1.0
-description:    A Simple dkms test module
 license:        GPL
+description:    A Simple dkms test module
+version:        1.0
 EOF
 
 if (( NO_SIGNING_TOOL == 0 )); then
@@ -1419,7 +1419,7 @@
 Cleaning build area...
 Building module(s)...(bad exit status: 2)
 Failed command:
-make -j1 KERNELRELEASE=${KERNEL_VER} all
+make -j1 KERNELRELEASE=${KERNEL_VER} all CC=clang OBJCOPY=llvm-objcopy LD=ld.lld
 Error! Bad return status for module build on kernel: ${KERNEL_VER} (${KERNEL_ARCH})
 Consult /var/lib/dkms/dkms_failing_test/1.0/build/make.log for more information.
 dkms autoinstall on ${KERNEL_VER}/${KERNEL_ARCH} failed for dkms_failing_test(10)
@@ -1438,7 +1438,7 @@
 Cleaning build area...
 Building module(s)...(bad exit status: 2)
 Failed command:
-make -j1 KERNELRELEASE=${KERNEL_VER} all
+make -j1 KERNELRELEASE=${KERNEL_VER} all CC=clang OBJCOPY=llvm-objcopy LD=ld.lld
 Error! Bad return status for module build on kernel: ${KERNEL_VER} (${KERNEL_ARCH})
 Consult /var/lib/dkms/dkms_failing_test/1.0/build/make.log for more information.
 dkms autoinstall on ${KERNEL_VER}/${KERNEL_ARCH} failed for dkms_failing_test(10)
@@ -1587,7 +1587,7 @@
 Cleaning build area...
 Building module(s)...(bad exit status: 2)
 Failed command:
-make -j1 KERNELRELEASE=${KERNEL_VER} all
+make -j1 KERNELRELEASE=${KERNEL_VER} all CC=clang OBJCOPY=llvm-objcopy LD=ld.lld
 Error! Bad return status for module build on kernel: ${KERNEL_VER} (${KERNEL_ARCH})
 Consult /var/lib/dkms/dkms_failing_test/1.0/build/make.log for more information.

The explanation, my modinfo shows first the license, then the description and only then the version:

$ modinfo dkms_test.ko.zst
filename:       /lib/modules/6.9.1-2/updates/dkms/dkms_test.ko.zst
license:        GPL
description:    A Simple dkms test module
version:        1.0
vermagic:       6.9.1-2 SMP preempt mod_unload 
name:           dkms_test
...

And about the variables CC, LD and OBJCOPY, right now the tests doesn't expect to encounter them, this isn't inherently from my PR, as we already been exposing the CC, and LD before (so the tests will fail on a clang compiled kernel even without my patch, so we may have to fix that. I even tried, but the said variables aren't exposed so we can't do something like CC=${CC} and so on, because CC isn't defined).

One suggestion would be to use in the run_test.sh a mechanism similar to what is done in the dkms.in right now to tell if the kernel was compiled using clang. Or we could change the dkms.in make_commands to include the CC, LD, etc. even when compiling using gcc, that would make the changes needed in the run_test.sh minimal.

Other than that the tests passes normally.

@evelikov
Copy link
Collaborator

evelikov commented Jun 6, 2024

Don't recall personally seeing varying order of the modinfo output... even tough it was reported before. Let me apply a quick hack for dkms and will set a fix for kmod upstream in a bit.

Looking at the make messages, I think we can strip the extra CC/LD/OBJDUMP entries in genericize_expected_output().

Let me prep a PR with the above two fixes, then you can rebase (+ hopefully add some tests) your work on top.

@evelikov evelikov mentioned this pull request Jun 6, 2024
@evelikov
Copy link
Collaborator

evelikov commented Jun 6, 2024

PR #422 should address the issues - please give it a test. Thanks in advance.

Looking at the kernel documentation - I wonder if we shouldn't just use the LLVM=... variable all together. I'm not sure if tracking each and every variable, is going to be portable across multiple kernel releases.

[1] https://www.kernel.org/doc/html/latest/kbuild/llvm.html

@ltsdw
Copy link
Contributor Author

ltsdw commented Jun 6, 2024

Yes, #422 fixes the failing tests for clang.

Looking at the kernel documentation - I wonder if we shouldn't just use the LLVM=... variable all together. I'm not sure if tracking each and every variable, is going to be portable across multiple kernel releases.

[1] https://www.kernel.org/doc/html/latest/kbuild/llvm.html

Yes that would be better indeed (also no change in the tests would be needed, unless you want to remove the CC, LD, bits). Would you like me to make this changes reflect this PR? We would still have to check for clang to use the correct strip though (even though LLVM=1 sets the STRIP envar).

@evelikov
Copy link
Collaborator

evelikov commented Jun 6, 2024

Would you like me to make this changes reflect this PR?

Whichever you're comfortable with, really.

We would still have to check for clang to use the correct strip though (even though LLVM=1 sets the STRIP envar).

We do check the STRIP variable (array really) from the module's dkms.conf and make it an in-dkms variable. Even so we can technically honour the STRIP set by LLVM=1. If you choose to do this (again up-to you), please update the man page.

@ltsdw ltsdw force-pushed the fix-lto-building branch 4 times, most recently from da36723 to 69fb5e4 Compare June 6, 2024 21:59
@ltsdw
Copy link
Contributor Author

ltsdw commented Jun 6, 2024

Sorry about that, I was trying not to create a bunch of commits while rebasing my local branch lol 😅

We do check the STRIP variable (array really) from the module's dkms.conf and make it an in-dkms variable. Even so we can technically honour the STRIP set by LLVM=1.

I didn't understand this part, for example, without these changes:

-        [[ ${strip[$count]} != no ]] && strip -g "$built_module"
+        if [[ ${strip[$count]} != no ]] && [[ ${CC} == "clang" ]]; then
+            llvm-strip -g "$built_module"
+        elif [[ ${strip[$count]} != no ]]; then
+            strip -g "$built_module"
+        fi

Even when using the LLVM=1 the wrong strip command will be picked. That's what I meant by still have to check for clang to use the right strip command.

If you choose to do this (again up-to you), please update the man page.

If I understood the above incorrectly or you have any suggestion I'll be more than happy to implement and update this PR 😄

@ltsdw
Copy link
Contributor Author

ltsdw commented Jun 6, 2024

OK, nvm, I understand now:

        case ${STRIP[$index]} in
            [nN]*)
                strip[$index]="no"
                ;;
            [yY]*)
                strip[$index]="yes"
                ;;
            '')
                strip[$index]=${strip[0]:-yes}
                ;;
        esac

The STRIP envar and the dkms.conf's STRIP array are two different things (got a little confused there). I get it now.

Well, if you don't like the approach of using the same logic as before (checking if the module should be stripped):

        if [[ ${strip[$count]} != no ]] && [[ ${CC} == "clang" ]]; then
            llvm-strip -g "$built_module"
        elif [[ ${strip[$count]} != no ]]; then
            strip -g "$built_module"
        fi

I can think of something else, but really I don't find it deemed necessary. The result would be the same, if the STRIP of dkms.conf is set "yes" the modules will be stripped, if set to "no", no stripping, and if empty the modules will be stripped, all paths leads to the right strip command being used and the LLVM=1 STRIP being honored no matter what.

Now, if I understood all wrong and by "honoring the STRIP set by LLVM=1" you meant that if clang is being used, automatically we will use LLVM=1and inherently STRIP=llvm-strip will be set, meaning that the modules should be stripped by llvm-strip no matter what, I don't think it's a good idea, having control of when to strip modules in dkms.conf is better. (basically the way it's right now).

Right now there's some issues when trying to compile open-gpu-kernel-modules against a kernel compiled using clang+Thin/Full LTO.
Also the script will pick the wrong version of the strip command, is also necessary to set the OBJCOPY environment variable.

Fixes dkms-project#416.
@ltsdw ltsdw force-pushed the fix-lto-building branch from 69fb5e4 to 51b7cd3 Compare June 7, 2024 03:34
@evelikov
Copy link
Collaborator

evelikov commented Jul 5, 2024

Sorry for the radio silence - I've been offline for the last few weeks. Will have a look later today/tomorrow and if there's only minor nits, will amend and push.

@ptr1337
Copy link

ptr1337 commented Aug 4, 2024

@evelikov Would be cool, if you could review this in the near future.
560 Driver will use the open modules as default and this fixed an outstanding issue with them.

@ltsdw
Copy link
Contributor Author

ltsdw commented Sep 11, 2024

@evelikov friendly ping 😅

@ptr1337
Copy link

ptr1337 commented Sep 11, 2024

Were using this now too on CachyOS as default, mainly due the nvidia-open-dkms issue. Has been used for 1 months and no issues reported.

@evelikov
Copy link
Collaborator

Thanks for the poke, I was taking a short holiday from dkms. Checking through now, will merge later tonight/early tomorrow....

I'm thinking how we can easily test this. If anyone has patches that would be welcome.

@xuanruiqi
Copy link

This patch certainly works for me.
I think it should be fixed asap as lto-ed kernels are quite unusable with dkms without it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Failing to compile nvidia-open-dkms while using a ThinLTO/Clang kernel

5 participants