Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed dkms status or autoinstall returns code 0 instead of an error one #352

Open
C0rn3j opened this issue Oct 18, 2023 · 9 comments
Open

Comments

@C0rn3j
Copy link

C0rn3j commented Oct 18, 2023

Both of these commands return code 0, they should return a non-zero return code, as they have errored.

# dkms status
Error! Could not locate dkms.conf file.
File: /var/lib/dkms/nvidia/535.104.05/source/dkms.conf does not exist.

# dkms autoinstall
Error! Could not locate dkms.conf file.
File: /var/lib/dkms/nvidia/535.104.05/source/dkms.conf does not exist.

This is on Arch Linux with dkms 3.0.11.

I skimmed changelog for the latest 3.0.12 which I did not test with, but it does not look like this issue was fixed there.

@evelikov
Copy link
Collaborator

Hello fellow Arch user. Can you share some idiot proof step-by-step reproducer steps?

Yes, I don't think we fixed anything like that with 3.0.12.

@C0rn3j
Copy link
Author

C0rn3j commented Oct 19, 2023

I can't reproduce it with a fake module(same error, but return code 4), so I presume a condition is that a module already has to be installed, or some other weird stuff is going on.

I can reproduce it by breaking an existing nvidia module by pointing its source file to /dev/null

[0] % cd /var/lib/dkms/nvidia/535.113.01

[0] % sudo rm -f source; sudo ln -sf /usr/src/nvidia-535.113.01 source

[0] % dkms status     
nvidia/535.113.01, 6.1.58-1-lts, x86_64: installed
nvidia/535.113.01, 6.5.7-arch1-1, x86_64: installed

[0] % sudo rm -f source; sudo ln -sf /dev/null source                 

[0] % dkms status                                    
Error! Could not locate dkms.conf file.
File: /var/lib/dkms/nvidia/535.113.01/source/dkms.conf does not exist.

[0] % 

@evelikov
Copy link
Collaborator

evelikov commented Oct 19, 2023

The reproducer works for me. The error seems to be coming from the read_conf in module_status_built_extra().

All the other instances across the codebase are read_conf_or_die and a lot of that code is over 10 years old. Off the top of my head, I cannot see a reason why we couldn't flip the final user to "_or_die" variant.

@anbe42 IIRC you recently silenced dkms status so it doesn't show deprecation warnings - aka the read_conf that I'm thinking of, plus you did quite a lot of work around autoinstall (thanks again).

Do you have foresee any issues if we promote the error to being fatal?

@evelikov
Copy link
Collaborator

@scaronni if you have any input, that would be highly appreciated as well. Thanks o/

@evelikov
Copy link
Collaborator

Thinking about this a little more: autoinstall, explicitly aims to solder on, even when building/installing of specific module fails. So promoting the error to fatal does in the opposite direction.

On the other hand if dkms.conf is missing then the module is catastrophically broken.

@C0rn3j what did you do/what triggered the error on your end - was it manually tinkering around or something OS/packaging that caused it?

@C0rn3j
Copy link
Author

C0rn3j commented Oct 20, 2023

I am not sure yet what triggered it, I just had a bunch of broken dkms builds on two machines for non-existent kernel and driver versions, I suspect some weird race condition prodded on by the kernel-modules-hook package.

@anbe42
Copy link
Contributor

anbe42 commented Oct 20, 2023

Looks like we have two things to fix here:

  • recovery from an (externally) broken /var/lib/dkms, aka dkms fsck
  • error propagation in such a case (the bug reported here)

A possibility how this broken state could have happened: Some packaging removed /usr/src/$driver-$oldversion upon some upgrade without calling the corresponding dkms remove hook first ... Should not happen with Debian packaged *-dkms modules, but I don't know what else is out there in the wild ...

@evelikov
Copy link
Collaborator

Indeed splitting this in two makes sense. Recovery would be great, although since the base information is missing aka dkms.conf I don't know what we can do here.

Looking from the latter point, we already exit in all the other instances of missing dkms.conf. So it's a case of making those non-fatal and then fixing the almost impossible to test error paths or flipping the final one.

Browsing across the Arch packages:

  • kernel-modules-hook - touches only /usr/lib/modules making and restoring backups
  • nvidia-dkms - the one that was likely removed
  • dkms itself has separate hook/script, which does manual parsing/handling (akin to autoinstall) ensuring depmod is called only once per kernel, even if XXXs dkms modules are added/removed.

AFAICT autoinstall does not exist as far as Arch is concerned, although the extra script does call dkms status.

The pacman hook triggering the script is post transaction for install, and pre transaction for update/remove, so it cannot be the one causing the issue.

Considering there is no obvious way how this can happen (in Arch and Debian), outside of user error (it's fine, I'm not trying to blame anyone here) I'm inclined make it fatal error. If it turns out there's some valid use-case we can quickly revert it.

That said, let's leave this issue open for a while and see how things go.

This was referenced Oct 21, 2023
@C0rn3j
Copy link
Author

C0rn3j commented Apr 1, 2024

# 3.0.12
[0] % sudo dkms status
Error! Could not locate dkms.conf file.
File: /var/lib/dkms/nvidia/550.54.14/source/dkms.conf does not exist.
# 3.0.13
[0] % sudo dkms status
nvidia/550.54.14: broken
Error! nvidia/550.54.14: Missing the module source directory or the symbolic link pointing to it.
Manual intervention is required!
nvidia/550.67, 6.6.23-1-lts, x86_64: installed
nvidia/550.67, 6.7.9-arch1-1, x86_64: installed (original_module exists) (WARNING! Missing some built modules!) (WARNING! Missing some built modules!) (WARNING! Missing some built modules!) (WARNING! Missing some built modules!) (WARNING! Missing some built modules!)
nvidia/550.67, 6.8.2-arch2-1, x86_64: installed

Now with the new release, status goes through everything instead of instantly crashing, which will hopefully make this a bit nicer to debug...

Still haven't found how why this happens, but it does keep happening.
I have freshly installed .13 so all of this is created with .12:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants