Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate the NVIDIA container toolkit #52

Open
ehfd opened this issue Nov 18, 2023 · 28 comments · May be fixed by #84
Open

Integrate the NVIDIA container toolkit #52

ehfd opened this issue Nov 18, 2023 · 28 comments · May be fixed by #84
Labels
help wanted Extra attention is needed wontfix This will not be worked on

Comments

@ehfd
Copy link

ehfd commented Nov 18, 2023

This is an issue that has been spun off from the Discord channel.

@Murazaki : It might be good to find a better workflow for providing drivers to Wolf.
On Debian, drivers are pretty old in the main stable repo, and updated ones can be found on CUDA drivers, but do not exactly match manual installation ones.

@ABeltramo : I guess I should go back to look into the Nvidia Docker Toolkit for people that would like to use that
I agree though, it's a bit of a pain point at the moment

@Murazaki : Cuda drivers repo :
https://developer.download.nvidia.com/compute/cuda/repos/

Linux manual installer :
https://download.nvidia.com/XFree86/Linux-x86_64/

right now, latest in cuda packaged installs is 545.23.08.
It doesn't exist as a manual installer.
that breaks the dockerfile and renders wolf unusable
I wanted to make a docker image for debian packages install, but it uses apt-add-repository which is installing a bunch of supplementary stuff
Here it is for Debian Bookworm :
https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Debian&target_version=12&target_type=deb_network

More thorough installation steps here :
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

@juliosueiras : There is one problem though, nvidia driver toolkit doesn’t inject driver in the container and still require a driver installed in the container image itself,


And here, I start with what interventions I made with NVIDIA for the last three years to not require the NVIDIA drivers to run Wayland inside the NVIDIA container toolkit.

What NVIDIA container toolkit does: it's pretty simple. It injects (1) kernel devices, and (2) userspace libraries, into a container. (1) and (2) compose a subset of the driver.

(1) kernel devices: /dev/nvidiaN, /dev/nvidiactl, /dev/nvidia-modeset, /dev/nvidia-uvm, and /dev/nvidia-uvm-tools. In addition, /dev/dri/cardX and /dev/dri/renderDY, where N, X, and Y depend on the GPU the container toolkit provisions. The /dev/dri devices were added with NVIDIA/libnvidia-container#118.

(2) userspace libraries:
OpenGL libraries including EGL: '/usr/lib/libGL.so.1', '/usr/lib/libEGL.so.1', '/usr/lib/libGLESv1_CM.so.525.78.01', '/usr/lib/libGLESv2.so.525.78.01', '/usr/lib/libEGL_nvidia.so.0', '/usr/lib/libOpenGL.so.0', '/usr/lib/libGLX.so.0', and '/usr/lib/libGLdispatch.so.0', '/usr/lib/libnvidia-tls.so.525.78.01'

Vulkan libraries: '/usr/lib/libGLX_nvidia.so.0' and the configuration '/etc/vulkan/icd.d/nvidia_icd.json'

EGLStreams-Wayland and GBM-Wayland libraries: '/usr/lib/libnvidia-egl-wayland.so.1' and the config '/usr/share/egl/egl_external_platform.d/10_nvidia_wayland.json' '/usr/lib/libnvidia-egl-gbm.so.1' and the config '/usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json'

NVENC libraries: /usr/lib/libnvidia-encode.so.525.78.01, which depends on /usr/lib/libnvcuvid.so.525.78.01, which depends on /usr/lib/x86_64-linux-gnu/libcuda.so.1

VDPAU libraries: /usr/lib/vdpau/libvdpau_nvidia.so.525.78.01
NVFBC libraries: /usr/lib/libnvidia-fbc.so.525.78.01
OPTIX libraries: /usr/lib/libnvoptix.so.1

Not very relevant but of note, perhaps for XWayland: NVIDIA X.Org driver: /usr/lib/xorg/modules/drivers/nvidia_drv.so, NVIDIA X.org GLX driver: /usr/lib/xorg/modules/extensions/libglxserver_nvidia.so.525.78.01

In many cases, things don't work because the below configuration files are absent inside the container. Without these, applications inside the container don't know which library to call (what each file does is self-explanatory):

The contents of /usr/share/glvnd/egl_vendor.d/10_nvidia.json:

{
    "file_format_version" : "1.0.0",
    "ICD" : {
        "library_path" : "libEGL_nvidia.so.0"
    }
}

The contents of /etc/vulkan/icd.d/nvidia_icd.json (note that api_version is variable based on the Driver version):

{
    "file_format_version" : "1.0.0",
    "ICD": {
        "library_path": "libGLX_nvidia.so.0",
        "api_version" : "1.3.205"
    }
}

The contents of /etc/OpenCL/vendors/nvidia.icd:

libnvidia-opencl.so.1

The contents of /usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json:

{
        "file_format_version" : "1.0.0",
        "ICD" : {
                "library_path" : "libnvidia-egl-gbm.so.1"
        }
}

The contents of /usr/share/egl/egl_external_platform.d/10_nvidia_wayland.json:

{
    "file_format_version" : "1.0.0",
    "ICD" : {
        "library_path" : "libnvidia-egl-wayland.so.1"
    }
}

I'm pretty sure that now (was different a few months ago), the newest NVIDIA container toolkit provisions all of the required libraries plus the json configurations for Wayland (not for X11 but you don't have to care).
If only the json configurations are absent, it's trivial to manually add the above template.

About GStreamer:
https://gitlab.freedesktop.org/gstreamer/gstreamer/-/issues/3108

Now, it is correct that NVENC does require CUDA. But that doesn't mean that it requires the whole CUDA Toolkit (separate from the CUDA drivers). The CUDA drivers are the four following libraries installed with the display drivers, independent of the CUDA Toolkit: libcuda.so, libnvidia-ptxjitcompiler.so, libnvidia-nvvm.so, libcudadebugger.so

These versions go with the display drivers, and are all injected into the container by the NVIDIA container toolkit.

GStreamer 1.22 and before in nvcodec requires just two files of the CUDA Toolkit: libnvrtc.so and libnvrtc-bulletins.so. This can be installed from the network repository like the current approach, or be extracted from a PyPi package:

# Extract NVRTC dependency, https://developer.download.nvidia.com/compute/cuda/redist/cuda_nvrtc/LICENSE.txt
cd /tmp && curl -fsSL -o nvidia_cuda_nvrtc_linux_x86_64.whl "https://developer.download.nvidia.com/compute/redist/nvidia-cuda-nvrtc/nvidia_cuda_nvrtc-11.0.221-cp36-cp36m-linux_x86_64.whl" && unzip -joq -d ./nvrtc nvidia_cuda_nvrtc_linux_x86_64.whl && cd nvrtc && chmod 755 libnvrtc* && find . -maxdepth 1 -type f -name "*libnvrtc.so.*" -exec sh -c 'ln -snf $(basename {}) libnvrtc.so' \; && mv -f libnvrtc* /opt/gstreamer/lib/x86_64-linux-gnu/ && cd /tmp && rm -rf /tmp/*

One thing to note here is that libnvrtc.so is not minor version compatible with CUDA. Thus, it will error on any display driver version older than its corresponding display driver version. However, backwards compatibility always works. Thus, it is a good idea to use the oldest possible libnvrtc.so version.

Display CUDA
545 - 12.3
535 - 12.2
530 - 12.1
525 - 12.0
520 - 11.8
515 - 11.7
(and so on...)

https://docs.nvidia.com/deploy/cuda-compatibility/

So, I have moderate to high confidence that if you guys try the newest NVIDIA container toolkit again, you won't need to install the drivers, assuming that you ensure the json files are present or written.

Environment variables that currently work for me:

RUN echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf && \
    echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf
# Expose NVIDIA libraries and paths
ENV PATH /usr/local/nvidia/bin:${PATH}
ENV LD_LIBRARY_PATH /usr/lib/x86_64-linux-gnu:/usr/lib/i386-linux-gnu${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
# Make all NVIDIA GPUs visible by default
ENV NVIDIA_VISIBLE_DEVICES all
# All NVIDIA driver capabilities should preferably be used, check `NVIDIA_DRIVER_CAPABILITIES` inside the container if things do not work
ENV NVIDIA_DRIVER_CAPABILITIES all
# Disable VSYNC for NVIDIA GPUs
ENV __GL_SYNC_TO_VBLANK 0
@ABeltramo
Copy link
Member

TLDR: as of the latest Nvidia Container Toolkit (1.14.3-1) unfortunately this is still not possible.

What's the issue?

With the latest versions I can run both Wolf and the Gstreamer pipeline just by running the container with --gpus=all unfortunately for some apps this is still missing some important library.
Gamescope to run seems to require the following additional libraries that aren't provided by the toolkit:

libnvidia-egl-gbm.so.1
libnvidia-egl-wayland.so.1

libnvidia-vulkan-producer.so

gbm/nvidia-drm_gbm.so

Last one seems to just be a simlink to libnvidia-allocator.so.1 which is already present so that might be fine.

Now this is running from an X11 host and I can see that those additional libraries aren't present in my host system:

ls -la /usr/lib/x86_64-linux-gnu/libnv
libnvcuvid.so@                         libnvidia-container.so.1@              libnvidia-glvkspirv.so.530.30.02       libnvidia-opticalflow.so.1@
libnvcuvid.so.1@                       libnvidia-container.so.1.14.3*         libnvidia-ml.so.1@                     libnvidia-ptxjitcompiler.so.1@
libnvidia-allocator.so.1@              libnvidia-eglcore.so.530.30.02         libnvidia-ngx.so.1@                    libnvidia-rtcore.so.530.30.02
libnvidia-cfg.so.1@                    libnvidia-encode.so.1@                 libnvidia-ngx.so.530.30.02             libnvidia-tls.so.530.30.02
libnvidia-compiler.so.530.30.02        libnvidia-fbc.so.1@                    libnvidia-nvvm.so@                     libnvidia-wayland-client.so.530.30.02
libnvidia-container-go.so.1@           libnvidia-glcore.so.530.30.02          libnvidia-nvvm.so.4@                   libnvoptix.so.1@
libnvidia-container-go.so.1.14.3       libnvidia-glsi.so.530.30.02            libnvidia-opencl.so.1@

Anyone that can confirm the output of

docker run --rm -it --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all  -e NVIDIA_DRIVER_CAPABILITIES=all ubuntu ls /usr/lib/x86_64-linux-gnu/libnvidia*
/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.1	   /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.530.30.02     /usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.4
/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.1		   /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.530.30.02       /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.530.30.02  /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.530.30.02  /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.530.30.02   /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1		       /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-encode.so.1		   /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.1		       /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.530.30.02
/usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.1		   /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.530.30.02        /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.530.30.02

on a Nvidia Wayland host?

What can we do better?

I think we should keep manually downloading and linking the drivers like we are doing at the moment. We should probably add a proper check for mismatch between the downloaded drivers and the host installed drivers either on startup of the containers (somewhere in the base-app) or on startup of Wolf.

@ehfd
Copy link
Author

ehfd commented Jan 9, 2024

@ABeltramo One comment I have here is that there isn't really a concept called an "Xorg host" and a "Wayland host". It depends on what the desktop environment and login manager uses, and by default, all drivers bundle both libraries.

We will discuss more in NVIDIA/libnvidia-container#118.

@ABeltramo ABeltramo added help wanted Extra attention is needed wontfix This will not be worked on labels Jan 24, 2024
@ohayak
Copy link

ohayak commented Feb 12, 2024

Hi,
I've been working on a project using Unreal Engine Pixel Streaming to stream games at scale with kubernetes. Packaging docker images with the right nvidia docker drivers is a pain. I invite you to take a look at the work done by Adam He used nvidia/cuda image as base to create a set of images that with various configurations:
22.04-vulkan: Ubuntu 22.04 + OpenGL + Vulkan + PulseAudio Client + PulseAudio Server
22.04-cudagl11: Ubuntu 22.04 + OpenGL + Vulkan + CUDA 11.8.0 + PulseAudio Client + PulseAudio Server
22.04-cudagl12: Ubuntu 22.04 + OpenGL + Vulkan + CUDA 12.2.0 + PulseAudio Client + PulseAudio Server
22.04-vulkan-noaudio: Ubuntu 22.04 + OpenGL + Vulkan (no audio support)
22.04-cudagl11-noaudio: Ubuntu 22.04 + OpenGL + Vulkan + CUDA 11.8.0 (no audio support)
22.04-cudagl12-noaudio: Ubuntu 22.04 + OpenGL + Vulkan + CUDA 12.2.0 (no audio support)
22.04-vulkan-hostaudio: Ubuntu 22.04 + OpenGL + Vulkan + PulseAudio Client (uses host PulseAudio Server)
22.04-cudagl11-hostaudio: Ubuntu 22.04 + OpenGL + Vulkan + CUDA 11.8.0 + PulseAudio Client (uses host PulseAudio Server)
22.04-cudagl12-hostaudio: Ubuntu 22.04 + OpenGL + Vulkan + CUDA 12.2.0 + PulseAudio Client (uses host PulseAudio Server)
22.04-vulkan-x11: Ubuntu 22.04 + OpenGL + Vulkan + PulseAudio Client (uses host PulseAudio Server) + X11
22.04-cudagl11-x11: Ubuntu 22.04 + OpenGL + Vulkan + CUDA 11.8.0 + PulseAudio Client (uses host PulseAudio Server) + X11
22.04-cudagl12-x11: Ubuntu 22.04 + OpenGL + Vulkan + CUDA 12.2.0 + PulseAudio Client (uses host PulseAudio Server) + X11
I think you should check the docker files

@Murazaki
Copy link
Contributor

Hi, I've been working on a project using Unreal Engine Pixel Streaming to stream games at scale with kubernetes. Packaging docker images with the right nvidia docker drivers is a pain. I invite you to take a look at the work done by Adam He used nvidia/cuda image as base to create a set of images that with various configurations
[...]
I think you should check the docker files

Could you provide a link to that/those Dockerfiles maybe ?

@Murazaki
Copy link
Contributor

Murazaki commented Feb 15, 2024

Oh sorry I actually know what they're talking about : it's adamrehn/ue4-runtime. I think I wrote an answer and forgot to send ^^

https://github.com/adamrehn/ue4-runtime

@ABeltramo ABeltramo reopened this Feb 15, 2024
@ABeltramo
Copy link
Member

If those are the images that @ohayak was talking about, unfortunately, there's nothing there that can help us.
As I explained in a comment above, the nvidia toolkit is not mounting all libraries that are needed in order to spawn our custom Wayland compositor. There are only a few possible solutions that I can think of:

  • Ask users to install the dependencies in a volume that will be mounted and shared with the apps (the current solution in Wolf)
  • Ask users to manually install the libraries if missing from the host and then fill out all the correct mounts
  • Try to fill a patch upstream with the nvidia driver toolkit project and hope that it gets merged

I'm very open to suggestions or alternative solutions!

@ehfd
Copy link
Author

ehfd commented Feb 25, 2024

modprobe -r nvidia_drm ; modprobe nvidia_drm modeset=1

Our experience was related to the above kernel module.
Also, we have reinstalled the OS and installed everything from scratch, then things started to work again.
One more possibility is the lack of DKMS (required to build relevant NVIDIA kernel modules) or a kernel version upgrade without rebuilding the NVIDIA modules.

@ehfd
Copy link
Author

ehfd commented Mar 28, 2024

The core issue is whether the EGL Wayland library is installed or not, likely not the container toolkit.

This is available if you use the .run file.
https://download.nvidia.com/XFree86/Linux-x86_64/550.67/README/installedcomponents.html

I don't think the Debian/Ubuntu PPA repositories install the Wayland components automatically.
If so, the following components need to be installed. This is sufficient and contains all missing Wayland files.

libnvidia-egl-gbm1
libnvidia-egl-wayland1

But, solution: install the .run file.

https://community.kde.org/Plasma/Wayland/Nvidia

@tux-rampage
Copy link

Hi,

I am currently trying to get this flying with CRI-o and the nvidia-ctk by using the runtime and CDI config. Afair this can be used in docker as well.
So far all configs and libs are injected/mounted into the container as far as I can see.
I can double check for the reported libs here later.

Currently I'm facing the vblank resource unavailable issue. (Driver version 550.something - cannot look it up atm)

@tux-rampage
Copy link

tux-rampage commented Apr 5, 2024

Here are some short steps for what I've done so far (Nvidia Driver version 550.54.14):

nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
nvidia-ctk runtime configure --runtime=crio

For docker I guess it's enough to use --runtime=docker

I used the nvidia runtime when starting the container(s) and setting the following Env-Vars:

NVIDIA_DRIVER_CAPABILITIES=all
NVIDIA_VISIBLE_DEVICES="nvidia.com/gpu=all"

As documented, this enables the CDI integration which will mount the host libs and binaries.

What is working so far:

  • vkgears (directly)
  • vkgears (gamescope)

What is not working:

  • glxgears (gamescope): Failed to load driver zink, Blank surface, success render output on STDOUT
  • steam (gamescope): Failed to load driver zink, vblank sync failures, shm failures seems to be webkit sandbox related, possibly Nvidia 545.29.06 and Gamescope issue #60

@tux-rampage
Copy link

TLDR: as of the latest Nvidia Container Toolkit (1.14.3-1) unfortunately this is still not possible.

What's the issue?

With the latest versions I can run both Wolf and the Gstreamer pipeline just by running the container with --gpus=all unfortunately for some apps this is still missing some important library. Gamescope to run seems to require the following additional libraries that aren't provided by the toolkit:

libnvidia-egl-gbm.so.1
libnvidia-egl-wayland.so.1

libnvidia-vulkan-producer.so

gbm/nvidia-drm_gbm.so

Last one seems to just be a simlink to libnvidia-allocator.so.1 which is already present so that might be fine.

Now this is running from an X11 host and I can see that those additional libraries aren't present in my host system:

ls -la /usr/lib/x86_64-linux-gnu/libnv
libnvcuvid.so@                         libnvidia-container.so.1@              libnvidia-glvkspirv.so.530.30.02       libnvidia-opticalflow.so.1@
libnvcuvid.so.1@                       libnvidia-container.so.1.14.3*         libnvidia-ml.so.1@                     libnvidia-ptxjitcompiler.so.1@
libnvidia-allocator.so.1@              libnvidia-eglcore.so.530.30.02         libnvidia-ngx.so.1@                    libnvidia-rtcore.so.530.30.02
libnvidia-cfg.so.1@                    libnvidia-encode.so.1@                 libnvidia-ngx.so.530.30.02             libnvidia-tls.so.530.30.02
libnvidia-compiler.so.530.30.02        libnvidia-fbc.so.1@                    libnvidia-nvvm.so@                     libnvidia-wayland-client.so.530.30.02
libnvidia-container-go.so.1@           libnvidia-glcore.so.530.30.02          libnvidia-nvvm.so.4@                   libnvoptix.so.1@
libnvidia-container-go.so.1.14.3       libnvidia-glsi.so.530.30.02            libnvidia-opencl.so.1@

Anyone that can confirm the output of

docker run --rm -it --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all  -e NVIDIA_DRIVER_CAPABILITIES=all ubuntu ls /usr/lib/x86_64-linux-gnu/libnvidia*
/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.1	   /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.530.30.02     /usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.4
/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.1		   /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.530.30.02       /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.530.30.02  /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.530.30.02  /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.530.30.02   /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1		       /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-encode.so.1		   /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.1		       /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.530.30.02
/usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.1		   /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.530.30.02        /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.530.30.02

on a Nvidia Wayland host?

What can we do better?

I think we should keep manually downloading and linking the drivers like we are doing at the moment. We should probably add a proper check for mismatch between the downloaded drivers and the host installed drivers either on startup of the containers (somewhere in the base-app) or on startup of Wolf.

ls -1 /usr/lib/x86_64-linux-gnu/libnvidia*
/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1.1.1
/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-encode.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-encode.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-gpucomp.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.4
/usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-pkcs11-openssl3.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-tls.so.550.54.14

It seems that libnvidia-egl-wayland.so.1 and libnvidia-vulkan-producer.so is missing. driver version is 550.54.14

@tux-rampage
Copy link

tux-rampage commented Apr 9, 2024

libnvidia-egl-wayland seems to be present in the package libnvidia-egl-wayland1.
libnvidia-vulkan-producer.so seems to be dropped from the driver. Need to confirm by building the driver volume for the current release.

Edit: Is it possible that libnvidia-egl-wayland1 is not part of Nvidias driver but an oss component?

Edit2: yes libnvidia-vulkan-producer was removed recently: https://www.nvidia.com/Download/driverResults.aspx/214102/en-us/

Removed libnvidia-vulkan-producer.so from the driver package. This helper library is no longer needed by the Wayland WSI.

@ehfd
Copy link
Author

ehfd commented Apr 10, 2024

Those libraries are for the EGLStreams backend. I believe compositors have now stopped supporting them.

https://github.com/NVIDIA/egl-wayland

@tux-rampage
Copy link

Maybe. The latest Nvidia Driver comes with a gbm backend. Maybe that's something useful. I'll give GBM_BACKEND=nvidia-gbm a try. Maybe this will help: https://download.nvidia.com/XFree86/Linux-x86_64/510.39.01/README/gbm.html

From my approach yesterday evening, glxgears is running without any error messages so far. But the Output in moonlight stays black.

@ehfd
Copy link
Author

ehfd commented May 10, 2024

I think NVIDIA Container Toolkit 1.15.0 (released not long ago) fixes most of the problems.

Please check it out.

I am trying to fix the remainder of issues with NVIDIA/nvidia-container-toolkit#490. Please feedback.

I've written about the situation more detailedly in: NVIDIA/nvidia-container-toolkit#490 (comment)

Within the scope of Wolf, the libnvidia-egl-wayland1 APT package will install the EGLStreams interface (if it can be used instead of GBM), and libnvidia-egl-gbm is installed with the Graphics Drivers PPA. Both are installed with the .run installer and the above PR will also inject libnvidia-egl-wayland.

@tux-rampage
Copy link

I've finally been successful to run steam and cyberpunk with wolf using the nvidia container toolkit instead of the drivers image.
As @ehfd metioned, the Symlink nvidia-drm_gbm.so is missing, so I had to create it manually before running gamescope/steam:

mkdir -p /usr/lib/x86_64-linux-gnu/gbm;
ln -sv ../libnvidia-allocator.so.1 /usr/lib/x86_64-linux-gnu/gbm/nvidia-drm_gbm.so;

After that launching steam and the game was successful. This all works without the use of libnvidia-egl-wayland.

@ABeltramo
Copy link
Member

That sounds really good, thanks for reporting back!
We could easily add that to the images, my only concern is that this only works with a specific version of the toolkit. We should probably add a check on startup for the required libraries and print a proper error message if missing..

@ehfd
Copy link
Author

ehfd commented Jun 2, 2024

If someone has knowledge of Go, could they contribute fixes for the unsolved aspects of the PR for NVIDIA/nvidia-container-toolkit#490?

I will give write access to https://github.com/ehfd/nvidia-container-toolkit/tree/main if they ask in order to keep it into one PR.

@tux-rampage
Copy link

If someone has knowledge of Go, could they contribute fixes for the unsolved aspects of the PR for NVIDIA/nvidia-container-toolkit#490?

I will give write access to https://github.com/ehfd/nvidia-container-toolkit/tree/main if they ask in order to keep it into one PR.

I can take a look at it later

@tux-rampage
Copy link

@ehfd Thanks for trusting with the access to your fork. I have addressed the pending issues on the code side and requested some feedback.

@ehfd
Copy link
Author

ehfd commented Jun 5, 2024

Thanks for trusting with the access to your fork. I have addressed the pending issues on the code side and requested some feedback.

No sweat. You know Go better than me and seems like you did a great job at it.

@kayakyakr
Copy link

Congrats on getting this merged! This is going to substantially simplify getting the drivers up and running

@ABeltramo
Copy link
Member

I've tried upgrading Gstreamer to 1.24.5 unfortunately it's now failing to use Cuda with:

0:00:00.102290171   172 0x560d70ab1190 ERROR              cudanvrtc gstcudanvrtc.cpp:165:gst_cuda_nvrtc_load_library_once: Failed to load 'nvrtcGetCUBINSize', 'nvrtcGetCUBINSize': /usr/local/nvidia/lib/libnvrtc.so: undefined symbol: nvrtcGetCUBINSize

It looks like nvrtcGetCUBINSize was recently added, so I guess there's a mismatch between the libnvrtc.so and what Gstreamer expects. Seems that this was added in this commit which reference @ehfd issue.
I could successfully run this Gstreamer version on my host which has Cuda12 installed, I guess this will break compatibility with older cards so I'm going to revert back Gstreamer in Wolf for now..

ABeltramo added a commit that referenced this issue Jun 29, 2024
@ehfd
Copy link
Author

ehfd commented Jun 29, 2024

@ABeltramo This should not be an issue as long as NVRTC CUDA version is kept at around 11.3, yes, 11.0 will not work.

@ABeltramo
Copy link
Member

Thanks for the very quick reply! What's the compatibility matrix for NVRTC? Would upgrading to 11.3 still work for older Cuda installations?

@ehfd
Copy link
Author

ehfd commented Jun 29, 2024

Mostly the internal ABI of NVIDIA.
CUBIN was designed so that there can be forward compatibility for the NVRTC files, but since CUBIN itself didn't exist backwardly, there's a problem here.

You can probably fix the issue yourself on GStreamer and then backport it to GStreamer 1.24.6 by a simple error handling in the C code. Probably #ifdef can work.

@ABeltramo
Copy link
Member

Thanks, I've got enough on my plate already.. This is definitely lower priority compared to the rest, looks like we are going to stay on 1.22.7 for a bit longer..

@ABeltramo ABeltramo linked a pull request Jul 2, 2024 that will close this issue
@ehfd
Copy link
Author

ehfd commented Jul 9, 2024

The pull request was technically merged today because it was reverted.

NVIDIA/nvidia-container-toolkit#548

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed wontfix This will not be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants