Merge pull request #17 from games-on-whales/zb140/nvidia-xorg

install both nvidia libraries needed by xorg
games-on-whales · Jun 27, 2021 · 00793be · 00793be
2 parents c49d277 + eb33fce
commit 00793be
Show file tree

Hide file tree

Showing 2 changed files with 52 additions and 79 deletions.
diff --git a/README.md b/README.md
@@ -73,7 +73,7 @@ environment:
 To get the correct UUID for your GPU, use the `nvidia-container-cli` command:
 ```console
 $ sudo nvidia-container-cli --load-kmods info
-NVRM version:   465.27
+NVRM version:   [version]
 CUDA version:   11.3
 
 Device Index:   0
@@ -87,25 +87,31 @@ Architecture:   7.5
 
 ##### Xorg drivers
 
-Because Nvidia does not officially support running Xorg inside a container with their Container Toolkit, it does not automatically provide you with the `nvidia_drv.so` driver module that Xorg requires.  The preferred method for making it available inside the container is to map it in from the host as a bind volume.  This ensures it is always the correct version. Find the module on your host, then add a volume mapping like this to your `docker run` command:
-```console
---volume /path/to/nvidia_drv.so:/nvidia/xorg/nvidia_drv.so:ro
+Although the NVIDIA Container Toolkit automatically provides most of the drivers needed to use the GPU inside a container, Xorg is _not_ officially supported.  This means that the runtime will not automatically map in the specific drivers needed by Xorg.
+
+There are two libraries needed by Xorg: `nvidia_drv.so` and `libglxserver_nvidia.so.[version]`.  It is preferred to map these into the container as a bind volume from the host, because this guarantees that the version will exactly match between the container and the host.  Locate the two modules and add a section like this to the `xorg` service in your `docker-compose.yml`:
+```yaml
+volumes:
+  - /path/to/nvidia_drv.so:/nvidia/xorg/nvidia_drv.so:ro
+  - /path/to/libglxserver_nvidia.so.[version]:/nvidia/xorg/libglxserver_nvidia.so:ro
 ```
 
+Be sure to replace `[version]` with the correct version number from the `nvidia-container-cli` command above.
+
 Some common locations for `nvidia_drv.so` include:
- * /usr/lib64/xorg/modules/drivers/nvidia_drv.so (Unraid)
- * /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so (Ubuntu 20.04)
+ * `/usr/lib64/xorg/modules/drivers/` (Unraid)
+ * `/usr/lib/x86_64-linux-gnu/nvidia/xorg/` (Ubuntu 20.04)
 
-If you don't want to do this, or if you can't find the driver on your host for some reason, the container will attempt to install the correct version for you automatically.  However, there are some drawbacks: first, it can take a long time, and second, there is no guarantee that it will be able to find a version that exactly matches the driver version on your host.
+Some common locations for `libglxserver_nvidia.so.[version]` include:
+ * `/usr/lib64/xorg/modules/extensions/` (Unraid)
+ * `/usr/lib/x86_64-linux-gnu/nvidia/xorg/` (Ubuntu 20.04)
 
-If the automatic option is working for you and you want to speed up future launches of the container, you can provide a persistent volume for it to cache some of the setup work, using a mapping like this:
-```console
---volume ~/dr-cache:/var/cache/dummy
-```
+If you don't want to do this, or if you can't find the driver on your host for some reason, the container will attempt to install the correct version for you automatically.  However, there is no guarantee that it will be able to find a version that exactly matches the driver version on your host.
 
 If for some reason you want to skip the entire process and just assume the driver is already installed, you can do that too:
-```console
---env SKIP_NVIDIA_DRIVER_CHECK=1
+```yaml
+environment:
+    SKIP_NVIDIA_DRIVER_CHECK: 1
 ```
 
 ## Troubleshooting

diff --git a/images/xorg/scripts/ensure-nvidia-xorg-driver.sh b/images/xorg/scripts/ensure-nvidia-xorg-driver.sh
@@ -1,7 +1,5 @@
 #!/bin/bash
 
-DUMMY_PACKAGE_CACHE=/var/cache/dummy
-
 NVIDIA_DRIVER_MOUNT_LOCATION=/nvidia/xorg
 NVIDIA_PACKAGE_LOCATION=/usr/lib/x86_64-linux-gnu/nvidia/xorg
 
@@ -35,10 +33,15 @@ done
 HOST_DRIVER_VERSION=$(cat /proc/driver/nvidia/version | sed -nE 's/.*Module[ \t]+([0-9]+(\.[0-9]+)?).*/\1/p')
 HOST_DRIVER_MAJOR_VERSION=$(echo "$HOST_DRIVER_VERSION" | sed -E 's/\..+//')
 
-PACKAGE_NAME="xserver-xorg-video-nvidia-$HOST_DRIVER_MAJOR_VERSION"
+XORG_PACKAGE_NAME="xserver-xorg-video-nvidia-$HOST_DRIVER_MAJOR_VERSION"
+GL_PACKAGE_NAME="libnvidia-gl-$HOST_DRIVER_MAJOR_VERSION"
+
+# ensure the package info is up to date so we have the best chance of finding a
+# matching driver
+apt-get update &>/dev/null
 
 MAJOR_PACKAGE_APT_VERSIONS=$( \
-    apt-cache madison "$PACKAGE_NAME" \
+    apt-cache madison "$XORG_PACKAGE_NAME" \
         | awk '{ print $3 }' \
         | sort -rV
     )
@@ -53,74 +56,38 @@ if [ -z "$PACKAGE_APT_VERSION" ]; then
     fail "Failed to locate a package with the same driver version ($HOST_DRIVER_VERSION)"
 fi
 
-mkdir -p $DUMMY_PACKAGE_CACHE
-cd $DUMMY_PACKAGE_CACHE
+# tell dpkg to install the given file somewhere else so it doesn't try to
+# overwrite a bind-mounted file.
+function create_a_diversion() {
+    local mounted=$1
 
-DUMMY_NAME=nvidia-dummy
-DUMMY_VERSION=${HOST_DRIVER_VERSION}
-DUMMY_FILE=${DUMMY_NAME}_${DUMMY_VERSION}_all.deb
+    dir=$(dirname "$mounted")
+    file=$(basename "$mounted")
 
-__ticks=0
-function tick() {
-    __ticks=$((__ticks+1))
-    echo -ne "\rWorking: " >&3
-    printf '.%.0s' $(seq 1 $__ticks) >&3
-    if [ "${1:-}" = "last" ]; then
-        echo -ne "\n" >&3
-    fi
-}
+    diverted_dir="$dir/distro"
 
-function build_dummy() {
-    echo "Telling APT about the host driver (this may take a while)"
-    (
-        # exit the subshell early if any of the commands fail.
-        set -e; tick
-
-        # Create a `control` file for use by equivs to build the dummy package.
-        # We do this manually instead of using equivs-build because it's easier
-        # than editing in the custom values later.
-        cat << CONTROL >${DUMMY_NAME}.control
-Section: misc
-Priority: optional
-Standards-Version: 3.9.2
-Package: ${DUMMY_NAME}
-Version: ${DUMMY_VERSION}
-Provides: libnvidia-cfg1-${HOST_DRIVER_MAJOR_VERSION} (= ${PACKAGE_APT_VERSION})
-Description: Placeholder for nvidia-docker provided libs
- Since nvidia-docker provides most of the required drivers, this package tells APT about the current version for dependency purposes.
-CONTROL
-        tick
-
-        # Install equivs
-        apt-get update; tick
-        apt-get -qqy --no-install-recommends install equivs; tick
-
-        # Build the dummy package
-        equivs-build ${DUMMY_NAME}.control; tick
-        rm ${DUMMY_NAME}.control; tick
-
-        # Clean up all the extra junk we don't need anymore.
-        apt-get -qqy remove equivs; tick
-        apt-get -qqy remove --autoremove; tick last
-    ) 3>&1 &>/dev/null
+    # make sure the diverted location exists, or dpkg will fail when trying to
+    # write to it.
+    mkdir -p "$diverted_dir"
+
+    diverted="$diverted_dir/$file"
+
+    # echo "Diverting $a => $diverted"
+    dpkg-divert --no-rename --divert "$diverted" "$a" &>/dev/null
 }
 
-# If there's already a dummy package with the appropriate version, just use it
-# instead of rebuilding.
-if [ -f "$DUMMY_PACKAGE_CACHE/${DUMMY_FILE}" ]; then
-    echo "Telling APT about the host driver (cached)"
-else
-    if ! build_dummy; then
-        fail "Could not generate dependencies"
-    fi
-fi
+# for each of the driver files nvidia-docker mounts in for us, tell dpkg not to
+# overwrite them when installing packages.
+for a in $(mount | grep "\.so\.$HOST_DRIVER_VERSION" | cut -f 3 -d ' '); do
+    create_a_diversion "$a"
+done
 
-if dpkg -i ${DUMMY_FILE} &>/dev/null; then
-    echo -n "Installing Nvidia X driver ($PACKAGE_APT_VERSION)..."
-    apt-get install -qqy --no-install-recommends "$PACKAGE_NAME=$PACKAGE_APT_VERSION" &>/dev/null
-    echo "done."
-else
+echo -n "Installing Nvidia X driver ($PACKAGE_APT_VERSION)..."
+apt-get install -qqy --no-install-recommends "$XORG_PACKAGE_NAME=$PACKAGE_APT_VERSION" "$GL_PACKAGE_NAME=$PACKAGE_APT_VERSION" &>/dev/null
+if [ $? -ne 0 ]; then
+    echo "error!"
     fail "The Nvidia X driver could not be automatically installed."
+else
+    echo "done."
 fi
 
-