Skip to content

Commit

Permalink
Merge pull request #17 from games-on-whales/zb140/nvidia-xorg
Browse files Browse the repository at this point in the history
install both nvidia libraries needed by xorg
  • Loading branch information
zb140 committed Jun 27, 2021
2 parents c49d277 + eb33fce commit 00793be
Show file tree
Hide file tree
Showing 2 changed files with 52 additions and 79 deletions.
32 changes: 19 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ environment:
To get the correct UUID for your GPU, use the `nvidia-container-cli` command:
```console
$ sudo nvidia-container-cli --load-kmods info
NVRM version: 465.27
NVRM version: [version]
CUDA version: 11.3

Device Index: 0
Expand All @@ -87,25 +87,31 @@ Architecture: 7.5

##### Xorg drivers

Because Nvidia does not officially support running Xorg inside a container with their Container Toolkit, it does not automatically provide you with the `nvidia_drv.so` driver module that Xorg requires. The preferred method for making it available inside the container is to map it in from the host as a bind volume. This ensures it is always the correct version. Find the module on your host, then add a volume mapping like this to your `docker run` command:
```console
--volume /path/to/nvidia_drv.so:/nvidia/xorg/nvidia_drv.so:ro
Although the NVIDIA Container Toolkit automatically provides most of the drivers needed to use the GPU inside a container, Xorg is _not_ officially supported. This means that the runtime will not automatically map in the specific drivers needed by Xorg.

There are two libraries needed by Xorg: `nvidia_drv.so` and `libglxserver_nvidia.so.[version]`. It is preferred to map these into the container as a bind volume from the host, because this guarantees that the version will exactly match between the container and the host. Locate the two modules and add a section like this to the `xorg` service in your `docker-compose.yml`:
```yaml
volumes:
- /path/to/nvidia_drv.so:/nvidia/xorg/nvidia_drv.so:ro
- /path/to/libglxserver_nvidia.so.[version]:/nvidia/xorg/libglxserver_nvidia.so:ro
```

Be sure to replace `[version]` with the correct version number from the `nvidia-container-cli` command above.

Some common locations for `nvidia_drv.so` include:
* /usr/lib64/xorg/modules/drivers/nvidia_drv.so (Unraid)
* /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so (Ubuntu 20.04)
* `/usr/lib64/xorg/modules/drivers/` (Unraid)
* `/usr/lib/x86_64-linux-gnu/nvidia/xorg/` (Ubuntu 20.04)

If you don't want to do this, or if you can't find the driver on your host for some reason, the container will attempt to install the correct version for you automatically. However, there are some drawbacks: first, it can take a long time, and second, there is no guarantee that it will be able to find a version that exactly matches the driver version on your host.
Some common locations for `libglxserver_nvidia.so.[version]` include:
* `/usr/lib64/xorg/modules/extensions/` (Unraid)
* `/usr/lib/x86_64-linux-gnu/nvidia/xorg/` (Ubuntu 20.04)

If the automatic option is working for you and you want to speed up future launches of the container, you can provide a persistent volume for it to cache some of the setup work, using a mapping like this:
```console
--volume ~/dr-cache:/var/cache/dummy
```
If you don't want to do this, or if you can't find the driver on your host for some reason, the container will attempt to install the correct version for you automatically. However, there is no guarantee that it will be able to find a version that exactly matches the driver version on your host.

If for some reason you want to skip the entire process and just assume the driver is already installed, you can do that too:
```console
--env SKIP_NVIDIA_DRIVER_CHECK=1
```yaml
environment:
SKIP_NVIDIA_DRIVER_CHECK: 1
```

## Troubleshooting
Expand Down
99 changes: 33 additions & 66 deletions images/xorg/scripts/ensure-nvidia-xorg-driver.sh
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
#!/bin/bash

DUMMY_PACKAGE_CACHE=/var/cache/dummy

NVIDIA_DRIVER_MOUNT_LOCATION=/nvidia/xorg
NVIDIA_PACKAGE_LOCATION=/usr/lib/x86_64-linux-gnu/nvidia/xorg

Expand Down Expand Up @@ -35,10 +33,15 @@ done
HOST_DRIVER_VERSION=$(cat /proc/driver/nvidia/version | sed -nE 's/.*Module[ \t]+([0-9]+(\.[0-9]+)?).*/\1/p')
HOST_DRIVER_MAJOR_VERSION=$(echo "$HOST_DRIVER_VERSION" | sed -E 's/\..+//')

PACKAGE_NAME="xserver-xorg-video-nvidia-$HOST_DRIVER_MAJOR_VERSION"
XORG_PACKAGE_NAME="xserver-xorg-video-nvidia-$HOST_DRIVER_MAJOR_VERSION"
GL_PACKAGE_NAME="libnvidia-gl-$HOST_DRIVER_MAJOR_VERSION"

# ensure the package info is up to date so we have the best chance of finding a
# matching driver
apt-get update &>/dev/null

MAJOR_PACKAGE_APT_VERSIONS=$( \
apt-cache madison "$PACKAGE_NAME" \
apt-cache madison "$XORG_PACKAGE_NAME" \
| awk '{ print $3 }' \
| sort -rV
)
Expand All @@ -53,74 +56,38 @@ if [ -z "$PACKAGE_APT_VERSION" ]; then
fail "Failed to locate a package with the same driver version ($HOST_DRIVER_VERSION)"
fi

mkdir -p $DUMMY_PACKAGE_CACHE
cd $DUMMY_PACKAGE_CACHE
# tell dpkg to install the given file somewhere else so it doesn't try to
# overwrite a bind-mounted file.
function create_a_diversion() {
local mounted=$1

DUMMY_NAME=nvidia-dummy
DUMMY_VERSION=${HOST_DRIVER_VERSION}
DUMMY_FILE=${DUMMY_NAME}_${DUMMY_VERSION}_all.deb
dir=$(dirname "$mounted")
file=$(basename "$mounted")

__ticks=0
function tick() {
__ticks=$((__ticks+1))
echo -ne "\rWorking: " >&3
printf '.%.0s' $(seq 1 $__ticks) >&3
if [ "${1:-}" = "last" ]; then
echo -ne "\n" >&3
fi
}
diverted_dir="$dir/distro"

function build_dummy() {
echo "Telling APT about the host driver (this may take a while)"
(
# exit the subshell early if any of the commands fail.
set -e; tick

# Create a `control` file for use by equivs to build the dummy package.
# We do this manually instead of using equivs-build because it's easier
# than editing in the custom values later.
cat << CONTROL >${DUMMY_NAME}.control
Section: misc
Priority: optional
Standards-Version: 3.9.2
Package: ${DUMMY_NAME}
Version: ${DUMMY_VERSION}
Provides: libnvidia-cfg1-${HOST_DRIVER_MAJOR_VERSION} (= ${PACKAGE_APT_VERSION})
Description: Placeholder for nvidia-docker provided libs
Since nvidia-docker provides most of the required drivers, this package tells APT about the current version for dependency purposes.
CONTROL
tick

# Install equivs
apt-get update; tick
apt-get -qqy --no-install-recommends install equivs; tick

# Build the dummy package
equivs-build ${DUMMY_NAME}.control; tick
rm ${DUMMY_NAME}.control; tick

# Clean up all the extra junk we don't need anymore.
apt-get -qqy remove equivs; tick
apt-get -qqy remove --autoremove; tick last
) 3>&1 &>/dev/null
# make sure the diverted location exists, or dpkg will fail when trying to
# write to it.
mkdir -p "$diverted_dir"

diverted="$diverted_dir/$file"

# echo "Diverting $a => $diverted"
dpkg-divert --no-rename --divert "$diverted" "$a" &>/dev/null
}

# If there's already a dummy package with the appropriate version, just use it
# instead of rebuilding.
if [ -f "$DUMMY_PACKAGE_CACHE/${DUMMY_FILE}" ]; then
echo "Telling APT about the host driver (cached)"
else
if ! build_dummy; then
fail "Could not generate dependencies"
fi
fi
# for each of the driver files nvidia-docker mounts in for us, tell dpkg not to
# overwrite them when installing packages.
for a in $(mount | grep "\.so\.$HOST_DRIVER_VERSION" | cut -f 3 -d ' '); do
create_a_diversion "$a"
done

if dpkg -i ${DUMMY_FILE} &>/dev/null; then
echo -n "Installing Nvidia X driver ($PACKAGE_APT_VERSION)..."
apt-get install -qqy --no-install-recommends "$PACKAGE_NAME=$PACKAGE_APT_VERSION" &>/dev/null
echo "done."
else
echo -n "Installing Nvidia X driver ($PACKAGE_APT_VERSION)..."
apt-get install -qqy --no-install-recommends "$XORG_PACKAGE_NAME=$PACKAGE_APT_VERSION" "$GL_PACKAGE_NAME=$PACKAGE_APT_VERSION" &>/dev/null
if [ $? -ne 0 ]; then
echo "error!"
fail "The Nvidia X driver could not be automatically installed."
else
echo "done."
fi


0 comments on commit 00793be

Please sign in to comment.