Fix GPU Discovery. #4522

AndrewJGaut · 2023-08-26T23:29:17Z

Recently, the image we used to run nvidia-smi to to GPU discovery was deprecated. To address this, a PR was made to update the image; however, this broke GPU discovery since the nvidia-smi output changed (it now included a header).

This PR does the following:
(1) Fix GPU discovery so that GPU workers will now be able to run
(2) Use a smaller NVIDIA/CUDA image
(3) Downgrade the CUDA version.

Regarding (2), the official NVIDIA Dockerhub page includes the following blurb at the time this PR was created:

Overview of Images
Three flavors of images are provided:

base: Includes the CUDA runtime (cudart)
runtime: Builds on the base and includes the [CUDA math libraries](https://developer.nvidia.com/gpu-accelerated-libraries), and [NCCL](https://developer.nvidia.com/nccl). A runtime image that also includes [cuDNN](https://developer.nvidia.com/cudnn) is available.
devel: Builds on the runtime and includes headers, development tools for building CUDA images. These images are particularly useful for multi-stage builds.

Thus, the base image is the smallest. Since we need only run nvidia-smi, which is supported by all three images, it is best to use the smallest possible image to minimize download and container startup times.

Regarding (3), at the time of this PR, the Stanford NLP machines use CUDA version 11.5. Using an image with CUDA 12.2 yields an error since the machines do not support that version of CUDA. Thus, I downgraded the version.

AndrewJGaut · 2023-08-26T23:33:12Z

I should note: we end up not needing any extra regex parsing (e.g. like that used here) because the headers are no longer added. I believe this is because we aren't using a devel image, and as the Dockerhub page states, only those images include "headers" (which I think may be that copyright header we were getting when running nvidia-smi with that devel image).

codalab/worker/docker_utils.py

Fix gpu output. Already tested on slurm

7ab01e7

Add more logging

4b5f0be

epicfaace reviewed Aug 28, 2023

View reviewed changes

codalab/worker/docker_utils.py Show resolved Hide resolved

epicfaace approved these changes Aug 28, 2023

View reviewed changes

Merge branch 'master' into fix-gpu-discovery

48ea608

wwwjn merged commit 528db0f into master Aug 29, 2023
63 checks passed

wwwjn deleted the fix-gpu-discovery branch August 29, 2023 01:38

yifanmai mentioned this pull request Aug 29, 2023

Bump to version 1.7.1 #4525

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GPU Discovery. #4522

Fix GPU Discovery. #4522

AndrewJGaut commented Aug 26, 2023

AndrewJGaut commented Aug 26, 2023

Fix GPU Discovery. #4522

Fix GPU Discovery. #4522

Conversation

AndrewJGaut commented Aug 26, 2023

AndrewJGaut commented Aug 26, 2023