Skip to content

Latest commit

 

History

History
50 lines (30 loc) · 3.46 KB

cuda.md

File metadata and controls

50 lines (30 loc) · 3.46 KB

Building CUDA-enabled Images for DataHub/DSMLP

In this branch we will cover the starting steps of creating a GPU accelerated Docker Image for DSMLP. It's recommended to follow the steps in the "master" branch before continuing.

Note: our standard scipy-ml-notebook has CUDA installed with a compatible PyTorch and Tensorflow for use on DSMLP. Please use that image if that's all you require since installing a working CUDA environment is time consuming.

DSMLP: Available GPUs

As of Fall 2020, there are 15 GPU nodes on the DSMLP cluster available for classroom use, and each node has 8 NVIDIA GPUs installed. These GPUs are dynamically assigned to a container on start-up when requested and will stay attached until that container is deleted, meaning a GPU will remain occupied even if it's actually not running anything.

The graphics driver will be installed automatically to the container on start-up. The current driver version is 418.88. Because of this, the latest CUDA Tookit that is supported on DSMLP is version 10.1, according to NVIDIA.

GPU Model VRAM Amount Node
NVIDIA 1080 Ti 11GB 80 n01 through n12 except n09, n10
NVIDIA 2080 Ti 11GB 32 n18, n21, n22, n24
NVIDIA 1070 Ti 8GB 7 n10

Dockerfile: CUDA Tookit

Choosing the right version of CUDA is important because some legacy codebase relies on specific old versions of CUDA and their compatible software in order to run. We will use CUDA 10.1 for this example.

The following command let conda install CUDA Toolkit 11.2 along with a deep learning accelerator (cuDNN) and a device communication library (NCCL).

RUN conda install -y cudatoolkit=11.2 cudnn nccl

In the example Dockerfile, the above command is followed by conda clean --all -f -y, which cleans up the unnecessary cache. The two commands are executed sequentially with && in between in order to reduce overall size in that layer.

Dockerfile: Deep Learning Libraries

TensorFlow

There are two major versions of tensorflow APIs and they cannot coexist in the same environment. Look into the Dockerfile for the commands. Using tensorflow will get the latest 2.* version.

PyTorch

Installing PyTorch will require you to go on their website, select the appropriate specifications for the system and paste in the command. Remember to add --no-cache-dir after pip install to reduce image size.

Dockerfile: Additional Kernels

To install a new kernel that can be selected within a jupyter notebook, you can look into creating a second conda environment and use nb_conda_kernels to add it in.

Resources/Further Reading