Build NVIDIA-Docker image with proper PyTorch version to run DARTS code #10

JiaweiZhuang · 2019-10-07T17:20:36Z

The best way to run the DARTS scripts in production is probably via nvidia-docker. It is important to freeze the environment as the DARTS code requires PyTorch == 0.3.1, torchvision == 0.2.0. Newer pytorch version crashes for various reasons.

The same container image can run on

GPU instances on AWS or GCP
@memanuel 's in-house GPU server, if Docker can be configured properly
or even a k8s GPU cluster

The text was updated successfully, but these errors were encountered:

JiaweiZhuang · 2019-10-07T18:46:28Z

Here're the complete steps to install NVIDIA-Docker on Ubuntu-18.04, AWS p2.xlarge instance. The commands are a bit dense, but they can be wrapped into a single shell script.

1. Install CUDA driver

Get the relatively new nvidia-430 version from https://launchpad.net/~graphics-drivers/+archive/ubuntu/ppa

sudo add-apt-repository ppa:graphics-drivers/ppa -y
sudo apt-get update
sudo apt-get install -y nvidia-driver-430 nvidia-modprobe

Test installation:

$ nvidia-smi
Mon Oct  7 18:43:14 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50       Driver Version: 430.50       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   47C    P0    59W / 149W |      0MiB / 11441MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

2. Install the standard Docker

Follow https://docs.docker.com/install/linux/docker-ce/ubuntu/

sudo apt-get install -y \
    apt-transport-https \
    ca-certificates \
    curl \
    gnupg-agent \
    software-properties-common

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo apt-key fingerprint 0EBFCD88
sudo add-apt-repository \
"deb [arch=amd64] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) \
stable"

sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io

sudo groupadd docker
sudo usermod -aG docker $USER  # allow running docker without sudo, need to re-login

Test installation:

$ docker run hello-world

Hello from Docker!
This message shows that your installation appears to be working correctly.

3. Install NVIDIA-Docker

Follow https://github.com/NVIDIA/nvidia-docker#quickstart

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Test installation (https://github.com/NVIDIA/nvidia-docker#usage):

$ docker run --gpus all nvidia/cuda:9.0-base nvidia-smi
Mon Oct  7 18:46:13 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50       Driver Version: 430.50       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   52C    P0    69W / 149W |      0MiB / 11441MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

JiaweiZhuang · 2019-10-07T18:56:32Z

The easiest way to get PyTorch-GPU image is from NVIDIA NGC. The image contains a lot of stuff including JupyterLab and TensorBoard (see release notes)

docker pull nvcr.io/nvidia/pytorch:19.09-py3
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:19.09-py3

Inside the container, try:

$ python -c "import torch; print(torch.__version__, torch.cuda.is_available())"
1.2.0a0+afb7a16 True

It is the latest 1.2.0 version. Need to roll back to 0.3.1.

JiaweiZhuang · 2019-10-07T20:04:00Z

Done via ce4ed92 and f3b3938

Follow the README in docker/install-nvidia-docker and docker/darts-pytorch-image. Everyone should be able to run the default script and get the expected result:

10/07 08:02:25 PM test 000 1.233736e-01 96.875000 100.000000
10/07 08:02:48 PM test 050 1.105459e-01 97.120095 99.959150
10/07 08:03:11 PM test 100 1.074739e-01 97.359733 99.948432
10/07 08:03:12 PM test_acc 97.369997

JiaweiZhuang · 2019-10-07T20:34:47Z

@dylanrandle Here's how to run DARTS on graphene data within the container:

# get data and source code
mkdir data
wget https://capstone2019-google.s3.amazonaws.com/graphene_processed.nc -P ./data/
git clone https://github.com/capstone2019-neuralsearch/darts.git

# run training
docker run --rm -it --gpus all -v $(pwd):/workdir/host_files darts-pytorch
cd host_files
python3 darts/cnn/train_search.py --data ./data/ --dataset graphene

dylanrandle · 2019-10-07T21:05:54Z

This is absolutely awesome. Brilliant!

JiaweiZhuang · 2019-10-08T01:48:53Z

I changed pytorch==0.3.1 (built with CUDA80) to http://download.pytorch.org/whl/cu90/torch-0.3.1-cp36-cp36m-linux_x86_64.whl (built with CUDA 90), otherwise the DARTS script will crash on new GPU types such as p3.2xlarge (V100):

/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py:95: UserWarning:
    Found GPU0 Tesla V100-SXM2-16GB which requires CUDA_VERSION >= 9000 for
     optimal performance and fast startup time, but your PyTorch was compiled
     with CUDA_VERSION 8000. Please install the correct PyTorch binary
     using instructions from http://pytorch.org

  warnings.warn(incorrect_binary_warn % (d, name, 9000, CUDA_VERSION))

See 2172d1c

JiaweiZhuang self-assigned this Oct 7, 2019

dylanrandle added this to Implementation in DARTS Oct 7, 2019

JiaweiZhuang closed this as completed Oct 7, 2019

JiaweiZhuang moved this from Implementation to Done in DARTS Oct 7, 2019

JiaweiZhuang mentioned this issue Oct 7, 2019

Refactor DARTS code to support non-square images #7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build NVIDIA-Docker image with proper PyTorch version to run DARTS code #10

Build NVIDIA-Docker image with proper PyTorch version to run DARTS code #10

JiaweiZhuang commented Oct 7, 2019

JiaweiZhuang commented Oct 7, 2019

JiaweiZhuang commented Oct 7, 2019 •

edited

JiaweiZhuang commented Oct 7, 2019 •

edited

JiaweiZhuang commented Oct 7, 2019 •

edited

dylanrandle commented Oct 7, 2019

JiaweiZhuang commented Oct 8, 2019 •

edited

Build NVIDIA-Docker image with proper PyTorch version to run DARTS code #10

Build NVIDIA-Docker image with proper PyTorch version to run DARTS code #10

Comments

JiaweiZhuang commented Oct 7, 2019

JiaweiZhuang commented Oct 7, 2019

JiaweiZhuang commented Oct 7, 2019 • edited

JiaweiZhuang commented Oct 7, 2019 • edited

JiaweiZhuang commented Oct 7, 2019 • edited

dylanrandle commented Oct 7, 2019

JiaweiZhuang commented Oct 8, 2019 • edited

JiaweiZhuang commented Oct 7, 2019 •

edited

JiaweiZhuang commented Oct 7, 2019 •

edited

JiaweiZhuang commented Oct 7, 2019 •

edited

JiaweiZhuang commented Oct 8, 2019 •

edited