Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build NVIDIA-Docker image with proper PyTorch version to run DARTS code #10

Closed
JiaweiZhuang opened this issue Oct 7, 2019 · 6 comments
Assignees
Projects

Comments

@JiaweiZhuang
Copy link
Contributor

The best way to run the DARTS scripts in production is probably via nvidia-docker. It is important to freeze the environment as the DARTS code requires PyTorch == 0.3.1, torchvision == 0.2.0. Newer pytorch version crashes for various reasons.

The same container image can run on

  • GPU instances on AWS or GCP
  • @memanuel 's in-house GPU server, if Docker can be configured properly
  • or even a k8s GPU cluster
@JiaweiZhuang JiaweiZhuang self-assigned this Oct 7, 2019
@dylanrandle dylanrandle added this to Implementation in DARTS Oct 7, 2019
@JiaweiZhuang
Copy link
Contributor Author

Here're the complete steps to install NVIDIA-Docker on Ubuntu-18.04, AWS p2.xlarge instance. The commands are a bit dense, but they can be wrapped into a single shell script.

1. Install CUDA driver

Get the relatively new nvidia-430 version from https://launchpad.net/~graphics-drivers/+archive/ubuntu/ppa

sudo add-apt-repository ppa:graphics-drivers/ppa -y
sudo apt-get update
sudo apt-get install -y nvidia-driver-430 nvidia-modprobe

Test installation:

$ nvidia-smi
Mon Oct  7 18:43:14 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50       Driver Version: 430.50       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   47C    P0    59W / 149W |      0MiB / 11441MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

2. Install the standard Docker

Follow https://docs.docker.com/install/linux/docker-ce/ubuntu/

sudo apt-get install -y \
    apt-transport-https \
    ca-certificates \
    curl \
    gnupg-agent \
    software-properties-common

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo apt-key fingerprint 0EBFCD88
sudo add-apt-repository \
"deb [arch=amd64] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) \
stable"

sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io

sudo groupadd docker
sudo usermod -aG docker $USER  # allow running docker without sudo, need to re-login

Test installation:

$ docker run hello-world

Hello from Docker!
This message shows that your installation appears to be working correctly.

3. Install NVIDIA-Docker

Follow https://github.com/NVIDIA/nvidia-docker#quickstart

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Test installation (https://github.com/NVIDIA/nvidia-docker#usage):

$ docker run --gpus all nvidia/cuda:9.0-base nvidia-smi
Mon Oct  7 18:46:13 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50       Driver Version: 430.50       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   52C    P0    69W / 149W |      0MiB / 11441MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

@JiaweiZhuang
Copy link
Contributor Author

JiaweiZhuang commented Oct 7, 2019

The easiest way to get PyTorch-GPU image is from NVIDIA NGC. The image contains a lot of stuff including JupyterLab and TensorBoard (see release notes)

docker pull nvcr.io/nvidia/pytorch:19.09-py3
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:19.09-py3 

Inside the container, try:

$ python -c "import torch; print(torch.__version__, torch.cuda.is_available())"
1.2.0a0+afb7a16 True

It is the latest 1.2.0 version. Need to roll back to 0.3.1.

@JiaweiZhuang
Copy link
Contributor Author

JiaweiZhuang commented Oct 7, 2019

Done via ce4ed92 and f3b3938

Follow the README in docker/install-nvidia-docker and docker/darts-pytorch-image. Everyone should be able to run the default script and get the expected result:

10/07 08:02:25 PM test 000 1.233736e-01 96.875000 100.000000
10/07 08:02:48 PM test 050 1.105459e-01 97.120095 99.959150
10/07 08:03:11 PM test 100 1.074739e-01 97.359733 99.948432
10/07 08:03:12 PM test_acc 97.369997

@JiaweiZhuang JiaweiZhuang moved this from Implementation to Done in DARTS Oct 7, 2019
@JiaweiZhuang
Copy link
Contributor Author

JiaweiZhuang commented Oct 7, 2019

@dylanrandle Here's how to run DARTS on graphene data within the container:

# get data and source code
mkdir data
wget https://capstone2019-google.s3.amazonaws.com/graphene_processed.nc -P ./data/
git clone https://github.com/capstone2019-neuralsearch/darts.git

# run training
docker run --rm -it --gpus all -v $(pwd):/workdir/host_files darts-pytorch
cd host_files
python3 darts/cnn/train_search.py --data ./data/ --dataset graphene

@dylanrandle
Copy link
Contributor

This is absolutely awesome. Brilliant!

@JiaweiZhuang
Copy link
Contributor Author

JiaweiZhuang commented Oct 8, 2019

I changed pytorch==0.3.1 (built with CUDA80) to http://download.pytorch.org/whl/cu90/torch-0.3.1-cp36-cp36m-linux_x86_64.whl (built with CUDA 90), otherwise the DARTS script will crash on new GPU types such as p3.2xlarge (V100):

/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py:95: UserWarning:
    Found GPU0 Tesla V100-SXM2-16GB which requires CUDA_VERSION >= 9000 for
     optimal performance and fast startup time, but your PyTorch was compiled
     with CUDA_VERSION 8000. Please install the correct PyTorch binary
     using instructions from http://pytorch.org

  warnings.warn(incorrect_binary_warn % (d, name, 9000, CUDA_VERSION))

See 2172d1c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
DARTS
Done
Development

No branches or pull requests

2 participants