# Horovod Environment Setup

*Last edited: 2023-12-23*

- Horovod and TensorFlow instalation on SDumont.
- This Notebook assumes that Miniconda3 is already installed.

In [30]:
SCRA = ! SCRA=/scratch${HOME#/prj} && echo $SCRA
SCRA = SCRA[0]
%env SCRA {SCRA}

env: SCRA=/scratch/yyyy/xxxx
env: SPWD=/scratch/yyyy/xxxx/horov-mnist
env: DATA=/scratch/yyyy/xxxx/data/MNIST/raw


In [3]:
%%writefile {SCRA}/tfh01.yml
name: tfh01
channels:
  - nvidia
  - bokeh
  - pytorch
  - conda-forge
  - defaults
dependencies:
  - python=3.7
  - ipykernel
  - cudatoolkit=10.0
  - cupti=10.0
  - nccl=2.8
  - nvcc_linux-64=10.0
  - protobuf=3.8
  - libprotobuf=3.8
  - tensorboard=1.15
  - bokeh
  - ccache
  - mpi4py
  - nodejs
  - pip
  - pip:
    - tensorflow==1.15
    - tensorflow-gpu==1.15
    - tensorrt==8.5.3.1
prefix: /scratch/yyyy/xxxx/miniconda3/envs/tfh01
variables:
  LD_LIBRARY_PATH: "'$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/'"
  XLA_FLAGS: "'--xla_gpu_cuda_data_dir=$CONDA_PREFIX/lib/'"

Overwriting /scratch/yyyy/xxxx/tfh01.yml


In [4]:
%%bash
source /scratch/yyyy/xxxx/miniconda3/bin/activate
conda env create -f /scratch/yyyy/xxxx/tfh01.yml --force

Channels:
 - nvidia
 - bokeh
 - pytorch
 - conda-forge
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

Downloading and Extracting Packages: ...working... done
Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
Installing pip dependencies: ...working... Ran pip subprocess with arguments:
['/scratch/yyyy/xxxx/miniconda3/envs/tfh01/bin/python', '-m', 'pip', 'install', '-U', '-r', '/scratch/yyyy/xxxx/condaenv.7fz58p6p.requirements.txt', '--exists-action=b']
Pip subprocess output:
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting tensorflow==1.15 (from -r /scratch/yyyy/xxxx/condaenv.7fz58p6p.requirements.txt (line 1))
  Downloading tensorflow-1.15.0-cp37-cp37m-manylinux2010_x86_64.whl (412.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m412.3/412.3 MB[0m [31m29.2 MB/s[0m eta

- It is necessary to edit the file *.../miniconda3/envs/tfh01/lib/python3.7/shutil.py*, comment the line *copystat(src, dst)* and add *pass*. E.g.: "pass # copystat(src, dst)".

In [3]:
%%bash
source /scratch/yyyy/xxxx/miniconda3/bin/activate
conda activate tfh01
ipython kernel install --user --name tfh01

overwriting variable ['LD_LIBRARY_PATH']


In [2]:
%%bash
source /scratch/yyyy/xxxx/miniconda3/bin/activate
jupyter kernelspec list

0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off
0.00s - to python to disable frozen modules.
0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.


Available kernels:
  tfh01      /prj/yyyy/xxxx/.local/share/jupyter/kernels/tfh01
  python3    /scratch/yyyy/xxxx/miniconda3/share/jupyter/kernels/python3


- It appears as "/prj" but in the json it is as "/scratch":

In [3]:
! cat /prj/yyyy/xxxx/.local/share/jupyter/kernels/tfh01/kernel.json

{
 "argv": [
  "/scratch/yyyy/xxxx/miniconda3/envs/tfh01/bin/python",
  "-m",
  "ipykernel_launcher",
  "-f",
  "{connection_file}"
 ],
 "display_name": "tfh01",
 "language": "python",
 "metadata": {
  "debugger": true
 }
}

In [5]:
%%bash
BASE=/scratch/yyyy/xxxx/miniconda3
PYT=3.7
ENV=tfh01
source /scratch/yyyy/xxxx/miniconda3/bin/activate
conda activate $ENV
rm -f $BASE/envs/$ENV/lib/libnvinfer.so.7 $BASE/envs/$ENV/lib/libnvinfer_plugin.so.7
ln -s $BASE/envs/$ENV/lib/python$PYT/site-packages/tensorrt/libnvinfer.so.8 \
    $BASE/envs/$ENV/lib/libnvinfer.so.7
ln -s $BASE/envs/$ENV/lib/python$PYT/site-packages/tensorrt/libnvinfer_plugin.so.8 \
    $BASE/envs/$ENV/lib/libnvinfer_plugin.so.7

overwriting variable ['LD_LIBRARY_PATH']


In [6]:
%%bash
source /scratch/yyyy/xxxx/miniconda3/bin/activate
conda activate tfh01
export HOROVOD_CUDA_HOME=/usr/local/cuda-10.0
export HOROVOD_NCCL_INCLUDE=/scratch/yyyy/xxxx/miniconda3/envs/tfh01/include
export HOROVOD_NCCL_LIB=/scratch/yyyy/xxxx/miniconda3/envs/tfh01/lib
export NCCL_INCLUDE_DIR=/scratch/yyyy/xxxx/miniconda3/envs/tfh01/include
export NCCL_LIBRARY=/scratch/yyyy/xxxx/miniconda3/envs/tfh01/lib
export HOROVOD_GPU_OPERATIONS=NCCL
export HOROVOD_WITH_MPI=1
export HOROVOD_WITH_TENSORFLOW=1
#export HOROVOD_WITH_PYTORCH=1
#export HOROVOD_WITHOUT_TENSORFLOW=1
export HOROVOD_WITHOUT_PYTORCH=1
export HOROVOD_WITHOUT_MXNET=1
export HOROVOD_WITHOUT_GLOO=1
pip install --no-binary=horovod --no-cache-dir horovod[tensorflow] 

overwriting variable ['LD_LIBRARY_PATH']


Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting horovod[tensorflow]
  Downloading horovod-0.28.1.tar.gz (3.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/3.5 MB[0m [31m55.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting cloudpickle (from horovod[tensorflow])
  Downloading cloudpickle-2.2.1-py3-none-any.whl (25 kB)
Building wheels for collected packages: horovod
  Building wheel for horovod (setup.py): started
  Building wheel for horovod (setup.py): still running...
  Building wheel for horovod (setup.py): finished with status 'done'
  Created wheel for horovod: filename=horovod-0.28.1-cp37-cp37m-linux_x86_64.whl size=10457374 sha256=3d6403c208ea02fefee3e20de1c0fea348395fcb9e1208daca0d516414d47e2c
  Stored in directory: /tmp/pip-ephem-wheel-cache-x9rxxwqa/wheels/75/bf/bf/1131c00d74352837272d3a176b5c32ed6

In [4]:
%%bash
source /scratch/yyyy/xxxx/miniconda3/bin/activate
conda activate tfh01
horovodrun --check-build

overwriting variable ['LD_LIBRARY_PATH']
2023-12-23 10:11:52.886324: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2023-12-23 10:11:55.075080: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2023-12-23 10:11:56.812352: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2023-12-23 10:11:58.570298: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2023-12-23 10:12:00.310445: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2023-12-23 10:12:02.128233: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0


Horovod v0.28.1:

Available Frameworks:
    [X] TensorFlow
    [X] PyTorch
    [ ] MXNet

Available Controllers:
    [X] MPI
    [ ] Gloo

Available Tensor Operations:
    [X] NCCL
    [ ] DDL
    [ ] CCL
    [X] MPI
    [ ] Gloo    


In [5]:
! conda env list

# conda environments:
#
                         /prj/yyyy/xxxx/miniconda3
base                  *  /scratch/yyyy/xxxx/miniconda3
tfh01                    /scratch/yyyy/xxxx/miniconda3/envs/tfh01



In [6]:
%%bash
export PYDEVD_DISABLE_FILE_VALIDATION=1
jupyter kernelspec list

Available kernels:
  tfh01      /prj/yyyy/xxxx/.local/share/jupyter/kernels/tfh01
  python3    /scratch/yyyy/xxxx/miniconda3/share/jupyter/kernels/python3
