# This is an example to use Keras in parallel using Horovod

Known issues:
- The nodes should have the same user as the jupyter, as horovod does not use a hostfile and specifying the user before the name does not work. An ATMOSPHERE user should be added to all the nodes with the corresponding .ssh configuration. This has been done manually.
- The jupyter has not instantiated properly the LD_LIBRARY_PATH variable. This should be checked for the specific user. User jovyan must be changed by ATMOSPHERE, with the same ssh configuration.

In [181]:
!cat /mnt/share/vol001/hostfile

root@clusterworker-0.mpicluster.default.svc.cluster.local
root@clusterworker-1.mpicluster.default.svc.cluster.local


First, let's test the system using an existing mpitest.c file

In [182]:
!cat /mnt/share/vol001/mpitest.c

#include <mpi.h>  
#include <stdio.h>  
#include <unistd.h>  
int main(int argc, char **argv) {    
  int rank;    char hostname[256];    
  MPI_Init(&argc,&argv);    
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);   
  gethostname(hostname,255);    
  printf("Hello world, I am process number: %d on host %s\n", rank, hostname);    
  MPI_Finalize();    
  return 0;  
  }


In [183]:
!LD_LIBRARY_PATH=/usr/lib/:/usr/local/lib/ mpicc -o /mnt/share/vol001/mpi41test /mnt/share/vol001/mpitest.c

In [184]:
!LD_LIBRARY_PATH=/usr/lib/:/usr/local/lib/ mpiexec -np 4 --allow-run-as-root --hostfile /mnt/share/vol001/hostfile  /mnt/share/vol001/mpi41test

^C


Now, let's run a Keras model

In [139]:
!cat /mnt/share/vol001/keras_mnist.py

from __future__ import print_function
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K
import math
import tensorflow as tf
import horovod.keras as hvd

# Horovod: initialize Horovod.
hvd.init()

# Horovod: pin GPU to be used to process local rank (one GPU per process)
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.visible_device_list = str(hvd.local_rank())
K.set_session(tf.Session(config=config))

batch_size = 128
num_classes = 10

# Horovod: adjust number of epochs based on number of GPUs.
epochs = int(math.ceil(12.0 / hvd.size()))

# Input image dimensions
img_rows, img_cols = 28, 28

# The data, shuffled and split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

if K.image_data_format() == 'channels_first':
    x_train =

### Running with 2 processes through horovod

In [143]:
import time
start = time.time()

In [144]:
# add HOROVOD_TIMELINE=/mnt/share/vol001/timeline.json for traces
!LD_LIBRARY_PATH=/usr/lib/:/usr/local/lib/ \
HOROVOD_HIERARCHICAL_ALLREDUCE=1 \
HOROVOD_FUSION_THRESHOLD=134217728 \
 /opt/conda/bin/horovodrun -np 2 \
                           -H clusterworker-0:1,clusterworker-1:1 \
                           python3 /mnt/share/vol001/keras_mnist.py
            

[1,1]<stderr>:2019-05-28 10:24:46.204268: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
[1,0]<stderr>:2019-05-28 10:24:46.205437: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
[1,1]<stderr>:2019-05-28 10:24:46.424450: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[1,1]<stderr>:2019-05-28 10:24:46.424905: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x4c95c70 executing computations on platform CUDA. Devices:
[1,1]<stderr>:2019-05-28 10:24:46.424921: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla V100-PCIE-32GB, Compute Capability 7.0
[1,1]<stderr>:2019-05-28 10:24:46.426877: I 

In [145]:
end = time.time()
print (end-start)

139.70471811294556


In [82]:
### Running with 1 processes through horovod

In [83]:
import time
start = time.time()

In [84]:
!LD_LIBRARY_PATH=/usr/lib/:/usr/local/lib/ \
    /opt/conda/bin/horovodrun --verbose -np 1 \
    -H clusterworker-0:1 \
    python3 /mnt/share/vol001/keras_mnist.py
        
        
        
        

Filtering local host names.
Checking ssh on all remote hosts.
SSH was successful into all the remote hosts.
Testing interfaces on all the hosts.
Launched horovodrun server.
Attempted to launch horovod task servers.
Waiting for the hosts to acknowledge.
Notified all the hosts that the registration is complete.
Waiting for hosts to perform host-to-host interface checking.
Host-to-host interface checking successful.
Interfaces on all the hosts were successfully checked.
mpirun --allow-run-as-root --tag-output -np 1 -H clusterworker-0:1 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib  -mca btl_tcp_if_include eth0,lo -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0,lo -x SSHSERVER_PORT -x MINICONDA_VERSION -x SSHSERVER_SERVICE_PORT -x SUDO_GID -x KUBERNETES_PORT -x KUBERNETES_SERVICE_PORT -x MOUNTPOINT -x SSHSERVER_PORT_22_TCP_ADDR -x SSHPUBKEY -x TF_NOTEBOOK_SERVICE_HOST -x LANGUAGE -x USER -x MPICLUSTER_SERVICE_HOST -x MPLBACKEND -x HOSTNAME -x SSHKEY -x SERVER -x SSHSERVER_PORT_22_

In [85]:
end = time.time()
print (end-start)

64.84776735305786


In [86]:
### Running with 2 processes through mpirun


In [174]:
import time
start = time.time()

In [175]:
!LD_LIBRARY_PATH=/usr/lib/:/usr/local/lib/ HOROVOD_CYCLE_TIME=5 HOROVOD_FUSION_THRESHOLD=134217728 mpiexec -np 2 \
-H clusterworker-0:1,clusterworker-1:1 \
--mca btl_tcp_rcvbuf 0 --mca btl_tcp_sndbuf 0 \
--mca btl ^openib \
-x HOROVOD_FUSION_THRESHOLD -x HOROVOD_CYCLE_TIME \
--map-by socket \
-x LD_LIBRARY_PATH -x PATH \
python3 /mnt/share/vol001/keras_mnist.py
        
#   -mca pml ob1 \
#   -mca btl ^openib \
#   -bind-to none \
#   -map-by slot \
#   -x NCCL_SOCKET_IFNAME=^docker0 \
#   -x NCCL_DEBUG=INFO \        
        
        

2019-05-28 10:48:28.510767: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2019-05-28 10:48:28.510626: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2019-05-28 10:48:28.653787: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-05-28 10:48:28.654236: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x44a69a0 executing computations on platform CUDA. Devices:
2019-05-28 10:48:28.654254: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla V100-PCIE-32GB, Compute Capability 7.0
2019-05-28 10:48:28.656149: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2099975000 Hz

In [176]:
end = time.time()
print (end-start)

100.8767318725586


In [103]:
!LD_LIBRARY_PATH=/usr/lib/:/usr/local/lib/ mpiexec -np 3 \
--allow-run-as-root \
--hostfile /mnt/share/vol001/hostfile /mnt/share/vol001/mpi41test
    
    
    

Hello world, I am process number: 0 on host clusterworker-0
Hello world, I am process number: 2 on host clusterworker-1
Hello world, I am process number: 1 on host clusterworker-0


In [113]:
!LD_LIBRARY_PATH=/usr/lib/:/usr/local/lib/ mpiexec -np 3 \
--allow-run-as-root \
--hostfile /mnt/share/vol001/hostfile /mnt/share/vol001/pingpong
    

[0] running in clusterworker-0
[2] running in clusterworker-1
[1] running in clusterworker-0
The mean time between 0 and 1 is 0.000001 msize=0 iterations=100
The mean time between 0 and 2 is 0.000104 msize=0 iterations=100


In [130]:
!LD_LIBRARY_PATH=/usr/lib/:/usr/local/lib/ mpiexec -np 2 \
--allow-run-as-root \
-H clusterworker-0,clusterworker-1 \
--hostfile /mnt/share/vol001/hostfile /mnt/share/vol001/pingpong 0    
    
!LD_LIBRARY_PATH=/usr/lib/:/usr/local/lib/ mpiexec -np 2 \
--allow-run-as-root \
-H clusterworker-0,clusterworker-1 \
--hostfile /mnt/share/vol001/hostfile /mnt/share/vol001/pingpong 10240   

[1] running in clusterworker-1
[0] running in clusterworker-0
The mean time between 0 and 1 is 0.000102 msize=0 iterations=100
[1] running in clusterworker-1
[0] running in clusterworker-0
The mean time between 0 and 1 is 0.000138 msize=10240 iterations=100
