# Limitations of traditional distributed training

The standard distributed TensorFlow package runs with a parameter server approach to averaging gradients. In this approach, each process has one of two potential roles: a worker or a parameter server. Workers process the training data, compute gradients, and send them to parameter servers to be averaged.

Challenges of Tensorflow distributed training

- __Identifying the right ratio of worker to parameter servers__. If one parameter server is used, it will likely become a networking or computational bottleneck. If multiple parameter servers are used, the communication pattern becomes “all-to-all” which may saturate network interconnects.

- __Bandwidth is not optimal, network could be the bottleneck__. If model has more parameters, network traffic grow with number of parameters. This is not a scalable solution for large neural network. 

- __Handling increased TensorFlow program complexity__. User has to explicitly start each worker and parameter server, pass around service discovery information such as hosts and ports of all the workers and parameter servers, and modify the training program to construct `tf.Server()` with an appropriate `tf.ClusterSpec()`. Additionally, users had to ensure that all the operations were placed appropriately using `tf.train.device_replica_setter()` and code is modified to use towers to leverage multiple GPUs within the server. This often led to a steep learning curve and a significant amount of code restructuring, taking time away from the actual modeling.

> Note: Complexity problem has been address by TF-operator, end user doesn't need to worry about it.


# Ring Allreduce
In early 2017, Baidu published an article, “Bringing HPC Techniques to Deep Learning,” evangelizing a different algorithm for averaging gradients and communicating those gradients to all nodes. 

In the `ring-allreduce` algorithm, each of N nodes communicates with two of its peers 2*(N-1) times. During this communication, a node sends and receives chunks of the data buffer. In the first N-1 iterations, received values are added to the values in the node’s buffer. In the second N-1 iterations, received values replace the values held in the node’s buffer. Baidu’s paper suggests that this algorithm is bandwidth-optimal, meaning that if the buffer is large enough, it will optimally utilize the available network.

![allreduce](./images/allreducering.png)


# MPI
Users utilize a Message Passing Interface (MPI) implementation such as Open MPI to launch all copies of the TensorFlow program. MPI then transparently sets up the distributed infrastructure necessary for workers to communicate with each other. All the user needs to do is modify their program to average gradients using an `allreduce()` operation.


# Horovod

The realization that a `ring-allreduce` approach can improve both __usability__ and __performance__ motivated us to work on our own implementation to address Uber’s TensorFlow needs. Uber adopted Baidu’s draft implementation of the TensorFlow ring-allreduce algorithm and built upon it. That's horovod.

Hovorod now supports most of the popular deep learning frameworks like Tensorflow, PyTorch and MxNet.


![allreduce-scale](./images/allreduce-scale.png)
The number of samples processed per second with a 300-million parameter language model scales linearly with the number of GPUs concurrently doing synchronous training.


## Examples

We will use MPI + Horovod + Tensorflow to train image classification model in this notebook.

This job requires GPU nodes in the EKS cluster, you have two options to scale up the GPU nodes

1. Open EC2 portal and find your GPU ASG, change desired number.
2. Use eksctl command `eksctl scale nodegroup --cluster=<your_eks_cluster_name> --nodes=3 <GPU_node_group_name>` 
3. Install node autoscaler and it will scale up and down the EKS cluster based on resource requested. (separate tutorial)

In [None]:
!cat distributed-training-jobs/distributed-mpi-job.yaml

### GPU Resources

`resource.limits.nvidia.com/gpu: 4` means every container will use 4 GPUs. 
> Note: Make sure you use p2.8xlarge or p3.8xlarge with 4 gpus at least

`replicas: 2` means we totally want to use 2 containers. 

In total, there're 8 gpus form a ring and doing distributed training

### Prerequiste: 
1. Update default-editor roles

    There's an upstream isuse that default-editor doesn't have permission to create mpijobs.

    Adding this on Clusterrole to skip error: User "system:serviceaccount:kubeflow:default-editor" cannot create resource "mpijobs" in API group "kubeflow.org" in the namespace "kubeflow"

    ```shell
    kubectl edit clusterrole kubeflow-edit -n kubeflow
    ```

    Add following policies to cluster role lists.
    ```yaml
    - apiGroups:
      - kubeflow.org
      resources:
      - '*'
      verbs:
      - '*'
    ```

In [None]:
!kubectl create -f distributed-training-jobs/distributed-mpi-job.yaml

In [None]:
!kubectl get mpijob

In [None]:
!kubectl describe mpijob distributed-mpi-job

In [None]:
!kubectl get pod | grep distributed-mpi-job

### Check logs of Launcher Job

Every mpi job will create a launcher job and it will sync with all workers and persist logs. 

You will see some logs like this


```
90	images/sec: 813.1 +/- 4.6 (jitter = 29.8)	7.593
90	images/sec: 813.1 +/- 4.6 (jitter = 29.4)	7.562
....
90	images/sec: 813.1 +/- 4.9 (jitter = 25.7)	7.572
100	images/sec: 814.2 +/- 4.5 (jitter = 25.0)	7.579
----------------------------------------------------------------
total images/sec: 6511.22
----------------------------------------------------------------
100	images/sec: 814.1 +/- 4.3 (jitter = 29.0)	7.604
----------------------------------------------------------------
total images/sec: 6511.26
----------------------------------------------------------------
100	images/sec: 814.1 +/- 4.3 (jitter = 29.2)	7.549
----------------------------------------------------------------
total images/sec: 6511.20
----------------------------------------------------------------

```

In [None]:
!kubectl logs distributed-mpi-job-launcher-q6h2d 

### Clean up

GPU instance are every expensive, remember to scale down the nodes when you finish the job