<p> <center> <a href="../Start_Here.ipynb">Home Page</a> </center></p> 
<div>
    <span style="float: left; width: 20%; text-align: left;"><a href="07-Message_Passing.ipynb">Previous Notebook</a></span>
    <span style="float: left; width:75%; text-align: right;"><a href="06-DDP_Mixed_Precision.ipynb">Next Notebook</a></span></div>

# Horovod

---

Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use.
(https://github.com/horovod/horovod)

## Horovod with PyTorch

To use Horovod with PyTorch, make the following modifications to your training script:

1. Run `hvd.init()`.
1. Pin each GPU to a single process.
    With the typical setup of one GPU per process, set this to local rank. The first process on the server will be allocated the first GPU, the second process will be allocated the second GPU, and so forth.

```python
    if torch.cuda.is_available():
        torch.cuda.set_device(hvd.local_rank())
```
3. Scale the learning rate by the number of workers.
1. Effective batch size in synchronous distributed training is scaled by the number of workers. An increase in learning rate compensates for the increased batch size.
1. Wrap the optimizer in `hvd.DistributedOptimizer`.

    The distributed optimizer delegates gradient computation to the original optimizer, averages gradients using allreduce or allgather, and then applies those averaged gradients.

1. Broadcast the initial variable states from rank 0 to all other processes:

```python
    hvd.broadcast_parameters(model.state_dict(), root_rank=0)
    hvd.broadcast_optimizer_state(optimizer, root_rank=0)
```

    This is necessary to ensure consistent initialization of all workers when training is started with random weights or restored from a checkpoint.

7. Modify your code to save checkpoints only on worker 0 to prevent other workers from corrupting them.

    Accomplish this by guarding model checkpointing code with `hvd.rank() != 0`.


### Example

```python
import torch
import horovod.torch as hvd

# Initialize Horovod
hvd.init()

# Pin GPU to be used to process local rank (one GPU per process)
torch.cuda.set_device(hvd.local_rank())

# Define dataset...
train_dataset = ...

# Partition dataset among workers using DistributedSampler
train_sampler = torch.utils.data.distributed.DistributedSampler(
    train_dataset, num_replicas=hvd.size(), rank=hvd.rank())

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=..., sampler=train_sampler)

# Build model...
model = ...
model.cuda()

optimizer = optim.SGD(model.parameters())

# Add Horovod Distributed Optimizer
optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters())

# Broadcast parameters from rank 0 to all other processes.
hvd.broadcast_parameters(model.state_dict(), root_rank=0)

for epoch in range(100):
   for batch_idx, (data, target) in enumerate(train_loader):
       optimizer.zero_grad()
       output = model(data)
       loss = F.nll_loss(output, target)
       loss.backward()
       optimizer.step()
       if batch_idx % args.log_interval == 0:
           print('Train Epoch: {} [{}/{}]\tLoss: {}'.format(
               epoch, batch_idx * len(data), len(train_sampler), loss.item()))
```

## Launch command

In [14]:
! horovodrun

usage: horovodrun [-h] [-v] -np NP [-cb] [--disable-cache]
                  [--start-timeout START_TIMEOUT] [--network-interface NICS]
                  [--output-filename OUTPUT_FILENAME] [--verbose]
                  [--config-file CONFIG_FILE] [-p SSH_PORT]
                  [-i SSH_IDENTITY_FILE]
                  [--fusion-threshold-mb FUSION_THRESHOLD_MB]
                  [--cycle-time-ms CYCLE_TIME_MS]
                  [--cache-capacity CACHE_CAPACITY]
                  [--hierarchical-allreduce | --no-hierarchical-allreduce]
                  [--hierarchical-allgather | --no-hierarchical-allgather]
                  [--autotune] [--autotune-log-file AUTOTUNE_LOG_FILE]
                  [--autotune-warmup-samples AUTOTUNE_WARMUP_SAMPLES]
                  [--autotune-steps-per-sample AUTOTUNE_STEPS_PER_SAMPLE]
                  [--autotune-bayes-opt-max-samples AUTOTUNE_BAYES_OPT_MAX_SAMPLES]
                  [--autotune-gaussian-process-noise AUTOTUNE_GAUSSIAN_PROCESS_NOISE

In [4]:
! horovodrun -np 2 python ../source_code/ddp_horovod.py

[1,1]<stdout>:worldsize: 2
[1,0]<stdout>:worldsize: 2
[1,0]<stdout>:Files already downloaded and verified
[1,1]<stdout>:Files already downloaded and verified
[1,0]<stdout>:Files already downloaded and verified
[1,1]<stdout>:Files already downloaded and verified
[1,0]<stdout>:---------------------------------------------------------------------------
[1,0]<stdout>:Epoch: 0, Accuracy: 0.0
[1,0]<stdout>:---------------------------------------------------------------------------
[1,1]<stdout>:Local Rank: 1, Epoch: 0, Training ...
[1,1]<stdout>:Time 8.18 seconds
[1,0]<stdout>:Local Rank: 0, Epoch: 0, Training ...
[1,0]<stdout>:Time 8.18 seconds
[1,0]<stdout>:---------------------------------------------------------------------------
[1,0]<stdout>:Epoch: 1, Accuracy: 0.1
[1,0]<stdout>:---------------------------------------------------------------------------
[1,0]<stdout>:Local Rank: 0, Epoch: 1, Training ...
[1,0]<stdout>:Time 5.93 seconds
[1,1]<stdout>:Local Rank: 1, Epoch: 1, Training ..

---

### References

<div>
    <span style="float: left; width: 20%; text-align: left;"><a href="07-Message_Passing.ipynb">Previous Notebook</a></span>
    <span style="float: left; width:75%; text-align: right;"><a href="06-DDP_Mixed_Precision.ipynb">Next Notebook</a></span></div><br/>
<p> <center> <a href="../Start_Here.ipynb">Home Page</a> </center></p> 