
<p> <center> <a href="../Start_Here.ipynb">Home Page</a> </center> </p>

 
<div>
    <span style="float: left; width: 33%; text-align: left;"><a href="1.Introduction-to-Distributed-Deep-Learning.ipynb">Previous Notebook</a></span>
    <span style="float: left; width: 33%; text-align: center;">
        <a href="1.Introduction-to-Distributed-Deep-Learning.ipynb">1</a>
        <a >2</a>
        <a href="3.Hands-on-Multi-GPU.ipynb">3</a>
        <a href="4.Convergence.ipynb">4</a>
    </span>
    <span style="float: left; width: 33%; text-align: right;"><a href="3.Hands-on-Multi-GPU.ipynb">Next Notebook</a></span>
</div>

# Introduction to Distributed Deep Learning - Part 2

**Table of Contents**

- [Understanding System Topology](#Understanding-System-Topology)
    - [Communication concepts](#Communication-concepts)
- [Intra-Node Communication Topology](#Intra-Node-communication-Topology)
    - [Performance variation due to system topology](#Performance-variation-due-to-system-topology)
- [NCCL](#NCCL)
    - [NCCL_P2P_LEVEL=0 or P2P Disabled](#NCCL_P2P_LEVEL=0-or-P2P-Disabled)
    - [NCCL_P2P_LEVEL=1 or P2P via PCIe](#NCCL_P2P_LEVEL=1-or-P2P-via-PCIe)
- [Benchmarking the system topology](#Benchmarking-the-system-topology)

**The objectives of this Notebook is to make you understand:**

- How system topolgy plays a role in Distributed training.
- Intra-node topology and underlying technologies like P2P and their implication on program performance

# Understanding System Topology

In our previous notebook when we calculated the `throughput` of deep learning training with different parameters, we saw a slight dip when we scaled from 4 to 8 GPUs. Let us try to reason it by understanding the underlying system.

Before we begin, let us define two important terms:

* **Latency:** The amount of time it takes to take a unit of data from point A to point B. For example, if 4bytes of data can be transferred from point A to B in 4 $\mu$s, that is the latency of transfer.
* **Bandwidth:** The amount of data that can be transferred from point A to point B in a unit of time. For example, if the width of the bus is 64KiB and latency of transfer between point A and B is 4 $\mu$s, the bandwidth is 64KiB * (1/4$\mu$s) = 1.6 GiB/s.

### Setting up the GPU

To verify that our system has multiple GPUs in each node, run the command below:

In [None]:
!nvidia-smi

If the output is unclear, you can launch a Terminal session by clicking on `File` $\rightarrow$ `New` $\rightarrow$ `Terminal` or by following the steps as shown:

![open_terminal_session](images/open_terminal.png)

## Communication concepts

There are many ways in which GPUs can transfer data between one another , let us look at two of the most used copy operations. Understanding these will help us with the further sections of the notebook when we benchmark and toggle different options available to us.

#### Host Staging of Copy Operations

The path taken by the data in both the cases is denoted by the red arrow as follows:

<center><img src="images/memcpy_host_staging.png"/></center>

That is, in the above GPU-to-GPU memory copy, the data traverses from GPU 0 the PCIe bus to the CPU, where it is staged in a buffer before being copied to GPU 1. This is called "host staging" and it decreases the bandwidth while increasing the latency of the operation. If we eliminate host staging, we can usually improve the performance of our application.

#### Peer-to-Peer Memory Access

P2P allows devices to address each other's memory from within device kernels and eliminates host staging by transferring data either through the PCIe switch or through NVLink as denoted by the red arrow below. 

<center><img src="images/memcpy_p2p_overview.png"/></center>

Peer-to-Peer (P2P) memory access requires GPUs to share a Unified Virtual Address Space (UVA). UVA means that a single address space is used for the host and all modern NVIDIA GPU devices (specifically, those with compute capibility of 2.0 or higher).

Let us now try to understand the Intra-node topology.

## Intra-Node communication Topology

Run the command below to display your node's GPU and NIC communication topology:

### DGX-1 (V100)

In [None]:
!nvidia-smi topo -m

Output of running the command: 

![nvidia_smi_topo_output](images/nvidia_smi_topo_output.png)

Focus on a particular row, say GPU 0. The output states that GPUs 1 through 4 are connected to it via NVLink (in addition to PCIe) and GPUs 5 through 7 are connected to it via PCIe as well as an "SMP" interconnect. We have a dual-socket system and the CPUs in these sockets are connected by an interconnect known as SMP interconnect.

Thus, GPU 0 to GPU 5 communication happens via not just PCIe, but also over the inter-socket interconnect within the same node. Clearly, this is a longer path than say the one between GPU 0 and GPU 1, which are connected via NVLink directly.

Even within the GPUs connected via NVLink, we see different annotations such as `NV1` and `NV2` that affect the communication bandwidth and hence the performance. In this section, we will explore the nuances associated with a diverse intra-node GPU communication topology like in the output above. Specifically, in our system, the communication topology is as follows:

![dgx1_8x_tesla_v100_topo](images/dgx1_8x_tesla_v100_topo.png)


### DGX-1 (A100)

In [None]:
#run this cell to see the topology of A100

!nvidia-smi topo -m

In constrast to V100, GPU0 to GPU7 are connected via NVLink(NV12) and in summary, each GPU in A100 is connected to all other GPUs via the third generation NVLink. As a result, this doubles the GPU-to-GPU direct bandwidth to 600 gigabytes per second (GB/s), almost 10X higher than PCIe Gen 4 with a new NVIDIA NVSwitch that’s 2X faster than the V100.

<br/>
<center><img src="images/dgx_a100_architecture.png" width="70%" height="70%"/></center
    
  

### Overview 

Qualitatively, the bandwidth and latency vary with the topology as follows:

<br/>
<center><img src="images/intra_node_topology_map.png"/></center>

Host staging implies traversing through the CPU and the travel path taken is one of PHB, NODE, and SYS. In contrast, if the path taken is either NV1, NV2, or PIX, then P2P is available. PXB implies that the GPUs belong to different PCIe hubs and P2P is usually not supported in this case.

A double NVLink connection provides twice the bandwidth compared to a single NVLink. 

For a pair of 2 GPUs, the peak bidirectional bandwidth are as follows:
* PCIe: Using PIX topology, 15.75GB/s for PCIe Gen 3.0 and 31.5GB/s for PCIe Gen 4.0.
* NVLink: Using NV# topology, 50GB/s per connection. So a double NVLink connection has 100GB/s peak bidirectional bandwidth.

Let us understand what difference the underlying communication topology can make to the application performance in the following sub-section.

**Note:** If your command output doesn't show any NVLink connection or if there's no difference in connection type (PIX, PXB, PHB, NODE, SYS, NV#) between any 2 pair of GPUs, then the communication bandwidth and latency will likely be the same between any pair and the following sub-sections will not display any performance difference.

### Performance variation due to system topology

So far, we have run the application specifying the number of GPUs to use. To specify which GPU to use, we can supply the `CUDA_VISIBLE_DEVICES` environment variable to the executable to run our code on specific GPUs. If we want to run on only 2 GPUs, namely GPU 0 and GPU 3, we use set the environment variable `CUDA_VISIBLE_DEVICES="0,3"` while executing the command. Let us also include the `NCCL_DEBUG=INFO` variable to understand how the GPUs are connected. We will take a closer look into the NCCL library in the upcoming section.

Let us now run the command with two GPUs and compare the throughput achieved in both cases. `Note that this usecase is only applicable to DGX-1 V100 and has no effect on DGX-1 A100 as all GPUs are connected via NVLinks`.


**Experiment 1** : Try to find the GPU pair with **highest bandwidth and lowest latency** available as per the table above and replace `0,3` with those GPUs, and then run the command below:

In [None]:
!TF_CPP_MIN_LOG_LEVEL=3 NCCL_DEBUG=INFO CUDA_VISIBLE_DEVICES="0,3" horovodrun -np 2 --mpi-args="--oversubscribe" python3 ../source_code/N2/cnn_fmnist.py --batch-size=512 2> /dev/null

**Experiment 2** : Let us run the command below with the **highest latency and lowest bandwidth** ( in our case GPU `1,7` ).

In [None]:
!TF_CPP_MIN_LOG_LEVEL=3 NCCL_DEBUG=INFO CUDA_VISIBLE_DEVICES="1,7" horovodrun -np 2 --mpi-args="--oversubscribe" python3 ../source_code/N2/cnn_fmnist.py --batch-size=512 2> /dev/null

Now with the results obtained, we can now compare them :

The scaling efficiency would likely be higher for the set of GPUs having **low latency and high bandwidth**.

Output of running the command on DGX-1 (V100) : 


```bash
NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0
NCCL INFO Setting affinity for GPU 3 to 0fffff00,000fffff
NCCL INFO Channel 00 : 1[b000] -> 0[6000] via P2P/IPC
NCCL INFO Channel 00 : 0[6000] -> 1[b000] via P2P/IPC
NCCL INFO Channel 01 : 1[b000] -> 0[6000] via P2P/IPC
NCCL INFO Channel 01 : 0[6000] -> 1[b000] via P2P/IPC
NCCL INFO Channel 02 : 1[b000] -> 0[6000] via P2P/IPC
NCCL INFO Channel 02 : 0[6000] -> 1[b000] via P2P/IPC
NCCL INFO Channel 03 : 1[b000] -> 0[6000] via P2P/IPC
NCCL INFO Channel 03 : 0[6000] -> 1[b000] via P2P/IPC
NCCL INFO Connected all rings
NCCL INFO Connected all trees
NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
NCCL INFO 4 coll channels, 4 p2p channels, 4 p2p channels per peer


Epoch 4/8
Images/sec: 100702.49
Epoch 5/8
Images/sec: 101486.84
Epoch 6/8
Images/sec: 101490.28
Epoch 7/8
Images/sec: 99128.98
Epoch 8/8
Images/sec: 101215.77
```

Now, run the binary a pair of GPUs that have the lowest available bandwidth. In our case, we use GPU 1 and GPU 7. 


Output of running the command on DGX-1 (V100): 

```bash
NCCL INFO Setting affinity for GPU 7 to ffff,f00000ff,fff00000
NCCL INFO Channel 00/02 :    0   1
NCCL INFO Channel 01/02 :    0   1
NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
NCCL INFO Setting affinity for GPU 1 to 0fffff00,000fffff
NCCL INFO Channel 00 : 1[8a000] -> 0[7000] via direct shared memory
NCCL INFO Channel 00 : 0[7000] -> 1[8a000] via direct shared memory
NCCL INFO Channel 01 : 1[8a000] -> 0[7000] via direct shared memory
NCCL INFO Channel 01 : 0[7000] -> 1[8a000] via direct shared memory
NCCL INFO Connected all rings
NCCL INFO Connected all trees
NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer

Epoch 4/8
Images/sec: 98996.51
Epoch 5/8
Images/sec: 98135.64
Epoch 6/8
Images/sec: 97798.09
Epoch 7/8
Images/sec: 96672.95
Epoch 8/8
Images/sec: 95782.78
```

# NCCL

The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and Networking. NCCL provides routines such as `all-gather`, `all-reduce`, `broadcast`, `reduce`, `reduce-scatter` as well as point-to-point send and receive that are optimized to achieve high bandwidth and low latency over PCIe and NVLink high-speed interconnects within a node and over NVIDIA Mellanox Network across nodes.

The Horovod framework also uses NCCL Collective communications to keep the all the GPUs in sync , we can then toggle P2P levels using Environment variables to manually switch between different communication protocols available. The complete list of Environment variables can be found 
[here](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-p2p-disable)

Let us now toggle Peer-to-peer levels using the `NCCL_P2P_LEVEL` environment variable. 

```text
NCCL_P2P_LEVEL
(since 2.3.4)

The NCCL_P2P_LEVEL variable allows the user to finely control when to use the peer to peer (P2P) transport between GPUs. The level defines the maximum distance between GPUs where NCCL will use the P2P transport.

Values accepted
LOC or 0 : Never use P2P (always disabled)

NVL : Use P2P when GPUs are connected through NVLink

PIX or 1 : Use P2P when GPUs are on the same PCI switch.

PXB or 2 : Use P2P when GPUs are connected through PCI switches (potentially multiple hops).

PHB or 3, or 4 : Use P2P when GPUs are on the same NUMA node. Traffic will go through the CPU.

SYS or 5 : Use P2P betweem NUMA nodes, potentially crossing the SMP interconnect (e.g. QPI/UPI).
```

We have benchmarked for the case where we use NVLink and verified it through `NCCL_DEBUG` environment variable, let us now try two different settings and compare their throughputs.

### NCCL_P2P_LEVEL=0 or P2P Disabled

In [None]:
!NCCL_P2P_LEVEL=0 TF_CPP_MIN_LOG_LEVEL=3 NCCL_DEBUG=INFO CUDA_VISIBLE_DEVICES="0,3" horovodrun -np 2 --mpi-args="--oversubscribe" python3 ../source_code/N2/cnn_fmnist.py --batch-size=512 2> /dev/null

**Output of running the command on DGX-1 V100** : 

```bash
NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
NCCL INFO Channel 00/02 :    0   1
NCCL INFO Setting affinity for GPU 3 to 0fffff00,000fffff
NCCL INFO Channel 01/02 :    0   1
NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
NCCL INFO Channel 00 : 1[b000] -> 0[6000] via direct shared memory
NCCL INFO Channel 00 : 0[6000] -> 1[b000] via direct shared memory
NCCL INFO Channel 01 : 1[b000] -> 0[6000] via direct shared memory
NCCL INFO Channel 01 : 0[6000] -> 1[b000] via direct shared memory
NCCL INFO Connected all rings
NCCL INFO Connected all trees
NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer

Epoch 4/8
Images/sec: 95033.4
Epoch 5/8
Images/sec: 94848.44
Epoch 6/8
Images/sec: 94289.97
```

**Output of running the command on DGX-1 A100** :

```bash
NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
NCCL INFO PXN Disabled as plugin is v4
NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
NCCL INFO PXN Disabled as plugin is v4
NCCL INFO Channel 00/04 :    0   1
NCCL INFO Channel 01/04 :    0   1
NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
NCCL INFO Channel 02/04 :    0   1
NCCL INFO Channel 03/04 :    0   1
NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
NCCL INFO Channel 00 : 1[4e000] -> 0[7000] via direct shared memory
NCCL INFO Channel 01 : 1[4e000] -> 0[7000] via direct shared memory
NCCL INFO Channel 00 : 0[7000] -> 1[4e000] via direct shared memory
NCCL INFO Channel 02 : 1[4e000] -> 0[7000] via direct shared memory
NCCL INFO Channel 01 : 0[7000] -> 1[4e000] via direct shared memory
NCCL INFO Channel 03 : 1[4e000] -> 0[7000] via direct shared memory
NCCL INFO Channel 02 : 0[7000] -> 1[4e000] via direct shared memory
NCCL INFO Channel 03 : 0[7000] -> 1[4e000] via direct shared memory
NCCL INFO Connected all rings
NCCL INFO Connected all trees
NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer


Epoch 4/6
Images/sec: 98121.52
Epoch 5/6
Images/sec: 97400.23
Epoch 6/6
Images/sec: 97135.72
```

### NCCL_P2P_LEVEL=1 or P2P via PCIe

In [None]:
!NCCL_P2P_LEVEL=1 TF_CPP_MIN_LOG_LEVEL=3 NCCL_DEBUG=INFO CUDA_VISIBLE_DEVICES="0,3" horovodrun -np 2 --mpi-args="--oversubscribe" python3 ../source_code/N2/cnn_fmnist.py --batch-size=512 2> /dev/null

**Output of running the command on DGX-1 V100**: 

```bash
NCCL INFO NCCL_P2P_LEVEL set by environment to PIX
NCCL INFO NCCL_P2P_LEVEL set by environment to PIX
NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0
NCCL INFO Channel 00/04 :    0   1
NCCL INFO Channel 01/04 :    0   1
NCCL INFO Channel 02/04 :    0   1
NCCL INFO Setting affinity for GPU 3 to 0fffff00,000fffff
NCCL INFO Channel 03/04 :    0   1
NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1
NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
NCCL INFO Channel 00 : 1[b000] -> 0[6000] via P2P/IPC
NCCL INFO Channel 00 : 0[6000] -> 1[b000] via P2P/IPC
NCCL INFO Channel 01 : 1[b000] -> 0[6000] via P2P/IPC
NCCL INFO Channel 01 : 0[6000] -> 1[b000] via P2P/IPC
NCCL INFO Channel 02 : 1[b000] -> 0[6000] via P2P/IPC
NCCL INFO Channel 02 : 0[6000] -> 1[b000] via P2P/IPC
NCCL INFO Channel 03 : 1[b000] -> 0[6000] via P2P/IPC
NCCL INFO Channel 03 : 0[6000] -> 1[b000] via P2P/IPC
NCCL INFO Connected all rings
NCCL INFO Connected all trees
Epoch 4/8
Images/sec: 96529.63
Epoch 5/8
Images/sec: 97288.7
Epoch 6/8
Images/sec: 97230.33
Epoch 7/8
Images/sec: 97701.72
Epoch 8/8
Images/sec: 97075.39
```

**Output of running the command on DGX-1 A100**:

```bash
NCCL INFO NCCL_P2P_LEVEL set by environment to PIX
NCCL INFO PXN Disabled as plugin is v4
NCCL INFO NCCL_P2P_LEVEL set by environment to PIX
NCCL INFO PXN Disabled as plugin is v4
NCCL INFO Channel 00/24 :    0   1
NCCL INFO Channel 01/24 :    0   1
NCCL INFO Channel 02/24 :    0   1
NCCL INFO Channel 03/24 :    0   1
NCCL INFO Channel 04/24 :    0   1
NCCL INFO Channel 05/24 :    0   1
NCCL INFO Channel 06/24 :    0   1
NCCL INFO Channel 07/24 :    0   1
NCCL INFO Channel 08/24 :    0   1
NCCL INFO Channel 09/24 :    0   1
NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] -1/-1/-1->1->0 [7] -1/-1/-1->1->0 [8] -1/-1/-1->1->0 [9] -1/-1/-1->1->0 [10] -1/-1/-1->1->0 [11] -1/-1/-1->1->0 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] -1/-1/-1->1->0 [19] -1/-1/-1->1->0 [20] -1/-1/-1->1->0 [21] -1/-1/-1->1->0 [22] -1/-1/-1->1->0 [23] -1/-1/-1->1->0
[0] NCCL INFO Channel 10/24 :    0   1
NCCL INFO Channel 11/24 :    0   1
-------------------------
NCCL INFO Channel 23/24 :    0   1
NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
NCCL INFO Channel 00 : 1[4e000] -> 0[7000] via P2P/IPC/read
NCCL INFO Channel 00 : 0[7000] -> 1[4e000] via P2P/IPC/read
NCCL INFO Channel 01 : 1[4e000] -> 0[7000] via P2P/IPC/read
NCCL INFO Channel 02 : 1[4e000] -> 0[7000] via P2P/IPC/read
NCCL INFO Channel 03 : 1[4e000] -> 0[7000] via P2P/IPC/read
NCCL INFO Channel 04 : 1[4e000] -> 0[7000] via P2P/IPC/read
NCCL INFO Channel 05 : 1[4e000] -> 0[7000] via P2P/IPC/read
NCCL INFO Channel 06 : 1[4e000] -> 0[7000] via P2P/IPC/read
NCCL INFO Channel 07 : 1[4e000] -> 0[7000] via P2P/IPC/read
NCCL INFO Channel 08 : 1[4e000] -> 0[7000] via P2P/IPC/read
NCCL INFO Channel 09 : 1[4e000] -> 0[7000] via P2P/IPC/read
NCCL INFO Channel 10 : 1[4e000] -> 0[7000] via P2P/IPC/read
------------------------
NCCL INFO Channel 21 : 1[4e000] -> 0[7000] via P2P/IPC/read
NCCL INFO Channel 22 : 1[4e000] -> 0[7000] via P2P/IPC/read
NCCL INFO Channel 23 : 1[4e000] -> 0[7000] via P2P/IPC/read
NCCL INFO Channel 04 : 0[7000] -> 1[4e000] via P2P/IPC/read
NCCL INFO Channel 05 : 0[7000] -> 1[4e000] via P2P/IPC/read
------------------------
NCCL INFO Channel 22 : 0[7000] -> 1[4e000] via P2P/IPC/read
NCCL INFO Channel 23 : 0[7000] -> 1[4e000] via P2P/IPC/read
NCCL INFO Connected all rings
NCCL INFO Connected all trees

Epoch 4/6
Images/sec: 105595.09
Epoch 5/6
Images/sec: 102635.29
Epoch 6/6
Images/sec: 104972.76
```

We can summarise the results using the following table. 

|GPUs|Condition|Throughput V100|Throughput A100|
|-|-|-|-|
|0,3|P2P via NVLink|~100000|~166697 (batch-size=8192)|
|0,3|P2P via PCIe|~97000| ~104000|
|0,3|P2P Disabled|~95000|~97000 |

We now understood the role of communication and hardware configuration for training. In our case, we used a smaller model for quicker runtimes. The decrease in throughput due to communication is more pronounced when the data transfer size increases for larger models that typically require multi-node training. In such cases, NVLink helps reduce the scaling efficiency gap as we scale further.

### Benchmarking the system topology

The above application is not highly memory intensive as we mentioned earlier. Therefore, to get a quantitative measure of latency and bandwidth impact due to topology, we run a micro-benchmark.

**The p2pBandwidthLatencyTest micro-benchmark**

p2pBandwidthLatencyTest is a part of [CUDA Samples GitHub repository](https://github.com/NVIDIA/cuda-samples) available to help CUDA developers. 

As the name suggests, this test measures the bandwidth and latency impact of P2P and underlying communication topology. Let's compile the benchmark:

In [None]:
!cd ../source_code/N2/Samples/p2pBandwidthLatencyTest/ && make clean && make

Now, let's run the benchmark:

In [None]:
!cd ../source_code/N2/Samples/p2pBandwidthLatencyTest/ && ./p2pBandwidthLatencyTest

The first part of the benchmark gives device information and P2P access available from each GPU (similar to `nvidia-smi topo -m` command). Next, the benchmark measures the unidirectional and bidirectional bandwidth and latency with P2P disabled and enabled.

We share partial results obtained on running the command on DGX-1 : :

```bash
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7 
     0 783.95   9.56  14.43  14.46  14.47  14.24  14.51  14.43 

Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7 
     0 784.87  48.49  48.49  96.85  96.90  14.25  14.54  14.49 
     
P2P=Disabled Latency Matrix (us)
   GPU     0      1      2      3      4      5      6      7 
     0   1.78  17.52  16.41  16.43  17.35  16.88  17.34  16.85 
     
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2      3      4      5      6      7 
     0   1.76   1.62   1.61   2.01   2.02  18.44  19.15  19.34
```

Our system is based on PCIe gen 3.0 with a peak maximum GPU-GPU PCIe banwidth of 15.75 GB/s. Let us analyze and understand these results:

* GPU 0 and GPU 1/2: Connected by a single NVLink connection. By enabling P2P-
  - Bandwidth reaches close to the maximum peak of 50 GB/s.
  - Latency decreases by an order of magnitude.
* GPU 0 and GPU 3/4: Connected by a double NVLink connection. By enabling P2P-
  - Bandwidth reaches close to the maximum peak of 100 GB/s.
  - Latency decreases by an order of magnitude.
* GPU 0 and GPU 5/6/7: Connected by PCIe and SMP interconnect. By enabling P2P- 
  - Bandwidth is unchanged.
  - Latency increases marginally.
  
Correlate these results with the communication topology that can be displayed by usng `nvidia-smi topo -m` command and the qualtitative table in the previous section. They should be consistent with one another.

In general, we should try to set the GPUs in an application such that a GPU can share data with its neighbours using a high-bandwidth, low-latency communication topology. Enabling P2P, when possible, usually improves the performance by eliminating host staging.

**Now that we understand the role of system topology in distributed deep learning , let us now get hands-on with refractoring and scaling Deep learning models in the upcoming notebook.**

***

## Licensing

This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0).

<div>
    <span style="float: left; width: 33%; text-align: left;"><a href="1.Introduction-to-Distributed-Deep-Learning.ipynb">Previous Notebook</a></span>
    <span style="float: left; width: 33%; text-align: center;">
        <a href="1.Introduction-to-Distributed-Deep-Learning.ipynb">1</a>
        <a >2</a>
        <a href="3.Hands-on-Multi-GPU.ipynb">3</a>
        <a href="4.Convergence.ipynb">4</a>
    </span>
    <span style="float: left; width: 33%; text-align: right;"><a href="3.Hands-on-Multi-GPU.ipynb">Next Notebook</a></span>
</div>

<p> <center> <a href="../Start_Here.ipynb">Home Page</a> </center> </p>

