<img src="./images/DLI_Header.png" style="width: 400px;">

# 1. Overview of the Class Environment

Before we start looking at how to deploy large models, we should revisit the setup of the lab environment. In this section, we will experiment with tools for resource monitoring. The hardware used in this class may vary between sessions, so the number of GPUs, their memory capacity as well as their interconnect might vary from class to class. The results currently listed are based on 4 V100s with 16GB of memory. 

The goals of this notebook are to: 
* Revisit the hardware configuration at our disposal. 
* Use key nvidia-smi commands to monitor NVIDIA GPUs. 
* Run test scripts to measure the peer-to-peer communication performance of the NVLINK bus which will be essential for model parallel communication. 

# The Hardware Overview 

Let us have a look at the key components of the hardware system at our disposal. As discussed earlier, the configuration of the system does vary between deliveries so you might see different results than some of your classmates. 

## The CPU

Let us start by inspecting the type of CPU used as well as the number of cores at our disposal: 

In [None]:
!lscpu

In [None]:
# Check the number of CPU cores
!grep 'cpu cores' /proc/cpuinfo | uniq

We will host the NLP model alone and make a limited number of concurrent requests for that model. This will mainly use the GPU, but not much of the GPU. In the case of most production systems, one would deploy not just the model in isolation. An end-to-end production pipeline would include data pre and post processing steps. Production systems also experience much higher traffic, creating higher demand on the CPU, which needs to handle the processing of incoming requests (e.g. Triton Execution overheads). Therefore, maintaining the correct ratio between CPU and GPU resource is critical. Please reach out to your local NVIDIA team for a more detailed conversation about the design of inference systems. 

### The GPU

As before let us list the number and type of available GPUs. As the class environments will vary, there may be anywhere from four to eight Volta V100 GPUs with either 16G or 32G of onboard high bandwidth memory.

In [None]:
# Check available GPUs
!nvidia-smi

### Interconnect Topology

As discussed in lab 1, the GPUs we are using today are interconnected using [NVIDIA NVLink technology](https://www.nvidia.com/en-us/data-center/nvlink/). It allows workloads that have high bandwidth and low latency communication to overcome the limitations of PCIe technology. Inference of deep neural networks is in principle an "embracingly parallel" workload which is enhanced by the connectivity between the GPUs. Large models, that do not fit into a single GPU (such as recommender systems) create a high requirement for both required bandwidth and also latency. For such models, parallel deployments with NVLINK is a key technology enabling real time execution. Let us inspect the interconnect between the GPUs. Please use the `nvidia-smi topo --matrix` command below to check the topology of our NVLINK interconnect. Depending on the setup of the class we should see all 4 GPUs connected to each other (like in the example output listed below) or 8 GPUS interconnected (in this situation not all GPUs have direct NVLINK interconnect). 

```
        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity
GPU0     X      NV12    SYS     SYS     0-23            N/A
GPU1    NV12     X      SYS     SYS     24-47           N/A
GPU2    SYS     SYS      X      NV12    48-71           N/A
GPU3    SYS     SYS     NV12     X      72-95           N/A

Where X= Self and NV# = Connection traversing a bonded set of # NVLinks
```

On Ampere and Hopper based NVLINK enabled systems, one can find also NVSWITCH overcoming the above-mentioned limitation. 

In [None]:
# Check Interconnect Topology 
!nvidia-smi topo --matrix

We can also check NVLink status and capabilities using `nvidia-smi nvlink --status` command. On a 4 GPU based system there should be an output listing NVLink capabilities of each GPU like the below:
```
GPU 0: Graphics Device
	 Link 0: 25 GB/s
	 Link 1: 25 GB/s
	 Link 2: 25 GB/s
	 Link 3: 25 GB/s
```

In [None]:
# Check nvlink status
!nvidia-smi nvlink --status

### Testing the Connectivity

Let's make an empirical measurement of the bandwidth and latency that we are achieving in our environment. NVIDIA provides an example application called  **p2pBandwidthLatencyTest** that demonstrates CUDA Peer-To-Peer (P2P) data transfers between pairs of GPUs by computing bandwidth and latency while enabling and disabling NVLink connections. This tool is part of the code samples for CUDA Developers [cuda-samples](https://github.com/NVIDIA/cuda-samples.git). It can be downloaded using the following command, but it was pre-downloaded for the purpose of this class: 

`git clone --depth 1 --branch v11.2 https://github.com/NVIDIA/cuda-samples.git` 

To test the bandwidth and latency, please execute the below commands. Please pay particular attention to comparison of results where "P2P=Disabled" vs "P2P=Enabled".

In [None]:
!chmod 770 ./cuda-samples/bin/x86_64/linux/release/p2pBandwidthLatencyTest

In [None]:
!./cuda-samples/bin/x86_64/linux/release/p2pBandwidthLatencyTest

<h2 style="color:green;">Congratulations!</h2>

Now that we have reviewed the information about the lab environment, let's begin the model deployment. <br> 

Please proceed to the following notebook to start the next section of the lab: [Inference of the GPT-J 6b model with HuggingFace.](02_HFRunInferenceOfTheGPT-J.ipynb) 

