# CIP203 - Maximizing GPU usage with MIGs, MPS, and Time-Slicing: 
## Why bother maximizing GPU usage


**Questions**
* Why improving GPU performance is important (aka motivation) ?

**Objectives**
* Be aware of the GPU usage on the clusters
* Understand the problem of GPU under-utilization
* Become motivated to further make your code more GPU efficient

### Is maximizing GPU usage even a problem ?

- It <font color='red'>**WOULD NOT BE**</font> a problem if you ran your jobs on <font color='red'>**YOUR OWN**</font> desktop/laptop/personal HPC envoronment.
- The problem comes when you start using shared compute resources (like the Alliance's clusters)
- Your not-that-efficient GPU code makes your request more GPU resources
- The more GPUs you request, the fewer GPUs are available for other user
- The fewer GPUs are available, the less happy we are

### How do you know whether the GPU efficiency of your code is good or bad ?

- One way to check that is to profile and/or debug your code. There are several profilers available. One of them is NVIDIA Nsight Systems (good for GPU jobs)
- Another way is to use our <font color='red'>**PORTALS**</font> available for many of our clusters at the Alliance. Here are few of them:
https://portail.narval.alliancecan.ca<br>
https://metrix.rorqual.alliancecan.ca  

### What do you get from using the portals ?

- Current utilization of the cluster including filesystem performance, status of login and data tranfer nodes, etc
- User summary
- <font color='red'>**Job stats**</font>
- Account stats

From the Job stats you can view all your jobs (running, pending, canceled, etc). You can get the CPU utilization, GPU utilization, I/O traffic, network bandwidth, filesystems, and more. As for the GPU utilization is concerned, you can get GPU activity, GPU occupancy, GPU memory consumption, GPU power consumption.

### How to use the portals ?

1. **Login to the portal with your Alliance's credentials and MFA**

<table><tr>
<td> <img src="./images/portal_job_stat.png" alt="Drawing" style="width: 350px;"/> </td>
<td> <img src="./images/portal-your-jobs.png" alt="Drawing" style="width: 350px;"/> </td>
</tr></table>

2. **Click <font color='red'>**Job stats**</font>**
3. You will see the list of your jobs. Choose the jobs of your interest.

<table><tr>
<td> <img src="./images/portal-resources.png" alt="Drawing" style="width: 350px;"/> </td>
<td> <img src="./images/portal-gpu-utilization.png" alt="Drawing" style="width: 350px;"/> </td>
</tr></table>

### Do all jobs really need GPUs ?

1. **Jobs that trully need GPU resources**

- Below we plot the GPU utilization as a function of time for the period of several days both allocated and actually used by a user (left figure). The GPU utilization is measured as % of GPU cycles a Streaming Multiprocessor(SM) has at least 1 warp (minimal scheduled unit consisting of 32 threads) assigned. This plot demonstrates that the used GPU utilization is almost as good as the allocated one.

- The figure on the right shows GPU utilization as a function of time for a random job of the same user. It demonstrates that multiple GPUs weere used (plotted withdifferent colours), and that the actual utilization is peaking at 80% (which is pretty good).

<table><tr>
<td> <img src="./images/utilization1.1.png" alt="Drawing" style="width: 350px;"/> </td>
<td> <img src="./images/utilization1.2.png" alt="Drawing" style="width: 350px;"/> </td>
</tr></table>

Decision: <font color='red'>**NO GPU sharing is required !**</font> <br>
The user here is welcome to use full body GPUs.

2. **Jobs that need only a fraction of a GPU**

- The same GPU utilization metrics is plotted for another user. The figure on the left clearly shows that roughly only 15-20% of GPU cycles were used.
- The figure on the right shows the GPU utilization for one of the jobs of that user. As expected,  the GPU is used at the fraction of its capability.

<table><tr>
<td> <img src="./images/utilization2.1.png" alt="Drawing" style="width: 350px;"/> </td>
<td> <img src="./images/utilization2.2.png" alt="Drawing" style="width: 350px;"/> </td>
</tr></table>

Decision: <font color='red'>**Some GPU sharing may be needed to save resources !**</font> <br>
The user here needs to either improve GPU utilization or use fractional GPUs or share GPUs.

3. **Jobs that don't need GPUs at all**

- The figure on  the left shows that the GPU utilization collected for the period of several days was almost negligeable. Number of used GPU cycles is very low.
- The figure on the right again shows the GPU utilization of a random job of the user. It clearly demonstrates that the GPU is almost idle. 

<table><tr>
<td> <img src="./images/utilization3.1.png" alt="Drawing" style="width: 350px;"/> </td>
<td> <img src="./images/utilization3.2.png" alt="Drawing" style="width: 350px;"/> </td>
</tr></table>

Decision: <font color='red'>**GPU sharing is a must !**</font> <br>
The user here should either improve GPU efficiency or use GPU sharing or completely switch all calculations to CPU cores.

4. **Jobs with scattered GPU utilization**

<table><tr>
<td> <img src="./images/gpu-usage-scattered.png" alt="Drawing" style="width: 350px;"/> </td>
</tr></table>

Decision: <font color='red'>**GPU sharing is a must !**</font> <br>
The user here should probably restructure the code so that the GPUs are not idling.

### What are the consequences of poor GPU usage

1. Your code is slow making you to run it longer to get the result
2. It takes enormous number of GPUs for the users to accompish their tasks
3. Number of GPUs is very limited
4. The waiting time in the queue becomes long
5. Your account is getting over-charged (for the resources you have never used but requested anyway)
6. Your job priority goes down (making you unable to submit any jobs)
7. Your PI's compute allocations are burned very fast

### Why maximize GPU utilization ? 

- Either your code will run much faster or you stop wasting GPU cycles
- Your job priority will increase (as you use less resources)
- Your waiting time will increase as well
- Your PI's happy as the allocation is safe
- Your jobs won't get terminated due to low GPU utilisation

## Key Points

* **What's happening at our clusters**
* **Lots of jobs with idle GPUs**
* **Often times the GPUs are severely under-utilized**
* **Not all jobs even qualify to run on GPUs**