# CIP203 - Maximizing GPU usage with MIGs, MPS, and Time-Slicing: 
## Nvidia MPS

Questions
* How to speed-up my code with the use of MPS ?

Objectives
* Get familiarized with the concept of MPS
* Learn what concurency is
* Learn how to launch MPS daemon and change your submission script
* Practice running tests using MPS and do benchmarking

### What is NVIDIA MPS ?

NVIDIA MPS (Multi-Process Service) is a runtime service that allows multiple, individual processes to share a single NVIDIA GPU, significantly improving its utilization and performance by enabling them to run concurrently. 

### Why MPS is needed ?

MPS is useful when each application process does not generate enough work to saturate the GPU. Multiple processes can be run per node using MPS to enable more concurrency. 

### How NVIDIA MPS works

1.  Shared GPU Contexts
2.  Concurrent Kernel Execution
3.  Reduced Overhead
4.  

The GPU has an additional "adapter" attached to the front-end work-delivery. Work is delivered to the GPU as if it emanated from a single process. This means that the individual tenants do not get exclusive access to the GPU; somehow the GPU is shared. Furthermore, some kind of overlap of activity may happen, which would not otherwise happen in the default case for multi-tenant/multi-process. In particular, kernel execution overlap is possible

### What is the difference between MIG and MPS

- <font color='red'>**MIG**</font> provides full isolation, thus potentially sacrificing performance
- <font color='blue'>**MPS**</font> focuses on maximizing performance, but does not isolate
- <font color='red'>**MIG**</font> is suitable for sharing a GPU between different users
- <font color='blue'>**MPS**</font> is good for running multiple tasks by the same user

### GPU farming

One of the best case scenario for using MPS:
- when you need to run multiple instances of a CUDA application
- BUT the application is too small to saturate a modern GPU

MPS allows you to run multiple instances of the application sharing a single GPU, as long as there is enough of GPU memory for all of the instances of the application. In many cases this should result in a significantly increased throughput from all of your GPU processes

![alt text](./images/nvidia-mps.png "Title")

### How to enable MPS

MPS is not enabled by default, but it is straightforward to do. Execute the following commands in your submission script before running your CUDA application:

export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps <br>
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log <br>
nvidia-cuda-mps-control -d

- Then you can use the MPS feature if you have more than one CPU thread accessing the GPU. This will happen if you run a hybrid MPI/CUDA application, a hybrid OpenMP/CUDA application, or multiple instances of a serial CUDA application (GPU farming).

- With MPS the GPU can be shared between up to 48 processes
- There is also a memory overhead associated with using MPS

### Can MPS be used with MIGs ? 

Absolutely yes. You can execute multiple processes on a MIG instance thus potentially increasing the GPU utilization even more. In this case though each process is expected to not being able to fully saturate even a MIG instance and use substantial amount of its memory.

### Exercise 4: Matrix multiplication using MPS (GPU kernels only)

We execute the matrix multiplication code on the GPU first, trying a single process, then multiple processes and compare execution times. We also vary the matrix size (thus changing the GPU load) and see what effect it does on the concurency.

In [None]:
import torch
import time
import sys

if torch.cuda.is_available():
    device = torch.device("cuda")
    print("Using GPU:", torch.cuda.get_device_name(0))
else:
    device = torch.device("cpu")
    print("Using CPU")
matrix_size = 10000
A1 = torch.randn(matrix_size, matrix_size, device=device)
B1 = torch.randn(matrix_size, matrix_size, device=device)

print(f"Allocated GPU memory: {torch.cuda.memory_allocated() / (1024**3):.2f} GB")

# Perform matrix multiplication
start_time = time.perf_counter()
# Matrix multiplication on GPU
C1 = torch.matmul(A1, B1)
torch.cuda.synchronize()
end_time = time.perf_counter()
elapsed_time = end_time - start_time
#print("Matrix multiplication complete.")
print(f"Execution time: {elapsed_time:.6f} seconds")

### Exercise 5: Matrix multiplication using MPS (GPU kernels + CPU operations)

Now we also add matrix multiplication operation on the CPU. We first allocate matrices A_cpu and B_cpu in the CPU memory, populate them with random numbers, then execute matrix multiplication on the CPU right after the GPU execution. Then we use MPS to run multiple processes and compare the execution time. 

### Conclusion

- MPS requires to have a daemon running in the backhround
- It is possible to run multiple processes concurently, thus increase GPU utilization
- Concurency depends on how well processes saturate GPU
    - If one of the processes fully use the GPU resources (be it GPU cores, memory, or registers) then execution of multiple processes will be serialized

## Key Points

In [None]:
* **What is MPS**
* **Why use MPS**
* **Who profits from using MPS**
* **How to start MPS daemon**
* **Running multiple processes using MPS**