# CIP203 - Maximizing GPU usage with MIGs, MPS, and Time-Slicing: 
## CUDA streams

**Questions**
* How to speed-up my code with the use of CUDA streams ?

**Objectives**
* Get familiarized with the concept of CUDA streams
* Learn how to create cuda streams and put operations in streams in PyTorch
* Get your feet wet by doing more exercises 

### The problem:

- all the CPU threads in the same process (same CUDA context) share the same GPU
- kernels (aka functions) are mostly serialized
- when each fucntion is not able to fully use GPU cores, the GPU is wasted

### CUDA Streams is the oldest NVIDIA solution for sharing a GPU between unrelated tasks:

- CUDA streams provide an application-layer method for sharing GPU resources by enabling <font color='red'>**CONCURRENT**</font>, asynchronous execution of tasks, such as data transfers and kernel launches. 

- Each stream is a sequence of commands executed in a specific order, but independent streams can run in parallel, potentially overlapping operations to improve performance.

- This allows developers to overlap memory copies with kernel execution, run multiple kernels simultaneously, and manage data dependencies more effectively to achieve higher GPU utilization. 

- the biggest drawback - it's limited to a single process: if your process does not have enough of parallelism to saturate modern GPU, then streams are useless

- another drawback: it requires re-writing the code which is time consuming

### CUDA streams is like squeezing a lemon in order to get even more juice

After you're done parallelizing your code and think that you did the best you could ... then think again ... you may get even more performance boost by using CUDA streams

![alt text](./images/sqeezing_lemon.jpeg "Title")

### Here is how it works:

1. Different streams are created
2. Various operations including memory copy, kernel executions, etc are placed into different streams
3. Streams are run <font color='red'>**CONCURRENTLY**</font>
4. Streams are synchronized to ensure they finish at the same time
5. Speed-up depends on availability of GPU resources

### What is concurency ?

The ability to perform multiple CUDA operations simultaneously (beyond multi-threaded parallelism):
- CUDA kernels (functions)
- memory transfers from CPU to GPU
- memory transfers from GPU to CPU
- operations on the CPU

![alt text](./images/NVIDIA-CUDA-Streams.png "Title")

### Will my code always be faster with the use of CUDA streams ? 

It depends:
1. on availble GPU resources (global memory, registers, shared memory, etc)
2. on whether your code can completely saturate the GPU or not

GPU resources are shared between streams. If one streams completely use all the available CUDA cores and/or memory, then other streams will have to wait, i.e. the execution will be serialized.

### Requirements for concurency

- GPU must support concurent operations
- The kernels launched concurrently must not exceed the GPU's available resources, such as registers, shared memory, and compute units.
- CUDA operations must be in different, non-0, streams
- Kernels or memory operations in different streams should not have dependencies that force sequential executions
- Utilize asynchronous CUDA API calls, such as cudaMemcpyAsync for memory transfers and kernel launches

### CUDA streams synchronization

**Why synchronize streams** ?\
CUDA streams are synchronized to ensure the correct ordering and completion of operations, particularly when dependencies exist between tasks or when the host (CPU) needs to interact with the results of GPU computations.\
\
**What if you overuse with the synchronization ?**\
It introduces performance bottlenecks as it forces the CPU to wait for GPU execution. Therefore, it is important to minimize unnecessary synchronization and leverage asynchronous operations and concurrent streams where possible to maximize GPU utilization.


### How to invoke CUDA streams in PyTorch

In [None]:
import torch
s1 = torch.cuda.Stream()
s2 = torch.cuda.Stream()

Here we created 2 CUDA streams in addition to the default 0 stream.<br>
Next we need to execute certain operations in those streams:

In [None]:
with torch.cuda.stream(s1):
    YOUR_FUNCTION
with torch.cuda.stream(s2):
    ANOTHER_FUNCTION

# Exercise 1: Matrix multiplications using 2 streams

In what follows we are going to have 2 streams. Then we distribute matrix multiplication operations between those streams. Here we also have to use NVIDIA Nsight profiler to be able to see the overlap between streams.<br>

However, we will not execute the code from within the notebook. Instead, we use the Terminal to edit the exercises and the submission scripts as well as to submit sbatch jobs. <br>

Please launch the Terminal and got to scripts. There you will find several Python files along with the few submission scripts. Please go ahead and open **matmul-cuda.py** first. 

In [4]:
import torch
device = torch.device("cuda")
matrix_size = 10000  # Adjust this value to increase/decrease GPU intensity

In [5]:
# Create  large random matrices on the chosen device
A1 = torch.randn(matrix_size, matrix_size, device=device)
B1 = torch.randn(matrix_size, matrix_size, device=device)
A2 = torch.randn(matrix_size, matrix_size, device=device)
B2 = torch.randn(matrix_size, matrix_size, device=device)

In [None]:
print(f"Performing matrix multiplication of two {matrix_size}x{matrix_size} matrices...")
C1 = torch.matmul(A1, B1)
C2 = torch.matmul(A2, B2)
torch.cuda.synchronize()

Use NVIDIA Nsight Systems to run&profile the code so that you can observe the concurency.

<font color='blue'>**Here is how to use NVIDIA Nsight Systems profiler:**</font>

1. NVIDIA Night System can be laucnhed from within the Launcher tab
2. Once it's opened click "Select target for profiling" and chose nodegpupool.
3. In the "Collect CPU IP/backtrace samples" block find "Target application" sub-block. Enter the following into the "Command line with arguments" field: python ./cq-formation-cip203-main/scripts/matmul-streams.py
4. Click "Start" in the right column

### Exercise 2: Create stream 3 and 4, then use them to add more matrix multiplication operations

### What programming languages/libraries/packages have support for CUDA streams ?

- C/C++, C++ with Thrust
- Fortran
- Python (including packages Numba, Pytorch, PyCuda, CuPy)
- Julia
- Rust
- Net
- Java

## Key Points

* **CUDA streams definition**
* **Concurency**
* **Synchronization**
* **Are CUDA streams for everyone ?**