<div align="center"><img src="./images/DLI_Header.png"></div>

# Scaling CUDA C++ Applications on Multiple Nodes

Welcome to _Scaling CUDA C++ Applications on Multiple Nodes_. In this course you will learn several techniques for scaling single GPU CUDA applications to multiple GPUs and multiple nodes, with an emphasis on [NVSHMEM](https://developer.nvidia.com/nvshmem) which allows for elegant multi GPU application code and has been proven to scale very well on systems with many GPUs.

## The Coding Environment

For your work today, you have access to several GPUs in the cloud. Run the following cell to see the GPUs available to you today.

In [1]:
!nvidia-smi

Mon Sep 19 16:21:37 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  On   | 00000001:00:00.0 Off |                  Off |
| N/A   37C    P0    41W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000002:00:00.0 Off |                  Off |
| N/A   40C    P0    44W / 300W |      0MiB / 32510MiB |      0%      Default |
|       

While your work today will be on a single node, all the techniques you learn today, in particular CUDA-aware MPI and NVSHMEM, can be used to run your applications across clusters of multi GPU nodes.

## Table of Contents

During the workshop today you will work through each of the following notebooks with your instructor:

- [_Monte Carlo Approximation of  𝜋  - Single GPU_](02_MCπ-SGPU.ipynb): You will begin by familiarizing yourself with a single GPU implementation of the monte-carlo approximation of π algorithm, which we will use to introduce many multi GPU programming paradigms.
- [_Monte Carlo Approximation of $\pi$ - Multiple GPUs_](03_MCπ-MGPU.ipynb): In this notebook you will extend the monte-carlo π program to run on multiple GPUs by looping over available GPU devices.
- [_Monte Carlo Approximation of $\pi$ - Multiple GPUs with Peer Access_](04_MCπ-P2P.ipynb): In this notebook you will improve on your multi GPU code by utilizing direct peer-to-peer GPU communication.
- [_Monte Carlo Approximation of $\pi$ - MPI_](05_MCπ-MPI.ipynb): In this notebook you will be introduced to the single-program multiple-data paradigm (SPMD) and will simplify your monte-carlo π application with MPI.
- [_Monte Carlo Approximation of $\pi$ - CUDA-Aware MPI_](06_MCπ-CUDA-MPI.ipynb): In this notebook you will learn about CUDA-Aware MPI, which facilitates direct peer-to-peer communication between GPUs in the SPMD paradigm.
- [_Monte Carlo Approximation of $\pi$ - NVSHMEM_](07_MCπ-NVSHMEM-Dup.ipynb): In this notebook you will be introduced to NVSHMEM, and will take your first pass with it using the monte-carlo π program.
- [_Monte Carlo Approximation of $\pi$ - NVSHMEM with Distributed Work_](08_MCπ-NVSHMEM-Dist.ipynb): In this notebook you will expand your NVSHMEM skills by using it to distribute different work to multiple GPUs with NVSHMEM.
- [_The NVSHMEM Memory Model_](09_MCπ-NVSHMEM-Sym.ipynb): In this notebook you will learn about NVSHMEM's symmetric memory - an elegant mechanism for inter-GPU communication initiated on the GPU - and will apply it to the monte-carlo π program.
- [_NVSHMEM Histogram: Duplicated Approach_](10_Histogram-Dup.ipynb): In this notebook you will learn how to use NVSHMEM to perform collective operations across GPUs using a histogram application.
- [_NVSHMEM Histogram: Distributed Approach_](11_Histogram-Dist.ipynb): In this notebook you will take a different approach to the NVSHMEM histogram application and will learn how to reason about performance trade-offs in your multi GPU applications.
- [_Jacobi Iteration_](12_Jacobi.ipynb): In this notebook you will be introduced to a Laplace equation solver using Jacobi iteration and will learn how to use NVSHMEM to handle boundary communications between multiple GPUs.
- [_Improving the Reduction Performance with `cub`_](13_Jacobi-cub.ipynb): In this notebook you will learn about the `cub` library to improve the performance of your NVSHMEM Jacobi application.
- [_Final Exercise_](14_Wave.ipynb): In this exercise you apply your day's learnings by refactoring a single GPU 1D wave equation solver to run on multiple GPUs with NVSHMEM.

## Next

Please continue to the next notebook: [_Monte Carlo Approximation of  𝜋  - Single GPU_](02_MCπ-SGPU.ipynb).