



## Automatic Discovery of Implementation Rules for Fast GPU + MPI Operations

Carl Pearson, Karen Devine, Aurya Javeed Sandia National Labs

SIAM Parallel Processing 2022

MS61 Experiences in Developing GPU Support for DOE Math Libraries

This work is supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Scientific Discovery through Advanced Computing (SciDAC) program through the FASTMath Institute. This research used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility located at Lawrence Berkeley National Laboratory, operated under Contract No. DE-AC02-05CH11231 using NERSC award ERCAP0019623.



Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-NA0003525.

SAND2022-2037 C

#### Automatic Discovery of Implementation Rules for Fast GPU + MPI Operations



- Fast libraries for heterogeneous architectures
  - Mapping computation onto processors
  - Choosing communication strategy
  - Unpredictable performance interaction

- Prototype automatic tooling for discovering important design decisions
  - Reduced developer effort for performance on new systems
  - Maintain human provenance of library design
  - e.g. Modernize Tpetra MPI+GPU distributed linear algebra operations

| Key Challenge               | How it's Done                                                                                                                                                             |
|-----------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Large Design Space          | <ul> <li>Express operation as a directed acyclic graph (DAG) of operations</li> <li>Monte-Carlo Tree Search (MCTS) to identify and explore regions of interest</li> </ul> |
| Extract performance insight | <ul> <li>Empirical benchmarking</li> <li>Feature vector for each implementation</li> <li>Decision tree training for design rules</li> </ul>                               |

Initial results pass "sniff test," working on broader experiments and quantitative evaluation

### Libraries are built on existing lower-level primitives



- Our libraries (and applications) are combinations of existing library and vendor operations
  - and code to coordinate them.
  - and code to implement custom behavior
- Performance changes at many layers for new platforms
  - new hardware,
  - new CUDA version,
  - new OS version,
  - etc.







DAG of operations describes design space



C++ / CUDA / MPI

Python / scikit-learn



C++ / CUDA / MPI

Python / scikit-learn





C++ / CUDA / MPI

space

assignment

Python / scikit-learn

### **Example: Distributed SpMV**



#### DAG represents primitive operations and their dependences





**CPU** 

Stream 1

Stream 2

pack

#### Design Space: Order of Operations, Resource Assignment, and **Synchronization**



- Different resource assignments require different synchronization
- May improve GPU utilization or communication/computation overlap, but increases required operations

WSs

 $y_R$ 

PSs WRs









#### **Need to Discover Important Design Decisions**

- Some choices matter a lot
- Many choices do not matter at all
- input- and system-dependent
- Large design space: lots of expert time to evaluate and implement for each target platform
- Monte-Carlo Tree Search to focus on valuable decisions



{order of operations}

X

{stream assignments}

X

{synchronizations}

2036 implementations

State space search is stored in a tree





State space search is stored in a tree



From DAG, or synchronization operation

State space search is stored in a tree



From DAG, or synchronization operation



State space search is stored in a tree



### MCTS Iteratively Grows Tree to Focus on Valuable Regions



*Selection:* Choose a path through the tree



*Selection:* Choose a path through the tree

Expansion: Create a new child



Selection: Choose a path through the tree

Expansion: Create a new child

Rollout: Random ordering / assignment to complete the implementation



Selection: Choose a path through the tree

Expansion: Create a new child

Rollout: Random ordering / assignment to complete the implementation

*Evaluation:* Empirical benchmark



Selection: Choose a path through the tree

Expansion: Create a new child

Rollout: Random ordering / assignment to complete the implementation

*Evaluation:* Empirical benchmark

tart PostSends pack stream 0 stream 1 stream 1 cudaEventRecord  $y_{\mathsf{L}}$  $y_L$ stream 0 tream 0, event 1 stream 0 cudaEventSync event 1 end

*Backpropagation:* Store result along path

#### Tree is Deeper and Larger in Valuable Regions

As iterations proceed, tree preferentially explores high-reward regions of the design space

Store all complete implementations and performance results in a table as we go



# Transform Empirical Results into Performance Classes and Feature Vectors





automatic class labeling to identify performance classes (convolution & peak detection)

**Implementation** 

feature vectors encode which rules an implementation follows (sequence-to-vector transformation)

# Decision Tree Training to Determine which Rules Discriminate between Classes





- sync pack before y<sub>L</sub>
- WaitRecv before y<sub>L</sub>
- $y_L, y_R$  in same stream

Each path through the tree is a set of design rules that define a performance class

#### **Train an Accurate Decision Tree**

- Training process is for isolating discriminating features
  - not for classifying unseen inputs
- Incrementally increase tree size until 100% accuracy achieved
- Accuracy-complexity tradeoff in generated rules



#### Does MCTS Find Relevant Design Space Regions?



- Each MCTS iteration is a costly empirical benchmark
- Rule quality with reduced iterations?
  - For a given # of iterations, how accurate are the rules?
  - For a given # of iterations, qualitative look at the rules?



| MCTS<br>Iterations                                           | 2036                                                                       | 50                                                                         | 100                                                                                                   | 200                                                                                              | 400                                                                                                                                                                                 |
|--------------------------------------------------------------|----------------------------------------------------------------------------|----------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Discovered<br>Ruleset for<br>Fastest<br>Performance<br>Class | $y_L \rightarrow CES-b4-PostSend$ $y_L \times Pack$ $Pack \rightarrow y_L$ | $y_L \rightarrow CES-b4-PostSend$ $y_L \times Pack$ $Pack \rightarrow y_L$ | $y_L \rightarrow CES-b4-PostSend$ $y_L \times Pack$ $Pack \rightarrow y_L$ $y_L \rightarrow WaitSend$ | $y_L \rightarrow CES-b4-PostSend$ $y_L \times Pack$ Pack before $y_L$ $y_L \rightarrow WaitSend$ | $y_L \rightarrow WaitRecv$ $PostSend \rightarrow y_L$ $Pack \rightarrow y_L$ $CER$ -after-Pack $\rightarrow y_L$ $y_L \rightarrow WaitSend$ $PostRecv \rightarrow CES$ -b4-PostSend |

 $A \times B$ : A different stream than B

 $A \rightarrow B$ : A, then B

Most populous ruleset shown

#### Vision for this work

#### Current

- C++ MCTS implementation for MPI/CUDA codes with multiple streams
- Prototype feature-vector and decision tree training using SciKit in Python
- In progress open sourcing

#### Upcoming

- Apply to key Tpetra distributed linear algebra operations
- MCTS search strategies
- Better rollout techniques

#### Future Explorations

- Identify unexpected performance effects on target platforms ("performance bugs")
- What to do as communication / computation are more tightly integrated

#### Summary

- Represent CUDA+MPI operation as DAG
- Automatically generate human-interpretable rules for library design
- Maintain human provenance of implementation (no "black boxes")