# **Graph Traversal**

Graph Search



#### **Graph and Matrix**

- A graph is a data structure that represents the relationships between entities
  - A graph is a data structure that represents the relationships between entities
- Graphs are intrinsically related to sparse matrices
  - An intuitive representation of a graph is an adjacency matrix
  - Thus, graph computation can also be formulated in terms of sparse matrix operations
- Store them can be done using a different format
  - CSR/CSC
  - o COO





### **Graph Usage Examples**

- They are used to model different problems from different fields
  - Social media connection graphs
  - Driving directions
  - Telecommunication networks
  - Manufacturing process dependencies
  - Computation graph
  - o 3D Meshes
  - Graphical models
- These graphs tend to be massive and sparse



# **Breadth-First Search** (BFS)

- Given a source node S, find the number of steps required to reach each node N in the graph
  - BFS is often used to discover the shortest number of edges that one needs to traverse to go from one vertex to another vertex of the graph



# **BSF Graph Computation Example**

- While designing an integrated circuit chip, many electronic components need to be connected to complete the design
  - The routing software represents the chip as a grid of wiring blocks in which each block can potentially serve as a piece of a wire
- The maze routing application represents the chip as a graph
  - The routing blocks are vertices
  - $\circ$  An edge from vertex i to vertex j indicates that one can extend a wire from block i to block j





#### **Vertix-Center BFS Computation**

- Iterate or assign each thread to a vertex
  - For each iteration, check all incoming edges to see if the source vertex was just visited in the last iteration; if so, mark as visited in this iteration
  - $\circ$  Not very work efficient, complexity of O(VL)
    - V number of vertices
    - *L* length of the longest path
  - Difficult to detect stopping criterion



#### Frontier-Center Parallel BFS

- Assign threads to frontier vertices from the previous iteration
  - Add all the non-visited neighbors to the next frontier
  - The source will be the first element in the frontier
- Parallel execution between threads
  - A variable number of unnecessary threads are launched



# **Problems and Optimizations**

- Privatization
- Texture Memory
- Kernel launch overhead
- Sub Block-Level Queue
- Load Imbalance

| Optimization                                          | Benefit to compute cores                                               | Benefit to memory                                                                 | Strategies                                                                                                                                                                                                                      |
|-------------------------------------------------------|------------------------------------------------------------------------|-----------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Maximizing occupancy                                  | More work to hide<br>pipeline latency                                  | More parallel<br>memory accesses<br>to hide DRAM<br>latency                       | Tuning usage of SM resources such as threads per block, shared memory per block, and registers per thread                                                                                                                       |
| Enabling<br>coalesced<br>global<br>memory<br>accesses | Fewer pipeline<br>stalls waiting for<br>global memory<br>accesses      | Less global memory<br>traffic and better<br>utilization of bursts/<br>cache lines | Transfer between global memory and shared memory in a coalesced manner and performing uncoalesced accesses in shared memory (e.g., comer turning) Rearranging the mapping of threads to data Rearranging the layout of the data |
| Minimizing<br>control<br>divergence                   | High SIMD<br>efficiency (fewer<br>idle cores during<br>SIMD execution) | -                                                                                 | Rearranging the mapping<br>of threads to work and/or<br>data<br>Rearranging the layout of<br>the data                                                                                                                           |
| Tiling of reused data                                 | Fewer pipeline<br>stalls waiting for<br>global memory<br>accesses      | Less global memory traffic                                                        | Placing data that is reused within a block in shared memory or registers so that it is transferred between global memory and the SM only once                                                                                   |
| Privatization<br>(covered<br>later)                   | Fewer pipeline<br>stalls waiting for<br>atomic updates                 | Less contention and<br>serialization of<br>atomic updates                         | Applying partial updates to<br>a private copy of the data<br>and then updating the<br>universal copy when done                                                                                                                  |
| Thread<br>coarsening                                  | Less redundant<br>work, divergence,<br>or synchronization              | Less redundant<br>global memory<br>traffic                                        | Assigning multiple units of<br>parallelism to each thread<br>to reduce the price of<br>parallelism when it is<br>incurred unnecessarily                                                                                         |



# Frontier Output Interference



#### **Privatization**

- Privatization reduces contention of atomics by applying partial updates to a private copy of the data, then updating the public copy when done
  - Threads will contend on the same data only with other threads in the same block
  - The local frontier and its counter can be stored in shared memory
    - Which enables lower-latency atomic operations





# Irregular global memory access

Access patterns depend on graph structure and are unpredictable





# **Memory Optimization Flowchart**



# **Texture Memory**

- Texture memory is another form of global memory
  - Like constant memory, it is aggressively cached for read-only access
- Originally developed and optimized for storing and reading textures for graphics applications
  - Has hardware-level support for 1-,2-, or3-D layouts and interpolated reads
  - The texture cache is also spatial layout-aware
- It can be useful for irregular access patterns with un-coalesced reads



#### Kernel Launch Overhead

- For some iterations of BFS (especially near the beginning), the frontier can be quite small
  - The benefits of parallelism only outweigh the kernel launch overhead when the frontier becomes large enough
- Use the CPU if the frontier size dips below some threshold



# Block-level queue length counter contention

- While the block-level queues reduced contention for global memory, the block-level counter is now the bottleneck
- We can extend the hierarchy by further splitting the block-level queue

| Global queue:             |                           |
|---------------------------|---------------------------|
|                           |                           |
|                           |                           |
| Block queue:              | Block queue:              |
| Sub-queue 0: Sub-queue 1: | Sub-queue 0: Sub-queue 1: |
| Sub-queue 2: Sub-queue 3: | Sub-queue 2: Sub-queue 3: |
| Block 0                   | Block 1                   |

#### Sub-Queue





#### **Credits**

• <u>CSE 599 I</u> Accelerated Computing - Programming GPUS - Parallel Pattern: Graph Traversal, Author *Tanner Schmidt* 



# Thanks for your attention!

Gianmarco Accordi gianmarco.accordi@polimi.it

