October 21st, 2020
Evaluate Unified Virtual Memory (UVM) technology's performance for graph algorithms
| Algorithm Covered | Computational Complexity | Previous work | To be finished in the project |
|---|---|---|---|
| BFS | Polynomial O(|V|) | 1.Open MP version based on traditional algorithm 2. GPU version based on traditional algorithm |
CUDA GPU UVM version based on HALO-I algorithm2 CUDA GPU version based on HALO-I algorithm2 |
| k-truss decomposition |
Polynomial | 1.OpenMP version based on peeling algorithm 2.OpenMP version based on SND algorithm 3.CUDA GPU version based on traditional algorithm5 |
CUDA GPU UVM version based on SND algorithm |
- Why to choose these algorithms:
I want to explore the performance of UVM technology in multiple computational complexities for large scale graph data. Hence I get BFS (linear complexity) and K-truss (generally quadratic complexity).
| Time Period | TODOs |
|---|---|
| 2020.10.05 - 2020.10.15 | literature reading, CUDA UVM program tutorial, reevaluate and improve the project topic. |
| 2020.10.16 - 2020.10.26 | GPU HALO-I algorithm2 |
| 2020.10.27 - 2020.11.10 | GPU UVM version based on HALO-I algorithm2 |
| 2020.11.11 - 2020.11.20 | GPU UVM version based on SND algorithm3 |
| 2020.11.20 - | Summarizing and doc writing. |
| Algorithm | Metrics | Trade-off |
|---|---|---|
| BFS | speedup, efficiency | / |
| k-truss | speedup, accuracy, iteration numbers | speedup / iteration numbers v.s. accuracy |
How to deal with: Generally the accuracy should be guaranteed first, in the reference paper, when the optimized algorithm goes >to convergence, at least 98% accuracy should be reached by the algorithm.
Basic
- Implement planned algorithms under GPU UVM technique, pass the test;
- Compare the GPU UVM version program with OpenMP (CPU) version, pure GPU version programs to evaluate GPU UVM's performance in graph algorithms;
- Reach speedup and accuracy at least faster than OpenMP version, and seem to be reasonable as shown in https://dl.acm.org/doi/abs/10.14778/3384345.3384358 for BFS and https://arxiv.org/abs/1704.00386 for k-truss.
Advanced
- Give feasible improvement suggestion for algorithms implemented
- Show by experiment that the suggestion works.
Project Proposal
October 5th, 2020
Topic: Will Unified Memory Be the Final Piece of the GPU Programming Puzzle?
Subtitle: From the view of k-core algorithms
Motivations:
When adopting graph mining algorithms with GPU, most prior work on graph applications has been restricted to the size of the graph due to the limited capacity of GPU memory1,2. Recently a novel technology known as unified virtual memory, which supports unified virtual memory between CPUs and GPUs, makes it possible to address the very puzzle. However, such an approach to accessing host memory while computing is probably much slower, and the problem has been seldom explored1. Hence, I want to focus on implementing and testing some specific graph mining algorithms to catch a glimpse of it.
Prasun Gera at al. (2020)1, has done exciting work about evaluating large graphs traversal on GPUs with unified memory and designing a lightweight offline graph reordering algorithm (HALO) as a preprocessing step to boost the algorithm. Their work yields speedups of 1.5x-1.9x over the baseline. Their work will be a great guideline for my project.
While I looked through the papers in the Github repository, I noticed that k-cores and BFS are so popular to be targets of optimization in the field of HPC on GPUs. Hang Liu (2019)6, Laxman Dhulipala (2017)5, and Ahmet Erdem (2018)7 have all done great jobs to design fast parallel computing frameworks to accelerate graph algorithms, and they all select the k-core algorithm as part of their experiments. However, none of them did it on united virtual memory and imported large scale of graphs as input. I think I can take advantage of their optimization ideas to help me implement the k-core algorithm on united memory and their results can be the baseline results for me to reach the conclusion.
Goals:
- Understand how unified virtual memory works under CUDA architecture, enable to write k-core and other graph algorithms (e.g. BFS, DFS) using unified virtual memory works;
- Import large graphs in Prasun Gera at al. (2020)1's paper, try to define the problem for k-core algorithm and manage to prove UVM is working while running the program;
- Compare the UVM program with GPU-only approaches on their performance, and summarize what causes the difference.
- Take Prasun Gera at al. (2020)1, Hang Liu (2019)6, Laxman Dhulipala (2017)5, and Ahmet Erdem (2018)7 's work into consideration, as well as optimization tutorials on Nvidia's official website3,4. Test whether HALO also works for k-core algorithm, and also try to come up with some other feasible optimizing methods to boost the program above.
- Review the experiment and make a judgement on whether Unified Memory Will Be the Final Piece of the GPU Programming Puzzle, point out its advantages and disadvantages.
- * Try different algorithms to make comparison. (like gSpan8,9 etc.)
Expected Results:
- CUDA source codes of baseline and improved k-core algorithm on GPU with unified memory. If time permits, there will be another version of gSpan algorithm.
- A detailed report on all the experiment data, together with how the improvement works and the analysis on the unified memory.
Schedule:
- 2020.10.05 - 2020.10.15: literature reading, CUDA UVM program tutorial, reevaluate and improve the project topic.
- 2020.10.15 - 2020.10.26: Implement baseline algorithms (UVM version, GPU-only version, CPU-only version, etc. ).
- 2020.10.26 - 2020.11.10: Work on the improvement ideas.
- 2020.11.10 - : Summarizing and report writing.
-
Unified Memory for CUDA Beginners, referred on November 15, 2020