A CUBLAS‐CUDA based implementation of multi-GPU large matrix multiplication. It is a standalone C/C++ commandline application to lauch large matrix-matrix multiplicatio and get profiled outputs in a GPU cluster. It can be easily transformed into a C/C++ library given its lightweight codebase.
- CUDA - A parallel computing platform and programming model developed by NVIDIA for GPU-accelerated computing.
- cuBLAS - The NVIDIA CUDA Basic Linear Algebra Subprograms (cuBLAS) library for efficient GPU-accelerated linear algebra operations.
- Dependencies
- Environment
- Important Files
- Installation
- Documentation
- Running the Application
- Available Options
The LargeMM application relies on the following dependencies:
Dependency | Version |
---|---|
CUDA | 11.6.1+ |
GCC | 10.3.0+ |
CMake | 3.24.2+ |
CUDA modules should be loaded prior to compilation or execution.
This application is designed to run on 1-4 Tesla V100 SXM2. The default environment is a GPU node in Gadi.
-
data
folder stores performance data of LargeMM. -
profile
folder stores profiler timeline files for the performance ofv2_ngpus_reduction
,v1_1_n_streams
, andbase_cublasDgemm
. -
test
folder stores tests forv2_ngpus_reduction
,v1_1_n_streams
, andbase_cublasDgemm
.
-
Clone the repository into your workspace and navigate to the project directory:
git clone https://github.com/Zlisch/LargeMM.git cd LargeMM
-
Run the installation script:
chmod -x ./INSTALL.sh ./INSTALL.sh
Or you can directly download the latest executable from the link.
You can either view the documentation in header files of the cloned repository or if you are using Visual Studio Code,
-
Install the Live Server extension in your Visual Studio Code. To enable Live Server,
cmd
+shift
+p
in your Visual Studio Code and typelive server
in the prompt. Selectopen with live server
. -
With the Live Server extension enabled, enter
http://127.0.0.1:5500/docs/html/globals.html
in your browser to view the documentation.
After running ./INSTALL.sh
, use the following to run v2_ngpus_reduction
with lookup table on 4 GPUs and print the output.
./bin/largemm -s "-1" -m 28377 -a 2 -g 4
To run the LargeMM with NVIDIA visual profiler, use:
nsys profile --stats=true ./bin/largemm -s "-1" -m 28377 -a 2 -g 4
Or you can build your own run script. A run script template is provided in ./run.sh
.
-s
- Description: Specify the stream stride (square root of the number of streams to be used) for each GPU. If
-1
is given, the lookup table will be used instead to decide the number of streams for each GPU. - Example: Run
v2_ngpus_reduction
with 9 streams for each GPU on 4 GPUs and print the output.
./bin/largemm -s 3 -m 28377 -a 2 -g 4
-a
- Description: Specify the algorithm to run.
Value | Algorithm Version |
---|---|
0 | base_cublasDgemm |
1 | v1_1_n_streams |
2 | v2_ngpus_reduction |
3 | v2_ngpus_parallel_a |
4 | v2_ngpus_parallel_a_n_streams_breadth |
-m
- Description: Row dimension of the matrix.
- Example: Run
v2_ngpus_reduction
on a square matrix of size 6GB (row width 28377 if double precision is used).
./bin/largemm -s 3 -m 28377 -a 2 -g 4
-g
- Description: Specify the number of GPU(s) to use. Cannot be zero.