Perbandingan implementasi perkalian matriks (C = A × B) secara sekuensial, paralel terdistribusi (MPI), dan terakselerasi GPU (CUDA, cuBLAS).
Mata kuliah Pemrograman Paralel — Fasilkom UI.
| Berkas | Deskripsi | Platform |
|---|---|---|
matrix_mul_seq.c |
Algoritma sekuensial ikj; 3 nested loop | CPU (1 core) |
matrix_mul_mpi.c |
Paralelisasi MPI; distribusi baris via MPI_Scatter/MPI_Gather; broadcast matriks B |
CPU Cluster |
matrix_mul_cuda.cu |
Tiga mode kernel CUDA: coalesced (2D grid), row-wise (1D grid), uncoalesced (swap x/y) | GPU (CUDA) |
matrix_mul_cuda_shared.cu |
Kernel tiled dengan shared memory (__shared__) |
GPU (CUDA) |
matrix_mul_cublas.cu |
Pustaka cuBLAS (cublasDgemm) |
GPU (cuBLAS) |
| Target | Compiler / Runtime |
|---|---|
| seq | gcc dengan OpenMP |
| mpi | mpicc (OpenMPI / MPICH) |
| cuda | nvcc (CUDA Toolkit) + GPU NVIDIA |
| cuda_shared | nvcc + GPU NVIDIA |
| cublas | nvcc + cuBLAS + GPU NVIDIA |
Semua target (CPU + GPU):
makeHanya CPU (seq + mpi):
make -f Makefile.cpuHanya GPU (cuda + cuda_shared + cublas):
make -f Makefile.gpuTarget individu:
make seq # matrix_mul_seq
make cuda # matrix_mul_cuda
make cuda_shared # matrix_mul_cuda_shared
make cublas # matrix_mul_cublas
make mpi # matrix_mul_mpi
make clean # hapus semua binaryFormat argumen umum: <N> [opsi] — semua program menerima N (ukuran matriks N×N, default: 512).
./matrix_mul_seq <N>./matrix_mul_cuda <N> <blockSize> <mode> <verify>blockSize: 8, 16, atau 32 (default: 16)mode: 0 = coalesced (2D), 1 = row-wise, 2 = uncoalesced (default)
Contoh:
./matrix_mul_cuda 1024 16 0 0 # coalesced, N=1024, tanpa verifikasi
./matrix_mul_cuda 1024 # uncoalesced (default), N=1024
./matrix_mul_cuda 512 32 # uncoalesced (default), N=512, block=32./matrix_mul_cuda_shared <N> <blockSize> <verify>blockSize: 8, 16, atau 32verify: 0 atau 1
./matrix_mul_cublas <N> <verify>mpirun -np <ranks> ./matrix_mul_mpi <N>N harus habis dibagi jumlah rank.
Contoh:
mpirun -np 4 ./matrix_mul_mpi 1024make run-seq
make run-cuda
make run-cuda-shared
make run-cublas
make run-mpi # default 8 rank, N=512Setiap program mencetak:
- Computation time — waktu eksekusi inti (kernel / loop matriks)
- Communication time — waktu transfer data (host ↔ device, atau MPI)
- Checksum — jumlah seluruh elemen C sebagai validasi numerik (expected:
N³ × 2)
Contoh output:
CUDA Matrix Multiplication (N=512, blockSize=16, mode=Coalesced)
============================================================
Grid: 32x32, Block: 16x16
Computation time: 0.002341 seconds
Communication time: 0.009876 seconds
Checksum: 268435456.000000 (expected: 268435456.0)
Verification PASSED
Buat arsip .tar.gz yang hanya berisi file relevan:
./package_cpu.sh # → pr2_cpu_matrix_mul_YYYYMMDD.tar.gz
./package_gpu.sh # → pr2_gpu_matrix_mul_YYYYMMDD.tar.gzFile YAML digunakan untuk deploy container NVHPC di klaster GPU dengan NFS volume. Ubah server dan path NFS sesuai lingkungan masing-masing.
pods-working-user03-gpu02.yaml— nodegputype: gpu-02pods-working-user03-gpu03.yaml— nodegputype: gpu-03
kubectl apply -f pods-working-user03-gpu02.yaml
kubectl apply -f pods-working-user03-gpu03.yaml