Skip to content

adipppp/pr2-gpu

Repository files navigation

Matrix Multiplication — CPU vs GPU

Perbandingan implementasi perkalian matriks (C = A × B) secara sekuensial, paralel terdistribusi (MPI), dan terakselerasi GPU (CUDA, cuBLAS).

Mata kuliah Pemrograman Paralel — Fasilkom UI.

Implementasi

Berkas Deskripsi Platform
matrix_mul_seq.c Algoritma sekuensial ikj; 3 nested loop CPU (1 core)
matrix_mul_mpi.c Paralelisasi MPI; distribusi baris via MPI_Scatter/MPI_Gather; broadcast matriks B CPU Cluster
matrix_mul_cuda.cu Tiga mode kernel CUDA: coalesced (2D grid), row-wise (1D grid), uncoalesced (swap x/y) GPU (CUDA)
matrix_mul_cuda_shared.cu Kernel tiled dengan shared memory (__shared__) GPU (CUDA)
matrix_mul_cublas.cu Pustaka cuBLAS (cublasDgemm) GPU (cuBLAS)

Prasyarat

Target Compiler / Runtime
seq gcc dengan OpenMP
mpi mpicc (OpenMPI / MPICH)
cuda nvcc (CUDA Toolkit) + GPU NVIDIA
cuda_shared nvcc + GPU NVIDIA
cublas nvcc + cuBLAS + GPU NVIDIA

Build

Semua target (CPU + GPU):

make

Hanya CPU (seq + mpi):

make -f Makefile.cpu

Hanya GPU (cuda + cuda_shared + cublas):

make -f Makefile.gpu

Target individu:

make seq                   # matrix_mul_seq
make cuda                  # matrix_mul_cuda
make cuda_shared           # matrix_mul_cuda_shared
make cublas                # matrix_mul_cublas
make mpi                   # matrix_mul_mpi
make clean                 # hapus semua binary

Cara Menjalankan

Format argumen umum: <N> [opsi] — semua program menerima N (ukuran matriks N×N, default: 512).

matrix_mul_seq

./matrix_mul_seq <N>

matrix_mul_cuda

./matrix_mul_cuda <N> <blockSize> <mode> <verify>
  • blockSize: 8, 16, atau 32 (default: 16)
  • mode : 0 = coalesced (2D), 1 = row-wise, 2 = uncoalesced (default)

Contoh:

./matrix_mul_cuda 1024 16 0 0   # coalesced, N=1024, tanpa verifikasi
./matrix_mul_cuda 1024          # uncoalesced (default), N=1024
./matrix_mul_cuda 512 32        # uncoalesced (default), N=512, block=32

matrix_mul_cuda_shared

./matrix_mul_cuda_shared <N> <blockSize> <verify>
  • blockSize: 8, 16, atau 32
  • verify: 0 atau 1

matrix_mul_cublas

./matrix_mul_cublas <N> <verify>

matrix_mul_mpi

mpirun -np <ranks> ./matrix_mul_mpi <N>

N harus habis dibagi jumlah rank.

Contoh:

mpirun -np 4 ./matrix_mul_mpi 1024

Target make run

make run-seq
make run-cuda
make run-cuda-shared
make run-cublas
make run-mpi      # default 8 rank, N=512

Metrik Keluaran

Setiap program mencetak:

  • Computation time — waktu eksekusi inti (kernel / loop matriks)
  • Communication time — waktu transfer data (host ↔ device, atau MPI)
  • Checksum — jumlah seluruh elemen C sebagai validasi numerik (expected: N³ × 2)

Contoh output:

CUDA Matrix Multiplication (N=512, blockSize=16, mode=Coalesced)
============================================================
Grid: 32x32, Block: 16x16
Computation time: 0.002341 seconds
Communication time: 0.009876 seconds
Checksum: 268435456.000000 (expected: 268435456.0)
Verification PASSED

Packaging

Buat arsip .tar.gz yang hanya berisi file relevan:

./package_cpu.sh      # → pr2_cpu_matrix_mul_YYYYMMDD.tar.gz
./package_gpu.sh      # → pr2_gpu_matrix_mul_YYYYMMDD.tar.gz

Kubernetes / GPU Cluster

File YAML digunakan untuk deploy container NVHPC di klaster GPU dengan NFS volume. Ubah server dan path NFS sesuai lingkungan masing-masing.

  • pods-working-user03-gpu02.yaml — node gputype: gpu-02
  • pods-working-user03-gpu03.yaml — node gputype: gpu-03
kubectl apply -f pods-working-user03-gpu02.yaml
kubectl apply -f pods-working-user03-gpu03.yaml

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors