# **General Matrix Multiplication (GEMM) Optimization**

In [1]:
import os
os.environ["PATH"] += ":/usr/local/cuda-12.6/bin"
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Oct_29_23:50:19_PDT_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0


In [2]:
import re
import subprocess
import statistics
import time  # Added for sleep functionality

def run_gemm(executable_path, choice, m, n, k, runs=10, sleep_time=2):  # Added sleep_time parameter
    times = []
    
    for i in range(runs):
        result = subprocess.run([executable_path, str(choice), str(m), str(n), str(k)],
                              capture_output=True, text=True)
        print(f"Output {i+1} - {result.stdout}") 
        match = re.search(r'CUDA kernel time: (\d+\.\d+)', result.stdout)
        if match:
            cuda_time = float(match.group(1))
            times.append(cuda_time)
        else:
            print(f"Warning: No time found in output: {result.stdout}")
        
        # Add sleep between runs, except for the last run
        if i < runs - 1:
            time.sleep(sleep_time)
            print(f"Sleeping for {sleep_time} seconds...")
    
    # Calculate and print statistics
    mean_time = statistics.mean(times)
    std_dev = statistics.stdev(times) if len(times) > 1 else 0
    min_time = min(times) if times else 0
    max_time = max(times) if times else 0
    
    print(f"\nStatistics:")
    print(f"Mean: {mean_time:.2f} ms")
    print(f"Std Dev: {std_dev:.2f} ms")
    print(f"Min: {min_time:.2f} ms")
    print(f"Max: {max_time:.2f} ms")
    
    return mean_time

## **1 - Naive GEMM Kernel**

<div style="text-align: center;">
  <img src="./images/naive_kernel_mul.png" alt="Naive GEMM Multiplication" width="800">
</div>

This diagram shows a naive GEMM (General Matrix Multiplication) kernel implementation using threads. Each thread accesses matrix elements based on its ID: x = blockDim.x * blockIdx.x + threadIdx.x and y = blockDim.y * blockIdx.y + threadIdx.y. Within the B matrix, threads in a warp access the same values (broadcast), while in the A matrix, threads access non-consecutive memory locations (non-coalesced memory access), which is inefficient. The C matrix shows how different threads (0,0), (0,1), (0,2) etc., compute their respective output elements through these memory access patterns.

<div style="text-align: center;">
  <img src="./images/naive_kernel_memory_access.png" alt="Naive Kernel Memory Access" width="800">
</div>

This diagram illustrates a memory access pattern issue in GPU computing. It shows two warps (groups of threads) accessing memory in a non-coalesced pattern, meaning threads access scattered memory locations rather than consecutive ones. Each warp requires 4x32B loads (8 loads total), which is inefficient. The crossing lines between thread indices and memory locations visualize this scattered access pattern. This non-optimal memory access results in performance penalties because too many separate load operations are needed to execute each warp.

In [3]:
!nvcc -o ./src/01_naive_gemm ./src/run.cu -lcublas -lnvToolsExt

In [4]:
naive_gemm_time = run_gemm("./src/01_naive_gemm", 1, 4096, 4096, 4096)
print(f"Average Naive GEMM time: {naive_gemm_time}")

Output 1 - Naive GEMM Kernel:
CUDA kernel time: 1289.4343 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 2 - Naive GEMM Kernel:
CUDA kernel time: 1286.6808 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 3 - Naive GEMM Kernel:
CUDA kernel time: 1286.7893 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 4 - Naive GEMM Kernel:
CUDA kernel time: 1296.2511 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 5 - Naive GEMM Kernel:
CUDA kernel time: 1268.7273 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 6 - Naive GEMM Kernel:
CUDA kernel time: 1254.4408 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 7 - Naive GEMM Kernel:
CUDA kernel time: 1259.7162 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 8 - Naive GEMM Kernel:
CUDA kernel time: 1257.1493 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 9 - Naive GEMM Kernel:
CUDA kernel time: 1297.1204 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 10 - Naive G

In [5]:
!ncu --set full -o ./profiles/01_naive_gemm ./src/01_naive_gemm 1 4096 4096 4096

==PROF== Connected to process 103654 (/home/darshith/code/cuda-gemm-optimization/src/01_naive_gemm)
==PROF== Profiling "ampere_sgemm_128x64_nn" - 0: 0%
....50%....100% - 38 passes
Naive GEMM Kernel:
==PROF== Profiling "gemmNaive" - 1: 0%
....50%....100% - 38 passes
CUDA kernel time: 190976.5625 ms
Results match : Yes 
==PROF== Disconnected from process 103654
==PROF== Report: /home/darshith/code/cuda-gemm-optimization/./profiles/01_naive_gemm.ncu-rep


## **2 - Coalesced Memory GEMM Kernel**

<div style="text-align: center;">
  <img src="./images/coalesced_memory_mul.png" alt="Coalesced Memory Multiplication" width="800">
</div>

This diagram shows an optimized memory coalesced GEMM (General Matrix Multiplication) kernel design. Unlike the naive version, threads access consecutive memory locations in matrix B, enabling memory coalescing and better performance. For matrix A, all threads within a warp access the same values (broadcast). The coordinates are calculated as: x = blockIdx.x * BLOCK_SIZE + (threadIdx.x / BLOCK_SIZE) for matrix A's row access, and y = blockIdx.y * BLOCK_SIZE + (threadIdx.y % BLOCK_SIZE) for matrix B's column access. Threads (0,0), (0,1), and (0,2) are grouped in the same warp to optimize memory access patterns.

<div style="text-align: center;">
  <img src="./images/coalesced_memory_access.png" alt="Coalesced Memory Access" width="800">
</div>

This diagram shows an optimized memory coalesced access pattern where threads within each warp (Warp-0 and Warp-1) access consecutive memory locations. Each warp now requires only 2x32B loads (4 loads total), half of what was needed in the non-coalesced version. The straight vertical lines from thread indices to memory locations indicate efficient coalesced memory access, improving performance by reducing the number of required load operations.

In [6]:
!nvcc -o ./src/02_memory_coalesced_gemm ./src/run.cu -lcublas -lnvToolsExt

In [7]:
memory_coalesced_gemm_time = run_gemm("./src/02_memory_coalesced_gemm", 2, 4096, 4096, 4096)
print(f"Average Global Memory Coalesced GEMM time: {memory_coalesced_gemm_time}")

Output 1 - Global Memory Coalescing:
CUDA kernel time: 254.7391 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 2 - Global Memory Coalescing:
CUDA kernel time: 252.1203 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 3 - Global Memory Coalescing:
CUDA kernel time: 255.5368 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 4 - Global Memory Coalescing:
CUDA kernel time: 251.8124 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 5 - Global Memory Coalescing:
CUDA kernel time: 251.9612 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 6 - Global Memory Coalescing:
CUDA kernel time: 251.5591 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 7 - Global Memory Coalescing:
CUDA kernel time: 252.3045 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 8 - Global Memory Coalescing:
CUDA kernel time: 252.6084 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 9 - Global Memory Coalescing:
CUDA kernel time: 254.9641 ms
Results match

In [8]:
!ncu --set full -o ./profiles/02_memory_coalesced_gemm ./src/02_memory_coalesced_gemm 2 4096 4096 4096

==PROF== Connected to process 104516 (/home/darshith/code/cuda-gemm-optimization/src/02_memory_coalesced_gemm)
==PROF== Profiling "ampere_sgemm_128x64_nn" - 0: 0%
....50%....100% - 37 passes
Global Memory Coalescing:
==PROF== Profiling "gemmMemCoalesced" - 1: 0%
....50%....100% - 37 passes
CUDA kernel time: 105258.9141 ms
Results match : Yes 
==PROF== Disconnected from process 104516
==PROF== Report: /home/darshith/code/cuda-gemm-optimization/./profiles/02_memory_coalesced_gemm.ncu-rep


## **3 -  Shared Memory Cache-Blocking**

<div style="text-align: center;">
  <img src="./images/shared_memory_cache_blocking.png" alt="Shared Memory Cache Blocking" width="800">
</div>

This diagram illustrates a block-based matrix multiplication algorithm with a block size of 32. When multiplying matrices A and C, each matrix is divided into blocks of size BLOCK_SIZE (32). The starting addresses of blocks are calculated using formulas: &A = row * BLOCK_SIZE * K for matrix A's rows, &B = col * BLOCK_SIZE for B's columns, and &C = (row * BLOCK_SIZE * K) + (col * BLOCK_SIZE) for matrix C's position. As the algorithm processes each block, it increments A by BLOCK_SIZE within the same row, B by BLOCK_SIZE * N to move to the next block, and C moves to process the next row of blocks.

In [9]:
!nvcc -o ./src/03_shared_memory_gemm ./src/run.cu -lcublas -lnvToolsExt

In [10]:
shared_memory_gemm_time = run_gemm("./src/03_shared_memory_gemm", 3, 4096, 4096, 4096)
print(f"Average Shared Memory GEMM time: {shared_memory_gemm_time}")

Output 1 - Shared Memory Cache-Blocking:
CUDA kernel time: 161.1391 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 2 - Shared Memory Cache-Blocking:
CUDA kernel time: 182.9159 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 3 - Shared Memory Cache-Blocking:
CUDA kernel time: 167.4789 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 4 - Shared Memory Cache-Blocking:
CUDA kernel time: 161.1305 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 5 - Shared Memory Cache-Blocking:
CUDA kernel time: 182.4557 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 6 - Shared Memory Cache-Blocking:
CUDA kernel time: 175.5508 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 7 - Shared Memory Cache-Blocking:
CUDA kernel time: 171.3870 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 8 - Shared Memory Cache-Blocking:
CUDA kernel time: 179.1420 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 9 - Shared Memory Cache-Blocking:
CUDA ke

In [11]:
!ncu --set full -o ./profiles/03_shared_memory_gemm ./src/03_shared_memory_gemm 3 4096 4096 4096

==PROF== Connected to process 105151 (/home/darshith/code/cuda-gemm-optimization/src/03_shared_memory_gemm)
==PROF== Profiling "ampere_sgemm_128x64_nn" - 0: 0%
....50%....100% - 37 passes
Shared Memory Cache-Blocking:
==PROF== Profiling "gemmSharedMem" - 1: 0%
....50%....100% - 38 passes
CUDA kernel time: 48232.0820 ms
Results match : Yes 
==PROF== Disconnected from process 105151
==PROF== Report: /home/darshith/code/cuda-gemm-optimization/./profiles/03_shared_memory_gemm.ncu-rep


## **4 - 1D Block-Tiling**

<div style="text-align: center;">
  <img src="./images/1d_block_tiling.png" alt="1D Block tiling" width="800">
</div>

In [12]:
!nvcc -o ./src/04_1d_block_tiling ./src/run.cu -lcublas -lnvToolsExt

In [13]:
block_tile1d_gemm_time = run_gemm("./src/04_1d_block_tiling", 4, 4096, 4096, 4096)
print(f"Average 1D Block-tiled GEMM time: {block_tile1d_gemm_time}")

Output 1 - 1D Block tiling:
CUDA kernel time: 64.9989 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 2 - 1D Block tiling:
CUDA kernel time: 65.0620 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 3 - 1D Block tiling:
CUDA kernel time: 65.2465 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 4 - 1D Block tiling:
CUDA kernel time: 65.4750 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 5 - 1D Block tiling:
CUDA kernel time: 65.1683 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 6 - 1D Block tiling:
CUDA kernel time: 65.1523 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 7 - 1D Block tiling:
CUDA kernel time: 65.2660 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 8 - 1D Block tiling:
CUDA kernel time: 65.1588 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 9 - 1D Block tiling:
CUDA kernel time: 65.1698 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 10 - 1D Block tiling:
CUDA kernel time: 64.9749 

In [14]:
!ncu --set full -o ./profiles/04_1d_block_tiling ./src/04_1d_block_tiling 4 4096 4096 4096

==PROF== Connected to process 105627 (/home/darshith/code/cuda-gemm-optimization/src/04_1d_block_tiling)
==PROF== Profiling "ampere_sgemm_128x64_nn" - 0: 0%
....50%....100% - 38 passes
1D Block tiling:
==PROF== Profiling "gemm1dBlockTiling" - 1: 0%....50%....100% - 37 passes
CUDA kernel time: 14103.1299 ms
Results match : Yes 
==PROF== Disconnected from process 105627
==PROF== Report: /home/darshith/code/cuda-gemm-optimization/./profiles/04_1d_block_tiling.ncu-rep


## **5 - 2D Block-Tiling**

In [15]:
!nvcc -o ./src/05_2d_block_tiling ./src/run.cu -lcublas -lnvToolsExt

In [16]:
block_tile2d_gemm_time = run_gemm("./src/05_2d_block_tiling", 5, 4096, 4096, 4096)
print(f"Average 2D Block-tiled GEMM time: {block_tile2d_gemm_time}")

Output 1 - 2D Block tiling:
CUDA kernel time: 379.6307 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 2 - 2D Block tiling:
CUDA kernel time: 383.0912 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 3 - 2D Block tiling:
CUDA kernel time: 370.4710 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 4 - 2D Block tiling:
CUDA kernel time: 366.1082 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 5 - 2D Block tiling:
CUDA kernel time: 374.2991 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 6 - 2D Block tiling:
CUDA kernel time: 371.2763 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 7 - 2D Block tiling:
CUDA kernel time: 383.6719 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 8 - 2D Block tiling:
CUDA kernel time: 358.5301 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 9 - 2D Block tiling:
CUDA kernel time: 372.7195 ms
Results match : Yes 

Sleeping for 2 seconds...
Output 10 - 2D Block tiling:
CUDA kernel time:

In [17]:
!ncu --set full -o ./profiles/05_2d_block_tiling ./src/05_2d_block_tiling 5 4096 4096 4096

==PROF== Connected to process 106033 (/home/darshith/code/cuda-gemm-optimization/src/05_2d_block_tiling)
==PROF== Profiling "ampere_sgemm_128x64_nn" - 0: 0%
....50%....100% - 37 passes
2D Block tiling:
==PROF== Profiling "gemm2dBlockTiling" - 1: 0%
....50%....100% - 38 passes
CUDA kernel time: 49358.2031 ms
Results match : Yes 
==PROF== Disconnected from process 106033
==PROF== Report: /home/darshith/code/cuda-gemm-optimization/./profiles/05_2d_block_tiling.ncu-rep


## **6 - Vectorized 2D Block-Tiling**

In [None]:
!nvcc -o ./src/06_vectorize_gemm ./src/run.cu -lcublas -lnvToolsExt

In [None]:
vector_block_tile2d_gemm_time = run_gemm("./src/06_vectorize_gemm", 6, 4096, 4096, 4096)
print(f"Average 2D Block-tiled GEMM time: {vector_block_tile2d_gemm_time}")

In [None]:
!ncu --set full -o ./profiles/06_vectorize_gemm ./src/06_vectorize_gemm 6 4096 4096 4096