<a href="https://colab.research.google.com/github/UchihaIthachi/cuda-gpu-programming-lab/blob/main/cuda-lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CUDA Lab Setup

This notebook will guide you through setting up and running the `cuda-lab` examples in Google Colab.

## 1. Check GPU Availability

First, let's ensure that a GPU is available. Go to **Runtime -> Change runtime type** and select **GPU** as the hardware accelerator. Then, run the following cell to verify that Colab has assigned a GPU.

In [35]:
!nvidia-smi

Sun Aug 24 14:35:55 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   36C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## 2. Clone the Repository

Next, clone the `cuda-lab` repository from GitHub.

In [36]:
!git clone https://github.com/UchihaIthachi/cuda-gpu-programming-lab.git
%cd cuda-gpu-programming-lab

Cloning into 'cuda-gpu-programming-lab'...
remote: Enumerating objects: 137, done.[K
remote: Counting objects: 100% (137/137), done.[K
remote: Compressing objects: 100% (105/105), done.[K
remote: Total 137 (delta 65), reused 84 (delta 30), pack-reused 0 (from 0)[K
Receiving objects: 100% (137/137), 50.11 KiB | 789.00 KiB/s, done.
Resolving deltas: 100% (65/65), done.
/content/cuda-gpu-programming-lab/cuda-gpu-programming-lab/cuda-gpu-programming-lab/cuda-gpu-programming-lab


## 3. Set GPU_ARCH Environment Variable

The `Makefile` is designed to automatically detect the GPU architecture. However, you can explicitly set the `GPU_ARCH` environment variable to override this. This cell determines the GPU's compute capability and sets the `ARCH` variable for the `make` commands.

In [37]:
import os
import re
from subprocess import check_output

def get_gpu_arch():
    try:
        output = check_output(['nvidia-smi', '--query-gpu=compute_cap', '--format=csv,noheader']).decode('utf-8').strip()
        if output:
            major, minor = re.search(r'(\d+)\.(\d+)', output).groups()
            return f"sm_{major}{minor}"
    except Exception as e:
        print(f"Could not determine GPU architecture: {e}")
    return "sm_50"  # Fallback to a default architecture

ARCH = get_gpu_arch()
os.environ['ARCH'] = ARCH
print(f"Detected GPU architecture: {ARCH}")

Detected GPU architecture: sm_75


## 4. Build the CUDA Programs

Now, let's compile all the CUDA and serial programs using `make`. The compiled binaries will be placed in the `bin/` directory.

In [38]:
!make all

mkdir -p ./bin
nvcc -O2 -std=c++14 -gencode arch=compute_75,code=sm_75 -I./include -o bin/00_device_query src/00_device_query.cu
nvcc -O2 -std=c++14 -gencode arch=compute_75,code=sm_75 -I./include -o bin/01_hello_kernel src/01_hello_kernel.cu
nvcc -O2 -std=c++14 -gencode arch=compute_75,code=sm_75 -I./include -o bin/02_vector_add src/02_vector_add.cu
nvcc -O2 -std=c++14 -gencode arch=compute_75,code=sm_75 -I./include -o bin/03_saxpy src/03_saxpy.cu
nvcc -O2 -std=c++14 -gencode arch=compute_75,code=sm_75 -I./include -o bin/04_matmul_naive src/04_matmul_naive.cu
nvcc -O2 -std=c++14 -gencode arch=compute_75,code=sm_75 -I./include -o bin/05_matmul_tiled_shared src/05_matmul_tiled_shared.cu
nvcc -O2 -std=c++14 -gencode arch=compute_75,code=sm_75 -I./include -o bin/06_reduction_sum src/06_reduction_sum.cu
nvcc -O2 -std=c++14 -gencode arch=compute_75,code=sm_75 -I./include -o bin/07_histogram_atomics src/07_histogram_atomics.cu
nvcc -O2 -std=c++14 -gencode arch=compute_75,code=sm_75 -I./inclu

## 5. Run Examples and Performance Comparison

You can now run any of the compiled programs. Here are a few examples, along with a performance comparison between the serial and CUDA versions.

### Device Query

This program lists the available CUDA devices.

In [39]:
!make run PROG=00_device_query

Running 00_device_query...
././bin/00_device_query 
Found 1 CUDA device(s)
-- Device 0: Tesla T4 | CC 7.5 | SMs=40 | Mem=15.83 GB


### Hello Kernel

A simple "Hello, World!" from the GPU.

In [40]:
!make run PROG=01_hello_kernel

Running 01_hello_kernel...
././bin/01_hello_kernel 
Hello from block 0, thread 0
Hello from block 0, thread 1
Hello from block 0, thread 2
Hello from block 0, thread 3
Hello from block 1, thread 0
Hello from block 1, thread 1
Hello from block 1, thread 2
Hello from block 1, thread 3
Host: done.


### Vector Addition

This example adds two vectors. We will compare the performance of the serial and CUDA implementations.

In [41]:
!python scripts/compare.py \
    serial_vector_add \
    02_vector_add \
    "serial_vector_add: n=\d+ -> (\d+\.\d+) ms" \
    "vec_add: n=\d+ t=\d+ -> (\d+\.\d+) ms" \
    --args "-n 10000000"

--- Running Serial: serial_vector_add ---
Running serial_vector_add...
././bin/serial_vector_add -n 10000000
serial_vector_add: n=10000000 -> 11.672 ms

--- Running CUDA: 02_vector_add ---
Running 02_vector_add...
././bin/02_vector_add -n 10000000
vec_add: n=10000000 t=256 -> 0.503 ms, OK=true, BW=238.35 GB/s

--- Performance Comparison ---
Serial execution time: 11.672 ms
CUDA execution time:   0.503 ms
Speedup: 23.20x


### SAXPY

In [42]:
!python scripts/compare.py \
    serial_saxpy \
    03_saxpy \
    "serial_saxpy: n=\d+ -> (\d+\.\d+) ms" \
    "saxpy: n=\d+ t=\d+ -> (\d+\.\d+) ms" \
    --args "-n 10000000"

--- Running Serial: serial_saxpy ---
Running serial_saxpy...
././bin/serial_saxpy -n 10000000
serial_saxpy: n=10000000 a=2.50 -> 11.179 ms, OK=true

--- Running CUDA: 03_saxpy ---
Error running command: make run PROG=03_saxpy ARGS="-n 10000000"
make: *** [Makefile:77: run] Error 1



### Matrix Multiplication

In [43]:
import re
import subprocess

def run_and_parse_time(command, pattern):
    print(f"Running command: {command}")
    result = subprocess.run(command, shell=True, capture_output=True, text=True)
    if result.returncode != 0:
        print(f"Error running command: {command}\n{result.stderr}")
        return None, None
    output = result.stdout
    print(output)
    match = re.search(pattern, output)
    time = float(match.group(1)) if match else None
    return output, time

m, n, k = 1024, 1024, 1024
args = f"-m {m} -n {n} -k {k}"

# Run Serial
_, serial_time = run_and_parse_time(
    f"make run PROG=serial_matmul ARGS='{args}'",
    r"serial_matmul: .*? -> (\d+\.\d+) ms"
)

# Run Naive CUDA
_, cuda_naive_time = run_and_parse_time(
    f"make run PROG=04_matmul_naive ARGS='{args}'",
    r"mm_naive: .*? -> (\d+\.\d+) ms"
)

# Run Tiled CUDA
_, cuda_tiled_time = run_and_parse_time(
    f"make run PROG=05_matmul_tiled_shared ARGS='{args}'",
    r"mm_tiled: .*? -> (\d+\.\d+) ms"
)

# Comparison
print("--- Performance Comparison ---")
if all([serial_time, cuda_naive_time, cuda_tiled_time]):
    speedup_naive = serial_time / cuda_naive_time if cuda_naive_time > 0 else float('inf')
    speedup_tiled = serial_time / cuda_tiled_time if cuda_tiled_time > 0 else float('inf')
    print(f"Serial execution time:       {serial_time:.3f} ms")
    print(f"CUDA Naive execution time:   {cuda_naive_time:.3f} ms (Speedup: {speedup_naive:.2f}x)")
    print(f"CUDA Tiled execution time:   {cuda_tiled_time:.3f} ms (Speedup: {speedup_tiled:.2f}x)")
else:
    print("Could not parse execution time from one or more runs.")

Running command: make run PROG=serial_matmul ARGS='-m 1024 -n 1024 -k 1024'
Running serial_matmul...
././bin/serial_matmul -m 1024 -n 1024 -k 1024
serial_matmul: 1024x1024x1024 -> 3241.066 ms, 0.66 GF/s

Running command: make run PROG=04_matmul_naive ARGS='-m 1024 -n 1024 -k 1024'
Running 04_matmul_naive...
././bin/04_matmul_naive -m 1024 -n 1024 -k 1024
mm_naive: 1024x1024x1024 TB=16 -> 9.238 ms, 232.47 GF/s, maxdiff=9.54e-07

Running command: make run PROG=05_matmul_tiled_shared ARGS='-m 1024 -n 1024 -k 1024'
Running 05_matmul_tiled_shared...
././bin/05_matmul_tiled_shared -m 1024 -n 1024 -k 1024
mm_tiled: 1024x1024x1024 TILE=16 -> 5.844 ms, 367.48 GF/s, maxdiff=9.54e-07

--- Performance Comparison ---
Serial execution time:       3241.066 ms
CUDA Naive execution time:   9.238 ms (Speedup: 350.84x)
CUDA Tiled execution time:   5.844 ms (Speedup: 554.60x)


### Reduction Sum

In [44]:
!python scripts/compare.py \
    serial_reduction_sum \
    06_reduction_sum \
    "serial_reduction_sum: n=\d+ -> (\d+\.\d+) ms" \
    "reduction_sum: n=\d+ t=\d+ -> (\d+\.\d+) ms" \
    --args "-n 10000000"

--- Running Serial: serial_reduction_sum ---
Running serial_reduction_sum...
././bin/serial_reduction_sum -n 10000000
serial_reduction_sum: n=10000000 -> 12.903 ms, sum=7.5e+06

--- Running CUDA: 06_reduction_sum ---
Running 06_reduction_sum...
././bin/06_reduction_sum -n 10000000
reduction_sum: n=10000000 t=256 -> 0.606 ms

--- Performance Comparison ---
Serial execution time: 12.903 ms
CUDA execution time:   0.606 ms
Speedup: 21.29x


### Histogram

In [45]:
!python scripts/compare.py \
    serial_histogram \
    07_histogram_atomics \
    "serial_histogram: n=\d+ -> (\d+\.\d+) ms" \
    "histogram: n=\d+ t=\d+ -> (\d+\.\d+) ms" \
    --args "-n 10000000"

--- Running Serial: serial_histogram ---
Running serial_histogram...
././bin/serial_histogram -n 10000000
serial_histogram: n=10000000 -> 42.349 ms

--- Running CUDA: 07_histogram_atomics ---
Running 07_histogram_atomics...
././bin/07_histogram_atomics -n 10000000
histogram: n=10000000 t=256 -> 5.980 ms

--- Performance Comparison ---
Serial execution time: 42.349 ms
CUDA execution time:   5.980 ms
Speedup: 7.08x


### Pi Monte Carlo

In [46]:
!python scripts/compare.py \
    serial_pi_monte_carlo \
    08_pi_monte_carlo \
    "serial_pi_monte_carlo: n=\d+ pi=.+ -> (\d+\.\d+) ms" \
    "pi_monte_carlo: n=\d+ t=\d+ pi=.+ -> (\d+\.\d+) ms" \
    --args "-n 100000000"

--- Running Serial: serial_pi_monte_carlo ---
Running serial_pi_monte_carlo...
././bin/serial_pi_monte_carlo -n 100000000
serial_pi_monte_carlo: n=100000000 pi=3.14156264 -> 283.823 ms

--- Running CUDA: 08_pi_monte_carlo ---
Running 08_pi_monte_carlo...
././bin/08_pi_monte_carlo -n 100000000
pi_monte_carlo: n=100000000 t=256 pi=3.14158480 -> 0.768 ms

--- Performance Comparison ---
Serial execution time: 283.823 ms
CUDA execution time:   0.768 ms
Speedup: 369.56x


## 5. Run Tests

To ensure everything is working correctly, you can run the provided tests.

In [None]:
!make test

nvcc -O2 -std=c++14 -gencode arch=compute_75,code=sm_75 -I./include -o bin/test_test_matmul tests/test_matmul.cu
nvcc -O2 -std=c++14 -gencode arch=compute_75,code=sm_75 -I./include -o bin/test_test_reduce tests/test_reduce.cu
nvcc -O2 -std=c++14 -gencode arch=compute_75,code=sm_75 -I./include -o bin/test_test_vec tests/test_vec.cu
Running tests...
use 04/05 matmul
use 06_reduction_sum
use 02_vector_add


## 6. Clean Up

You can remove the compiled binaries and other build artifacts by running `make clean`.

In [None]:
!make clean

## 7. Run All Tests

In [None]:
!make test