---
# **LAB 3 - CUDA Execution Model**
---

# ‚ñ∂Ô∏è CUDA tools...

In [1]:
!nvidia-smi

Wed Jan 21 10:49:08 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   45C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [2]:
import numpy as np
import numba
from numba import cuda
import warnings
warnings.filterwarnings("ignore")

print(np.__version__)
print(numba.__version__)

cuda.detect()



2.0.2
0.60.0
Found 1 CUDA devices
id 0             b'Tesla T4'                              [SUPPORTED]
                      Compute Capability: 7.5
                           PCI Device ID: 4
                              PCI Bus ID: 0
                                    UUID: GPU-b65bf531-3727-0e5c-eb91-e9d7d93167bd
                                Watchdog: Disabled
             FP32/FP64 Performance Ratio: 32
Summary:
	1/1 devices are supported


True

In [3]:
# Suppress Numba deprecation and performance warnings
from numba.core.errors import NumbaDeprecationWarning, NumbaPerformanceWarning
import warnings

warnings.simplefilter('ignore', category=NumbaDeprecationWarning)
warnings.simplefilter('ignore', category=NumbaPerformanceWarning)

# ‚úÖ Parallel Reduction

In [4]:
import numpy as np
from numba import cuda
import time

from numba import cuda

@cuda.jit
def blockParReduce(array_in, array_out, n):
    tid = cuda.threadIdx.x
    idx = cuda.blockIdx.x * cuda.blockDim.x + tid

    # boundary check (recommended)
    if idx >= n:
        return

    # "local pointer" to this block's segment (array_off)
    base = cuda.blockIdx.x * cuda.blockDim.x

    # in-place reduction in global memory (stride doubles each step)
    stride = 1
    while stride < cuda.blockDim.x:
        if (tid % (2 * stride)) == 0:
            array_in[base + tid] += array_in[base + tid + stride]
        cuda.syncthreads()
        stride *= 2

    # write one result per block (optional, matches common reduction pattern)
    if tid == 0:
        array_out[cuda.blockIdx.x] = array_in[base]

# ----------------------------
# host-side usage
# ----------------------------
blockSize = 1024;               # block dim 1D
numBlock = 1024*1024            # grid dim 1D
n = blockSize * numBlock;       # array dim

# prepare data
a = np.ones(n, dtype=np.int32)
a_d = cuda.to_device(a)
b_d = cuda.device_array(numBlock, dtype=np.int32)

# verify numpy sum time
tic = time.time()
a.sum()
toc = time.time()
print("Numpy sum result:", a.sum())
print(f"Numpy sum time: {toc - tic:.4f} seconds")

# launch kernel
t0 = time.perf_counter()
blockParReduce[numBlock, blockSize](a_d, b_d, n)
cuda.synchronize()
t1 = time.perf_counter()
print(f"Kernel execution time: {t1 - t0:.4f} seconds")
print("speedup over numpy:", (toc - tic) / (t1 - t0))

# copy result back to host
b = b_d.copy_to_host()
print("Final sum:", b.sum())


Numpy sum result: 1073741824
Numpy sum time: 0.6655 seconds
Kernel execution time: 1.7099 seconds
speedup over numpy: 0.38917476209131036
Final sum: 1073741824


## ‚ÜòÔ∏è TODO...

**Background: Divergence in Reduction**

-   Problem:
    -   threads in the same warp take different paths
    -   warps execute both paths (masked execution)
    -   performance drops
-   Goal:
    -   restructure indexing so the condition is based on a contiguous range of thread IDs

- Divergence-Avoiding Idea

    -  Instead of checking tid % (2\*stride) == 0, compute a new local index:
    $$
    index = 2 \cdot stride \cdot tid
    $$

    -   Then only threads with:
    $$
    index < blockDim.x
    $$


-   Implement the No-Divergence Reduction Loop

-   Use:

    -   `stride = 1, 2, 4, ...`
    -   `index = 2 * stride * tid`
    -   update: `in_arr[base + index] += in_arr[base + index + stride]`

Template:

```{python}
@cuda.jit
def blockParReduce_no_div(in_arr, out_arr, n):
    pass
```

# ‚úÖ Image histogram

In [13]:
!hostname

d6907a4e0647


In [None]:
!passwd root

New password: 

In [12]:
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
        
img = Image.open("../images/dog.png") # load image
img_mat = np.array(img).astype(np.uint8) # convert to numpy array
H, W, C = img_mat.shape
print(f"Image size: {W} x {H} x {C}")
img

FileNotFoundError: [Errno 2] No such file or directory: '../images/dog.png'

## ‚ÜòÔ∏è TODO...

**Problem Description**

-   Given an RGB image `image` of shape: $(H, W, 3)$

-   compute a histogram such that:

    -   `histogram[0, i]` = num pixels with **red value** `i`
    -   `histogram[1, i]` = num pixels with **green value** `i`
    -   `histogram[2, i]` = num pixels with **blue value** `i`

-   Each color channel has **256 bins**

üîπ **CPU Reference Function**

-   CPU helper function that computes a frequency histogram for a 1D array:

``` python
def array_freq(arr):
    h = np.zeros(256, dtype=np.int32)
    for e in arr:
        h[e] += 1
    return h
```

üîπ **CUDA Kernel Requirements**

1.  Uses a 2D grid and 2D block
2.  Maps each thread to one pixel $(y, x)$
3.  Reads the pixel‚Äôs RGB values
4.  Updates the histogram using atomic additions
5.  Avoids out-of-bounds accesses

``` python
from numba import cuda

@cuda.jit
def histGPU(image, histogram):
    """
    image: uint8 array of shape (H, W, 3)
    histogram: int32 array of shape (3, 256)
    """
    # TODO
    
```