# **CUDA Examples**

In [1]:
import os
os.environ["PATH"] += ":/usr/local/cuda/bin"

# Verify nvcc is now accessible
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Oct_29_23:50:19_PDT_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0


## **1. Matrix**

### **a. Matrix Transpose**

Matrix transposition is the process of swapping the rows and columns of a matrix. For a given matrix A, its transpose A<sup>T</sup> is formed by converting the element at position (i,j)(i,j) in A to position (j,i)(j,i) in A<sup>T</sup>.

In [2]:
!make SRC=./matrix/matrix_transpose.cu run

nvcc -o ./matrix/matrix_transpose ./matrix/matrix_transpose.cu
././matrix/matrix_transpose
Matrix transposition verified successfully!
GPU execution time: 4.8152 ms
CPU execution time: 137.621 ms
Speedup (CPU vs GPU): 28.5806x


### **b. Matrix Addition**

Matrix addition is an element-wise operation where corresponding elements from two matrices of the same dimensions are added together. The resulting matrix has the same dimensions as the input matrices, with each element being the sum of the corresponding elements from the input matrices. In CUDA, this operation can be parallelized by assigning each element addition to a separate thread, making it highly efficient compared to sequential CPU processing.

In [3]:
!make SRC=./matrix/matrix_addition.cu run

nvcc -o ./matrix/matrix_addition ./matrix/matrix_addition.cu
././matrix/matrix_addition
Matrix Addition Results:
CPU Time: 64.58 milliseconds
GPU Time: 7.52 milliseconds
Speedup: 8.59x
Results match: Yes


### **c. Matrix Multiplication**

Matrix multiplication is an operation that combines two matrices to produce a third matrix. Given matrices 
A and B, the element C[i,j] in the result matrix C is calculated as the dot product of the i-th row of A and the 
j-th column of B. This operation is widely used in fields like machine learning, computer graphics, and scientific computing.

In [4]:
!make SRC=./matrix/matrix_multiplication.cu run

nvcc -o ./matrix/matrix_multiplication ./matrix/matrix_multiplication.cu
././matrix/matrix_multiplication
Matrix Multiplication Results (1024x1024):
CPU Time: 7396.46 milliseconds
GPU Time: 5.75 milliseconds
Speedup: 1287.31x
Max Error: 9.155273e-05


In [5]:
!make SRC=./matrix/matrix_transpose.cu clean
!make SRC=./matrix/matrix_addition.cu clean
!make SRC=./matrix/matrix_multiplication.cu clean

rm -f ./matrix/matrix_transpose
rm -f ./matrix/matrix_addition
rm -f ./matrix/matrix_multiplication


## **2. Reduction**

### **a. Maximum/Minimum**

Finding maximum value within an array.

In [6]:
!make SRC=./reduction/max.cu run

././reduction/max
Maximum value (CPU): 999
CPU Time: 0.00373999 seconds
Maximum value (CUDA): 999
GPU Time: 0.000668777 seconds
Speedup (CPU vs GPU): 5.59228x


*Has been implemented only for integers because atmomicMax only works on integers. For floating point numbers, there's a different technique using atomicCAS to find the maximum value.*

### **b. Sum**

Finding sum of all the elements in an array.

In [7]:
!make SRC=./reduction/sum.cu run

././reduction/sum
Sum (CPU): 5.2379e+08
CPU Time: 0.00401969 seconds
Sum (CUDA): 5.23806e+08
GPU Time: 0.000582532 seconds
Speedup (CPU vs GPU): 6.90037x


In [8]:
!make SRC=./reduction/max.cu clean
!make SRC=./reduction/sum.cu clean

rm -f ./reduction/max
rm -f ./reduction/sum


## **3. Parallel Scan**

### **a. Parallel Prefix Sum (Hillis-Steele Inclusive Scan)**

<div style="text-align: center;">
  <img src="./parallel_scan/hillis_steele.png" alt="Hillis Steele" width="400">
</div>

In [9]:
!make SRC=./parallel_scan/hillis_steele_prefix_sum.cu run

nvcc -o ./parallel_scan/hillis_steele_prefix_sum ./parallel_scan/hillis_steele_prefix_sum.cu
././parallel_scan/hillis_steele_prefix_sum
CPU Time: 8.267e-06 seconds
Results match!
GPU Time: 0.000246784 seconds
Speedup (CPU / GPU): 0.0334989x


*For small arrays of size 1024 CPU works faster. For bigger arrays beyond the size of 1024, prefix sum isn;t working because prefix sum isn't happening across blocks in this code. It is happening only within the block.*

### **b. Blelloch Scan Prefix Sum**

<div style="text-align: center;">
  <img src="./parallel_scan/blelloch_scan_reduce.png" alt="Blelloch Scan Reduce" width="400">
</div>

<div style="text-align: center;">
  <img src="./parallel_scan/blelloch_scan_down_sweep.png" alt="Blelloch Scan Down Sweep" width="400">
</div>

In [10]:
!make SRC=./parallel_scan/blelloch_prefix_sum.cu run

nvcc -o ./parallel_scan/blelloch_prefix_sum ./parallel_scan/blelloch_prefix_sum.cu
././parallel_scan/blelloch_prefix_sum
CPU time taken: 1.0785e-05 seconds
0 1 2 3 4 5 6 7 8 9 ...
GPU time taken: 0.000541696 seconds
0 1 2 3 4 5 6 7 8 9 ...
Speedup: 0.0199097x


*For small arrays of size 1024 CPU works faster. For bigger arrays beyond the size of 1024, prefix sum isn;t working because prefix sum isn't happening across blocks in this code. It is happening only within the block.*

In [11]:
!make SRC=./parallel_scan/hillis_steele_prefix_sum.cu clean
!make SRC=./parallel_scan/blelloch_prefix_sum.cu clean

rm -f ./parallel_scan/hillis_steele_prefix_sum
rm -f ./parallel_scan/blelloch_prefix_sum


## **4. Searching**

### **a. Parallel Binary Search**

### **b. K-Nearest Neighbors (KNN) Search**

## **5. Sorting**

### **a. Bitonic Sort**

### **b. Radix Sort**

## **6. Graph Algorithms**

### **a. Breadth First Search (BFS)**

### **b. Depth First Search (DFS)**

### **c. Dijkstra's Algorithm**

### **d. A\* Algorithm**

## **7. Image/Signal Processing**

### **a. Image Convolution (Gaussion Blur)**

### **b. Fast Fourier Transform (FFT) on a signal**

## **8. Statistical Simulation**

### **a. Monte Carlo Simulation**

## **9. Physics Simulation**

### **a. N-Body Simulation**

### **b. Navier-Stokes Fluid Simulation**

### **c. Heat Diffusion**