# **CUDA Examples**

In [None]:
import os
os.environ["PATH"] += ":/usr/local/cuda/bin"

# Verify nvcc is now accessible
!nvcc --version

## **1. Matrix**

### **a. Matrix Transpose**

Matrix transposition is the process of swapping the rows and columns of a matrix. For a given matrix A, its transpose A<sup>T</sup> is formed by converting the element at position (i,j)(i,j) in A to position (j,i)(j,i) in A<sup>T</sup>.

In [None]:
!make SRC=./matrix/matrix_transpose.cu run

### **b. Matrix Addition**

Matrix addition is an element-wise operation where corresponding elements from two matrices of the same dimensions are added together. The resulting matrix has the same dimensions as the input matrices, with each element being the sum of the corresponding elements from the input matrices. In CUDA, this operation can be parallelized by assigning each element addition to a separate thread, making it highly efficient compared to sequential CPU processing.

In [None]:
!make SRC=./matrix/matrix_addition.cu run

### **c. Matrix Multiplication**

Matrix multiplication is an operation that combines two matrices to produce a third matrix. Given matrices 
A and B, the element C[i,j] in the result matrix C is calculated as the dot product of the i-th row of A and the 
j-th column of B. This operation is widely used in fields like machine learning, computer graphics, and scientific computing.

In [None]:
!make SRC=./matrix/matrix_multiplication.cu run

In [None]:
!make SRC=./matrix/matrix_transpose.cu clean
!make SRC=./matrix/matrix_addition.cu clean
!make SRC=./matrix/matrix_multiplication.cu clean

## **2. Reduction**

### **a. Maximum/Minimum**

Finding maximum value within an array.

In [None]:
!make SRC=./reduction/max.cu run

*Has been implemented only for integers because atmomicMax only works on integers. For floating point numbers, there's a different technique using atomicCAS to find the maximum value.*

### **b. Sum**

Finding sum of all the elements in an array.

In [None]:
!make SRC=./reduction/sum.cu run

In [None]:
!make SRC=./reduction/max.cu clean
!make SRC=./reduction/sum.cu clean

## **3. Parallel Scan**

### **a. Parallel Prefix Sum (Hillis-Steele Inclusive Scan)**

<div style="text-align: center;">
  <img src="./parallel_scan/hillis_steele.png" alt="Hillis Steele" width="400">
</div>

In [None]:
!make SRC=./parallel_scan/hillis_steele_prefix_sum.cu run

*For small arrays of size 1024 CPU works faster. For bigger arrays beyond the size of 1024, prefix sum isn;t working because prefix sum isn't happening across blocks in this code. It is happening only within the block.*

### **b. Blelloch Scan Prefix Sum**

<div style="text-align: center;">
  <img src="./parallel_scan/blelloch_scan_reduce.png" alt="Blelloch Scan Reduce" width="400">
</div>

<div style="text-align: center;">
  <img src="./parallel_scan/blelloch_scan_down_sweep.png" alt="Blelloch Scan Down Sweep" width="400">
</div>

In [None]:
!make SRC=./parallel_scan/blelloch_prefix_sum.cu run

*For small arrays of size 1024 CPU works faster. For bigger arrays beyond the size of 1024, prefix sum isn;t working because prefix sum isn't happening across blocks in this code. It is happening only within the block.*

In [None]:
!make SRC=./parallel_scan/hillis_steele_prefix_sum.cu clean
!make SRC=./parallel_scan/blelloch_prefix_sum.cu clean

## **4. Searching**

### **a. Parallel Binary Search**

### **b. K-Nearest Neighbors (KNN) Search**

## **5. Sorting**

### **a. Bitonic Sort**

### **b. Radix Sort**

## **6. Graph Algorithms**

### **a. Breadth First Search (BFS)**

### **b. Depth First Search (DFS)**

### **c. Dijkstra's Algorithm**

### **d. A\* Algorithm**

## **7. Image/Signal Processing**

### **a. Image Convolution (Gaussion Blur)**

### **b. Fast Fourier Transform (FFT) on a signal**

## **8. Statistical Simulation**

### **a. Monte Carlo Simulation**

## **9. Physics Simulation**

### **a. N-Body Simulation**

### **b. Navier-Stokes Fluid Simulation**

### **c. Heat Diffusion**