# GPU Acceleration Basics 02
This notebook follows the video Python CUDA Installation & CUPY | GPU Acceleration Basics 02 by Rounak Paul found on YouTube.

## Using CUDA via the Numba Library
Going to compare the performace of running a simple operation on a large array (65k values)

Host = CPU
Device = GPU

In [1]:
from numba import cuda
import numpy as np
import math

In [2]:
x_host = np.ones(shape=(65536))

def host_increment_by_one(arr):
    for i in range(len(arr)):
        arr[i] += 1


In [3]:
%%timeit
host_increment_by_one(x_host)

5.07 ms ± 74.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## GPU Acceleration (1D Data)
Here we are accelerating the previous function by having it run on the GPU referred to as the "device". I'm still figuring out how this all works, but I'm beginning to see it. 

Notice how the setup of this GPU function is different than the previous CPU function. 
- Before the function, we have to use the Numba CUDA decorator: @cuda.jit. This tells the system to compile the function into GPU machine code using Numba's JIT Compiler. This decorator marks the function as a 'CUDA kernel', which means it can be launched on the GPU from the CPU. 
- Inside the function, there's all this notion of Thread ID, Block ID, and Block Width. This has to do with all the total number of threads (or cores?) we are sending this computation to. These are needed to properly correspond the indexes of the array to the cores that do the computation.
- The main part of the function, it is no longer a sequential for loop because this "code" is not really run "sequentially", so it isn't really written sequentially. It's being compiled into machine code that executes instructions in parallel as a "batch". So calling it "code" is somewhat of a misnomer. But it does get compiled, and it does compute results, so it is kinda of like code. As for the device, I'm imagining a big grid of cores, and every single one of them corresponds to one element in the array, specified by our function and the index calculation below. Each core is tasked with performing a computation on a single element in the array. In this case, it is simply incrementing the value of each element by one, simultaneously all at once.

In [4]:
@cuda.jit
def device_increment_by_one(arr):
    # Thread ID in a 1D Block
    tid = cuda.threadIdx.x

    # Block ID in a 1D Grid
    bid = cuda.blockIdx.x

    # Block Width (Number of Threads per Block)
    bw = cuda.blockDim.x

    # Compute flattened index in side the array
    i = tid + bid * bw
    if i < arr.size:    # Boundary Condition
        arr[i] += 1


In [5]:
x_host = np.ones(shape=(65536))
x_device = cuda.to_device(x_host)
threads_per_block = 256
blocks_per_grid = (x_device.size + (threads_per_block - 1)) // threads_per_block

## Compare Single Trial Run vs Multi Trial Run
The magic command '%%time' runs the block once, where '%%timeit' runs it multiple times, and prints the mean and standard deviation

In [6]:
%%time
device_increment_by_one[blocks_per_grid, threads_per_block](x_device)

CPU times: user 80.6 ms, sys: 28.5 ms, total: 109 ms
Wall time: 316 ms


In [7]:
%%timeit
device_increment_by_one[blocks_per_grid, threads_per_block](x_device)

11.1 μs ± 1.25 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


Notice how the single trial block took **milliseconds** where the multi trial block took **microseconds**

In the single trial block, the function had to be compiled, and then transferred to the device (over the PCIe bus), which is why it took so long the first trial.

## Alternative GPU Index Technique
This is another way you can get the thread index 

In [8]:
@cuda.jit
def device_increment_by_one(arr):
    # Quick and dirty way to get the thread indeces
    i = cuda.grid(1)

    if i < arr.size:
        arr[i] += 1

In [None]:
%%time
device_increment_by_one[blocks_per_grid, threads_per_block](x_device)

CPU times: user 80.6 ms, sys: 28.5 ms, total: 109 ms
Wall time: 316 ms


In [10]:
%%timeit
device_increment_by_one[blocks_per_grid, threads_per_block](x_device)

11.3 μs ± 204 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [None]:
# Since the previous time/timeits ran a modified the data a bunch of times already, re-init the data
x_host = np.ones(shape=(65536))
x_device = cuda.to_device(x_host)
threads_per_block = 256
blocks_per_grid = (x_device.size + (threads_per_block - 1)) // threads_per_block

# Run the kernel on the re-initialized data
device_increment_by_one[blocks_per_grid, threads_per_block](x_device)
x_device.copy_to_host(x_host)

array([2., 2., 2., ..., 2., 2., 2.], shape=(65536,))

## GPU Acceleration (2D Data)
Now do the same except a 2D arrray, most of it similar, besides the thread per block stuff

In [16]:
# Initialize a 2D array
# Note: The shape of the array is (256, 256) but the size is 65536
x_device = cuda.to_device(np.ones(shape=(256, 256)))

@cuda.jit
def device_increment_2D_arr(arr):
    x, y = cuda.grid(2)
    if x < arr.shape[0] and y < arr.shape[1]:
        arr[x, y] += 1

threads_per_block = (16, 16)
blocks_per_grid = (math.ceil(x_device.shape[0] / threads_per_block[0]),
                   math.ceil(x_device.shape[1] / threads_per_block[1]))


In [14]:
%%timeit
device_increment_2D_arr[blocks_per_grid, threads_per_block](x_device)

11.9 μs ± 534 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


## CuPy and Numba are friends
CuPy arrays use CUDA Array Interfaces, so they can be easily converted into Numba arrays. They are both built on top the CUDA toolkit which is provided by Nvidia.

In [17]:
import cupy as cp

x_device = cp.ones(shape=(256, 256))
device_increment_2D_arr[blocks_per_grid, threads_per_block](x_device)

print(type(x_device))
x_device

<class 'cupy.ndarray'>


array([[2., 2., 2., ..., 2., 2., 2.],
       [2., 2., 2., ..., 2., 2., 2.],
       [2., 2., 2., ..., 2., 2., 2.],
       ...,
       [2., 2., 2., ..., 2., 2., 2.],
       [2., 2., 2., ..., 2., 2., 2.],
       [2., 2., 2., ..., 2., 2., 2.]], shape=(256, 256))