# GPU Acceleration Basics 01
This notebook follows the video Python CUDA Installation & CUPY | GPU Acceleration Basics 01 by Rounak Paul found on YouTube.

## GPU Acceleration of Simple / Small Computions 
NumPy runs on the CPU, CuPy runs on the GPU. For simple computations, it's not really worth it. As you can see, the GPU is so much slower than the CPU. The guy in the video said it's called the "Acceleration Tax". Data transfer is not the main concern here. Our data x_dev has already been stored in the device memory. 

He mentions a metaphor of a race car (CPU) vs a bus (GPU). The clocks of a CPU vs GPU is something like 5 Ghz vs 1.5 GHz. When the task is small and simple enough, the CPU will always beat the GPU.


In [6]:
import numpy as np
import cupy as cp

In [7]:
# Array stored on host memory aka the CPU
x_host = np.array([1, 2, 3])
type(x_host)

numpy.ndarray

In [8]:
# Array on the device memory aka the GPU
x_dev = cp.array([1, 2, 3])
type(x_dev)

cupy.ndarray

In [9]:
%%timeit
np.linalg.norm(x_host)

1.06 μs ± 7.13 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


In [10]:
%%timeit
cp.linalg.norm(x_dev)

42 μs ± 432 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [11]:
# List all of the GPUs I have access to:
num_gpus = cp.cuda.runtime.getDeviceCount()

for i in range(num_gpus):
    props = cp.cuda.runtime.getDeviceProperties(i)
    print(f"GPU {i}: {props['name'].decode()}")

# Select which GPU you want to put the CuPy array
with cp.cuda.Device(0):
    x_on_device_0 = cp.array([1, 2, 3, 4, 5])


GPU 0: NVIDIA GeForce RTX 4060 Ti


In [12]:
# Create NumPy array on host
x_host = np.random.randint(0, 255, (20000, 20000))

In [13]:
# Transfer NumPy array from host to device
x_dev = cp.asarray(x_host)

In [14]:
# Transfer CuPy array from device to host
x_host_1 = x_dev.get()

## GPU Acceleration of the Fast Fourier Transform
SciPy naturally runs on the CPU. The SciPy GPU wrapper is in cupyx, the experimental CuPy library. 

Also it is a good habit it free the object in VRAM.

In [15]:
# CPU FFT
from scipy.fft import fftn

# CUDA implementation of some algorithms
import cupyx

In [18]:
# Benchmark the acceleration
t_host = %timeit -o fftn(x_host)
t_dev = %timeit -o cupyx.scipy.fft.fftn(x_dev)

t_ratio_best = t_host.best / t_dev.best
t_ratio_avg = t_host.average / t_dev.average
t_ratio_stdev = t_host.stdev / t_dev.stdev

print()
print(f"The acceleration is on average {t_ratio_avg:.2f} times faster")

# Free VRAM objects
cp.get_default_memory_pool().free_all_blocks()

3.64 s ± 1.1 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
The slowest run took 7.92 times longer than the fastest. This could mean that an intermediate result is being cached.
66.2 μs ± 74.3 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)

The acceleration is on average 54953.44 times faster
