# GPU Acceleration Basics 01
This notebook follows the video Python CUDA Installation & CUPY | GPU Acceleration Basics 01 by Rounak Paul found on YouTube.

In [2]:
import numpy as np
import cupy as cp

In [8]:
# array on host memory
x_host = np.array([1, 2, 3])
type(x_host)

numpy.ndarray

In [9]:
# array on device memory
x_dev = cp.array([1, 2, 3])
type(x_dev)

cupy.ndarray

In [10]:
%%timeit
np.linalg.norm(x_host)

1.14 μs ± 7.69 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


In [11]:
%%timeit
cp.linalg.norm(x_dev)

The slowest run took 6.95 times longer than the fastest. This could mean that an intermediate result is being cached.
53.2 μs ± 57 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)


# Why is GPU so much slower than CPU?
He called it the "Acceleration Tax". Data transfer is not the main concern here. Our data x_dev is already being stored in the device memory. 

He mentions a metaphor of a race car (CPU) vs a bus (GPU). The clocks of a CPU vs GPU is something like 5 Ghz vs 1.5 GHz. When the task is small and simple enough, the CPU will always beat the GPU.

In [8]:
# List all of the GPUs I have access to:
num_gpus = cp.cuda.runtime.getDeviceCount()

for i in range(num_gpus):
    props = cp.cuda.runtime.getDeviceProperties(i)
    print(f"GPU {i}: {props['name'].decode()}")

# Select which GPU you want to put the CuPy array
with cp.cuda.Device(0):
    x_on_device_0 = cp.array([1, 2, 3, 4, 5])


GPU 0: NVIDIA GeForce RTX 4060 Ti


In [None]:
# Create NumPy array on host
x_host = np.random.randint(0, 255, (20000, 20000))