<a href="https://colab.research.google.com/github/ggruszczynski/gpu_colab/blob/main/60_python_cuda_basic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python + cuda

Let us repeat the previous exercises in python.

In [7]:
!nvidia-smi

Mon Nov  7 10:09:05 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   47C    P0    28W /  70W |    106MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
from numba import cuda
from numba import jit
import numpy as np
from numba import vectorize, int32, int64, float32, float64
import matplotlib.pyplot as plt

%matplotlib inline

N = 2**26
x = np.arange(N, dtype=np.float64) # [0...N] on the host
y = np.copy(x)

print(f"Number of elements: {N} \nMemory size of array element in [MB]: {x.nbytes/1E6}")

d_x = cuda.to_device(x) # Copy of x on the device
d_y = cuda.to_device(y) # Copy of y on the device
d_out = cuda.device_array_like(d_x) # Like np.array_like, but for device arrays


Number of elements: 67108864 
Memory size of array element in [MB]: 536.870912


## Reduction

In [3]:
# reference: https://numba.pydata.org/numba-doc/dev/cuda/reduction.html

@cuda.reduce
def sum_reduce(a, b):
    return a + b


expect = x.sum()      # numpy sum reduction
got = sum_reduce(x)   # cuda sum reduction
assert expect == got




In [4]:
#Lambda functions can also be used here:
sum_reduce_lam = cuda.reduce(lambda a, b: a + b)

expect = x.sum()      # numpy sum reduction
got = sum_reduce_lam(x)   # cuda sum reduction
assert expect == got



In [5]:
%timeit x.sum()    # NumPy on CPU

44.3 ms ± 243 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [6]:
%timeit sum_reduce_lam(x) # Numba on GPU - data from host

242 ms ± 23.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [7]:
%timeit sum_reduce_lam(d_x) # Numba on GPU - prefetched data

2.56 ms ± 10.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## SAXPY

**SAXPY** stands for “Single-Precision A·X Plus Y”. It is a function in the standard Basic Linear Algebra Subroutines (BLAS)library.

In [None]:

@vectorize(['float64(int64, float64, float64)'], target='cuda') # Type signature and target are required for the GPU
def add_ufunc(a, x, y):
    return a*x + y