# Introduction to GPU Programming with Python
## Other options: PyCUDA, CuPy, Cython

**pyCUDA**
- Allow to use CUDA C / C ++ API in full
- One of the most powerful options available in Python
- Request to write C in Python and several modifications
**CuPy**
- CuPy is an open-source matrix library accelerated with CUDA
- highly compatible with NumPy

### PyCUDA
PyCUDA lets you access Nvidia's CUDA parallel computation API from Python.
Key features:

- Maps all of CUDA into Python.
- Enables run-time code generation (RTCG) for flexible, fast, automatically tuned codes.
- Added robustness: automatic management of object lifetimes, automatic error checking
- Added convenience: comes with ready-made on-GPU linear algebra, reduction, scan. Add-on packages for FFT and LAPACK available.
- Fast. Near-zero wrapping overhead.
    
Disadvantage: you need to know a little bit of CUDA C/C++ to write a kernel

Here is an example that takes a NumPy array, send it to GPU where each element is doubled

In [None]:
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import pycuda.autoinit
import numpy

In [None]:
a = numpy.arange(0,16, dtype=numpy.float32)
a_gpu = cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu, a)

In [None]:
mod = SourceModule("" »
              __global__ void doublify(float *a){
              int idx = threadIdx.x*blockDim.y + threadIdx.y;
                a[idx] *= 2;
                __syncthreads();
        }
""")

In [None]:
doublify = mod.get_function("doublify")
doublify(a_gpu, grid=(1,1), block=(4,4,1))

a_doubled = numpy.empty_like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)
print(a_doubled)

#### Few important commands
**Data transfer**
cuda.mem_alloc(size) : allocate array of size size on the GPU
cuda.memcpy_htod(destination, source) : copy array from CPU to GPU
cuda.memcpy_dtoh(destination, source) : copy array from GPU back to CPU

**Creating CUDA kernel**
mod = SourceModule("""   CUDA KERNEL CODE """)  

**Shortcuts**
cuda.In(A)
cuda.Out(A)
cuda.InOut(A)


#### GPUARRAY: PyCUDA library


### CuPy
Just like NumPy, CuPy offers:

    ndarray multi-dimensional arrays, but for GPUs
    ufuncs, for GPUs
    a large set of functions implemented with CUDA

In [None]:
import numpy as np
import cupy as cp

#### ndarray
We can create an `ndarray` which will be allocated on the current GPU. Using a previous example:

In [None]:
a = cp.zeros(shape=(2,4), dtype=np.int8)

print(type(a))
print(repr(a))
print(a.dtype)
print(a.shape)

It is also possible to move data from the system to the GPU. For example, we can move an array `numpy.ndarray`:

In [None]:
a_cpu = np.array([1,2,3])
a_gpu = cp.asarray(a_cpu)

print('cpu :', a_cpu)
print('gpu :', a_gpu, a_gpu.device)

It is important to note that in order to display the GPU table, the data is copied to the system beforehand.
It is also possible to move the data to the system:

In [None]:
a_cpu2 = cp.asnumpy(a_gpu)
print(repr(a_cpu2))
print(type(a_cpu2))

# Ou équivalent
# a_cpu2 = a_gpu.get()