# Introduction to GPU Programming with Python
## Numba on GPU
Numba supports CUDA GPU programming by directly compiling a restricted subset of Python code into CUDA kernels and device functions following the CUDA execution model. Kernels written in Numba appear to have direct access to NumPy arrays. NumPy arrays are transferred between the CPU and the GPU automatically.


### ufuncs
A universal function (or ufunc for short) is a function that operates on NumPy arrays (ndarrays) in an element-by-element fashion.
A ufunc is a “vectorized” wrapper for a function that takes a fixed number of scalar inputs and produces a fixed number of scalar outputs.

Creating a traditional NumPy ufunc is not the most difficult task in the world, but it is also not the most straightforward process and involves writing some C code. Numba makes this easy though. Using the vectorize decorator, Numba can compile a Python function into a ufunc that operates over NumPy arrays as fast as traditional ufuncs written in C.

Numba can create compiled ufuncs functions. Just decorate our function with @vectorize.
First, let's create an ufunc for the CPU:

### Vectorize decorator and signatures

In [None]:
from numba import vectorize

@vectorize
def add_n(x, n):
    # Done on all elements of ndarray
    return x + n 

In [None]:
import numpy as np
n = 10

x = np.arange(n).astype(np.int64)
y = np.ones_like(x)

Here, using @vectorize, you write your function as operating over input scalars, rather than arrays.

In order to generate a ufunc for the GPU, you must add an explicit function signature and specify the target. 
The function signature describes which types are used in input and output in the form of:
```python
'return_value_type(argument1_value_type, argument2_value_type, ...)'
```
Below, an addition of two integers that returns an integer:

### CUDA ufuncs
With the vectorize decorator you can write a kernel in python, and then have it execute on the GPU.

Generating a ufunc that uses CUDA requires giving an explicit type signature and setting the target attribute:

In [None]:
@vectorize(['int64(int64, int64)'], target='cuda')
def add(x, y):
    return x + y

In [None]:
#Run and measure execution time:
add(x, y)

In [None]:
#Now run the NumPy built-in function and compare:
np.add(x, y)

#### Several things happened with this function call:
- A CUDA kernel has been created to perform parallel additions on all elements
- GPU memory allocation
- Moving data to the GPU
- Running the kernel
- Moving data to the host
- Conversion to ndarray
- How much faster is our GPU function ?

### Explicit data management
Numba also allows us to manage the movement of our data. Let's take our previous add example:

In [None]:
@vectorize(['int32(int32, int32)'], target='cuda')
def add(x, y):
    return x + y

In [None]:
from numba import cuda, int32
d_x=cuda.to_device(x)
d_y=cuda.to_device(y)

In [None]:
# Run and measure execution time:
x = np.arange(n).astype(np.int32)
y = np.ones_like(x)
add(d_x,d_y)

Here the result is returned back to CPU. Sometimes you need to leave it on the GPU (e.g. for further computing on GPU). This can be done bty creating an arrar directly on GPU:

In [None]:
d_res = cuda.device_array(shape=(n,), dtype=np.int32)

In [None]:
# Run again and measure execution time:
add(d_x, d_y, out=d_res)

### Multiple signatures
It is possible to provide several signatures for the @vectorize :

In [None]:
import math
from numba import vectorize, cuda
import numpy as np

@vectorize(['float32(float32, float32, float32)',
            'float64(float64, float64, float64)'],
           target='cuda')
def cu_discriminant(a, b, c):
    return math.sqrt(b ** 2 - 4 * a * c)