Video Link: https://www.youtube.com/watch?v=06VErVj9MaQ&t=1108s

Numba derives from "Numpy" and "Mamba". Numba turns Python into a compiled language with a GPU target.

[You cannot use the python list and dictionary](https://numba.pydata.org/numba-doc/dev/cuda/cudapysupported.html). If you write in that way, it might be slower. But you can use Numpy array.
```
The following Python constructs are not supported:
- Exception handling
- context management (the with statement)
- Comprehensions (either list, dict, set or generator comprehensions)
- Generator (any yield statments)
```

# Numba

- Opernsource BSD license
- Basic CUDA GPU JIT compilation
- OpenCL support coming

In [1]:
import numba
print("numba", numba.__version__)

numba 0.34.0


# The CUDA GPU

- A massively parallel processor (many cores)
    - 100 threads, 1000 threads, and more
- optimized for data throughput
    - simple (shallow) cache hierarchy
    - best with manual caching!
    - Cache memory is called shared memory and it is addressable
- CPU is latency optimized
    - Deep cache hierarchy
    - L1, L2 L3 cahces
- GPU execution model is different
- GPU forces you to think and program in parallel

In [3]:
import numba.cuda
import numpy as np
import math

In [4]:
my_gpu = numba.cuda.get_current_device()
print("Running on GPU:", my_gpu.name)

Running on GPU: b'GeForce GTX 1080 Ti'


In [6]:
cc = my_gpu.compute_capability
print("Compute capability: ", "%d.%d" % cc, "(Numba requires >= 2.0)")

Compute capability:  6.1 (Numba requires >= 2.0)


In [8]:
print("Number of streaming multiprocessor:", my_gpu.MULTIPROCESSOR_COUNT)

Number of streaming multiprocessor: 28


# High-level Array-Oriented Style
- Use NumPy array as a unit of computation
- Use NumPy universal function (ufunc) as an abstraction of computation of scheduling
- ufuncs are elementwise functions
- If you use NumPy, you are using ufuncs

In [9]:
print(np.sin, "is of type", type(np.sin))
print(np.add, "is of type", type(np.add))

<ufunc 'sin'> is of type <class 'numpy.ufunc'>
<ufunc 'add'> is of type <class 'numpy.ufunc'>


# Vectorize
- generate a ufunc from a python function
- converts scalar function to elementwise array function
- Numba provides CPU support
-  <s>NumbaPro provides GPU support</s>

In [10]:
# CPU version
@numba.vectorize(['float32(float32, float32)', 
                  'float64(float64, float64)'], target = 'cpu')
def cpu_sincos(x, y):
    return math.sin(x) * math.cos(y)

Reference: 
- https://numba.pydata.org/numba-doc/latest/cuda/ufunc.html
- https://numba.pydata.org/numba-doc/dev/user/vectorize.html

```
The vectorize() decorator supports multiple ufunc targets:
Target      Description
cpu         Single-threaded CPU
parallel    Multi-core CPU
cuda        CUDA GPU
```

In [14]:
# GPU version
@numba.vectorize(['float32(float32, float32)', 
                  'float64(float64, float64)'], target = 'cuda')
def gpu_sincos(x, y):
    return math.sin(x) * math.cos(y)

# Test it out
- 2 input arrays
- 1 output array
- 1 million doubles (8 MB) per array
- Total 24 MB of data

Note: [numpy.allclose](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.allclose.html)  
Returns True if two arrays are element-wise equal within a tolerance.

In [15]:
# generate data
n = 1000000
x = np.linspace(0, np.pi, n)
y = np.linspace(0, np.pi, n)

# check result
np_ans = np.sin(x) * np.cos(y)
np_cpu_ans = cpu_sincos(x, y)
np_gpu_ans = gpu_sincos(x, y)

print("CPU vectorize correct: ", np.allclose(np_cpu_ans, np_ans))
print("GPU vectorize correct: ", np.allclose(np_gpu_ans, np_ans))

CPU vectorize correct:  True
GPU vectorize correct:  True


In [16]:
print("Numpy")
%timeit np.sin(x) * np.cos(y)

print("CPU vectorize")
%timeit cpu_sincos(x, y)

print("GPU vectorize")
%timeit gpu_sincos(x, y)

Numpy
28.3 ms ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
CPU vectorize
32.4 ms ± 67.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
GPU vectorize
6.34 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
