## Numba tutorial
It's a JIT on llvm (Backend part) that compile CPU/GPU code infered from python code. It needs less changes compared to cython, but needs that you fix the data type of the functions that you want to accelerate. To use is quite simple you just need to add a decorator(PRAGMA) to the python code.
In other words, Numba turns python into a compiled language with GPU/CPU target.

### Numba modes
* Object mode: Compiled code operates on python objects. Not fast only improve loop performance
* nopython mode: Full compiled code that operates on "machine native data"

### References
* https://eng.climate.com/2015/04/09/numba-vs-cython-how-to-choose/
* https://www.youtube.com/watch?v=eYIPEDnp5C4
* https://www.youtube.com/watch?v=06VErVj9MaQ&t=1509s
* https://ipython.org/ipython-doc/3/interactive/magics.html
* http://jakevdp.github.io/blog/2012/08/24/numba-vs-cython/
* https://jakevdp.github.io/blog/2013/06/15/numba-vs-cython-take-2/
* http://numba.pydata.org/numba-doc/dev/user/examples.html
* https://julien.danjou.info/blog/2015/guide-to-python-profiling-cprofile-concrete-case-carbonara
* http://rlhick.people.wm.edu/posts/comparing-the-speed-of-matlab-versus-pythonnumpy.html
* http://stackoverflow.com/questions/5217167/how-many-cuda-cores-does-each-multiprocessor-of-a-gpu-have
* https://www.nvidia.com/en-us/geforce/products/10series/titan-x-pascal/

In [6]:
import numpy as np
from numba import jit

# Add jit decorator to use LLVM to compile to native code.
@jit
def calculate_mean_numba(x_vec):
    total = 0
    # Iterate on x_vec
    for xVal in x_vec:
        total = total + xVal
    
    return total / len(x_vec)

# Normal python
def calculate_mean(x_vec):
    total = 0
    # Iterate on x_vec
    for xVal in x_vec:
        total = total + xVal
    
    return total / len(x_vec)

# Using numpy
def calculate_mean_numpy(x_vec):
    return np.mean(x_vec)

In [7]:
# Create random vector with 10000 elements
big_vec = np.random.rand(10000)
print('Big vector shape:',big_vec.shape)

Big vector shape: (10000,)


### Using python version

In [8]:
%timeit calculate_mean(big_vec)
mean = calculate_mean(big_vec)
print('Mean value(Pure python) is %f' % (mean))

1000 loops, best of 3: 1.3 ms per loop
Mean value(Pure python) is 0.499933


In [9]:
%timeit calculate_mean_numpy(big_vec)
mean_numpy = calculate_mean_numpy(big_vec)
print('Mean value(Numpy) is %f' % (mean_numpy))

The slowest run took 6.62 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 18.7 µs per loop
Mean value(Numpy) is 0.499933


### Using Numba version

In [10]:
%timeit calculate_mean_numba(big_vec)
mean2 = calculate_mean_numba(big_vec)
print('Mean value(Numba CPU) is %f' % (mean2))

The slowest run took 6116.44 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 9.22 µs per loop
Mean value(Numba CPU) is 0.499933


## Using GPU
Numba also provide computations with CUDA, one of it's coolest features is that it accept numpy arrays. 
 * ufunc: Universal functions, element-wise functions.
 
 By using numba.vectorize you transform your scalar function into a element-wise function.

In [24]:
import numba.cuda

In [25]:
myGpu = numba.cuda.get_current_device()
nMultiProcessors = myGpu.MULTIPROCESSOR_COUNT
# Check NVIDIA Architecture
nCoresPerCapability = {
    1:8,
    2:32,
    3:192,
    5:128,
    6:128
}
print ("Running on GPU", myGpu.name, "compute capability(major):", myGpu.compute_capability[0])
print ("Number of streaming multiprocessors:",nMultiProcessors)
print ("Number cores per multiprocessor:",nCoresPerCapability[myGpu.compute_capability[0]])
print ("Total cores on GPU:",nMultiProcessors*nCoresPerCapability[myGpu.compute_capability[0]])

Running on GPU b'TITAN X (Pascal)' compute capability(major): 6
Number of streaming multiprocessors: 28
Number cores per multiprocessor: 128
Total cores on GPU: 3584


In [23]:
@numba.vectorize(['float32(float32,float32)', 'float64(float64,float64)'], target=gpu)
def sin_cos(x,y):
    return math.sin(x) * math.cos(y)

NameError: name 'gpu' is not defined