<a href="https://colab.research.google.com/github/freha-mezzoudj/Fast-Computation/blob/main/numba_decoratorsV1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**Numba** : Make the Python Code Run Faster like C/C++

* Python is an interpreter-based language hence it's slow compared to other compiler-based languages like C/C++. Due to this python was not used in any performance-intensive application. To solve this problem, a python library named Numba was developed. Numba is generally referred to as Just-In-Time (JIT) compiler of python code which can speed some parts of or all of the python code by converting it to low-level machine instructions. It uses LLVM library for converting python code to machine instructions. Many times, numba can translate whole function code as well to lower level machine instructions. 

* Numba is a python library that translates a subset of the python code into low-level machine code using LLVM compiler to speed up our existing python code. In order to speed up our code, it generally does not require many changes to our code, using one of the decorators (@jit, @vectorize, etc) provided by numba.

* The process of using Numba to speed up code is quite simple. Numba provides us with a list of decorators that we can use to decorate our functions and it'll compile them when we call the function the first time. Each subsequent call will be using that compiled version hence will be faster. When a function decorated with Numba decorators is called, it'll be compiled first to generate faster machine code hence it'll take a little more time. Once the code is compiled then recalling such function will be way faster because the compiled version will be called subsequently.

* Numba works well on functions that involve python loops or numpy arrays. When we decorate an existing function with a numba decorator, it compiles the part of the function code which it can translate to lower-level machine code which will be able to run faster and speeds up the function. 

* see: https://llvm.org/

**Numba can only translate a certain subset of python code which involves loops and code involving numpy to faster machine code. Not everything will be running faster using Numba.**

* A Sunny Solanki's tutorial at: https://coderzcolumn.com/tutorials/python/numba

* Thanks to Jack of Some: https://www.youtube.com/watch?v=x58W9A2lnQc


In [13]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


`%matplotlib` prevents importing * from pylab and numpy
  warn("pylab import has clobbered these variables: %s"  % clobbered +


**Numba @jit decoder**

@jit decorator can be used to decorate any python function and it should speed up the python function.

*Example 1:*

In [19]:
from numba import jit
import random
#@jit(nopython=True)               not yet!
def monte_carlo_pi(nsamples):
  acc=0
  for i in range(nsamples):
    x=random.random()
    y=random.random()
    if (x**2+y**2)<1.0:
      acc+=1
  return 4.0 * acc / nsamples    

We call the function:

In [20]:
%time monte_carlo_pi(10000)

CPU times: user 6.72 ms, sys: 1.09 ms, total: 7.81 ms
Wall time: 10.8 ms


3.1056

We use the jit of numba:

In [23]:
monte_carlo_pi_jitted = jit(nopython=True)(monte_carlo_pi)

In [24]:
%time monte_carlo_pi_jit(10000)

CPU times: user 219 µs, sys: 22 µs, total: 241 µs
Wall time: 245 µs


3.0984

We can use 
* **@jit(nopython=True)** 
* or **@njit()** : (Strict nopython Mode)

@jit(nopython=True)-->
This will force Numba to run in strict nopython mode and convert all the code of the function to low-level machine code.

In [27]:
from numba import njit
monte_carlo_pi_jitted2 = njit()(monte_carlo_pi)

In [28]:
%time monte_carlo_pi_jit(10000)

CPU times: user 298 µs, sys: 0 ns, total: 298 µs
Wall time: 307 µs


3.124

*Example 2*

Without Numba @jit decoder:

In [40]:
def cube_formula(x):
    return x**3 + 3*x**2 + 3

In [41]:
def perform_operation(x):
    out = np.empty_like(x)
    for i, elem in enumerate(x):
        res = cube_formula(elem)
        out[i] = res
    return out

In [42]:
%time out = perform_operation(np.arange(1e6))

CPU times: user 1.41 s, sys: 8.52 ms, total: 1.41 s
Wall time: 1.44 s


In [43]:
%time out = perform_operation(np.arange(1e7))

CPU times: user 13.6 s, sys: 49.1 ms, total: 13.6 s
Wall time: 13.7 s


Only @jit  --> (nopython = False)

In [57]:
@jit           #(nopython=False) by default 
def cube_formula1(x):
    return x**3 + 5*x**2 + 5

In [58]:
@jit 
def perform_operation_jitted(x):
    out = np.empty_like(x)
    for i, elem in enumerate(x):
        res = cube_formula1(elem)
        out[i] = res
    return out

In [59]:
%time out = perform_operation_jitted(np.arange(1e6))

CPU times: user 193 ms, sys: 8.57 ms, total: 201 ms
Wall time: 200 ms


In [60]:
%time out = perform_operation_jitted(np.arange(1e7))

CPU times: user 33.4 ms, sys: 0 ns, total: 33.4 ms
Wall time: 33.6 ms


We can notice from the results of time taken by both functions that it takes literally a lot less compared to what it used to take without @jit. The @jit decorator has improved the performance by quite a big margin.

With Numba @jit decoder:

In [45]:
@jit(nopython=True)  
def cube_formula(x):
    return x**3 + 3*x**2 + 3

In [46]:
@jit(nopython=True) 
def perform_operation_jitted(x):
    out = np.empty_like(x)
    for i, elem in enumerate(x):
        res = cube_formula(elem)
        out[i] = res
    return out

In [47]:
%time out = perform_operation_jitted(np.arange(1e6))

CPU times: user 282 ms, sys: 1.98 ms, total: 284 ms
Wall time: 286 ms


In [48]:
%time out = perform_operation_jitted(np.arange(1e7))

CPU times: user 41.3 ms, sys: 632 µs, total: 42 ms
Wall time: 48.7 ms


In [None]:
from numba import jit, int32, int64, float32, float64

@jit([int32(int32), int64(int64), float64(float64)], nopython=True, cache=True)
def cube_formula(x):
    return x**3 + 3*x**2 + 3

@jit([int64[:](int64[:]), float64[:](float64[:])], nopython=True, cache=True)
def perform_operation_jitted(x):
    out = np.empty_like(x)
    for i, elem in enumerate(x):
        res = cube_formula(elem)
        out[i] = res
    return out

By using the **type** of the array and **nopython** options, we can notice from the time taken by executions that they are the lowest of all our tries till now.

Parallelize Code for Multi-Core CPU (Uses Multi-Threading to Parallelize) 

Numba can also parallelize our code on multi-core CPUs. It uses multi-threading to speed up code by running threads on different cores of the computer in parallel. In order to parallelize code, we need to set parallel parameter of @jit decorator to True. There are two types of parallelization available in Numba

* Automatic Parallelization - When we decorate a function with @jit(parallel=True) decorator, Numba will try to run function in parallel if possible else it'll run it normally.

* Explicit Parallel Loops - We can explicitly force Numba to run code in parallel by using prange() function available from Numba for the loops. This will force Numba to parallelize code.

We'll use explicit parallelization by using prange() function.



Also, the Python Global Interpreter Lock (GIL) can prevent the speed up of multi-threading. We'll explain in our upcoming examples how we can release GIL and get around this problem.

Below we have re-defined our functions and set parallel parameter to True inside of @jit decorator. We have also modified the logic of our perform_operation_jitted() function to use prange() function. We are using index retrieved from prange() function to index array and retrieve individual element.

In [None]:
from numba import jit, int64, float32, float64, prange

In [None]:
@jit([int64(int64), float64(float64)], nopython=True, cache=True)
def cube_formula(x):
    return x**3 + 3*x**2 + 3

Options: type, nopython=True, cache=True, parallel=True

In [None]:
@jit([int64[:](int64[:]), float64[:](float64[:])], nopython=True, cache=True, parallel=True)
def perform_operation_jitted(x):
    out = np.empty_like(x)
    for i in prange(len(x)):
        res = cube_formula(x[i])
        out[i] = res
    return out

Numba does not Improve Pandas Code 

Numba works well with python loops and numpy. Though Pandas is built on top of numpy but still Numba can not improve code involving pandas data structures using pandas operations. The reason behind this can be that Numba does not have access to lower-level code behind pandas API which it can optimize.
Sol: https://coderzcolumn.com/tutorials/python/guide-to-speed-up-code-involving-pandas-dataframe-using-numba


**Numba vectorization**

*Example 1 using Vectorize()*

example from: https://www.learnpythonwithrune.org/performance-comparison-of-numba-vs-vectorization-vs-lambda-function-with-numpy/

In [36]:
size = 100
x = np.random.rand(size, size)
y = np.random.rand(size, size)
iterations = 100000

In [37]:
@jit(nopython=True)
def add_numba(a, b):
    c = np.zeros(a.shape)
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            c[i, j] = a[i, j] + b[i, j]
    return c

In [38]:
def add_vectorized(a, b):
    return a + b

We call the function once, to precompile the code

In [None]:
z = add_numba(x, y)
start = time.time()
for _ in range(iterations):
    z = add_numba(x, y)
end = time.time()
print("Elapsed (numba, precompiled) = %s" % (end - start))


In [None]:
start = time.time()
for _ in range(iterations):
    z = add_vectorized(x, y)
end = time.time()
print("Elapsed (vectorized) = %s" % (end - start))

* @vectorize is used to write an expression that can be applied one element at a time (scalars) to an array. 

* The @jit decorator is more general and can work on any type of calculation.


* More examples at: https://coderzcolumn.com/tutorials/python/numba-vectorize-decorator

*Example 2*

In [61]:
def cube_formula(x):
    return x**3 + 3*x**2 + 3

cube_formula(5)

203

We have vectorized our cube_formula() function using np.vectorize() function. 
The np.vectorize() function takes as input any function and make it run on numpy array. 

The function wrapped inside of np.vectorize() will run faster compared to same function run as python loop through array.

In [62]:
vectorized_cube_formula = np.vectorize(cube_formula)

vectorized_cube_formula

<numpy.vectorize at 0x7f0139c243a0>

In [63]:
arr = np.arange(1, 1000000, dtype=np.int64)

In [64]:
%%time

res = vectorized_cube_formula(arr)

CPU times: user 846 ms, sys: 57.4 ms, total: 904 ms
Wall time: 905 ms


In [65]:
res[:5]

array([  7,  23,  57, 115, 203])

The @vectorize decorator requires us to specify possible data types of input and output of the function. It'll then create a compiled version for each data type. The data type should be in order from less memory data type to more memory data type. Below we have highlighted the signature of @vectorize decorator.

@vectorize([ret_datatype1(input1_datatype1,input2_datatype1,...), ret_datatype2(input1_datatype2,input2_datatype2,...), ...], target='cpu', cache=False)
def func(x):
    return x*x
Apart from datatypes, it accepts two other arguments.

target - This argument accepts one of the below-mentioned three strings as input specifying how to further speed up code based on available resources.
'cpu' - This is default argument. It's used for a single-core (single-threaded) CPU.
'parallel' - This argument runs code in parallel on multi-core (multi-threaded) CPU.
'cuda' - This argument is set for GPU.
cache - This parameter accepts boolean values specifying whether to use caching to speed up reruns of the same function again and again with the same inputs.

In [73]:
from numba import vectorize, int64, float32, float64

@vectorize([int64(int64), float32(float32), float64(float64)])
def cube_formula_numba_vec(x):
    return x**3 + 3*x**2 + 3

In [74]:
arr = arr.astype(np.int64)

In [75]:
%%time
res = cube_formula_numba_vec(arr)

CPU times: user 1.96 ms, sys: 0 ns, total: 1.96 ms
Wall time: 1.97 ms


In [76]:
res[:5]

array([  7,  23,  57, 115, 203])

Execution using Flaot type:

In [77]:
arr = arr.astype(np.float64)

In [78]:
%%time
res = cube_formula_numba_vec(arr)

CPU times: user 1.96 ms, sys: 0 ns, total: 1.96 ms
Wall time: 1.99 ms


**Numba Vectorize Decorated and Parallelized Function**
we can decorated the function again with @vectorize decorator with 'parallel' option.

In [79]:
from numba import vectorize, int64, float32, float64

@vectorize([int64(int64,int64), float32(float32,float32), float64(float64,float64)], target="parallel")
def cube_formula_numba_vec_paralleled(x, y):
    return x**3 + 3*x**2 + y

We can notice  that the results are almost the same as previous runs without parallelizing.