# The need for speed without bothering too much: An introduction to `numba`


<img src="figures/numba_logo.png" style="display:block;margin:auto;width:70%;"/>

# Do you really need the speed?

* Write your Python program.
* Ensure it executes correctly and does what it is supposed to.
* Is it fast enough?
* If yes: Ignore the rest of the presentation.
* If no:
    1. Get it right.
    2. Test it's right.
    3. Profile if slow.
    4. Optimise (C, C++/Cython/`numba` and save yourself the pain).
    5. Repeat from 2.
    
> We *should forget* about small efficiencies, say about 97% of the time: **premature optimization is the root of all evil**.

> Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only **after** that code has been identified.

<p style="text-align:right">**Donald Knuth**</p>

# The need for speed

* For many programs, the most important resource is **developer time**.
* The best code is:
    * Easy to understand.
    * Easy to modify.
* But sometimes execution speed matters. Then what do you do?
* Go find a compiler!

# A Python compiler?

* Takes advantage of a simple fact:
    * Most functions in your program only use a small number of types.
* &rightarrow; Generate machine code to manipulate only the types you use!
* LLVM (Low Level Virtual Machine) library already implements a compiler backend.
   * It is used to construct, optimize and produce intermediate and/or binary machine code.
   * A compiler framework, where you provide the "front end" (parser and lexer) and the "back end" (code that converts LLVM's representation to actual machine code).
   * Multi platform.
   * LLVM optimizations (inlining, loop unrolling, SIMD vectorization etc).

# How can `numba` help?

* If you have big `numpy` arrays with your data (remember `pandas` uses `numpy` under-the-covers), `numba` makes it easy to write simple functions that are fast that work with that data.
* `numba` is an open source JIT (Just-In-Time) compiler for Python functions.
* From the types of the function arguments, `numba` can often generate a specialized, fast, machine code implementation at
runtime.
* Designed to work best with numerical code and `numpy` arrays.
* Uses the LLVM library as the compiler backend.


# `numba` features

* Numba generates optimized machine code from pure Python code using the LLVM compiler infrastructure. With a few simple annotations, array-oriented and math-heavy Python code can be JIT compiled to performance similar as C, C++ and Fortran, without having to switch languages or Python interpreters.
* `numba` supports:
    * Windows, OSX and Linux.
    * 32 and 64 bit CPUs and NVIDIA GPUs (CUDA).
    * `numpy`
* Does *not* require any C/C++ compiler.
* Does *not* replace the standard Python interpreter (all Python libraries are still available).
* Easy to install (`conda` not longer required): `pip install numba` (wheels for Windows/Linux/OSX are available, no need to compile anything).

# How `numba` works

<img src="figures/how_numba_works.png" style="display:block;margin:auto;width:80%;"/>

# `numba` modes of compilation

* **object mode**: Compiled code operates on Python objects. Supports nearly all of Python, but generally cannot speed up code by a large factor. Only significant improvement is the compilation of loops that can be compiled in *nopython* mode.
    * In object mode, Numba will attempt perform *loop lifting*, i.e. extract loops and compile them in *nopython* mode.
    * Works great for functions that are bookended by uncompilable code, but have a compilable core loop.
    * All happens automatically.
* **nopython mode**: Compiled code operates on native machine data. Supports a subset of Python, but runs close to C/C++/FORTRAN speed.

# **nopython** mode features
* Standard control and looping structures: *if*, *else*, *while*, *for*, *range*.
* `numpy` arrays, int, float, complex, booleans, and tuples.
* Almost all arithmetic, logical, and bitwise operators as well as functions from the math and numpy modules
* Nearly all `numpy` dtypes: int, float, complex, datetime64, timedelta64
* Array element access (read and write)
* Array reduction functions: sum, prod, max, min, etc.
* Calling other nopython mode compiled functions.
* Calling `ctypes` or `cffi` wrapped external functions.

# Example 1: Summation

In [1]:
from numba import jit

def psum(x):
    res = 0
    for i in range(x):
        res += i**2 + 1 + i
    return res

nsum = jit(psum)

In [2]:
%timeit -c psum(1000)
%timeit -c psum(100000)
%timeit -c nsum(1000)
%timeit -c nsum(100000000)

458 µs ± 5.98 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
46 ms ± 106 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
169 ns ± 3.21 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
167 ns ± 1.81 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


# References

1. [Stanley Seibert - Accelerating Python with the Numba JIT Compiler (SciPy 2015)](https://www.youtube.com/watch?v=eYIPEDnp5C4)
* Travis E. Oliphant - Performance Python: Introduction to Numba (PyData 2015)