# Chapter 5. Exploring Compilers

Compiling functions and methods directly to machine code rather than executing instructions in the interpreter

Numba is a library designed to compile small functions on the fly. Instead of transforming Python code to C, Numba analyzes and compiles Python functions directly to machine code

PyPy is a replacement interpreter that works by analyzing the code at runtime and optimizing the slow loops automatically

These tools are called **Just-In-Time (JIT)** compilers because the compilation is performed at runtime rather than before running the code (in other cases, the compiler is called ahead-oftime or AOT)

## 5.1 Numba

As a library for compiling individual Python functions at runtime using the Low-Level Virtual Machine (LLVM) toolchain.

LLVM is a set of tools designed to write compilers. LLVM is language agnostic and is used to write compilers for a wide range of languages (an important example is the clang compiler). One of the core aspects of LLVM is the intermediate representation (the LLVM IR), a very low-level platform-agnostic language similar to assembly, that can be compiled to machine code for the specific target platform

Numba works by inspecting Python functions and by compiling them, using LLVM, to the IR. As we have already seen in the last chapter, the speed gains can be obtained when we introduce types for variables and functions. Numba implements clever algorithms to guess the types (this is called type inference) and compiles type-aware versions of the functions for fast execution.

Note that Numba was developed to improve the performance of numerical code. The development efforts often prioritize the optimization of applications that intensively use NumPy arrays

### 5.1.1 First steps with numba

In [None]:
conda install numba

In [2]:
def sum_sq(a):
    result = 0
    N = len(a)
    for i in range(N):
        result += a[i]
    return result

In [4]:
import numba as nb

@nb.jit # decorator 

def sum_sq(a):
    result = 0
    N = len(a)
    for i in range(N):
        result += a[i]
    return result

In [5]:
import numpy as np
x = np.random.rand(10000)

# Original
%timeit sum_sq.py_func(x)
# 100 loops, best of 3: 6.11 ms per loop

# Numba
% timeit sum_sq(x)
# 100000 loops, best of 3: 11.7 μs per loop

1.66 ms ± 6.72 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
10.2 µs ± 1.26 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [6]:
%timeit (x**2).sum()

7.91 µs ± 38.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [8]:
x_list = x.tolist()
%timeit sum_sq(x_list)

9.55 ms ± 183 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [9]:
%timeit sum([x**2 for x in x_list])

721 µs ± 28.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


### 5.1.2 Type specializations

The *nb.jit* decorator works by compiling a specialized version of the function once it encounters a new argument type

Numba exposes the specialized types using the signatures attribute. Right after the sum_sq definition, we can inspect the available specialization by accessing the sum_sq.signatures

In [10]:
sum_sq.signatures

[(array(float64, 1d, C),), (reflected list(float64),)]

In [11]:
x = np.random.rand(1000).astype('float64')
sum_sq(x)
sum_sq.signatures

[(array(float64, 1d, C),), (reflected list(float64),)]

In [12]:
x = np.random.rand(1000).astype('float32')
sum_sq(x)
sum_sq.signatures

[(array(float64, 1d, C),),
 (reflected list(float64),),
 (array(float32, 1d, C),)]

It is possible to explicitly compile the function for certain types by passing a signature to the nb.jit function

An individual signature can be passed as a tuple that contains the type we would like to accept. Numba provides a great variety of types that can be found in the nb.types module, and they are also available in the top-level nb namespace. If we want to specify an array of a specific type, we can use the slicing operator, [:], on the type itself.

In [13]:
@nb.jit((nb.float64[:],))
def sum_sq(a):
    result = 0
    N = len(a)
    for i in range(N):
        result += a[i]
    return result

Note that when we explicitly declare a signature, we are prevented from using other types

In [14]:
sum_sq(x.astype('float32'))

TypeError: No matching definition for argument type(s) array(float32, 1d, C)

Another way to declare signatures is through type strings. For example, a function that takes a float64 as input and returns a float64 as output can be declared with the float64(float64) string. Array types can be declared using a [:] suffix

In [15]:
@nb.jit('float64(float64[:])')
def sum_sq(a):
    result = 0
    N = len(a)
    for i in range(N):
        result += a[i]
    return result

In [16]:
@nb.jit(['float64(float64[:])',
         'float64(float32[:])'])
def sum_sq(a):
    result = 0
    N = len(a)
    for i in range(N):
        result += a[i]
    return result

### 5.1.3 Object mode versus native mode

The degree of optimization obtainable from Numba depends on how well Numba is able to infer the variable types and how well it can translate those standard Python operations to fast type-specific versions. If this happens, the interpreter is side-stepped and we can get performance gains similar to those of Cython

When Numba cannot infer variable types, it will still try and compile the code, reverting to the interpreter when the types can't be determined or when certain operations are unsupported. In Numba, this is called **object mode** and is in contrast to the interpreter-free scenario, called **native mode**.

Numba provides a function, called **inspect_types**, that helps understand how effective the type inference was and which operations were optimized

In [17]:
sum_sq.inspect_types()

sum_sq (array(float64, 1d, A),)
--------------------------------------------------------------------------------
# File: <ipython-input-16-7a2c74aa39d8>
# --- LINE 1 --- 

@nb.jit(['float64(float64[:])',

         # --- LINE 2 --- 

         'float64(float32[:])'])

# --- LINE 3 --- 

def sum_sq(a):

    # --- LINE 4 --- 
    # label 0
    #   a = arg(0, name=a)  :: array(float64, 1d, A)
    #   $const2.0 = const(int, 0)  :: Literal[int](0)
    #   result = $const2.0  :: float64
    #   del $const2.0

    result = 0

    # --- LINE 5 --- 
    #   $6load_global.1 = global(len: <built-in function len>)  :: Function(<built-in function len>)
    #   $10call_function.3 = call $6load_global.1(a, func=$6load_global.1, args=[Var(a, <ipython-input-16-7a2c74aa39d8>:4)], kws=(), vararg=None)  :: (array(float64, 1d, A),) -> int64
    #   del $6load_global.1
    #   N = $10call_function.3  :: int64
    #   del $10call_function.3

    N = len(a)

    # --- LINE 6 --- 
    #   jump 14
    # label 14


For each line, Numba prints a thorough description of variables, functions, and intermediate results

All the variables have a well-defined type. Therefore, we can be certain that Numba is able to compile the code quite efficiently. This form of compilation is called **native mode**.

In [18]:
@nb.jit
def concatenate(strings):
    result = ''
    for s in strings:
        result += s
    return result

In [20]:
concatenate(['hello', 'world'])
concatenate.signatures

[(reflected list(unicode_type),)]

In [21]:
concatenate.inspect_types()

concatenate (reflected list(unicode_type),)
--------------------------------------------------------------------------------
# File: <ipython-input-18-52e1d864c8e7>
# --- LINE 1 --- 

@nb.jit

# --- LINE 2 --- 

def concatenate(strings):

    # --- LINE 3 --- 
    # label 0
    #   strings = arg(0, name=strings)  :: reflected list(unicode_type)
    #   $const2.0 = const(str, )  :: Literal[str]()
    #   result = $const2.0  :: unicode_type
    #   del $const2.0

    result = ''

    # --- LINE 4 --- 
    #   jump 6
    # label 6
    #   $10get_iter.1 = getiter(value=strings)  :: iter(reflected list(unicode_type))
    #   del strings
    #   $phi12.0 = $10get_iter.1  :: iter(reflected list(unicode_type))
    #   del $10get_iter.1
    #   jump 12
    # label 12
    #   $12for_iter.1 = iternext(value=$phi12.0)  :: pair<unicode_type, bool>
    #   $12for_iter.2 = pair_first(value=$12for_iter.1)  :: unicode_type
    #   $12for_iter.3 = pair_second(value=$12for_iter.1)  :: bool
    #   del $1

This means that, in this case, Numba is unable to compile this operation without the help of the Python interpreter

In [22]:
x = ['hello'] * 1000
%timeit concatenate.py_func(x)

74.8 µs ± 761 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [23]:
%timeit concatenate(x)

1.36 ms ± 11.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


This is because the Numba compiler is not able to optimize the code and adds some extra overhead to the function call

Numba compiled the code without complaints even if it is inefficient. The main reason for this is that Numba can still compile other sections of the code in an efficient manner while falling back to the Python interpreter for other parts of the code. This compilation strategy is called **object mode**

It is possible to force the use of native mode by passing the nopython=True option to the nb.jit decorator.

In [26]:
@nb.jit(nopython = True)
def concatenate(strings):
    result = ''
    for s in strings:
        result += s
    return result 

concatenate(x)

Encountered the use of a type that is scheduled for deprecation: type 'reflected list' found for argument 'strings' of function 'concatenate'.

For more information visit http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#deprecation-of-reflection-for-list-and-set-types
[1m
File "<ipython-input-26-29090bfedc97>", line 2:[0m
[1m@nb.jit(nopython = True)
[1mdef concatenate(strings):
[0m[1m^[0m[0m
[0m


'hellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohellohell

### 5.1.4 Numba and numpy

Numba was originally developed to easily increase performance of code that uses NumPy arrays. Currently, many NumPy features are implemented efficiently by the compiler

#### Universal functions with numba

Universal functions are special functions defined in NumPy that are able to operate on arrays of different sizes and shapes according to the broadcasting rules. One of the best features of Numba is the implementation of fast *ufuncs*

For instance, the np.log function is a ufunc because it can accept scalars and arrays of different sizes and shapes. Also, universal functions that take multiple arguments still work according to the broadcasting rules. Examples of universal functions that take multiple arguments are np.sum or np.difference

Universal functions can be defined in standard NumPy by implementing the scalar version and using the np.vectorize function to enhance the function with the broadcasting feature.

In [31]:
import numpy as np

def cantor_py(a,b):
    return int(0.5 * (a + b)*(a + b + 1) + b)

In [32]:
@np.vectorize
def cantor(a,b):
    return int(0.5 * (a + b)*(a + b + 1) + b)

In [33]:
cantor(np.array([1,2]),2)

array([ 8, 12])

Except for the convenience, defining universal functions in pure Python is not very useful as it requires a lot of function calls affected by interpreter overhead. For this reason, ufunc implementation is usually done in C or Cython, but Numba beats all these methods by its convenience

All that is needed to do in order to perform the conversion is using the equivalent decorator, nb.vectorize.

In [37]:
x1 = 1
x2 = 2

# Pure python 
%timeit cantor_py(x1, x2)

# Numba
%timeit cantor(x1, x2)

# Numpy 
%timeit (0.5 * (x1 + x2)*(x1 + x2 + 1) + x2)

256 ns ± 23.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
15.3 µs ± 335 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
148 ns ± 1.13 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


An additional advantage of universal functions is that, since they depend on individual values, their evaluation can also be executed in parallel. Numba provides an easy way to parallelize suc  functions by passing the target="cpu" or target="gpu" keyword argument to the nb.vectorize decorator.

#### Generalized universal functions

One of the main limitations of universal functions is that they must be defined on scalar values. A generalized universal function, abbreviated *gufunc*, is an extension of universal functions to procedures that take arrays

A classic example is the matrix multiplication. In NumPy, matrix multiplication can be applied using the np.matmul function, which takes two 2D arrays and returns another 2D array

In [38]:
a = np.random.rand(3,3)
b = np.random.rand(3,3)

c = np.matmul(a,b)
c.shape

(3, 3)

a ufunc broadcasts the operation over arrays of scalars, its natural generalization will be to broadcast over an array of arrays

If, for instance, we take two arrays of 3 by 3 matrices, we will expect np.matmul to take to match the matrices and take their product. In the following example, we take two arrays containing 10
matrices of shape (3, 3). If we apply np.matmul, the product will be applied matrix-wise to obtain a new array containing the 10 results (which are, again, (3, 3) matrices):

In [39]:
a = np.random.rand(10, 3, 3)
b = np.random.rand(10, 3, 3)

c = np.matmul(a, b)
c.shape

(10, 3, 3)

The usual rules for broadcasting will work in a similar way. For example, if we have an array of (3, 3) matrices, which will have a shape of (10, 3, 3), we can use np.matmul to calculate the matrix multiplication of each element with a single (3, 3) matrix. According to the broadcasting rules, we obtain that the single matrix will be repeated to obtain a size of (10, 3, 3)

In [40]:
a = np.random.rand(10, 3, 3)
b = np.random.rand(3, 3) # Broadcasted to shape (10, 3, 3)

c = np.matmul(a, b)
c.shape

(10, 3, 3)

Numba supports the implementation of efficient generalized universal functions through the nb.guvectorize decorator. As an example, we will implement a function that computes the euclidean distance between two arrays as a gufunc. To create a gufunc, we have to define a function that takes the input arrays, plus an output array where we will store the result of our calculation

The nb.guvectorize decorator requires two arguments
1. The types of the input and output: two 1D arrays as input and a scalar as output
2. The so called layout string, which is a representation of the input and output sizes; in our case, we take two arrays of the same size (denoted arbitrarily by n), and we output a scalar

In [41]:
@nb.guvectorize(['float64[:], float64[:], float64[:]'], '(n), (n)->()')
def euclidean(a,b,out):
    N = a.shape[0]
    out[0] = 0.0
    for i in range(N):
        out[0] += (a[i] - b[i]) ** 2

Numba treats scalar argument as arrays of size 1.

In [45]:
a = np.random.rand(2)
b = np.random.rand(2)
c = euclidean(a, b) # Shape: (1,)
c.shape

()

In [46]:
a = np.random.rand(10, 2)
b = np.random.rand(10, 2)
c = euclidean(a, b) # Shape: (10,)
c.shape

(10,)

In [47]:
a = np.random.rand(10, 2)
b = np.random.rand(2)
c = euclidean(a, b) # Shape: (10,)
c.shape

(10,)

In [48]:
a = np.random.rand(10000, 2)
b = np.random.rand(10000, 2)

In [49]:
%timeit ((a - b)**2).sum(axis=1)
# 1000 loops, best of 3: 288 μs per loop

119 µs ± 3.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [50]:
%timeit euclidean(a, b)
# 10000 loops, best of 3: 35.6 μs per loop

28.7 µs ± 524 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


### 5.1.5 JIT classes

In [51]:
class Node:
    def __init__(self, value):
        self.next = None
        self.value = value

In [53]:
class LinkedList:
    def __init__(self):
        self.head = None
    def push_front(self, value):
        if self.head == None:
            self.head = Node(value)
        else:
            # We replace the head
            new_head = Node(value)
            new_head.next = self.head
            self.head = new_head

In [54]:
def show(self):
    node = self.head
    while node is not None:
        print(node.value)
        node = node.next

In [59]:
lst = LinkedList()
lst.push_front(1)
lst.push_front(2)
lst.push_front(3)

In [60]:
@nb.jit
def sum_list(lst):
    result = 0
    node = lst.head
    while node is not None:
        result += node.value
        node = node.next
    return result

In [61]:
lst = LinkedList()
[lst.push_front(i) for i in range(10000)]

%timeit sum_list.py_func(lst)
# 1000 loops, best of 3: 2.36 ms per loop
%timeit sum_list(lst)
# 100 loops, best of 3: 1.75 ms per loop

840 µs ± 24.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Compilation is falling back to object mode WITH looplifting enabled because Function "sum_list" failed type inference due to: [1m[1mnon-precise type pyobject[0m
[0m[1m[1] During: typing of argument at <ipython-input-60-40193199fddc> (3)[0m
[1m
File "<ipython-input-60-40193199fddc>", line 3:[0m
[1mdef sum_list(lst):
[1m    result = 0
[0m    [1m^[0m[0m
[0m
  @nb.jit
Compilation is falling back to object mode WITHOUT looplifting enabled because Function "sum_list" failed type inference due to: [1m[1mcannot determine Numba type of <class 'numba.dispatcher.LiftedLoop'>[0m
[1m
File "<ipython-input-60-40193199fddc>", line 5:[0m
[1mdef sum_list(lst):
    <source elided>
    node = lst.head
[1m    while node is not None:
[0m    [1m^[0m[0m
[0m[0m
  @nb.jit
[1m
File "<ipython-input-60-40193199fddc>", line 3:[0m
[1mdef sum_list(lst):
[1m    result = 0
[0m    [1m^[0m[0m
[0m
  state.func_ir.loc))
Fall-back from the nopython compilation path to the object mode com

866 µs ± 34.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [None]:
node_type = nb.deferred_type()
node_spec = [
    ('next', nb.optional(node_type)),
    ('value', nb.int64)
]

@nb.jitclass(node_spec)
class Node:
    # Body of Node is unchanged
node_type.define(Node.class_type.instance_type)

In [None]:
ll_spec = [
('head', nb.optional(Node.class_type.instance_type))
]

@nb.jitclass(ll_spec)
class LinkedList:
# Body of LinkedList is unchanged

In [None]:
lst = LinkedList()
[lst.push_front(i) for i in range(10000)]

%timeit sum_list(lst)
# 1000 loops, best of 3: 345 μs per loop
%timeit sum_list.py_func(lst)
# 100 loops, best of 3: 3.36 ms per loop

### 5.1.6 Limitations in numba

There are some instances where Numba cannot properly infer the variable types and will refuse to compile

In the following example, we define a function that takes a nested list of integers and returns the sum of the element in every sublist. In this case, Numba will raise ValueError and refuse to compile

In [68]:
a = [[0, 1, 2],
     [3, 4],
     [5, 6, 7, 8]]

@nb.jit
def sum_sublists(a):
    result = []
    for sublist in a:
        result.append(sum(sublist))
    return result

sum_sublists(a)

Compilation is falling back to object mode WITH looplifting enabled because Function "sum_sublists" failed type inference due to: [1mUntyped global name 'sum':[0m [1m[1mcannot determine Numba type of <class 'builtin_function_or_method'>[0m
[1m
File "<ipython-input-68-7812101a7ef5>", line 9:[0m
[1mdef sum_sublists(a):
    <source elided>
    for sublist in a:
[1m        result.append(sum(sublist))
[0m        [1m^[0m[0m
[0m[0m
  @nb.jit
Compilation is falling back to object mode WITHOUT looplifting enabled because Function "sum_sublists" failed type inference due to: [1m[1mcannot determine Numba type of <class 'numba.dispatcher.LiftedLoop'>[0m
[1m
File "<ipython-input-68-7812101a7ef5>", line 8:[0m
[1mdef sum_sublists(a):
    <source elided>
    result = []
[1m    for sublist in a:
[0m    [1m^[0m[0m
[0m[0m
  @nb.jit
[1m
File "<ipython-input-68-7812101a7ef5>", line 7:[0m
[1mdef sum_sublists(a):
[1m    result = []
[0m    [1m^[0m[0m
[0m
  state.func_ir.lo

[3, 7, 26]

The problem with this code is that Numba is not able to determine the type of the list and fails. A way to fix this problem is to help the compiler determine the right type by initializing the list with a sample element and removing it at the end:

In [69]:
@nb.jit
def sum_sublists(a):
    result = [0]
    for sublist in a:
        result.append(sum(sublist))
    return result[1:]

## 5.2 The pypy project

PyPy is a very ambitious project at improving the performance of the Python interpreter. The way PyPy improves performance is by automatically compiling slow sections of the code at runtime.

PyPy is written in a special language called RPython (rather than C) that allows developers to quickly and reliably implement advanced features and improvements. RPython means Restricted Python because it implements a restricted subset of the Python language targeted to the compiler development.

PyPy compiles code using a very clever strategy, called tracing JIT compilation. At first, the code is executed normally using interpreter calls. PyPy then starts to profile the code and identifies the most intensive loops. After the identification takes place, the compiler then observes (traces) the operations and is able to compile its optimized, interpreter-free version

Once an optimized version of the code is present, PyPy is able to run the slow loop much faster than the interpreted version

This strategy can be contrasted with what Numba does. In Numba, the units of compilation are methods and functions, while the PyPy focus is just slow loops. Overall, the focus of the projects is also very different as Numba has a limited scope for numerical code and requires a lot of instrumentation while PyPy aims at replacing the CPython interpreter

### 5.2.1 Setting up pypy

### 5.2.2 Running a particle simulator in pypy

## 5.3 Other interesting projects

**Numba** and **PyPy** are mature projects that are steadily improving over the years. Features are continuously being added and they hold great promise for the future of Python.

**Nuitka** is a program developed by Kay Hayen that compiles Python code to C. As of right now (version 0.5.x), it provides extreme compatibility with the Python language and produces efficient code that results in moderate performance improvements over CPython

**Pyston** is a new interpreter developed by Dropbox that powers JIT compilers. It differs substantially from PyPy as it doesn't employ a tracing JIT, but rather a method-at-a-time JIT (similar to what Numba does). Pyston, like Numba, is also built on top of the LLVM compiler infrastructure.

## 5.4 Summary