## Make your numerical Python code fly at transonic 🚀 speed! 

# Transonic https://transonic.readthedocs.io

- A new package

- A unified modern Python API for Python/Numpy accelerators

- A thin layer between developers and Pythran / Cython / Numba

<div align="middle"> EuroScipy 2019 (4 September 2019, Bilbao) </div>

<div align="middle">
    <a href="http://tiny.cc/euroscipy2019-transonic">http://tiny.cc/euroscipy2019-transonic</a>    
</div>

# Few Transonic code examples

These codes can be accelerated with Pythran, Cython and Numba.

### Ahead-of-time compilation

In [None]:
import numpy as np

from transonic import boost

T0 = "int[:, :]"
T1 = "int[:]"

@boost
def row_sum(arr: T0, columns: T1):
    return arr.T[columns].sum(0)

@boost(boundscheck=False, wraparound=False)
def row_sum_loops(arr: T0, columns: T1):
    # locals type annotations are used only for Cython
    i: int
    j: int
    sum_: int
    res: "int[]" = np.empty(arr.shape[0], dtype=arr.dtype)
    for i in range(arr.shape[0]):
        sum_ = 0
        for j in range(columns.shape[0]):
            sum_ += arr[i, columns[j]]
        res[i] = sum_
    return res

### Just-in-time compilation

In [None]:
import numpy as np

from transonic import jit

def add(a, b):
    return a + b

@jit
def func(a, b):
    return np.exp(a) * b * add(a, b)

### Time annotations

In [None]:
import numpy as np
from transonic import Type, NDim, Array, boost

T = Type(int, float, np.complex128)
N = NDim(1, 2, 3)

A = Array[T, N]
A1 = Array[np.float32, N + 1]

@boost
def compute(a: A, b: A, c: T, d: A1):
    ...

### `inline` functions

In [None]:
from transonic import boost

T = int

@boost(inline=True)
def add(a: T, b: T) -> T:
    return a + b

@boost
def use_add(n: int = 10000):
    _: int
    for _ in range(n):
        tmp = add(tmp, 1)
    return tmp


### Accelerate methods

In [None]:
from transonic import boost

@boost
class MyClass:
    attr: int
    
    @boost
    def numerical_kernel(self, arg: int):
        return self.attr + arg

# Proper compilation needed for high efficiency !

Not like CPython: `compile(...)` to (high level) virtual machine instructions (with nearly no optimization)

## Compilation to machine instructions

*One needs to write code that can be well optimized by a compiler!*

- Just-in-time (`@jit`)

  Has to be fast (warm up), can be hardware specific

- Ahead-of-time (`@boost`)

  Can be slow, hardware specific or more general to distribute binaries

## First step for Python: Transpilation

From one language to another language (for example Python to C++, or Cython to C)


# Many tools to compile Python

## Compiled at which levels?

- programs (Nuitka)

- slowest loops (PyPy)

- modules (Cython, Pythran)

- user-defined functions / methods (Numba, Transonic)

- blocks of code (Transonic)

- expressions (Numexp)

- call compiled functions (Numpy / Python)

## Cython (module-level) AOT complier

- Langage: superset of Python

- A great mix of Python / C / CPython C API! 

  Very powerfull but a tool for experts!

- Easy to study where the interpreter is used (`cython --annotate`).

- Very mature

- Very efficient for C-like code (explicit loops, "low level")

- Now able to use Pythran internally...

My experience: large Cython extensions difficult to maintain

## Numba: (per-method) JIT for Python-Numpy code

- Very simple to use (just add few decorators) 🙂

In [None]:
from numba import jit

@jit
def myfunc(x):
    return x**2

- "nopython" mode (fast and no GIL) 🙂

- Also a "python" mode 🙂

- GPU and Cupy 😀

- Methods (of classes) 🙂

- Only JIT 🙁
    - Sometimes not as much efficient as it could be 🙁
      (sometimes slower than Pythran / Julia / C++)

- Not good to optimize high-level NumPy code 🙁

## Pythran: AOT compiler for modules using Python-Numpy

Transpiles Python to efficient C++

- Good to optimize *high-level NumPy code* 😎

- Extensions never use the Python interpreter (pure C++ ⇒ no GIL) 🙂

- Can produce C++ that can be used without Python

- Usually **very efficient** (sometimes faster than Julia)

    - High and low level optimizations
    
      (Python optimizations and C++ compilation)

    - SIMD 🤩 (with [xsimd](https://github.com/QuantStack/xsimd)) 

    - Understand OpenMP instructions 🤗 !

- Can [use and make PyCapsules](https://serge-sans-paille.github.io/pythran-stories/the-capsule-corporation.html) (functions operating in the native word) 🙂

### High level transformations

In [None]:
from black import format_str, FileMode
from pythran.toolchain import generate_py
import gast as ast
import astunparse

def print_optimized(src):    
    optimized_py = generate_py("bar", src)
    tree = ast.parse(optimized_py)
    for node in tree.body:
        if isinstance(node, ast.FunctionDef):
            fdef = node
            fdef.body = [node for node in fdef.body[:-1] if not isinstance(node, ast.Pass)] + [fdef.body[-1]]
    optimized_code = astunparse.unparse(tree)
    print(format_str(optimized_code, mode=FileMode(line_length=82)))


In [None]:
# calcul of range
print_optimized("""
def f(x):
    y = 1 if x else 2
    return y == 3
""")

In [None]:
# inlining
print_optimized("""
def foo(a):
    return  a + 1
def bar(b, c):
    return foo(b), foo(2 * c)
""")

In [None]:
# unroll loops
print_optimized("""
def foo():
    ret = 0
    for i in range(1, 3):
        for j in range(1, 4):
            ret += i * j
    return ret
""")

In [None]:
# constant propagation
print_optimized("""
def fib(n):
    return n if n< 2 else fib(n-1) + fib(n-2)
    
def bar(): 
    return [fib(i) for i in [1, 2, 8, 20]]
""")

In [None]:
# advanced transformations
print_optimized("""
import numpy as np
def wsum(v, w, x, y, z):
    return sum(np.array([v, w, x, y, z]) * (.1, .2, .3, .2, .1))
""")

## Pythran: AOT compiler for module using Python-Numpy

- Compile only full modules (⇒ refactoring needed 🙁)

- Only "nopython" mode

    * limited to a subset of Python
    
        - only homogeneous list / dict 🤷‍♀️
        - no methods (of classes) 😢 and user-defined class
    
    * limited to few extension packages (Numpy + bits of Scipy)
    
    * pythranized functions can't call Python functions

- No JIT: need types (written manually in comments)

- Lengthy ⌛️ and memory intensive compilations

- Debugging 🐜 Pythran requires C++ skills

- No GPU (maybe with [OpenMP 4](https://www.openmp.org/updates/openmp-accelerator-support-gpus/)?)

- Intel compilers unable to compile Pythran C++11 👎

- Small community, only 1 core-dev

# First conclusions

- Python great language & ecosystem for sciences & data

## Transonic example: a unique code accelerated with 3 backends

### Benchmark ahead-of-time compilation

In [None]:
import numpy as np

from transonic import boost

T0 = "int[:, :]"
T1 = "int[:]"

@boost
def row_sum(arr: T0, columns: T1):
    return arr.T[columns].sum(0)

@boost
def row_sum_loops(arr: T0, columns: T1):
    # locals type annotations are used only for Cython
    i: int
    j: int
    sum_: int
    res: "int[]" = np.empty(arr.shape[0], dtype=arr.dtype)
    for i in range(arr.shape[0]):
        sum_ = 0
        for j in range(columns.shape[0]):
            sum_ += arr[i, columns[j]]
        res[i] = sum_
    return res

## Transonic example: a unique code accelerated with 3 backends

### Benchmark ahead-of-time compilation

```
TRANSONIC_BACKEND="python" python row_sum_boost.py
Python
row_sum              1.38 s
row_sum_loops        108.57 s

TRANSONIC_BACKEND="cython" python row_sum_boost.py
Cython
row_sum              1.32 s
row_sum_loops        0.38 s

```

```
TRANSONIC_BACKEND="numba" python row_sum_boost.py
Numba
row_sum              1.16 s
row_sum_loops        0.27 s

TRANSONIC_BACKEND="pythran" python row_sum_boost.py
Pythran
row_sum              0.76 s
row_sum_loops        0.27 s

```

## Transonic example: a unique code accelerated with 3 backends

### Benchmark Just-in-time compilation

In [None]:
import numpy as np

from transonic import jit

@jit(native=True, xsimd=True)
def row_sum(arr, columns):
    return arr.T[columns].sum(0)

@jit(native=True, xsimd=True)
def row_sum_loops(arr, columns):
    res = np.empty(arr.shape[0], dtype=arr.dtype)
    for i in range(arr.shape[0]):
        sum_ = 0
        for j in range(columns.shape[0]):
            sum_ += arr[i, columns[j]]
        res[i] = sum_
    return res

## Transonic example: a unique code accelerated with 3 backends

### Benchmark Just-in-time compilation

```
TRANSONIC_BACKEND="cython" python row_sum_jit.py
Cython
row_sum              1.28 s
row_sum_loops        11.94 s

TRANSONIC_BACKEND="numba" python row_sum_jit.py
Numba
row_sum              1.14 s
row_sum_loops        0.28 s

```

```
TRANSONIC_BACKEND="pythran" python row_sum_jit.py
Pythran
row_sum              0.76 s
row_sum_loops        0.28 s
```

## NotImplemented

### dataclass (`numba.jitclass` and Cython `cdef class`) 

In [None]:
from transonic import dataclass, boost

@dataclass
class MyStruct:
    attr: int

    def compute(self, arg: int):
        return self.attr + arg

    def modify(self, arg: int):
        self.attr = arg

@boost
def func(o: MyStruct, a: int):
    o.modify(a)
    return o

@boost
def func1(o: MyStruct, a: int):
    return o.compute(a)