# Python and performance

In this post we discuss various aspects of performance when coding in Python.

## Compiled vs interpreted languages

Compiled languages require the use of a *compiler*. This is a computer program which converts the human-readable *source file* written by a human into machine-readable code. The output is an *executable file*, wwhich is a sequence of instruction which the central processing unit (CPU) of the computer is able to process and execute directly. Modern CPUs can execute billions of these elementary instructions every second. Because the instructions can directly be processed by the CPU, the executable is *architecture-dependent*. In other words, an executable compiled to run on your laptop will probably not run on your phone. The most common architectures today are x86-64 for computers and ARM for mobile and embedded devices. 

C++ is an ubiquitous compiled programming language; it was created to add object-oriented programming functionalities to C, which is even more widely used. Let's look at a simple sample C++ file, and its translation into x86-64 code.

In [1]:
!cat example0.cpp

#include <iostream>
#include <string>

int main() {
	int a = 5;
	int b = 8;
	int c = 0;
	c = a + b;
	std::cout
	  << c << std::endl;
}


This program defines three variables `a`, `b` and `c`, and stores the sum of `a` and `b` into `c`. It then displays the result. To compile it, we'll use `gcc`, the GNU Compiler Collection.

In [2]:
!g++ example0.cpp -O0 -o example0

Let's run the compiled executable:

In [3]:
!./example0

13


A *disassembler* decomposes a binary executable into individual instructions for humans to read (this may look like a simple task, but is not an exact science!). Let's disassemble the executable we just created.

In [4]:
!objdump -d example0


example0:     file format elf64-x86-64


Disassembly of section .init:

0000000000000708 <_init>:
 708:	48 83 ec 08          	sub    $0x8,%rsp
 70c:	48 8b 05 d5 08 20 00 	mov    0x2008d5(%rip),%rax        # 200fe8 <__gmon_start__>
 713:	48 85 c0             	test   %rax,%rax
 716:	74 02                	je     71a <_init+0x12>
 718:	ff d0                	callq  *%rax
 71a:	48 83 c4 08          	add    $0x8,%rsp
 71e:	c3                   	retq   

Disassembly of section .plt:

0000000000000720 <.plt>:
 720:	ff 35 72 08 20 00    	pushq  0x200872(%rip)        # 200f98 <_GLOBAL_OFFSET_TABLE_+0x8>
 726:	ff 25 74 08 20 00    	jmpq   *0x200874(%rip)        # 200fa0 <_GLOBAL_OFFSET_TABLE_+0x10>
 72c:	0f 1f 40 00          	nopl   0x0(%rax)

0000000000000730 <__cxa_atexit@plt>:
 730:	ff 25 72 08 20 00    	jmpq   *0x200872(%rip)        # 200fa8 <__cxa_atexit@GLIBC_2.2.5>
 736:	68 00 00 00 00       	pushq  $0x0
 73b:	e9 e0 ff ff ff       	jmpq   720 <.plt>

000000000000

There is a lot there, but we are only interested in the section starting with `<main>:` since it directly corresponds to the C++ code we wrote. Let's isolate that part.

In [12]:
!objdump -d example0 | sed -n '/<main>:/,/retq/p'

000000000000088a <main>:
 88a:	55                   	push   %rbp
 88b:	48 89 e5             	mov    %rsp,%rbp
 88e:	48 83 ec 10          	sub    $0x10,%rsp
 892:	c7 45 f4 05 00 00 00 	movl   $0x5,-0xc(%rbp)
 899:	c7 45 f8 08 00 00 00 	movl   $0x8,-0x8(%rbp)
 8a0:	c7 45 fc 00 00 00 00 	movl   $0x0,-0x4(%rbp)
 8a7:	8b 55 f4             	mov    -0xc(%rbp),%edx
 8aa:	8b 45 f8             	mov    -0x8(%rbp),%eax
 8ad:	01 d0                	add    %edx,%eax
 8af:	89 45 fc             	mov    %eax,-0x4(%rbp)
 8b2:	8b 45 fc             	mov    -0x4(%rbp),%eax
 8b5:	89 c6                	mov    %eax,%esi
 8b7:	48 8d 3d 62 07 20 00 	lea    0x200762(%rip),%rdi        # 201020 <_ZSt4cout@@GLIBCXX_3.4>
 8be:	e8 9d fe ff ff       	callq  760 <_ZNSolsEi@plt>
 8c3:	48 89 c2             	mov    %rax,%rdx
 8c6:	48 8b 05 03 07 20 00 	mov    0x200703(%rip),%rax        # 200fd0 <_ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_@GLIBCXX_3.4>
 8cd:	48 89 c6             	mov    %rax,

It's possible to recognise what we wrote in C++ after a while... for example, lines 908, 90f and 916 define the three variables, lines 91d and 920 push the contents of the variables to the CPU registers, and finally line 923 adds the value of the two registers! For more complicated programmes, this becomes tricky to follow, to say the least... but this is also the point: this code is only supposed to be read by a machine. However, we do see that the critical part of our programme was converted to a handful of CPU instructions, which is great for performance. If you'd like to get more hands-on with assembly code, here are two useful resources:
* https://www.agner.org/optimize: contains, amongst other things, a guide on optimization for x-86 processors (AMD, VIA and Intel). If you ever wondered what `mov`, `1ea` or `callq` do, what their throughput and latencies are on a given CPU, this is the place to go.
* Intel publishes "intrinsics" i.e. C functions that map directly to individual CPU instructions. More here: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#

Interpreted languages, function differently from compiled ones. Instead of being converted to CPU instructions directly, the source code is parsed then intermediate, machine-independent *opcode* (or *bytecode*) is generated. This intermediate code is then *interpreted* by an interpreter (for the Python language, the reference interpreter is CPython - so called because it is written in C!). In a way, the interpreter is a "machine within the machine": it takes the place of the CPU (though of course it itself is being run by the CPU!). This explains some of the performance loss incurred when running Python scripts, since the presence of an interpreter induces overhead. We've written a python program equivalent to the C++ one above:

In [None]:
!cat example1.py

In [None]:
!python ./example1.py

One can explicitly ask the Python executable to generate the intermediate opcode. Ever wondered what the `.pyc` files present in `__pycache__` were?

In [None]:
!python -m compileall example1.py

In [None]:
!python view_bytecode.py __pycache__/example1.cpython-36.pyc

Each group of opcodes directly corresponds to each line in `example1.py` (and this is much more readable than assembly code...). To execute each opcode, many CPU instructions are required - but how many? To get a feel for this, let's inspect the source code of the CPython interpreter! In fact, let's just consider the crucial opcode, which is `BINARY_ADD` (adds two numbers together). We'll have to clone the cpython repository... exciting!

In [None]:
!git clone https://github.com/python/cpython.git

Feel free to have a closer look at ceval.c. We'll only show the relevant parts, since it is a lot of C code to take in in one go. Effectively, you'll find a big `switch` statement, followed by a lot of `case` statements, each corresponding to one opcode:

In [None]:
!cat cpython/Python/ceval.c | grep 'case TARGET' | tail -20

Although this is C code, it's not hard to see what this is doing: the programme is looping over each opcode, and based on the value of the opcode does someting different. So what happens when `BINARY_ADD` is encountered?...

In [None]:
!cat cpython/Python/ceval.c | sed -n '/BINARY_ADD/,/DISPATCH()/p'

Looks fairly complicated for a simple sum... but really all this does is call PyNumber_Add provided the arguments are not two Unicode strings (in which case they will be concatenated). So let's have a look at `PyNumber_Add`...

In [None]:
!cat cpython/Objects/abstract.c | sed -n '/PyNumber_Add/,/return result/p'

So now we call `binary_op1`. If the result is `Py_NotImplemented` we do... something, else we return the result. Fine, let's have a look at `binary_op1`.

In [None]:
!cat cpython/Objects/abstract.c | sed -n '/binary_op1/,/Py_RETURN_NOTIMPLEMENTED/p'

This seems to be doing some type checking. If the object in question a "number type" (if so define `slotv` function...) and more checking, and finally if possible run slotv. And do make sure to check that the returned type is not `PyNotImplemented`, of course...

What's going on here is that Python is (very) dynamically typed: types are associated to *values* rather than *variables*. So whenever an opration is executed on an object, the interpreter has to check that the relevant object supports it. An take care of potential failures. Everything is an object in Python! What was one CPU instruction in our C++ example is quickly turning into many more when the Python interpreter is running the show. Because of the flexibility that Python offers (dynamic typing, etc), a lot of optimisations that are available to other interpreted languages (e.g. Java) are simply not available to it. That is the main reason why Python is so slow! It's common for a C++ version of a Python function to be 100x faster.

## Improving performance

There are a number of ways in which you can improve the performance of your Python code. Normally, your programmes will spend most of their time in the same portion of the code - we call these *hotspots*. It is these hotspots that really need to be sped up, and not the rest of your code. For example, if your programme consists in reading a file from disk, parsing it into a custom data structure, and then running some bespoke ML algorithm on that structure, then probably only the latter step needs optimisation. We will discuss three different ways of speeding up your code: Cython, Numba and writing custom extensions; we won't discuss some others like using a different Python interpreter such as PyPy or Jython.

### Cython

Cython is an *optimising static compiler* for the Python language and the *extended Cython language, which is a superset of the Python programming language*. What does that mean? Cython compiles Python code, instead of interpreting it. Second, it defines a set of annotations that you may use on your Python code. These annotations are not part of the Python language. Rather, they help Cython understand the types of your variables and the signatures of your functions. Really, they let you write C code with Python syntax. The corollary here is that to use Cython well, you need to understand C. In particular, Cython is most succesful at optimising your code when you do not use Python's convenient features like dynamically typed variables, introspection, and so on. Let's give Cython a go. To compare its performance with pure Python, we'll test a function adding the first $10^8$ integers and returns the output.

In [None]:
!cat add_integers_python.py

Cython code should be written in files ending in '.pyx'

In [None]:
!cat add_integers_cython.pyx

This is fairly similar to the pure Python function. The main difference is that we annoted the type of the variables using the `cdef` keyword, and also annotated the function arguments. Note that `uint64_t` stands for unsigned 64 bit integer. It can hold any non-negative integer between 0 and $2^{64} - 1$, inclusive. To compile `add_integers_cython.pyx` into a Python extension, we have written a custom `setup.py` file:

In [None]:
!cat setup.py 

We build the extension as follows:

In [None]:
!python setup.py build_ext --inplace

Let's compare the pure python implementation to the Cython one.

In [None]:
import time

In [None]:
def timefunc(func, *args, **kwargs):
    t0 = time.time()
    func(*args, **kwargs)
    t1 = time.time()
    return t1 - t0

In [None]:
from add_integers_python import f

In [None]:
n = 100000000

In [None]:
timefunc(f, n)

In [None]:
from add_integers_cython import f  as f_cython

In [None]:
timefunc(f_cython, n)

That's a roughly 200x improvement! You may want to check out `add_integers_cython.c`. The `add_integers_cython.pyx` file was tranlated into this C source file, and the latter was then compiled using a C compiler. For more information on Cython, see https://cython.org/

### Numba

Numba is easier to use than Cython. Numba's premise is that rather then having you specify types manually, it will try to infer them at runtime for you. As a consequence, most of the time all you have to do is to decorate your functions to tell numba to try to attempt and optimise them. No need for a different language here.

In [None]:
from numba import jit

In [None]:
f_numba = jit(nopython=True)(f)

In [None]:
f_numba(10)

In [None]:
timefunc(f_numba, n)

Woah! That was fast. How is this possible? You may want to use the `inspect_asm()` or `inspect_llvm()` methods of `f_numba` to check your theory.

It is also possible, and sometimes necessary, to specify the signatures of your functions directly, rather than to let `numba` find out what they are on its own. That's how you do it:

In [None]:
from numba import uint64

In [None]:
f_numba_with_signature = jit(
    uint64(uint64),
    nopython=True
)(f)

In [None]:
f_numba_with_signature(100)

Numba has (very basic) support for classes, too.

In [None]:
from numba import jitclass
from collections import OrderedDict

In [None]:
@jitclass(
    OrderedDict([("x", uint64), ("y", uint64)]),
)
class Point2D:
    def __init__(self, x, y):
        self.x = x
        self.y = y

Some trickery is required if you want to specify the signature of functions using your jitclassed classes...

In [None]:
NumbaPoint2DType = Point2D.class_type.instance_type

@jit(
    NumbaPoint2DType(NumbaPoint2DType, NumbaPoint2DType),
    nopython=True
)
def add_points(a, b):
    return Point2D(a.x + b.x, a.y + b.y)

In [None]:
result = add_points(Point2D(1, 2), Point2D(3, 4))

In [None]:
(result.x, result.y)

Class support is limited, however. For example, it is not possible for a class to reference itself - something you would do using pointers in C++. So defining recursive structures is very difficult (I have seen one example online, and could not adapt it to my needs). Again, knowing some C is useful here to understand what can and cannot be done with `numba`. For more information, follow the documentation here: http://numba.pydata.org/

### Using C/C++ directly

If C knowledge is effectively required to properly use Cython or Numba, might we not want to code our functions directly in C? Indeed, this is possible, and is probably preferable. One of the risks in using Cython or Numba is that they may not always suit your needs (see limitations above). You would not want to realise this in the middle of a project...

#### The C Foreign Function Interface (`cffi`) package

The `cffi` package allows you to use compiled C libraries directly. Here is our favourite function coded up in C:

In [None]:
!cat add_integers_c.h

In [None]:
!cat add_integers_c.c

Let us compile it into a shared library:

In [None]:
<<<<<<< local
!gcc -c -fpic add_integers_c.c
=======
!gcc -c -Wall -Werror -fPIC add_integers_c.c
>>>>>>> remote

In [None]:
!gcc -shared -Wl,-soname,libadd_integers_c.so -o libadd_integers_c.so add_integers_c.o

Now we can use `cffi` to make functions from this shared library available from a Python module.

In [None]:
from cffi import FFI
ffibuilder = FFI()

# cdef() expects a single string declaring the C types, functions and
# globals needed to use the shared object. It must be in valid C syntax.
ffibuilder.cdef("""
    uint64_t f(uint64_t);
""")

# set_source() gives the name of the python extension module to
# produce, and some C source code as a string.  This C code needs
# to make the declarated functions, types and globals available,
# so it is often just the "#include".
<<<<<<< local
ffibuilder.set_source(
    "_add_integers_cffi",
"""
     #include "add_integers_c.h"   // the C header of the library
""",
     libraries=['add_integers_c'],
    extra_link_args=['-L/project/performance/examples'] # Must add this to tell the linker where to find our shared library
)   # library name, for the linker
=======
ffibuilder.set_source("_add_integers_cffi",
"""
     #include "add_integers_c.h"   // the C header of the library
""",
     libraries=['add_integers_c'])   # library name, for the linker
>>>>>>> remote

ffibuilder.compile(verbose=True)

In [None]:
from _add_integers_cffi import lib

In [None]:
lib.f(100)

#### Writing a C++ extension with Boost

Boost (https://www.boost.org) has a library which wraps all the boiler plate code required to write Python extensions manually (it will take care of reference counts for you, for example...). The result is that it is quite simple to expose your C++ classes as Python classes. Installation is a bit tricky, though. For reference, a list of instructions follows. However, we have included a script to automate this process. To run this script, close this notebook, open a new terminal and run
```
cd install-boost && ./install_boost.sh
```
This will restart Jupyter.

Manual installation instructions:

1. make sure that `numpy` is installed.  For example using `pip`: 

   ```
   pip install numpy
   ```

2. Download the Boost C++ libraries version 1.69 available here:
   https://dl.bintray.com/boostorg/release/1.69.0/source/boost_1_69_0.tar.gz for
   example by running
    
   ```
   cd /tmp && wget https://dl.bintray.com/boostorg/release/1.69.0/source/boost_1_69_0.tar.gz
   ```

3. Extract the Boost C++ libraries:

   ```
   cd /tmp && tar -xvf boost_1_69_0.tar.gz
   ``` 
   
   This creates a directory `boost_1_69_0`.
   Define an environment variable named `BOOST_ROOT`
   pointing to the root of your newly installed Boost distribution:
   
   ```
   export BOOST_ROOT=/tmp/boost_1_69_0
   ```

4. Compile the Boost Python and Numpy shared libraries. Run 

	```
	cd $BOOST_ROOT && ./bootstrap.sh
	```
	
   This will create a `bjam` configuration file named `project-config.jam`.
   Bjam (the Boost build tool) is not always capable of detecting the correct
   Python include paths, so we'll need to fix `project-config.jam` manually.
   Look for a line resembling 
   ```
   using python : 3.6 : /opt/anaconda/envs/Python3 ;
   ```
   in this file, and replace it with
    
   ```
   using python : 3.6 : /opt/anaconda/envs/Python3 : /opt/anaconda/envs/Python3/include/python3.6m : /opt/anaconda/envs/Python3/lib ;
   ``` 
   
   If you are not using Faculty, if you are
   targetting another version of Python, or if your Python installation is
   located elsewhere, you will need to modify this step accordingly.  The
   essential point is to make sure that the fourth field points to the directory
   containing the Python C header files (in particular, it should contain the
   file `Python.h`).  Now compile the Boost Python library with

   ```
   ./b2 install --with-python stage
   ```
   
   The Boost Python and Numpy shared libraries will
   be installed in `$BOOST_ROOT/stage/lib`.

Let's also set a number of environment variables to make things tidier.

* `BOOST_ROOT`: we have already defined this variable.
  It should point to the directory containing the Boost C++ headers.

* `BOOST_LIB`: this variable should point to the directory containing the shared
  Boost Python and Boost Numpy libraries. To define it, run 
  
  ```
  export BOOST_LIB=$BOOST_ROOT/stage/lib
  ```

* `PYTHON_INCLUDE`: this variable should point to the directory containing the
  Python header files. On Faculty, this would currently be set as follows:
  
  ```
  export PYTHON_INCLUDE=/opt/anaconda/envs/Python3/include/python3.6m
  ```

* `NUMPY_INCLUDE`: this variable should point to the directory containing the
  numpy header files.  On Faculty, this would currently be set as follows:

  ```
  export NUMPY_INCLUDE=/opt/anaconda/envs/Python3/lib/python3.6/site-packages/numpy/core/include
  ```

Additionally, `$BOOST_LIB` should be part of your `$LD_LIBRARY_PATH` environment
variable. This is to make sure that the Python interpreter is able to find the
shared Boost Python library we just compiled. Run the following
command to make sure that `LD_LIBRARY_PATH` is correctly set:

``` 
export LD_LIBRARY_PATH=$BOOST_LIB:$LD_LIBRARY_PATH
```

Note that this is the only environment variable that is required to be set
after installation. On Faculty, you may want to create a file `envs.sh` in 
`/etc/sherlockml_environment.d` defining the variables above. Then restart Jupyter with 
```
sudo sv stop jupyter && sudo sv start jupyter
```
The IPython kernels run by your Jupyter server will now have the variables correctly set

Let's give Boost Python a go! Here is a simple example defining our function `f` as part of a module `add_integers_boost`.

In [None]:
!cat add_integers_boost.cpp

In [None]:
!g++ add_integers_boost.cpp \
     -fPIC \
     -shared \
     -I $BOOST_ROOT \
     -I $PYTHON_INCLUDE \
     -L $BOOST_LIB \
     -lboost_python36 \
     -Wno-deprecated-declarations \
     -o add_integers_boost.so

A word on the options:
* `-fPIC` tells the compiler to generate position independent code, meaning that the code does not rely on where it is located in memory to be run. For example, jumps will be relative rather than absolute. This is required for shared libraries.
* `-shared` tells the compiler to create a shared library.
* `-I $BOOST_ROOT`, `-I $PYTHON_INCLUDE` and `-I $BOOST_LIB` give the compiler additional directories to look for header files
* `-lboost_python36` tells the compiler to dynamically link the compiled library against the Boost Python library (for version 3.6)
* `-Wno-deprecated-declarations` removes some warnings coming from the Boost headers (this can be omitted)
* `-o add_integers_boost.so` specifies the output file name of the library.

In [None]:
from add_integers_boost import f

In [None]:
f(100)

Here is an example of a C++ class exposed as a Python class:

In [None]:
!cat accumulator_boost.cpp

In [None]:
!g++ accumulator_boost.cpp \
     -fPIC \
     -shared \
     -I $BOOST_ROOT \
     -I $PYTHON_INCLUDE \
     -L $BOOST_LIB \
     -lboost_python36 \
     -Wno-deprecated-declarations \
     -o accumulator_boost.so

In [None]:
from accumulator_boost import Accumulator

In [None]:
acc = Accumulator()

In [None]:
acc.total

In [None]:
acc.add(5)

In [None]:
acc.total

You typically will only want to implement the critical code paths in C++ - and hence only part of your class in C++. Therefore you might want to wrap the Boost-generated class in another class. For examle:

In [None]:
from accumulator_boost import Accumulator as CoreAccumulator

class Accumulator(CoreAccumulator):
    def __init__(self):
        self._core = CoreAccumulator()
        
    def add(self, increment):
        self._core.add(increment)

    @property
    def total(self):
        return self._core.total

    def help(self):
        print("Use this class to accumulate numbers.")
    
    def __repr__(self):
        return "Accumulator: total = {}".format(self.total)

In [None]:
acc = Accumulator()

In [None]:
acc.add(5)

In [None]:
print(acc)

Although quite bare, the Boost Python documentation does have a helpful tutorial which you might want to check out - see the Boost documentation. In particular, it will show you how to interact with Numpy.

### Can I code a C extension from scratch?

Yes. Good luck :)

## Parallelisation

Let's talk about parallelisation... 

### System processes and threads

In computing a *process* is an instance of program being executed. At any one time, your computer is running tens of processes, if not more. Here's how to get the number of running user processes on Linux:

In [None]:
!ps -A --no-headers | wc -l

Note that you are very likely to be running (many) more processes than you have CPU cores! How is this possible? In effect, only one process is ever being run at a time on a machine with a single core. Processes *share CPU time*: the CPU constantly switches between them, and this gives the illusion that processes are being run concurrently. The part of your operating system responsible for orchestrating this is called the *scheduler*. It divides time into slices, and decides which process should be run during each slice. Of course, when you have more than one core, the scheduler will use them all: processes will be scheduled to run on different cores.

 A process can run one or more *threads*: these are sequences of execution sharing the same address space (so that communication between threads is much faster than communication between processes). So your system runs many processes, and each process can run many threads. Again, there is no relationship between number of cores and number of threads, but having more cores means that threads can run concurrently. To speed up the performance of you programmes, you might therefore want to make them *multithreaded*. This typically will make your programmes run faster on multi-core machines, but not always! It all depends on whether the bottleneck in the execution of your programme is the CPU, or something else. To give an example, if your programme is transcoding a video, then the bottleneck is the CPU - we say that it is *CPU-bound*. Video transcoder are typically multithreaded, and this significantly improved performance on multi-core machines. If your programme is downloading data from the web, then its performance is probably limited by the network bandwidth available to you - it is *I/O-bound*. In this context, the CPU is probably spending most of its time idling and waiting for packets to come from the source. Dividing the downloading task into two threads would probably not speed things up. 

### The `threading` package

You can create threads in Python with the `threading` package. Run the following cells on a machine with at least two cores.

In [None]:
import threading

In [None]:
from add_integers_python import f

In [None]:
def single_threaded():
    return f(100000000) + f(100000000)

In [None]:
timefunc(single_threaded)

In [None]:
def f_adapted(index, results, *args, **kwargs):
    results[index] = f(*args, **kwargs)

def multi_threaded():
    results = {}
    t0 = threading.Thread(target=f_adapted, args=(0, results, 100000000,))
    t1 = threading.Thread(target=f_adapted, args=(1, results, 100000000,))
    t0.start()
    t1.start()
    t0.join()
    t1.join()
    return results[0] + results[1] 

In [None]:
timefunc(multi_threaded)

That was not much faster than the single-threaded version... in fact, it may even be a bit slower! Here is the problem:
Python threads *cannot be run concurrently*. This is a feature of the Python interpreter - the same programme written in C
would indeed be twice faster than the single-threaded version. The Python interpreter has a so-called *Global Intepreter Lock (GIL)* - when a thread is being run by the interpreter, the GIL is set, which preventing other threads from running. In effect, the Python interpreter works like a single-core virtual machine. More on this here: https://www.dabeaz.com/python/UnderstandingGIL.pdf

This does not mean that using threads in Python is not a good idea: they can be very useful if your programme has to run independent streams of code concurrently: for example to handle multiple connections to a server at the same time. But you shouldn't expect performance improvements on CPU-bound tasks from using Python threads.

### The `multiprocessing` package

Python does have a way of making use of multicore machines: the `multiprocessing` package.

In [None]:
import multiprocessing

In [None]:
def multi_processed():
    manager = multiprocessing.Manager()
    results = manager.dict()
    t0 = multiprocessing.Process(target=f_adapted, args=(0, results, 100000000,))
    t1 = multiprocessing.Process(target=f_adapted, args=(1, results, 100000000,))
    t0.start()
    t1.start()
    t0.join()
    t1.join()
    return results[0] + results[1] 

In [None]:
timefunc(multi_processed)

This *does* improve performance on multicore machines. As the name indicates, the way that this module works is by creating indepedent *processes* rather than *threads*: a distinct Python intepreter (and hence a distinct system process) will be run for each "process" that you create in this way. This means, in particular,  that these processes do not share the same address space. This works well only when a small amount of inter-process communication is required. Besides, running a new Python interpreter for each process has its own inconvenients.

Naturally, if you choose to write a C/C++ extension (see above) then you can easily bypass the GIL mechanism, and you won't be limited by the `threading` or `multiprocessing` modules. And how does the (in?)famous `joblib` module come into the picture? Joblib will either use `threading` or `multiprocessing` depending on which backend you specify - and hence suffers from the same limitations as these modules.