## The builtin `multiprocessing` module

Jakob will introduce the builtin `multiprocessing` module, which allows you to simply parallelize work in Python.

Here, we will only consider embarassingly parallel problems (more on that in a second).

# Serial Python

Before we get into parallel processing, let's first consider how the type of problems we'll consider are typically solved in a serial context. We'll focus on problems which always have the following form: 

In [2]:
def do_science(x):
    """For example:
    - training a neural network (hyperparameter tuning!)
    - getting results from a database
    - scraping some websites
    - reading files
    - sampling monte-carlo style
    """
    return x ** 2  # we don't really do anything ;)

results = []
input_data = range(10)
for x in input_data:
    results.append(do_science(x))
print(results)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]


This typical structure (or "smell") is pretty common and most likely all of you have something similar somewhere in your code. It's an excellent opportunity for leveraging parallelism to speed things up. However, first we'll rewrite this code using the builtin `map` function, it makes the code more compact and it will be easier to make this run in parallel. `map` takes a function and an iterable and applies the function to each item.

In [3]:
# when applying a function to a bunch of data, maybe you would use list comprehension
results = [do_science(x) for x in input_data]
print(results)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]


In [4]:
# however, here we use map to later use its parallel implementations
results = map(do_science, input_data)
results = list(results)
print(results)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]


(watch out: `map` returns a generator, you need to "exhaust it" (convert it to a list) explicitly).

### Questions?

## Exercise 0

Questions:
- Can we also compute the *sum* of numbers `0..9` using map? -> no(t trivially). results of operations depend on each other. this is a contra-indicator for "embarrassingly parallel" problems

# Embarassingly parallel Python
This type of problem is referred to as "embarrassingly parallel" problems. This indicates that they can be easily parallelized across threads or processes as they do not require interaction while running (they can also be run in serial!). For these types of problems, we can use the builtin `multiprocessing` module. It supports parallel versions of `map` which can be run in parallel processes.

## Parallel threads
We first work with the `ThreadPool` available from the `multiprocessing.pool` module. We assume CPython in which the GIL prevent several threads from executing in parallel. However, for some use cases, in particular those which are **I/O bound**, threading can be very useful. Consider for example obtaining data from some database: you would like to query a couple of measurements, and completing each of these queries may take some processing time on the server. Here we mimick this server-side processing time by merely sleeping (which releases the GIL).

Now, we use use `ThreadPool` do perform these queries using multiple threads (here the `processes` argument actually refers to the number of threads).

First, let's figure out how many CPUs we have available:

In [None]:
import multiprocessing
multiprocessing.cpu_count()

# Thread-parallel(?) number crunching
Now let's consider a compute-intense number-crunching task, for example tuning hyperparameters our fancy neural network model to squeeze out the additional 0.0002% increase in accuracy. Here we mimick the training by merely counting down from a large number (let's avoid cognitive overhead).

In [None]:
def train_neural_network(x):
    """'Train' your favourite neural network model."""
    print(f'training with {x} start')
    n = x * 2e7  # mimick compute-intense training
    while n > 0:
        n -= 1
    y = x ** 2
    print(f'training with {x} end')
    return y

In [None]:
input_data = [1, 8, 1.5, 2]  # some dummy simulations

Again, first, we use the builtin `map` function to perform the number crunching serially for each item in l:

In [None]:
%%time
result = list(map(train_neural_network, input_data))
print(result)

Observations:
- number crunching causes high CPU load (surprise! ;) )

# Parallel processes
For such tasks, we use the `ProcessPool`. In contrast to the `ThreadPool` this distributes work across multiple processes running separate instances of the Python interpreter. This allows you to circumvent the limitations of the GIL and achieve truly parallel code execution. For use cases which are **compute bound**, it is an excellent, simple-to-use option. As already introduced above, these use cases may include numerical simulations, sampling methods etc. Unfortunately, using multiple processes introduces some downsides, such as some overhead (time & memory) for launching processes and increased memory consumption (e.g., duplication of data; warning: depends on implementation and use case).

In [None]:
from multiprocessing.pool import Pool as ProcessPool

In [None]:
%%time
with ProcessPool(processes=2) as pool:  # context manager providing a `Pool` instance
    result = pool.map(train_neural_network, input_data)
print(result)

## Exercise 1

Observations?
- results are identical to serial and threaded execution; good!
- runtime is reduced compared to both serial and threaded execution
- increased CPU load on multiple cores
- caveat: as before, no automatic load balancing, tasks are executed in order

Questions:
- What is the fundamental difference between threads and processes? -> memory shared across threads, by default not across processes
- How is data communicated between threads? -> shared memory (direct access)
- How is data communicated between processes? -> sending serialized data from one process to the other (in Python: `pickle` -> problems with stuff that's not picklable), or set up shared memory

### Parallel speedup

So how much faster does my code become when I'm increasing the number of processes? Here we investigate the relative speedup ($T_\textrm{parallel} / T_\textrm{serial}$) for an increasing number of processes. We use the same compute-bound function as before, but remove some of the annoying output and make it a bit shorter.

In [None]:
import time
import multiprocessing
import numpy as np


def train_neural_network(x):
    """Train your favourite neural network model."""
    n = x * 2e6  # mimick compute-intense training
    while n > 0:
        n -= 1
    y = x ** 2
    return y


input_data = [2] * 16  # some dummy simulations of equal duration
times = []
n_processes = np.arange(1, multiprocessing.cpu_count() + 4)
for n in n_processes:
    t0 = time.time()
    with ProcessPool(processes=n) as pool:
        result = pool.map(train_neural_network, input_data)
    times.append(time.time() - t0)

times

In [None]:
import matplotlib.pyplot as plt
import numpy as np

times = np.array(times)
fig, axes = plt.subplots()
axes.plot(n_processes, 1.0 * n_processes, color='k', linestyle='--', label='ideal')
axes.plot(n_processes, times[0] / times, marker='o', label='measured')
axes.set_xlabel(r'$n$ processes')
axes.set_ylabel('relative speedup')
fig.legend()

Observations
- perfect speedup up to X processes, (on some machines) good speedup until ~Y processes with decreasing benefits
- no (significant) benefits for more processes
- rule of thumb: benefits up to number of cores (OS also needs some compute: context switching; also hyperthreading does not seem to work well in my experience)

Questions:
- Can we combine `ProcessPools` with `ThreadPools` in the worker processes? -> yes, but benefit depends on use case

## Resources
https://www.youtube.com/watch?v=AG1soUh4-nU