Multi-Core Parallelism
====

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import os

In [None]:
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
import multiprocessing as mp
from multiprocessing import Pool, Value, Array
import time

## Vanilla Python

Toy problem: Estimate $\pi$ by sampling points at random within a box circumscribing the unit circle and counting the fraction that fall within the circle. This is a simple example of a Monte Carlo algorithm. We will parallelize this embarrassingly parallel problem.

In [None]:
def mc_pi(n):
    s = 0
    for i in range(n):
        x = np.random.uniform(-1, 1)
        y = np.random.uniform(-1, 1)
        if (x**2 + y**2) < 1:
            s += 1
    return 4*s/n

In [None]:
%%time

mc_pi(int(1e5))

In [None]:
%%time

res = [mc_pi(int(1e5)) for i in range(10)]

The `concurrent.futures` module
----

Concurrent processes are processes that will return the same results regardless of the order in which they were executed. A "future" is something that will return a result sometime in the future.  The `concurrent.futures` module provides an event handler, which can be fed functions to be scheduled for future execution. This provides us with a simple model for parallel execution on a multi-core machine.

Using processes in parallel with `ProcessPoolExecutor`
----

We get a linear speedup as expected.

In [None]:
%%time

with ProcessPoolExecutor(max_workers=4) as pool:
    res = pool.map(mc_pi, [int(1e5) for i in range(10)])

In [None]:
np.array(list(res))

### When you have many jobs

The `futures` object gives fine control over the process, such as adding callbacks and canceling a submitted job, but is computationally expensive. We can use the `chunksize` argument to reduce this cost when submitting many jobs - this specifies the number of tasks to be given to a worker at a time. A detailed explanation of `chunksize` is provided [here](https://stackoverflow.com/questions/53751050/python-multiprocessing-understanding-logic-behind-chunksize)

#### Using default `chunksize `

The total amount of computation whether you have 10 jobs of size 10,000,000 or 10,000 jobs of size 10,000 is essentially the same, so we would expect them both to take about the same amount of time, but this is not true due to the overhead described above.

In [None]:
%%time

with ProcessPoolExecutor(max_workers=4) as pool:
    res = pool.map(mc_pi, [int(1e2) for i in range(int(1e4))])

#### Using `chunksize` of 100

In [None]:
%%time

with ProcessPoolExecutor(max_workers=4) as pool:
    res = pool.map(mc_pi, [int(1e2) for i in range(int(1e4))], chunksize=100)

### Fine control of processes

#### Status of processes

In [None]:
def f1(x):
    return x**2

def f2(x, y):
    return x*y

In [None]:
with ProcessPoolExecutor(max_workers=4) as pool:
    a = pool.submit(f2, 1, 1)
    b = pool.submit(f2, 1,2)
    c = pool.submit(f1, 10)    

    print('a running:', a.running())
    print('a done:', a.done())

    print('b running:', b.running())
    print('b done:', b.done())

    print('c running:', c.running())
    print('c done:', c.done())

    print('a result', a.result())
    print('b result', b.result())
    print('c result', c.result())

### Canceling jobs and adding callbacks

For example, if you launch multiple versions of the same task for safety, you might want to cancel the duplicate tasks once one of them has completed.

Callbacks are to allow the function to notify you when some event occurs.

In [None]:
njobs = 24

res = []

with ProcessPoolExecutor(max_workers=4) as pool:

    for i in range(njobs):
        res.append(pool.submit(f2, *np.random.rand(2)))
        if i % 2 == 0:
            res[i].add_done_callback(lambda future: print("Process done!"))
    res[4].cancel()
    if res[4].cancelled():
        print("Process 4 cancelled")

    for i, x in enumerate(res):
        while x.running():
            print("Running")
            time.sleep(1)
        if not x.cancelled():
            print(x.result())

### Functions with multiple arguments

Using a pool and the `map` method with functions requiring multiple arguments can be done in two ways.

In [None]:
def f(a, b):
    return a + b

#### Using a function adapter

In [None]:
def f_(args):
    return f(*args)

In [None]:
xs = np.arange(24)
chunks = np.array_split(xs, xs.shape[0]//2)

In [None]:
chunks

In [None]:
with ProcessPoolExecutor(max_workers=4) as pool:
    res = pool.map(f_, chunks)
list(res)

#### Using multiple argument iterables

In [None]:
with ProcessPoolExecutor(max_workers=4) as pool:
    res = pool.map(f, range(0,24,2), range(1,24,2))
list(res)

Using processes in parallel with ThreadPoolExecutor
----

We do not get any speedup because the GIL only allows one thread to run at one time.

In [None]:
%%time

with ThreadPoolExecutor(max_workers=4) as pool:
    res = pool.map(mc_pi, [int(1e5) for i in range(10)])

In [None]:
np.array(list(res))

## Using `multiprocessing`

The `concurrent.futures.ProcessPoolExecutor` is actually a wrapper for `multiprocessing.Pool` to unify the threading and process interfaces. I typically just work directly with `mutliprocessing` since I don't have much use for threads. One nice thing about using `multiprocessing` apart from more fine-grai control if you need it, is that it typically works equally well for small numbers of large jobs, or large numbers of small jobs out of the box using a heuristic to guess the optimal `chunksize`.

In [None]:
%%time

with mp.Pool(processes=4) as pool:
    res = pool.map(mc_pi, [int(1e5) for i in range(10)])

In [None]:
np.array(res)

In [None]:
%%time

with mp.Pool(processes=4) as pool:
    res = pool.map(mc_pi, [int(1e2) for i in range(int(1e4))])

In [None]:
np.array(res)

### Functions with multiple arguments

Multiprocessing `Pool` has a `starmap` method that removes the need to write a wrapper function.

In [None]:
def f(a, b):
    return a + b

In [None]:
xs = np.arange(24)
with Pool(processes=4) as pool:
    res = pool.starmap(f, np.array_split(xs, xs.shape[0]//2))
list(res)

#### Partial application

Sometimes, `functools.partial` can be used to reduce the number of arguments needed to just one.

In [None]:
def f(a, b):
    return a * b

In [None]:
from functools import partial

fp = partial(f, b=2)

In [None]:
xs = np.arange(24)
with Pool(processes=4) as pool:
    res = pool.map(fp, xs)
np.array(list(res))

### Blocking and non-blocking calls

In [None]:
def func(n):
    time.sleep(n)
    return n

In [None]:
with Pool(processes=4) as pool:
    res = pool.map(func, [3,3,3,3,3])
    print("Control back!")
res

In [None]:
with Pool(processes=4) as pool:
    res = pool.map_async(func, [3,3,3,3,3])
    print("Control back!")
    print(res.ready())
    res.wait()
    print(res.ready())
    print(res.get())

#### Different jobs to different processes

In [None]:
def f1(n):
    time.sleep(1)
    return n

def f2(n):
    time.sleep(1)
    return n**2

def f3(n):
    time.sleep(1)
    return n**3

def f4(n):
    time.sleep(1)
    return n**4

In [None]:
%%time

with Pool(processes=4) as pool:
    res = []
    for i, f in enumerate([f4, f2, f3, f1]):
        res.append((i, pool.apply(f, [2])))
    print(res)

In [None]:
%%time

with Pool(processes=4) as pool:
    res = []
    for i, f in enumerate([f4, f2, f3, f1]):
        res.append((i, pool.apply_async(f, [2])))
    print([(i, r.get()) for i, r in res])

### Creating individual processes

If you need more control over individual processes than Pool provides - namely, if you need to share information across processes, you can work with individual workers and thread-safe memory structures. This is just for completeness as most data processing tasks do not require this level of control.

In [None]:
def f(i):
    time.sleep(np.random.random())
    print(os.getpid(), i)

In [None]:
for i in range(10):
    p = mp.Process(target=f, args=(i,))
    p.start()
    p.join()

#### Using Queues to share information between processes.

In [None]:
def f1(q, i):
    time.sleep(np.random.random())
    q.put((os.getpid(), i))

In [None]:
q = mp.Queue()

res = []
for i in range(10):
    p = mp.Process(target=f1, args=(q,i,))
    p.start()
    res.append(q.get())
    p.join()

res

#### Using Value and Array for sharing data

#### Counting number of jobs (1)

This does not work.

In [None]:
def f2(i):
    global counter
    counter = counter + 1
    print(os.getpid(), i)

#### Checking

In [None]:
counter = 0
f2(10)
print(counter)

In [None]:
counter = 0

for i in range(10):
    p = mp.Process(target=f2, args=(i,))
    p.start()
    p.join()

#### Note that separate processes have their own memory and DO NOT share global memory

In [None]:
counter

#### Counting number of jobs (2)

We can use shared memory to do this, but it is slow because multiprocessing has to ensure that only one process gets to use counter at any one time. Multiprocesing provides Value and Array shared memory variables, but you can also convert arbitrary Python variables into shared memory objects (less efficient).

In [None]:
def f3(i, counter, store):
    counter.value += 1
    store[os.getpid() % 10] += 1

In [None]:
%%time

counter = mp.Value('i', 0)
store = mp.Array('i', [0]*10)

for i in range(int(1e2)):
    p = mp.Process(target=f3, args=(i, counter, store))
    p.start()
    p.join()

print(counter.value)
print(store[:])

#### Avoiding use of shared memory

#### Counting number of jobs (3)

We should try to avoid using shared memory as much as possible in parallel jobs as they drastically reduce efficiency. One useful approach is to use the `map-reduce` pattern. We should also use Pool to reuse processes rather than spawn too many of them. 

In [None]:
def f4(i):
    return (os.getpid(), 1, i)

In [None]:
%%time

# map step
with mp.Pool(processes=4) as pool:
    res = pool.map(f4, range(int(1e2)))

##### Reduce steps

In [None]:
res = np.array(res)

In [None]:
res[np.random.choice(len(res), 10)]

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame(res, columns=['pid', 'one', 'i'])

In [None]:
df.groupby('pid').sum()