# Multiprocessing in Python

What happens if we need to do a lot of computation, and vectorization with NumPy is not enough? 

Basically: How can we assign different tasks to different cores or processors?

## Multiprocessing systems/interfaces

There are a few types of multiprocessing interfaces we can use, in order of easy to difficult:

| Interface | When to use |
|-----------|-------------|
| concurrent.futures.ProcessPoolExecutor | The modern way to launch parallel tasks. Usable for everything that takes a long time and needs to run on only one computer. |
| mpi4py.futures.MPIPoolExecutor | The modern way to run scalable parallel tasks on computer clusters. Use for long tasks that need to run on more than one computer. |
| multiprocess | An improved version of the built-in `multiprocessing` library. A bit more manual than the pools. |
| multiprocessing | The classic built-in library.  You probably will be using a part of this (like the Manager), but other components are trickier than the concurrent.futures implementation. |
| mpi4py classic | A bit beyond the scope of this class. |


In [1]:
import concurrent.futures
import time
import long_functions

In [2]:
def serial_task():
    possible_params = [3, 5, 10, 15]
    starting_time = time.time()
    for n in possible_params:
        long_functions.really_hard_task(n)
    duration_time = time.time() - starting_time
    print(f"Took {duration_time:.2f} s")

In [3]:
serial_task()

finished task crunching 3 with result 0.11
finished task crunching 5 with result 0.05
finished task crunching 10 with result 0.28
finished task crunching 15 with result 0.94
Took 8.04 s


In [15]:
possible_params = [3, 5, 10, 15, 17, 20]
with concurrent.futures.ProcessPoolExecutor(max_workers=4) as ppe:
    starting_time = time.time()
    the_futures = ppe.map(long_functions.really_hard_task, possible_params)
    print(list(the_futures))
    duration_time = time.time() - starting_time
    print(f"Took {duration_time:.2f} s")

[0.20724996676662955, 0.1254745967454749, 0.8664232287156962, 0.8340359255525761, 0.0017337226089754187, 0.8690424981520508]
Took 4.58 s


What happens if we run too many workers at once?

In [21]:
possible_params = range(1, 45, 2)
# I have 16 cores on my computer
with concurrent.futures.ProcessPoolExecutor(max_workers=32) as ppe:
    starting_time = time.time()
    the_futures = ppe.map(long_functions.really_hard_task, possible_params)
    print(list(the_futures))
    duration_time = time.time() - starting_time
    print(f"Took {duration_time:.2f} s")

[(2.0818426000000003, 0.46002758938601807), (2.079698, 0.38821834384052334), (2.0494889, 0.25627659113577117), (2.0574234000000002, 0.6112536963756398), (2.0002123, 0.015806798895283247), (2.0286079, 0.5601639456884568), (2.0063168, 0.8920934469282835), (2.0250826, 0.1395229536742053), (2.0010885, 0.9766091664377412), (2.0291446, 0.8343231677551194), (2.0006616, 0.4377725500481673), (2.0006008, 0.5825156998280913), (2.0004179, 0.4129245877479223), (2.0010774000000002, 0.4648283465467955), (2.0150069999999998, 0.48304946333931253), (2.0012633, 0.5643367886213527), (2.0573891, 0.42180645058951927), (2.0011395999999997, 0.06682621589679816), (2.0105072, 0.6316946418454051), (2.0045592, 0.6472864818983082), (2.0009552999999998, 0.26689059620006894), (2.0004841, 0.9281988421156807)]
Took 3.28 s


If you want to run a lot of workers, you can try it on DevCloud!

## Basic Linear Algebra Problems in Multiprocessing

Turns out that the Math Kernel Library or any other basic linear algebra system (BLAS) sometimes will try to be sneaky and do multiprocessing/multithreading on its own.  However, this conflicts with your own code, so make sure to do this before importing numpy or any other packages:

```
import os
os.environ['OMP_NUM_THREADS'] = '1'
os.environ['MKL_NUM_THREADS'] = '1'

# now you can import numpy and other packages
import numpy as np
import scipy.stats
# etc.
```