## Overview

CPython -common implementation we all use-  does not use multiple CPUs by default.

Expect *±n-times* speed-up of execution time with *n cores*.

Each additional process will increase communication overhead and decrease available RAM up to a significant overall slowdown.

Amdahls law: *If only a small part of your code can be parallelized, it doesn't matter how many CPUs you use - it won't run much faster after all*

If each process needs to communicate with every other Python process, the communication overhead will slowly overwhelm the processing and slow things down.

As more and more processes are added, overalle performance can be slowed down.

Typlical jobs for multiprocessing module:
* Parallelize a CPU-bound task with Process or Pool bjects
* Parallelize a IO-bound task in a Pool with threads (dummy module)
* Share pickled work via a Queue
* Share state between parallelized workers (bytes, primitive datatypes, dicts, lists)

**Threads** in Python are bound by the GIL (Global Interpreter Lock), so only one thread may interact with Python projects at a time.

By using **processes**, we run a number of Python interpreters in parallel - each with a private memory space with its own GIL -and each running in series -> no competition for each GIL.

## Multiprocessing module

Main components:
* **Process**: 
forked copy of current process (new pid - independent child process in operating system - provide it with target method to run)
* **Pool**:
wraps the process/threading - Thread API into a convenient pool of workers that share chunk of work und return aggregated result
* **Queue**:
FIFO queue allowing multiple producers and consumers
* **Pipe**:
Uni-/bidirectional communication cannel between two processes
* **Manager**:
High level managed interface to share Python objects between processes
* **ctypes**:
Allows sharing of primitive datatypes (ints flaots, bytes) between processes after they have forked


### Using prosesses

The Pool.apply and Pool.map methods are basically equivalents to Python’s in-built apply and map functions. 

The Pool.map and Pool.apply will lock the main program until all processes are finished, which is quite useful if we want to obtain results in a particular order for certain applications.

#### pool.map()

In [None]:
def cube(x):
    return x**3


if __name__ == "__main__":
    from multiprocessing import Pool #<- multiprocessing to set up a pool of processes -> multiple pid

    pool = Pool(processes=4)

    results = pool.map(cube, range(1,7))

    print(results)
    

#### pool.apply

In [None]:
def cube(x):
    return x**3
    
if __name__ == "__main__":
    from multiprocessing import Pool #<- multiprocessing to set up a pool of processes -> multiple pid

    pool = Pool(processes=4)

    results = [pool.apply(cube, args=(x,)) for x in range(1,7)]
    print(results)

### Using threads

Instead of importing multiprocessing import **multiprocessing.dummy**

#### dummy.pool.apply()

In [None]:

def cube(x):
    return x**3
    
if __name__ == "__main__":
    from multiprocessing.dummy import Pool #<- multiprocessing.dummy to set up a pool of threads -> same pid

    pool = Pool(processes=4)

    results = [pool.apply(cube, args=(x,)) for x in range(1,7)]

    print(results)


#### dummy.pool.map()

In [None]:
def cube(x):
    return x**3

if __name__ == "__main__":
    from multiprocessing.dummy import Pool #<- multiprocessing.dummy to set up a pool of threads -> same pid

    pool = Pool(processes=4)

    results = pool.map(cube, range(1,7))

    print(results)

## Joblib module
https://joblib.readthedocs.io/en/latest/parallel.html

Joblib is an improvement on multiprocesing module:

- easy parallel computing

- transparent disk-based caching of results

- focus on numpy arrays

For parallel computing we need the Parallel Class and delayed decorator:

- Paralllel sets up a process pool
- delayed wraps target function to be applied to instantiated parallel object via an iterator


In [None]:
from joblib import Parallel, delayed

def cube(x):
    return x**3

results = Parallel(n_jobs= 4, verbose=1)\
        (delayed(cube)\
        (x) for x in range(7))

print(results)

### Joblib mempory cache

Memory cache: decorator that places function results based on input arguments to a disc cache -> persists between Python sessions. 

Refresh only when cache location is cleared or decorated function is changed - not any subfunctions!

In [None]:
from joblib import Parallel, delayed, Memory

memory = Memory("./joblib_cache", verbose=0) #Set location of cache

@memory.cache # Place decorator to cached function
def cube(x):
    return x**3

results = Parallel(n_jobs= 4)\
        (delayed(cube)\
        (x) for x in range(7))

print(results)
