In [None]:
%run ../00_AdvancedPythonConcepts/talktools.py

<img src="https://www.evernote.com/l/AUV1r1xvhBdPF6lX-2SJLkO-vkkmCXEDrMwB/image.png">
http://www.slideshare.net/ManojitNandi/parallel-programming-in-python-speeding-up-your-analysis

Remember, you can create (fork) many processes, which are copies of the original parent process (memory, data, state) and act independently of each other. To share data between them you have to explicitly do that within each process. The Pythonic way we do multiprocessing (creation of new processes, communication between processes) is with `multiprocessing`.

*"effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine. It runs on both Unix and Windows."*

- https://docs.python.org/3/library/multiprocessing.html

The analog to `threading.Thread` is `multiprocessing.Process`. You should be able to do a drop-in replacement. Instead of `current_thread()` you'd use `os.getpid()`.

In [1]:
!pip install snakeviz

Collecting snakeviz
  Downloading snakeviz-0.4.1-py2.py3-none-any.whl (166kB)
[K    100% |████████████████████████████████| 174kB 1.7MB/s 
Installing collected packages: snakeviz
Successfully installed snakeviz-0.4.1


In [2]:
%load_ext snakeviz

In [3]:
%%snakeviz

import logging
import random
import time
import os

logging.basicConfig(level=logging.DEBUG,
                    format='(%(threadName)-9s) %(message)s',)

import multiprocessing

def worker(num):
    """thread worker function"""
    
    sleep_time = random.randint(1,5)
    logging.debug('worker: {0} sleeping for {1} s, name: {2}'
                   .format(num,sleep_time,os.getpid()))
    time.sleep(sleep_time)
    logging.debug('done')
    return

procs = []
for i in range(2):
    p = multiprocessing.Process(target=worker, args=(i,))
    procs.append(p)
    p.start()

(MainThread) worker: 0 sleeping for 4 s, name: 7354
(MainThread) worker: 1 sleeping for 2 s, name: 7355


 
*** Profile stats marshalled to file '/var/folders/s1/gv4yr2493jq4d4jtzpvhp0z00000gn/T/tmpnjvgop7m'. 


(MainThread) done
(MainThread) done


If your machine has multiple cores, these two processes may get run on those two separate cores, independently.

You may need to share info between processes. You can do this, just like with Threads with `Queues`. You can also use the (UNIX-like) Pipe to have  two processes communicate with each other:

In [4]:
# https://docs.python.org/3/library/multiprocessing.html#exchanging-objects-between-processes
from multiprocessing import Process, Pipe

def f(conn):
    conn.send([42, None, 'hello'])
    conn.close()

if __name__ == '__main__':
    parent_conn, child_conn = Pipe()
    p = Process(target=f, args=(child_conn,))
    p.start()
    print(parent_conn.recv())   # prints "[42, None, 'hello']"
    p.join()

[42, None, 'hello']


Using pools of workers (in separate processes) with multiprocessing Pool.

https://docs.python.org/3.5/library/multiprocessing.html#using-a-pool-of-workers

In [5]:
from multiprocessing import Pool 
pool = Pool(processes=4)     # start 4 worker processes 

In [6]:
pool

<multiprocessing.pool.Pool at 0x104ac6320>

In [7]:
import time
def g(x): 
    time.sleep(0.2)
    return x*x

In [8]:
%time list(map(g,range(10)))

CPU times: user 2.44 ms, sys: 1.5 ms, total: 3.94 ms
Wall time: 2.03 s


[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

In [10]:
%time pool.map(g,range(10))

CPU times: user 2.1 ms, sys: 1.55 ms, total: 3.65 ms
Wall time: 603 ms


[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

In [11]:
# print same numbers in arbitrary order
for i in pool.imap_unordered(g, range(10)):
    print(i,sep=" ",end=" ")

4 1 9 0 49 36 25 16 81 64 

Python is packaging (pickling) up your functions and sending them to different processes.

In [12]:
pool.map(lambda x: x**3, range(10))

PicklingError: Can't pickle <function <lambda> at 0x104bce840>: attribute lookup <lambda> on __main__ failed

In [13]:
# run only one process "g(10)" asynchronously 
result = pool.apply_async(g, [10])

# prints "100" unless you timeout
print(result.get(timeout=0.4)) 

100


In [14]:
del pool

In [15]:
from multiprocessing import Pool 
import time

def f(x): 
    return x*x

for i in [1,2,3,4,8]:
    print(i,"*"*5,flush=True)
    pool = Pool(processes=i)               # start 4 worker processes 
    start = time.time()
    pool.map(f, range(100000))
    print("{0:0.4f} sec".format(time.time() - start))
    pool.terminate()
    del pool

1 *****
0.0474 sec
2 *****
0.0408 sec
3 *****
0.0494 sec
4 *****
0.0374 sec
8 *****
0.0523 sec


In [16]:
!ulimit -a

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 4864
pipe size            (512 bytes, -p) 1
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 709
virtual memory          (kbytes, -v) unlimited


# `concurrent.futures` - Launching parallel tasks

Built-in, create different pools for executing **maps** (single loop over data). Local resources.

`The concurrent.futures module provides a high-level interface for asynchronously executing callables.

The asynchronous execution can be performed with threads, using ThreadPoolExecutor, or separate processes, using ProcessPoolExecutor. Both implement the same interface, which is defined by the abstract Executor class.`

https://docs.python.org/3/library/concurrent.futures.html

In [40]:
from concurrent.futures import ProcessPoolExecutor
e = ProcessPoolExecutor()  # can also use a threadpool

In [18]:
%%time 

from time import sleep

results = []
for i in range(8):
    sleep(1)
    results.append(i + 1)

CPU times: user 8.64 ms, sys: 3.74 ms, total: 12.4 ms
Wall time: 8.03 s


In [19]:
results

[1, 2, 3, 4, 5, 6, 7, 8]

In [20]:
%%time 
from time import sleep

from concurrent.futures import ProcessPoolExecutor
e = ProcessPoolExecutor() 

def slowfunc(x):
    sleep(1)
    return(x+1)

results = list(e.map(slowfunc,range(8)))

CPU times: user 11.2 ms, sys: 14.7 ms, total: 25.9 ms
Wall time: 2.02 s


In [21]:
results

[1, 2, 3, 4, 5, 6, 7, 8]

In [59]:
e.shutdown()

Figured out I have 4 cores and ran it in 4 separate processes.

## Breakout

Convert the sequential code to parallel using `concurrent.futures`

In [51]:
#%%time

import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor

url = "https://en.wikipedia.org/wiki/Special:Random"

def titlebot(x):
    url = "https://en.wikipedia.org/wiki/Special:Random"
    a = requests.get(url)
    resp = a.text
    print("title=",BeautifulSoup(resp, 'html.parser')
          .title.string.split("- Wikipedia")[0],"len=",len(resp))
    return(len(resp))

e = ThreadPoolExecutor()

In [52]:
results = list(e.map(titlebot,range(11)))

(Thread-36) Starting new HTTPS connection (1): en.wikipedia.org
(Thread-26) Starting new HTTPS connection (1): en.wikipedia.org
(Thread-30) Starting new HTTPS connection (1): en.wikipedia.org
(Thread-29) Starting new HTTPS connection (1): en.wikipedia.org
(Thread-31) Starting new HTTPS connection (1): en.wikipedia.org
(Thread-27) Starting new HTTPS connection (1): en.wikipedia.org
(Thread-32) Starting new HTTPS connection (1): en.wikipedia.org
(Thread-33) Starting new HTTPS connection (1): en.wikipedia.org
(Thread-34) Starting new HTTPS connection (1): en.wikipedia.org
(Thread-35) Starting new HTTPS connection (1): en.wikipedia.org
(Thread-28) Starting new HTTPS connection (1): en.wikipedia.org
(Thread-33) "GET /wiki/Special:Random HTTP/1.1" 302 20
(Thread-35) "GET /wiki/Special:Random HTTP/1.1" 302 20
(Thread-36) "GET /wiki/Special:Random HTTP/1.1" 302 20
(Thread-34) "GET /wiki/Special:Random HTTP/1.1" 302 20
(Thread-32) "GET /wiki/Special:Random HTTP/1.1" 302 20
(Thread-27) "GET /wik

title= Dasgupta  len= 34263
title= Navia immersa  len= 27197
title= Telephone numbers in Andorra  len= 39549
title= Meinl (surname)  len= 23353
title= One Small Step (album)  len= 34918
title= Ungheni, Argeș  len= 46284
title= Iridomyrmex mattiroloi  len= 29442
title= XHACB-FM  len= 44710
title= Mianzulan  len= 63028
title= Septaria (gastropod)  len= 34478
title= List of Royal Artillery Batteries  len= 69138


In [53]:
results

[69138, 34478, 63028, 46284, 34918, 29442, 27197, 23353, 39549, 34263, 44710]

### Executor.submit

`submit` starts an execution in a separate thread or process and immediately returns a `Future` object that points back to the result. Until the function completes, the future is pending. We get the result of a task with `.result()`, which blocks until the computation is complete.

In [54]:
%%time 
from time import sleep

from concurrent.futures import ProcessPoolExecutor
e = ProcessPoolExecutor() 

def slowfunc(x,y,delay=1):
    sleep(delay)
    return(x+y)

future = e.submit(slowfunc,1, 2)

CPU times: user 5.98 ms, sys: 13.4 ms, total: 19.3 ms
Wall time: 17.5 ms


In [56]:
future

<Future at 0x104bfb080 state=finished returned int>

In [55]:
future.result()

3

In [57]:
%%time 
futures = [e.submit(slowfunc,1,2, delay=1) for _ in range(10)]
results = [f.result() for f in futures]

CPU times: user 7.67 ms, sys: 3.31 ms, total: 11 ms
Wall time: 3.01 s


In [58]:
results

[3, 3, 3, 3, 3, 3, 3, 3, 3, 3]

## Joblib

http://pythonhosted.org/joblib/

Running Python functions as pipeline jobs. The *vision is to provide tools to easily achieve better performance and reproducibility when working with long running jobs.* Specifically meant to work well with large data (ie. numpy arrays).

  - **Avoid computing twice the same thing**: code is rerun over an over, for instance when prototyping computational-heavy jobs (as in scientific development), but hand-crafted solution to alleviate this issue is error-prone and often leads to unreproducible results
  - **Persist to disk transparently**: persisting in an efficient way arbitrary objects containing large data is hard. Using joblib’s caching mechanism avoids hand-written persistence and implicitly links the file on disk to the execution context of the original Python object. As a result, joblib’s persistence is good for resuming an application status or computational job, eg after a crash.

Joblib strives to address these problems while leaving your code and your flow control as unmodified as possible (no framework, no new paradigms).

In [None]:
!conda install joblib -y

In [None]:
from math import sqrt
[sqrt(i ** 2) for i in range(10)]

### Parallel Helpers

Joblib provides a simple helper class to write parallel for loops using multiprocessing. The core idea is to write the code to be executed as a generator expression, and convert it to parallel computing.

In [None]:
from math import sqrt
from joblib import Parallel, delayed

By default Parallel uses the Python multiprocessing module to fork separate Python worker processes to execute tasks concurrently on separate CPUs. This is a reasonable default for generic Python programs but it induces some overhead as the input and output data need to be serialized in a queue for communication with the worker processes. 

In [None]:
Parallel(n_jobs=2,backend="threading") \
  (delayed(sqrt)(i ** 2) for i in range(10))

In [None]:
import time
start = time.time()
Parallel(n_jobs=10,verbose=5) \
  (delayed(time.sleep)(1) for _ in range(10))
print(time.time()-start)

### On demand recomputing: the `Memory` class

Caching long running results so it can be reused. Let's try to cache to disk:

In [None]:
from joblib import Memory
memory = Memory(cachedir="/tmp/", verbose=0)

In [None]:
@memory.cache
def f(x):
    print('Running f(%s)' % x)
    return x

In [None]:
print(f(1))

In [None]:
print(f(1))

In [None]:
print(f(10))

In [None]:
!ls -lat /tmp/joblib/__main__--Users-jbloom-Classes-python-seminar-DataFiles_and_Notebooks-08_Parallelism-__ipython-input__/f

In [None]:
memory = Memory(cachedir="/tmp/",verbose=1,mmap_mode="r+")

In [None]:
@memory.cache
def g(x,blah=True):
    print('Running g(%s)' % x)
    return x

In [None]:
print(g(1))

In [None]:
print(g(1))

In [None]:
print(g(1,blah=False))

Ignoring variables:

In [None]:
@memory.cache(ignore=['blah'])
def h(x,blah=True):
    print('Running h(%s)' % x)
    return x

In [None]:
print(h(1))

In [None]:
print(h(1,blah=False))

Note: joblib also gives (for persistence) `joblib.dump()` and `joblib.load()` provide a replacement for pickle to work efficiently on Python objects containing large data, in particular large numpy arrays.