Based on **Francesco Pierfederici: Distributed Computing with Python, Chapter 3**

### Parallelism in Python

Here we will look at **parallel programming** in more detail and see what facilities Python offers us to **make our code use more than one CPU core** at the time 

In this demo we will stay in the boundaries of a **single machine**. We will discuss **distributed systems later**. 

The main goal here will be **speed for CPU-intensive problems, and responsiveness for I/O-intensive code.**

Let's start by writing a simple program that makes use of **multiple threads to 
download data** from the Web. 

In [1]:
#We start by importing the modules we need from the Standard Library (that is, threading, queue, and urllib.request).

from time import time
from threading import Thread
from queue import Queue
import urllib.request

Here is a list of links that we want to download parallel.

In [2]:
links=[]
links.append('https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2017-financial-year-provisional/Download-data/annual-enterprise-survey-2017-financial-year-provisional-size-bands-csv.csv')
links.append('https://www.stats.govt.nz/assets/Uploads/Household-living-costs-price-indexes/Household-living-costs-price-indexes-September-2018-quarter/Download-data/household-living-costs-price-indexes-sep18qtr-time-series-indexes.csv')
links.append('https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2017-financial-year-provisional/Download-data/annual-enterprise-survey-2017-financial-year-provisional-csv.csv')

In [3]:
print(links)

['https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2017-financial-year-provisional/Download-data/annual-enterprise-survey-2017-financial-year-provisional-size-bands-csv.csv', 'https://www.stats.govt.nz/assets/Uploads/Household-living-costs-price-indexes/Household-living-costs-price-indexes-September-2018-quarter/Download-data/household-living-costs-price-indexes-sep18qtr-time-series-indexes.csv', 'https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2017-financial-year-provisional/Download-data/annual-enterprise-survey-2017-financial-year-provisional-csv.csv']


Define a function "get_content" that downloads the content of "act_url" link, and outputs is to the "outq" Queue.

In [4]:
def get_content(act_url, outq):
    with urllib.request.urlopen(act_url) as res:
        body = res.read()
    outq.put((act_url, body)) #put the (act_url, body) tupple to the outq queue

Define a **thread-safe** queue (that is, an instance of Queue from the Python queue module)
We call this queue "outputq". 

It will **hold the data produced by the various threads** that downloaded the contents of the the websites. 


In [5]:
outputq = Queue()

Once we have the output queue, we then **spawn a new worker thread for each website link**. 

**Each worker thread simply runs the get_content function**, with the actual link and the output queue as arguments.


In [6]:
for link in links:
        # spawn a new thread "t" that will run the "get_content" function with "kwargs" arguments
        t = Thread(target=get_content,
                   kwargs={'act_url': link,
                           'outq': outputq})
        t.daemon = True
        t.start() # start the thread

A thread can be flagged as a daemon flag: If it is a daemon, then it will allow the main program to exit, 
i.e. the main program doesn't need to wait till all child threads are finished.

The main difficulty in using threads to perform actions in parallel is that
**we cannot tell when a given thread will read or write any data shared with other threads.**

One way out of synchronization problems like these is the use of **locks**: so that only one thread can write at a given time.

The Queue class helps in threaded programming when **information must be exchanged safely between multiple threads**. 
The Queue class in this module **implements all the required locking semantics**.

Thread-safe queues are a very convenient example of lock-based data structures that we can use 
to organize data access.

Since **each thread writes to the same output queue**, we might just as well **monitor 
that queue to know when results are ready** and it is time to quit. 

Here, we do that by simply fetching one result from the queue per link (the loop over links)

and by waiting for the queue to join (outputq.join()), 

which will happen when all the results have been fetched (more precisely, when each get() method is followed by a call to task_done()). 

This way, **we are sure that our program does not quit prematurely**.

**Queue.task_done()**: Indicate that a formerly enqueued task is complete. 

For each **get()** used to fetch a task, a subsequent call to **task_done()** tells the queue that the processing on the task is complete.

**Queue.join()**: Blocks until all items in the queue have been gotten and processed.

The **count of unfinished tasks** **goes up** whenever an item is added to the queue. The count **goes down** whenever a consumer thread calls **task_done()** to indicate that the item was retrieved and all work on it is complete. When the count of unfinished tasks drops to zero, **join() unblocks**.

In [7]:
for _ in links:
    link, body = outputq.get()
    print(link, body[:100])
    outputq.task_done() #signals when a thread is done
outputq.join() #unblocks when all tasks are complete in the queue.

https://www.stats.govt.nz/assets/Uploads/Household-living-costs-price-indexes/Household-living-costs-price-indexes-September-2018-quarter/Download-data/household-living-costs-price-indexes-sep18qtr-time-series-indexes.csv b'hlpi_name,series_ref,quarter,hlpi,nzhec,nzhec_name,nzhec_short,level,index,change.q,change.a\r\nAll ho'
https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2017-financial-year-provisional/Download-data/annual-enterprise-survey-2017-financial-year-provisional-size-bands-csv.csv b'year,industry_code_ANZSIC,industry_name_ANZSIC,rme_size_grp,variable,value,unit\r\n2011,A,"Agriculture'
https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2017-financial-year-provisional/Download-data/annual-enterprise-survey-2017-financial-year-provisional-csv.csv b'Year,Industry_aggregation_NZSIOC,Industry_code_NZSIOC,Industry_name_NZSIOC,Units,Variable_code,Varia'


### Example without using threads:

In [8]:
# no threads
q = Queue()
t0 = time(); [get_content(p, q) for p in links]; dt = time() - t0; print(dt)

0.7048904895782471


### Example with threads:

In [9]:
# threads can help!

t0 = time();

for link in links:
        t = Thread(target=get_content,
                   kwargs={'act_url': link,
                           'outq': outputq})
        t.daemon = True
        t.start()
        
for _ in links:
    link, body = outputq.get()
    #print(link, body[:100])
    outputq.task_done() #signals when a thread is done
outputq.join()

dt = time() - t0; print(dt)

0.34225893020629883


### Somtimes threads can hurt performance

In [10]:
def fib(n):
    if n <= 2:
        return 1
    elif n == 0:
        return 0
    elif n < 0:
        raise Exception('fib(n) is undefined for n < 0')
    return fib(n - 1) + fib(n - 2)


In [11]:
fib(5)

5

In [12]:
# calculate the "fibnum" Fibonacci number threadnum times independently on different threads!

def runthreads(threadnum,fibnum):
    t0 = time();
    for i in range(threadnum):
            t = Thread(target=fib, args=(fibnum, )) #spawn a new thread
            t.start()
    dt = time() - t0; 
    print(dt) # time neaded to claculate threadnum Fibonacci numbers

In [13]:

runthreads(1,34)
runthreads(2,34)
runthreads(3,34)
runthreads(4,34)
runthreads(8,34)

0.005433559417724609
0.062355756759643555
0.10884261131286621
0.28424906730651855
0.33549046516418457


Interesting! Increasing the number of parallel computations just increases the execution time.

**Clearly, something is not quite right**, as we would have expected the threads 
to run in parallel (again, on a quad-core machine).

It turns out that there is something not obvious going on deep inside the Python 
interpreter that is affecting our CPU-bound threads. 

That thing is called **Global Interpreter Lock (GIL)**. 

As the name implies, the **GIL is a global lock** that is used, 
mostly, to **keep reference counting sane** (remember when we talked about that a little 
while ago?). The consequence of the GIL is that even though Python threads are real 
OS-native threads, **only one of them can be active at any given point in time**.

This has led some to say that the **Python interpreter is a single-threaded interpreter**, 
which is not quite true. However, this statement is also, conceptually at least, not 
completely wrong either. 

The situation we just witnessed is very **similar to the 
behavior we observed when writing coroutines**. In that case, in fact, only one piece 
of code could run at any given point in time. 

Things just work, meaning **we get the 
parallelism that we expect, when one coroutine or thread waits for I/O and another 
one takes over the CPU**. Things do not work as well in terms of performance speedups, 
when one task needs the CPU for a long time, as is the case with CPU-bound tasks as 
in the Fibonacci example.

**Not all Python interpreters have the GIL; Jython, for instance, does not.**

# Multiple processes
Traditionally, the way Python programmers have worked around the GIL and its 
effect on CPU-bound threads has been to **use multiple processes instead of multiple 
threads**. 

This approach (multiprocessing) has some **disadvantages**: 
having to launch multiple instances of the Python interpreter with all the 
**startup time and memory usage penalties** that this implies

In [14]:
import psutil
psutil.cpu_count(logical=False)

4

In [15]:
psutil.cpu_count(logical=True)

4

Using **multiple processes to execute tasks in parallel has some nice properties.** 

Multiple processes have their **own memory space** and they also allow us to **(more) easily transition from a single-machine 
architecture to a distributed application**, where one would have to use multiple 
processes (on different machines) anyway.

There are two main modules in the Python Standard Library that we can use to 
implement process-based parallelism, and both of them are truly excellent. One is 
called **multiprocessing** and the other is **concurrent.futures**. 

The concurrent.futures module is built on top of multiprocessing and the threading module and 
provides a powerful high-level interface to them.

### Example for concurrent.futures multiprocessing

In [16]:
import concurrent.futures as cf

In [17]:
def fib(n):
    if n <= 2:
        return 1
    elif n == 0:
        return 0
    elif n < 0:
        raise Exception('fib(n) is undefined for n < 0')
        return fib(n - 1) + fib(n - 2)

Exception in thread Thread-14:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/python3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/ubuntu/anaconda3/envs/python3/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "<ipython-input-10-85033b0a92e0>", line 8, in fib
    return fib(n - 1) + fib(n - 2)
  File "<ipython-input-10-85033b0a92e0>", line 8, in fib
    return fib(n - 1) + fib(n - 2)
  File "<ipython-input-10-85033b0a92e0>", line 8, in fib
    return fib(n - 1) + fib(n - 2)
  [Previous line repeated 16 more times]
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

Exception in thread Thread-27:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/python3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/ubuntu/anaconda3/envs/python3/lib/python3.6/threading.py", line 864, in run
    self._target(*sel

In [18]:
fibnum=34
workernum=4
[fibnum] * workernum

[34, 34, 34, 34]

In [19]:
def runprocesses(workernum,fibnum):
    t0 = time();
    
    with cf.ProcessPoolExecutor(max_workers=workernum) as pool:
            results = pool.map(fib, [fibnum] * workernum) #run the fib function on each element of [34,34,34,34] in a parallel way
    
    dt = time() - t0; 
    print(dt)

We used the **ProcessPoolExecutor** class exported by concurrent.futures. 

This is one of the two main classes exported by 
the module, the other being **ThreadPoolExecutor**, which is used to create a **pool of 
threads**, instead of a **pool of processes**.

In [20]:
runprocesses(1,34)
runprocesses(2,34)
runprocesses(3,34)
runprocesses(4,34)
print('***************')
runprocesses(8,34)
runprocesses(16,34)
runprocesses(32,34)


0.010954141616821289
0.011400461196899414
0.015639305114746094
0.020844459533691406
***************
0.038100481033325195
0.08728265762329102
0.16221284866333008


Both, **ProcessPoolExecutor and ThreadPoolExecutor have the same API**: they have three main 
methods, which are as follows:

• **submit(f, *args, **kwargs)**: This is used to schedule an **asynchronous 
call** to f(*args, **kwargs) and return a **Future instance as a result** 
placeholder.

• **map(f, *arglist, timeout=None)**: This is the equivalent 
to the built-in map(f, *arglist) method. It returns a **list of Future objects** 
rather than a list of actual results, as map would do.

The third method, **shutdown(wait=True)** is used to **free the resources** used by the Executor object as soon as all currently scheduled functions are done. 

It waits (if wait=True) until that happens. 

A **Future instance** is a **placeholder** for the **result of an asynchronous call**. We can check 
whether the call is still running, whether or not it raised an exception, and so on.  

We call a Future instance result() method to access (with an optional timeout) its value once it is ready.

In [21]:
from concurrent.futures import ProcessPoolExecutor

In [22]:
pool = ProcessPoolExecutor(max_workers=1)
fut = pool.submit(fib, 38)
fut.running()

True

In [23]:
fut.done()

True

We saw how to use the concurrent.futures package to create a worker pool (using the ProcessPoolExecutor class) and submit 
work to it (pool.submit(fib, 38)). As we expect, submit returns a Future object 
(fut in the preceding code), which is a placeholder for a result that is not yet available.

### Context managers

**Executor objects can also be used as context managers** 

In those cases, there is an implicit blocking call made to the Executor shutdown method on the context manager's exit. 

This means that if we were to access the results list, we would **get integers rather than Future instances once the context 
manager exits.**

In [24]:
workernum=1
fibnum=38
with cf.ProcessPoolExecutor(max_workers=workernum) as pool:
            results = pool.map(fib, [fibnum] * workernum)
            print(results)

<generator object _chain_from_iterable_of_lists at 0x7f617404eba0>


We can make a **one-line modification** to our process-based parallel code and 
**switch to using threads instead**; simply replace ProcessPoolExecutor with 
ThreadPoolExecutor. 

For a quick example, change the

with cf. ProcessPoolExecutor (max_workers=args.n) as pool:

line with this one:

with cf.ThreadPoolExecutor(max_workers=args.n) as pool:

In [25]:
threadnum=1
fibnum=38
with cf.ThreadPoolExecutor(max_workers=threadnum) as pool:
    results = pool.map(fib, [fibnum] * workernum)
    print(results)

<generator object Executor.map.<locals>.result_iterator at 0x7f617404ee60>


### Multiprocess queues
When using multiple processes, the issue that comes up is **how to exchange data 
between the workers**. 

The multiprocessing module offers a mechanism to do that 
in the form of **queues and pipes**. 

The **multiprocessing.Queue** class is modeled after the queue.Queue class with the 
additional twist that **items stored in the multiprocessing queue need to be pickable**.


In [28]:
def fib(n):
    if n <= 2:
        return 1
    elif n == 0:
        return 0
    elif n < 0:
        raise Exception('fib(n) is undefined for n < 0')
    return fib(n - 1) + fib(n - 2)

In [29]:
def worker(inq, outq): #input queue for the task (fib function and its argument), output queue for the results (integer)
    while True:
        data = inq.get()
        if data is None:
            return
        fn, arg = data
        outq.put(fn(arg))

We use a **two-queue architecture**, whereby **one queue 
holds the tasks to be performed** (in this case, the function to be called and its only 
argument), while the **other queue holds the results** (simple integers in this example).

In [30]:
import multiprocessing as mp

workernum=4
fibnum=34

# We open two multiprocessing queues:
tasks = mp.Queue()
results = mp.Queue()

for i in range(workernum):
    tasks.put((fib, fibnum))

In [31]:
for i in range(workernum):
        mp.Process(target=worker, args=(tasks, results)).start()

In [32]:
for i in range(workernum):
        print(results.get())

5702887
5702887
5702887
5702887


In [33]:
for i in range(workernum):
        tasks.put(None)

As we did previously, we use a sentinel value (None) in the task queue to signal that 
the worker processes should quit. The worker process is a simple multiprocessing.
Process instance whose target is the worker function and whose behavior is the one 
that we just described.

Another piece of technology that might be worth investigating is **Cython**, a Python-like language to create C modules that is extremely popular and actively developed. Cython has excellent support for OpenMP, a directive-based API for C, 
C++, and Fortran, that allows programmers to easily multithread their code.