### Parallelism in Python

we will look at parallel programming in more detail and see which 
facilities Python offers us to make our code use more than one CPU or CPU core at the 
time (but always within the boundaries of a single machine). 

The main goal here will be speed for CPU-intensive problems, and responsiveness for I/O-intensive code.

Let's start by writing a simple program that makes use of multiple threads to 
download data from the Web. 

In [57]:
#We start by importing the modules we need from the Standard Library (that is, threading, queue, and urllib.request).

from threading import Thread
from queue import Queue
import urllib.request

In [36]:
links=[]
links.append('https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2017-financial-year-provisional/Download-data/annual-enterprise-survey-2017-financial-year-provisional-size-bands-csv.csv')
links.append('https://www.stats.govt.nz/assets/Uploads/Household-living-costs-price-indexes/Household-living-costs-price-indexes-September-2018-quarter/Download-data/household-living-costs-price-indexes-sep18qtr-time-series-indexes.csv')
links.append('https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2017-financial-year-provisional/Download-data/annual-enterprise-survey-2017-financial-year-provisional-csv.csv')

In [37]:
links

['https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2017-financial-year-provisional/Download-data/annual-enterprise-survey-2017-financial-year-provisional-size-bands-csv.csv',
 'https://www.stats.govt.nz/assets/Uploads/Household-living-costs-price-indexes/Household-living-costs-price-indexes-September-2018-quarter/Download-data/household-living-costs-price-indexes-sep18qtr-time-series-indexes.csv',
 'https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2017-financial-year-provisional/Download-data/annual-enterprise-survey-2017-financial-year-provisional-csv.csv']

In [43]:
def get_content(act_url, outq):
    with urllib.request.urlopen(act_url) as res:
        body = res.read()
    outq.put((act_url, body))

In [58]:
# Define a thread-safe queue (that is, an instance of Queue from the Python queue module)
# We call this queue outputq. It will hold the data produced by the various threads that downloaded the contents of the the websites. 

outputq = Queue()

In [60]:

#Once we have the output queue, we then spawn a new worker thread for each website link. 
#Each worker thread simply runs the get_content function, with the actual link and the output queue as arguments.

for link in links:
        t = Thread(target=get_content,
                   kwargs={'act_url': link,
                           'outq': outputq})
        t.daemon = True
        t.start()

Since these threads are just fire and forget threads, we can make them daemons, meaning that the main Python program 
will not wait for them to quit (join in  thread parlance) before exiting.
It is quite important to get this last observation about daemon threads and queues 
right. The main difficulty in using threads to perform actions in parallel is that we 
cannot tell when a given thread will read or write any data shared with other threads.

This can give rise to what is usually called a race condition. This is the situation where 
on one hand, the correct execution of the system depends on some actions being 
performed in a given order, and on the other hand, these actions are not guaranteed to 
happen in the right order, that is, the order envisioned by the programmer.

One way out of synchronization problems like these is the use of locks. Thread-safe 
queues are a very convenient example of lock-based data structures that we can use 
to organize data access.

Since each thread writes to the same output queue, we might just as well monitor 
that queue to know when results are ready and it is time to quit. 

Here, we do that by simply fetching one result from the queue per link 
(the loop over links) and by waiting for the queue to join (outputq.join()), 
which will happen when all the results have been fetched (more precisely, when each 
get() method is followed by a call to task_done()). This way, we are sure that our 
program does not quit prematurely.

In [62]:
for _ in links:
    link, body = outputq.get()
    print(link, body[:100])
    outputq.task_done()
outputq.join()

https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2017-financial-year-provisional/Download-data/annual-enterprise-survey-2017-financial-year-provisional-size-bands-csv.csv b'year,industry_code_ANZSIC,industry_name_ANZSIC,rme_size_grp,variable,value,unit\r\n2011,A,"Agriculture'
https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2017-financial-year-provisional/Download-data/annual-enterprise-survey-2017-financial-year-provisional-csv.csv b'Year,Industry_aggregation_NZSIOC,Industry_code_NZSIOC,Industry_name_NZSIOC,Units,Variable_code,Varia'
https://www.stats.govt.nz/assets/Uploads/Household-living-costs-price-indexes/Household-living-costs-price-indexes-September-2018-quarter/Download-data/household-living-costs-price-indexes-sep18qtr-time-series-indexes.csv b'hlpi_name,series_ref,quarter,hlpi,nzhec,nzhec_name,nzhec_short,level,index,change.q,change.a\r\nAll ho'


In [63]:
# no threads
q = Queue()
t0 = time(); [get_content(p, q) for p in links]; dt = time() - t0; print(dt)

3.4219446182250977


In [65]:
# threads can help!

t0 = time();

for link in links:
        t = Thread(target=get_content,
                   kwargs={'act_url': link,
                           'outq': outputq})
        t.daemon = True
        t.start()
        
for _ in links:
    link, body = outputq.get()
    #print(link, body[:100])
    outputq.task_done()
outputq.join()

dt = time() - t0; print(dt)

1.395106315612793


### Somtimes threads can hurt performance

In [66]:
def fib(n):
    if n <= 2:
        return 1
    elif n == 0:
        return 0
    elif n < 0:
        raise Exception('fib(n) is undefined for n < 0')
    return fib(n - 1) + fib(n - 2)


In [67]:
fib(5)

5

In [87]:
def runthreads(threadnum,fibnum):
    t0 = time();
    for i in range(threadnum):
            t = Thread(target=fib, args=(fibnum, ))
            t.start()
    dt = time() - t0; 
    print(dt/threadnum)

In [88]:
# these numbers should be the same if threads were parallel

runthreads(1,34)
runthreads(2,34)
runthreads(3,34)
runthreads(4,34)
runthreads(8,34)

0.005471467971801758
0.010702729225158691
0.033026933670043945
0.0513838529586792
0.02851840853691101


Interesting! The use of two threads to compute the 34 Fibonacci number in parallel 
takes twice as much time as using a single thread to do the same computation once. 
Increasing the number of parallel computations just increases the execution time 
linearly. Clearly, something is not quite right, as we would have expected the threads 
to run in parallel (again, on a quad-core machine).

It turns out that there is something not obvious going on deep inside the Python 
interpreter that is affecting our CPU-bound threads. That thing is called Global 
Interpreter Lock (GIL). As the name implies, the GIL is a global lock that is used, 
mostly, to keep reference counting sane (remember when we talked about that a little 
while ago?). The consequence of the GIL is that even though Python threads are real 
OS-native threads, only one of them can be active at any given point in time.

This has led some to say that the Python interpreter is a single-threaded interpreter, 
which is not quite true. However, this statement is also, conceptually at least, not 
completely wrong either. 

The situation we just witnessed is very similar to the 
behavior we observed when writing coroutines. In that case, in fact, only one piece 
of code could run at any given point in time. Things just work, meaning we get the 
parallelism that we expect, when one coroutine or thread waits for I/O and another 
one takes over the CPU. Things do not work as well in terms of performance speedups, 
when one task needs the CPU for a long time, as is the case with CPU-bound tasks as 
in the Fibonacci example.

Just like with coroutines, using threads in Python is far from being a lost cause. 
Parallel I/O can give a significant performance boost to our application, both in the 
case of code using multiple threads or coroutines

As a side note, not all Python interpreters have the GIL; Jython, for instance, does not.

# Multiple processes
Traditionally, the way Python programmers have worked around the GIL and its 
effect on CPU-bound threads has been to use multiple processes instead of multiple 
threads. This approach (multiprocessing) has some disadvantages, which mostly boil 
down to having to launch multiple instances of the Python interpreter with all the 
startup time and memory usage penalties that this implies