Based on **Francesco Pierfederici: Distributed Computing with Python, Chapter 3**

### Parallelism in Python

Here we will look at **parallel programming** in more detail and see what facilities Python offers us to **make our code use more than one CPU core** at the time 

In this demo we will stay in the boundaries of a **single machine**. We will discuss **distributed systems later**. 

The main goal here will be **speed for CPU-intensive problems, and responsiveness for I/O-intensive code.**

Let's start by writing a simple program that makes use of **multiple threads to 
download data** from the Web. 

In [1]:
#We start by importing the modules we need from the Standard Library (that is, threading, queue, and urllib.request).

from time import time
from threading import Thread
from queue import Queue
import urllib.request

Here is a list of links that we want to download parallel.

In [2]:
links=[]
links.append('https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2017-financial-year-provisional/Download-data/annual-enterprise-survey-2017-financial-year-provisional-size-bands-csv.csv')
links.append('https://www.stats.govt.nz/assets/Uploads/Household-living-costs-price-indexes/Household-living-costs-price-indexes-September-2018-quarter/Download-data/household-living-costs-price-indexes-sep18qtr-time-series-indexes.csv')
links.append('https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2017-financial-year-provisional/Download-data/annual-enterprise-survey-2017-financial-year-provisional-csv.csv')

In [3]:
for link in links: print(link)

https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2017-financial-year-provisional/Download-data/annual-enterprise-survey-2017-financial-year-provisional-size-bands-csv.csv
https://www.stats.govt.nz/assets/Uploads/Household-living-costs-price-indexes/Household-living-costs-price-indexes-September-2018-quarter/Download-data/household-living-costs-price-indexes-sep18qtr-time-series-indexes.csv
https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2017-financial-year-provisional/Download-data/annual-enterprise-survey-2017-financial-year-provisional-csv.csv


Define a function "get_content" that downloads the content of "act_url" link, and outputs is to the "outq" Queue.

In [4]:
def get_content(act_url, outq):
    with urllib.request.urlopen(act_url) as res:
        body = res.read()
    outq.put((act_url, body)) #put the (act_url, body) tupple to the outq queue

Define a **thread-safe** queue (that is, an instance of Queue from the Python queue module).

We call this queue "outputq". 

It will **hold the data produced by the various threads** that downloaded the contents of the the websites. 


In [5]:
outputq = Queue()

Once we have the output queue, we then **spawn a new worker thread for each website link**. 

**Each worker thread simply runs the get_content function**, with the actual link and the output queue as arguments.


In [6]:
for link in links:
        # spawn a new thread "t" that will run the "get_content" function with "kwargs" arguments
        t = Thread(target=get_content,
                   kwargs={'act_url': link,
                           'outq': outputq})
        t.daemon = True
        t.start() # start the thread

A thread can be flagged as a daemon flag: If it is a daemon, then it will allow the main program to exit, 
i.e. the main program doesn't need to wait till all child threads are finished.

The main difficulty in using threads to perform actions in parallel is that
**we cannot tell when a given thread will read or write any data shared with other threads.**

One way out of synchronization problems like these is the use of **locks**: so that only one thread can write at a given time.

The Queue class helps in threaded programming when **information must be exchanged safely between multiple threads**. 
The Queue class in this module **implements all the required locking semantics**.

Thread-safe queues are a very convenient example of lock-based data structures that we can use 
to organize data access.

Since **each thread writes to the same output queue**, we might just as well **monitor 
that queue to know when results are ready** and it is time to quit. 

Here, we do that by simply fetching one result from the queue per link (the loop over links)

and by waiting for the queue to join (outputq.join()), 

which will happen when all the results have been fetched (more precisely, when each get() method is followed by a call to task_done()). 

This way, **we are sure that our program does not quit prematurely**.

**Queue.task_done()**: Indicate that a formerly enqueued task is complete. 

For each **get()** used to fetch a task, a subsequent call to **task_done()** tells the queue that the processing on the task is complete.

**Queue.join()**: Blocks until all items in the queue have been gotten and processed.

The **count of unfinished tasks** **goes up** whenever an item is added to the queue. The count **goes down** whenever a consumer thread calls **task_done()** to indicate that the item was retrieved and all work on it is complete. When the count of unfinished tasks drops to zero, **join() unblocks**.

In [7]:
for _ in links:
    link, body = outputq.get()
    print(link, body[:100])
    print('***')
    outputq.task_done() #signals when a thread is done
outputq.join() #unblocks when all tasks are complete in the queue.

https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2017-financial-year-provisional/Download-data/annual-enterprise-survey-2017-financial-year-provisional-csv.csv b'Year,Industry_aggregation_NZSIOC,Industry_code_NZSIOC,Industry_name_NZSIOC,Units,Variable_code,Varia'
***
https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2017-financial-year-provisional/Download-data/annual-enterprise-survey-2017-financial-year-provisional-size-bands-csv.csv b'year,industry_code_ANZSIC,industry_name_ANZSIC,rme_size_grp,variable,value,unit\r\n2011,A,"Agriculture'
***
https://www.stats.govt.nz/assets/Uploads/Household-living-costs-price-indexes/Household-living-costs-price-indexes-September-2018-quarter/Download-data/household-living-costs-price-indexes-sep18qtr-time-series-indexes.csv b'hlpi_name,series_ref,quarter,hlpi,nzhec,nzhec_name,nzhec_short,level,index,change.q,change.a\r\nAll ho'
***


### Example without using threads:

In [10]:
# no threads
q = Queue()
t0 = time(); [get_content(p, q) for p in links]; dt = time() - t0; print(dt)

0.75335693359375


### Example with threads:

In [11]:
# threads can help!

t0 = time();

for link in links:
        t = Thread(target=get_content,
                   kwargs={'act_url': link,
                           'outq': outputq})
        t.daemon = True
        t.start()
        
for _ in links:
    link, body = outputq.get()
    #print(link, body[:100])
    outputq.task_done() #signals when a thread is done
outputq.join()

dt = time() - t0; print(dt)

0.3000946044921875
