# Multi-threading and Multi-processing in Python

 * Process: A program that is under execution. An individual process takes its own memory space and does not share this space with other processes.
 
 * Thread: Sequence of instructions that are being executed within the context of a process.
   - One process can spawn multiple threads but all of them will be sharing the same memory

![](threads.jpg)



# Concurrency vs Parallelism 
 * Concurrency: A condition that exists when at least two threads are making progress. A more generalized form of parallelism that can include time-slicing as a form of virtual parallelism.
 
![](concurrent.png)

 * Parallelism: A condition that arises when at least two threads are executing simultaneously.
 
![](parallel.png) 

 

# Multithreading is not parallel in python

Normally, the expectation would be that the use of a multi-threaded code on a multi-core machine should take advantage of the cores available and thus increase the overall performance.

But in Python the real picture is very different. A Python process cannot run threads in parallel.
This limitation enforced by **GIL** (The Python Global Interpreter Lock) that prevents threads within the same process to be executed at the same time.

GIL is necessary because Python’s interpreter is not thread-safe. At any given time, only one thread can acquire the lock for a specific object. Therefore, CPU-bound code will have no performance gain with Python multi-threading

Python can only run threads **concurrently** through context switching during I/O bound operations, but **not in parallel**.

# Multitreading for I/O bound operations
In python we can take advantage of multiple threads if we have many I/O operations that would leave the CPU idle while waiting. 
 * Reading / Writing to disk
 * Downloading / uploading files
 * Printing data to screen / log files
 
Let's consider the following example. We want to download 5 images 5 times each and store them to disk.
 * That's a I/O bound process at it has 50 I/O operations (25 downloads + 25 save to disk) and no cpu processing at all
![](io-seq.png)

In [17]:
%%time
import requests


def download_img(img_url: str):
    """
    Download image from img_url in curent directory
    """
    res = requests.get(img_url, stream=True)
    filename = f"{img_url.split('/')[-1]}.jpg"

    with open(filename, 'wb') as f:
        for block in res.iter_content(1024):
            f.write(block)


if __name__ == '__main__':
    images = [
        # Photo credits: https://unsplash.com/photos/IKUYGCFmfw4 
        'https://images.unsplash.com/photo-1509718443690-d8e2fb3474b7',
        # Photo credits: https://unsplash.com/photos/vpOeXr5wmR4
        'https://images.unsplash.com/photo-1587620962725-abab7fe55159',
        # Photo credits: https://unsplash.com/photos/iacpoKgpBAM
        'https://images.unsplash.com/photo-1493119508027-2b584f234d6c',
        # Photo credits: https://unsplash.com/photos/b18TRXc8UPQ
        'https://images.unsplash.com/photo-1482062364825-616fd23b8fc1',
        # Photo credits: https://unsplash.com/photos/XMFZqrGyV-Q
        'https://images.unsplash.com/photo-1521185496955-15097b20c5fe',
        # Photo credits: https://unsplash.com/photos/9SoCnyQmkzI
        'https://images.unsplash.com/photo-1510915228340-29c85a43dcfe',
    ]

    for img in images * 5:
        download_img(img)

**Can we take some advantage if we use threads?**

In Python, threads can be implemented with the use of threading module

Our download_img function will only be lightly modified to use the global queue of the main process:

In [None]:
%%time
import requests
from queue import Queue
from threading import Thread

NUM_THREADS = 5
q = Queue()

def download_img():
    """
    Download image from img_url in curent directory
    """
    global q

    while True:
        img_url = q.get()
        res = requests.get(img_url, stream=True)
        filename = f"{img_url.split('/')[-1]}.jpg"
        with open(filename, 'wb') as f:
            for block in res.iter_content(1024):
                f.write(block)
        q.task_done()


if __name__ == '__main__':
    images = [
        # Photo credits: https://unsplash.com/photos/IKUYGCFmfw4 
        'https://images.unsplash.com/photo-1509718443690-d8e2fb3474b7',
        # Photo credits: https://unsplash.com/photos/vpOeXr5wmR4
        'https://images.unsplash.com/photo-1587620962725-abab7fe55159',
        # Photo credits: https://unsplash.com/photos/iacpoKgpBAM
        'https://images.unsplash.com/photo-1493119508027-2b584f234d6c',
        # Photo credits: https://unsplash.com/photos/b18TRXc8UPQ
        'https://images.unsplash.com/photo-1482062364825-616fd23b8fc1',
        # Photo credits: https://unsplash.com/photos/XMFZqrGyV-Q
        'https://images.unsplash.com/photo-1521185496955-15097b20c5fe',
        # Photo credits: https://unsplash.com/photos/9SoCnyQmkzI
        'https://images.unsplash.com/photo-1510915228340-29c85a43dcfe',
    ]
    
    for img_url in images * 5:
        q.put(img_url)

    for t in range(NUM_THREADS):
        worker = Thread(target=download_img)
        worker.daemon = True
        worker.start()

    q.join()

Wall time has been halved, because most of the time the process is waiting for download or for disk, but only 1 core has been used all the time.

![](io-threads.png)

# CPU intensive operations and Multithreading

But, What if our proces was more cpu intensive? 

Would multithreading it have any advantage?

For example, let's consider the function sum_powers below. It is CPU intensive, it does never wait for I/O, and we want to run it 4 times:

In [None]:
%%time
def sum_powers(number):
    total = 0
    for i in range(number):
        total = total + i**i

if __name__ == "__main__":
    for i in range(4):
        sum_powers(10000)

Let's try to multithread it...

**Q:** Better or worse?

In [None]:
%%time
from queue import Queue
from threading import Thread

NUM_THREADS = 4
q = Queue()

def sum_powers():
    global q
    
    while True:
        number = q.get()
        total = 0
        for i in range(number):
            total = total + i**i
        q.task_done()


if __name__ == "__main__":
    for i in range(4):
        q.put(10000)

    for t in range(NUM_THREADS):
        worker = Thread(target=sum_powers)
        worker.daemon = True
        worker.start()

    q.join()

As python does not parallelize threads, execution of multithreaded processes is even slower than without threads.
![](cpu-threads.png)

# Multiprocessing in Python

To take full advantage of our CPU capacity and be able to truly parallelize our program execution we need to perform multi-processing instead of multi-threading.

For that we will use the **multiprocessing** module that can be used to spawn multiple OS processes. Therefore, multi-processing in Python side-steps the GIL and the limitations that arise from it since every process will now have its own interpreter and thus own GIL.

When we use multiprocessing we are launching several processes instead of several threads of the same process. That way each process can make use of one core.


In [None]:
%%time
import multiprocessing


NUM_PROC = 4


def sum_powers(number):
    total = 0
    for i in range(number):
        total = total + i**i    


if __name__ == "__main__":
    jobs = []

    for i in range(NUM_PROC):
        process = multiprocessing.Process(
            target=sum_powers, 
            args=(10000,)
        )
        jobs.append(process)

    for j in jobs:
        j.start()

    for j in jobs:
        j.join()

In [15]:
%%time
from multiprocessing import Pool

def sum_powers(number):
    total = 0
    for i in range(number):
        total = total + i**i    

if __name__ == '__main__':
    with Pool(4) as p:
        print(p.map(sum_powers, [10000,10000,10000,10000,10000,10000,10000,10000]))

# Warning! Concurrency is sometimes dangerous:

 * There is no problem using simultaneously the same code.

 * There is a big problem modifying simultaneously the same data.

Before implementing hard to debug parallel code... have a look at some libraries like numpy and scipy.

When those libraries are installed via conda, they are compiled using MKL, Blas and so on... So they take care of parallelization for you automatically (whenever possible)