# Threads & processes

### Lessons learned 
- GIL
    - [link](https://realpython.com/python-gil/)
- short summary
    - python keeps track of how many objects refer to a given object x (reference count)
    - when the reference count reaches 0, the memory occupied by object x is released
- alternatives:
    - garbage collection (as is done in R)
    - numba has an option to force the GIL off
- threading
    - some discussion [here](https://stackoverflow.com/questions/15085348/what-is-the-use-of-join-in-threading)

#### Questions
- when is threading a good option?
- guidance on when multiprocessing is worth it?
- see questions on model solutions from exercises at the end of the notebook

In [1]:
# copying pi from previous episode 
import random

def calc_pi(N):
    M = 0
    for i in range(N):
        # Simulate impact coordinates
        x = random.uniform(-1, 1)
        y = random.uniform(-1, 1)

        # True if impact happens inside the circle
        if x**2 + y**2 < 1.0:
            M += 1
    return 4 * M / N



In [2]:
from threading import (Thread)

In [15]:
%%time
n = 10**7
t1 = Thread(target=calc_pi, args=(n,))
t2 = Thread(target=calc_pi, args=(n,))

t1.start()
t2.start()
t1.join()
t2.join()

CPU times: user 8.15 s, sys: 24.1 ms, total: 8.17 s
Wall time: 8.15 s


### Numba without GIL

In [16]:
import numba 
@numba.jit(nopython=True, nogil=True)
def calc_pi_nogil(N):
    M = 0
    for i in range(N):
        x = random.uniform(-1, 1)
        y = random.uniform(-1, 1)
        if x**2 + y**2 < 1:
            M += 1
    return 4 * M / N

In [18]:
calc_pi_nogil(100)
%timeit calc_pi_nogil(10**7) # same as with GIL

63.1 ms ± 316 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Threading a numpy function

#### My solution

What is the point here?
- sequential vs parallel does not make a difference? (neither for np nor for normal sorted?)
- numpy + threading takes longer than numpy only?

In [55]:
import numpy as np 
rng = np.random.default_rng(3)

N = 1000
a = rng.integers(0, 100, (N,))
b = rng.integers(0, 100, (N,))


(1000,)

In [45]:
def threaded_func(f, a, b):
    "sort two arrays a and b with threading in parallel"
    t1 = Thread(target=f, args=(a,))
    t2 = Thread(target=f, args=(b,))

    t1.start()
    t2.start()
    t1.join()
    t2.join()

def sequential_func(f, a, b):
    "sort two arrays a and b with threading sequentially"
    t1 = Thread(target=f, args=(a,))
    t2 = Thread(target=f, args=(b,))

    t1.start()
    t1.join()
    t2.start()
    t2.join()


numpy

In [54]:
%timeit np.sort(a)

4.88 µs ± 24.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [47]:
%timeit -r 10 -n 10 threaded_func(np.sort, a, b)

113 µs ± 33.9 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)


In [48]:
%timeit -r 10 -n 10 sequential_func(np.sort, a, b)

107 µs ± 29.5 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)


sorted 

In [57]:
%timeit sorted(a)

132 µs ± 296 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [49]:
%timeit -r 10 -n 10 threaded_func(sorted, a, b)

382 µs ± 44.9 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)


In [50]:
%timeit -r 10 -n 10 sequential_func(sorted, a, b)

380 µs ± 52.7 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)


#### Solution from episode

In [40]:
high = 1000
rnd1 = np.random.random(high)
rnd2 = np.random.random(high)
%timeit -n 10 -r 10 np.sort(rnd1)

11.3 µs ± 6.45 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)


In [41]:
%%timeit -n 10 -r 10
t1 = Thread(target=np.sort, args=(rnd1, ))
t2 = Thread(target=np.sort, args=(rnd2, ))

t1.start()
t2.start()

t1.join()
t2.join()

85.1 µs ± 21.3 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)


### Calculate $\pi$ with a queue


#### My solution

- this does not work. why?
- how efficient is it? in each iteration, only one point is calculated

I followed these two examples:
- https://stackoverflow.com/questions/21609595/python-multiprocessing-with-an-updating-queue-and-an-output-queue
- https://stackoverflow.com/questions/11515944/how-to-use-multiprocessing-queue-in-python

What I did differently from the model solution (below):
- also have input queue with the points (I started with only an input queue, and added the output queue later on)
- did not use context 

In [69]:
from multiprocessing import Process, Queue

def reader_process(input_queue, output_queue):
    "Read from queue and return whether point is in circle or not."  
    while True:
        msg = input_queue.get()
        if msg == "DONE":
            break
        x, y = msg
    result = x**2 + y**2 < 1.0
    output_queue.put(result)


def writer(N, num_of_reader_procs, queue):
    "Write points into the queue. A reader_process will read them and return the result."
    for _ in range(0, N):
        # Simulate impact coordinates
        x = random.uniform(-1, 1)
        y = random.uniform(-1, 1)
        msg = (x, y)
        queue.put(msg)

    for _ in range(0, num_of_reader_procs): # need one "Done" for each reader process (?)
        queue.put("DONE")
    
def start_reader_procs(qq_in, qq_out, num_of_reader_procs):
    """Start the reader processes and return all in a list to the caller"""
    all_reader_procs = []
    for _ in range(0, num_of_reader_procs):
        reader_p = Process(target=reader_process, args=(qq_in, qq_out))
        reader_p.daemon = True # what does this do?
        reader_p.start()
        all_reader_procs.append(reader_p)

    return all_reader_procs


def calc_pi_mp(N):
    "Calculate pi with multiprocessing"
    input_q = Queue()
    output_q = Queue()
    n_reader_procs = 2
    all_reader_procs = start_reader_procs(input_q, output_q, n_reader_procs)

    writer(N, len(all_reader_procs), input_q)

    for idx, a_reader_proc in enumerate(all_reader_procs):
        print(f"Waiting for reader_p.join() index {idx}")
        a_reader_proc.join()

    M = 0
    while True:
        if output_q.empty() == True:
            break 
        M += output_q.get_nowait()
    
    return M
    # M = sum(results)
    # return 4 * M / N



In [70]:
calc_pi_mp(100)

Waiting for reader_p.join() index 0
Waiting for reader_p.join() index 1


0

#### Model solution
- see `calc_pi_mp.py`

In [91]:
!python -m notebooks.calc_pi_mp --method "queue"

(3.1409236, 10000000)
(3.1406512, 10000000)


Questions
- in a larger problem, what is the optimal way to split the number of repetitions (N) and the number of workers?
- more generally, I do not clearly see the benefit of parallelization in the model solutions; should we not split the work across all the workers? where does this happen in the code?
    - in both cases, it seems to me that we process the total number of iterations, instead of splitting them?
    - [example 1](https://esciencecenter-digital-skills.github.io/parallel-python-workbench/threads-and-processes.html#challenge3), [example 2](https://esciencecenter-digital-skills.github.io/parallel-python-workbench/threads-and-processes.html)

### Calculate $\pi$ with a process pool 

In [95]:
!python -m notebooks.calc_pi_mp --method "pool"

1000 0.14762725299806334
100000 0.2009583780018147
10000000 6.208641796001757
