# Multiprocessing

The ```multiprocessing``` package has a similar interface to the ```threading``` module but, instead of spawning threads, it spawns processes. These are separate processes in the operating system that have their own memory space. This means that sharing information between processes is more complicated than sharing information between threads. However, it also means that race conditions are less likely and, as each process is independent, the GIL is not a problem. This allows for code to be executed in parallel.

## Spawning Processes

The main class is the ```Process``` class. We can create a new instance of this class using ```Process(target=func, args=(arg1, arg2))```. We can then start the process using ```p.start()``` and wait for it to finish using ```p.join()```. For example:

In [16]:
import multiprocessing

def greeting(processes_number):
    print(f'Hello from process number {processes_number} {__name__}')

processes = []

print(__name__)

# this piece of code will only run if it's the original process (?)
# spawn processes only if it's the original process
# each process is going to create a separate copy of the function 'greeting'

# removing the if name == main works on linux but breaks on windows

if __name__ == '__main__':
    for i in range(2):
        p = multiprocessing.Process(target=greeting, args=(i,))
        p.start()
        processes.append(p)

    for p in processes:
        p.join()

    print('Main process is done')

__main__
Hello from process number 0 __main__
Hello from process number 1 __main__
Main process is done


Much of this is similar to what you've already done with threads, but there are a few important differences. 

The first thing to note is the line ```if __name == '__main__':```. This is necessary because, when a new process is spawned, it will run the code from the beginning of the script. This is necessary because the new process has a separate memory space and so needs to run the code again so that the function (in this case ```greetings```) is defined in the new process.

To explain this, we need to consider the built in variable ```__name__```. This variable is created automatically when Python is run and will have different names in different circumstances. In the piece of code which is being run directly,it will have the value ```__main__```. In the case of a piece of code which is being run as the main script of a new process it will have the value ```__mp_main__```. You can check the values of ```__name__``` by running [```03_multiprocessing_scripts/print_example.py```](03_multiprocessing_scripts/print_example.py).

This means that the line ```if __name__ == '__main__':``` will only be true in the main script and not in any new processes that are spawned. This is important because it means that the code inside this block will only be run in the main script and not in any new processes that are spawned. This prevents each new process from spawning more processes and so creating an infinite loop. We also include the code waiting for the processes to finish and the final call to the ```print``` function in the if-block so they are only run in the main script and not in any new processes that are spawned.

If you are running this code on a Codespace, you may not see all of the above behaviour described in the paragraph above. The reason for this is that the Codespace handles its processes slightly differently than the Windows system I wrote the code on. In a Codespace, when the process is spawned the ```def``` sections are performed by each thread but the ```print``` statements are ignored. In fact, in a Codespace, we can remove the `if __name__ == '__main__'` line entirely and the processes will only be spawned from the main process. However, leaving this out would cause the code break when run outside the Codespace, so we'll leave it in this course as its important for writing portable code.

When running this code outside of a Codespace, the result of the print statement in ```greetings``` is not displayed under the code cell. This is because it is being run in a separate process and so its output is not captured and displayed by the Jupyter notebook. Within a Codespace, however, it is captured and displayed by the Jupyter notebook.

A copy of this code is found in the file [```03_multiprocessing_scripts/print_example.py```](03_multiprocessing_scripts/print_example.py). If you are running this code outside of a Codespace, you can run this code and see that the output of all processes is captured by the terminal and displayed there.

## Getting Values from Processes

Just like a thread created with the ```threading``` module, any value returned from a function called in a process will be lost. However, there are a few ways we can communicate between processes. We'll look at ```Pipe```, ```Queue```, ```Value``` and ```Array```.

### Pipes

A pipe is a two-way communication channel between two processes. We can create a pipe using ```Pipe()```. The two ends of the pipe are known as connectors. These can be passed to two processes to allow communication between them. By default, communication is allowed in two directions. We can then use the ```send()``` and ```recv()``` methods of the connectors to send and receive data through the pipe. When the ```recv()``` method is called, the process will wait until data is available to be received.

Note that there is a maximum size of data which can be sent through the pipe (this may be around 32MB depending on operating system).

The example below shows a simple example using a pipe to communicate between the main process and a child process:

## note about code below:

The above code will not work in a Jupyter notebook in all contexts due to incompatibilities between ```multiprocessing``` and Jupyter. However, you can run this code in a Python script and see that the result is printed to the terminal. A copy of this code is found in the file [```03_multiprocessing_scripts/pipe_example.py```](03_multiprocessing_scripts/pipe_example.py) which you can run.

In the above code, the parent process creates a pipe and passes one end of the pipe to the child process. The parent process then sends an array to the child process. The child process receives the array, calculates the sum of the array and sends the result back to the parent process. The parent process then receives the result and prints it.

This method of communicating requires careful thought regarding the order in which processes will need to communicate with each other to make sure data is sent and received in the correct order. This can be difficult to manage in more complex programs. It can also lead to processes waiting for data from another process, reducing the benefits of parallel execution. Once we have more than one child process, we will need to create a pipe for each pair of processes that need to communicate, further increasing complexity.

In [None]:
import multiprocessing
import numpy

def calculate_sum(conn):
    # Wait to receive an array from the parent process
    array = conn.recv()
    # Calculate the sum of the array
    result = numpy.sum(array)
    # Send the result back to the parent process
    conn.send(result)

if __name__ == '__main__':
    # Create a Pipe() object
    # This function returns a pair of connection objects connected by a pipe
    # pipes -- symmetric

    # parent_conn here belongs to the main thread and the child is for the spawned process
    parent_conn, child_conn = multiprocessing.Pipe()

    # Create a process and pass the child connection object to it
    # The process will implement the calculate_sum function

    # the args bit is assigning the pipe
    p = multiprocessing.Process(target=calculate_sum, args=(child_conn,))

    # Start the process
    p.start()

    # Send an array to the child process
    parent_conn.send(numpy.arange(1, 6))
    # Receive the result from the child process
    print(parent_conn.recv())
    # Wait until the process is finished
    p.join()
    print('Main process is done')

15
Main process is done


### Deadlocks

A deadlock is a situation where two or more processes are waiting for each other before progressing. This can happen in a number of conditions in concurrent programming. One possible cause of deadlocks is when two processes are waiting for each other to send data through a pipe. The following code is an adapted version of the code above but without the call to ```parent_conn.send``` in the main thread:

```python
import multiprocessing
import numpy

def calculate_sum(conn):
    # Wait to receive an array from the parent process
    array = conn.recv()
    # Calculate the sum of the array
    result = numpy.sum(array)
    # Send the result back to the parent process
    conn.send(result)

if __name__ == '__main__':
    # Create a Pipe() object
    # This function returns a pair of connection objects connected by a pipe
    parent_conn, child_conn = multiprocessing.Pipe()
    # Create a process and pass the child connection object to it
    # The process will implement the calculate_sum function
    p = multiprocessing.Process(target=calculate_sum, args=(child_conn,))
    # Start the process
    p.start()
    # Receive the result from the child process
    print(parent_conn.recv())
    # Wait until the process is finished
    p.join()
    print('Main process is done')
```

This code can be run in the file [`03_multiprocessing_scripts/deadlock_example.py`](03_multiprocessing_scripts/deadlock_example.py). You will see that the code hangs and does not finish. This is because the parent process is waiting for the child process to send data through the pipe and the child process is waiting for the parent process to send data through the pipe. This is a deadlock. Care should be taken to avoid situations like this in concurrent programming.

### Queues

A queue is a datatype which allows for communication between many processes. We can create a queue using ```multiprocessing.Queue()```. We can then use the ```put()``` and ```get()``` methods to add and remove items from the queue. The data will be stored in a First In First Out (FIFO) order. The example below shows a simple example of how data is added to and removed from a queue using only the main process.

In [1]:
import multiprocessing

queue = multiprocessing.Queue()

queue.put(1)
queue.put(2)
queue.put(3)

print(queue.get())
print(queue.get())
print(queue.get())

1
2
3


A queue may be passed to multiple different processes and each processes with access to the ```Queue``` can add data to the queue or retrieve data from it. If many processes may add data to a ```Queue``` at the same time, the exact order in which they add data is not guaranteed as the order of execution across different processes is not guaranteed. This limits the way in which a ```Queue``` can be used as it may not be clear which process a piece of data is from. 

When the ```get``` method is called, the execution of the code will block (meaning "wait") until data is available in the queue. This means we don't need to worry about if the computations required to put data in the queue have been completed when we call the ```get``` method. However, we do need to make sure the same amount of data is added to the queue as is removed from it. If we try to remove more data from the queue than will be added to it, the code will block indefinitely.

The queue is thread and process safe, meaning that it can be used to communicate between many processes without the need for locks. The example below shows how we can use a ```Queue``` to collect the results from an arbitrary number of processes:

```python
import numpy as np
import multiprocessing
import time

# Note the start time
start_time = time.time()

def find_smallest_multiple(n_data, factor, queue):
    # This function generates n_data random integers and finds the smallest multiple of factor

    # Initially we have found no multiples of factor
    result = None

    # Create the random data
    data = np.random.randint(1, 1000, n_data)

    for d in data:
        # Loop over the data and check if it's a multiple of factor
        if d % factor == 0:
            # If it is, check if it's the smallest we've found so far
            if result is None or d < result:
                # Update the result
                result = d

    # After considering each value, put the result in the queue
    queue.put(result)

if __name__ == '__main__':
    # Set up the problem data
    n_processes = 2
    n_data = int(1e6)
    factor = 7
    n_data_per_process = n_data // n_processes

    # Set up the queue
    queue = multiprocessing.Queue()

    for i in range(n_processes):
        # Spawn and start the processes
        p = multiprocessing.Process(target=find_smallest_multiple, args=(n_data_per_process, factor, queue))
        p.start()

    # We haven't found any multiples of factor yet
    result = None

    for i in range(n_processes):
        # Get each result from the queue
        # The code will pause here while the main process waits for each child process to finish
        r = queue.get()

        if result is None or r < result:
            # If it's smaller than the current result, update it
            result = r

    # Note the end time and print the elapsed time
    end_time = time.time()
    print(f'Time taken: {end_time - start_time}')

    print(f'The smallest multiple of {factor} in the data is {result}')
```

This code can be run in the file [```03_multiprocessing_scripts/queue_example.py```](03_multiprocessing_scripts/queue_example.py). In the main process we create a ```Queue``` object and pass it to each of the child processes. Each child process calculates the smallest multiple of a given factor in a subset of the data and adds the result to the ```Queue```. 

The main process collects the same number of bits of data from the ```Queue``` as there are child processes. Initially, the processes won't have completed their calculations and added them to the ```Queue``` so the main process will block until the data is available. As the result from each thread is added to the ```Queue```, the main process will collect the data and process it. This sort of process is particularly well suited to a ```Queue``` as the order in which the data is added to the ```Queue``` is not important. We don't need to wait for the processes to finish as the ```queue.get()``` method will automatically block until data is available. As a result, we also don't need to create a list of the processes.

We can observe the performance of the code by changing the number of processes and the size of the data:

<p align="center">
<img src="resources/queue_smallest_factor.png" alt="A figure showing the runtime for different numbers of processes as a function of n_data" class="center">
</p>

When we completely remove the multiprocessing and run the code in a single process, we can see that the runtime is much less for low values of ```n_data```. This is because spawning processes takes some time, slowing down the code. However, as the size of ```n_data``` increases, this overhead becomes less significant and at around 100,000,0000 data points, the performance of the multiprocessing code equals that of the serial implementation. For 10,000,000,000 pieces of data, the multiprocessing implementation with both 4 and 8 courses is around 4 times faster than the serial implementation.

In [15]:
import numpy as np
import multiprocessing
import time

# Note the start time
start_time = time.time()

def find_smallest_multiple(n_data, factor, queue):
    # This function generates n_data random integers and finds the smallest multiple of factor

    # Initially we have found no multiples of factor
    result = None

    # Create the random data
    data = np.random.randint(1, 1000, n_data)

    for d in data:
        # Loop over the data and check if it's a multiple of factor
        if d % factor == 0:
            # If it is, check if it's the smallest we've found so far
            if result is None or d < result:
                # Update the result
                result = d

    # After considering each value, put the result in the queue
    queue.put(result)

if __name__ == '__main__':
    # Set up the problem data
    n_processes = 2
    n_data = int(1e8)
    factor = 40
    n_data_per_process = n_data // n_processes

    # Set up the queue
    queue = multiprocessing.Queue()

    for i in range(n_processes):
        # Spawn and start the processes
        p = multiprocessing.Process(target=find_smallest_multiple, args=(n_data_per_process, factor, queue))
        p.start()

    # We haven't found any multiples of factor yet
    result = None

    for i in range(n_processes):
        # Get each result from the queue
        # The code will pause here while the main process waits for each child process to finish
        r = queue.get()

        if result is None or r < result:
            # If it's smaller than the current result, update it
            result = r

    # Note the end time and print the elapsed time
    end_time = time.time()
    print(f'Time taken: {end_time - start_time}')

    print(f'The smallest multiple of {factor} in the data is {result}')

Time taken: 6.8752830028533936
The smallest multiple of 40 in the data is 40


### Exercise: Monte Carlo Pi

The value of $\pi$ can be estimated using a Monte Carlo method. This involves generating a large number of random points in a square and counting the number of points that fall within a circle inscribed in the square. The ratio of the number of points in the circle to the total number of points is an estimate of the ratio of the area of the circle to the area of the square. This ratio is equal to $\frac{\pi}{4}$. By multiplying this ratio by 4, we can estimate the value of $\pi$. 

<p align="center">
<img src="resources/monte_carlo_circle.png" alt="A circle with a radius inside a square with side length 2r" class="center">
</p>

In the diagram above, the circle has an area of $\pi r^2$ and the square has an area of $(2r)^2 = 4r^2$. The ratio of the area of the circle to the area of the square is $\frac{\pi r^2}{4r^2} = \frac{\pi}{4}$. This means that the ratio of the number of points in the circle to the total number of points is also $\frac{\pi}{4}$. When performing this calculation, it is convenient to use a circle with a radius of 1, so that the area of the circle is $\pi$ and the area of the square is 4. A non-concurrent implementation of this method is shown below:

In [16]:
import random
import time

# Note the starting time
start_time = time.time()

# Number of points to generate
n_points = int(1e7)

# Counter for the number of points inside the circle
n_inside = 0

for i in range(n_points):
    # Generate a random point in the unit square
    # random.random generates a random number between 0 and 1
    x = random.random()
    y = random.random()

    if x ** 2 + y ** 2 <= 1:
        # If the point is inside the circle, increment the counter
        n_inside += 1

# Calculate the approximate value of pi
pi_approximation = 4 * n_inside / n_points

print(f'The value of pi is approximately {pi_approximation}')

print(f'Time taken: {time.time() - start_time}')

The value of pi is approximately 3.141492
Time taken: 3.559295415878296


To do this, we need to generate a large number of random points and count how many fall within the circle. We can split this task between multiple processes to speed up the calculation. Create a new `.py` file and adapt the code to use multiple processes. Reminder: if you want to use more than 2 processes, you will need to [increase the number of cores the Codespace is using](https://docs.github.com/en/codespaces/customizing-your-codespace/changing-the-machine-type-for-your-codespace).

A sample solution can be found in the file [`sample_soluitons/monte_carlo_pi.py`](sample_solutions/monte_carlo_pi.py).

In [None]:
## monte carlo estimation of Pi example: with queue

import numpy as np
import multiprocessing
import time

# Note the start time
start_time = time.time()

def random_points_in_circle(n_data, queue):
    # This function generates n_data random integers and finds the smallest multiple of factor

    # Initially we have found no multiples of factor
    result = 0

    for d in range(n_data):
        x = random.random()
        y = random.random()

        if x**2 + y**2 <= 1:
            result +=1

    # After considering each value, put the result in the queue
    queue.put(result)

if __name__ == '__main__':
    # Set up the problem data
    n_processes = 4
    n_data = int(1e7)
    n_data_per_process = n_data // n_processes

    # Set up the queue
    queue = multiprocessing.Queue()

    for i in range(n_processes):
        # Spawn and start the processes
        p = multiprocessing.Process(target=random_points_in_circle, args=((n_data_per_process,queue)))
        p.start()

    num_point_in_circle = 0

    for i in range(n_processes):
        # each queue has number of points in circle for each process, we just need to add it
        ## QUEUE.GET IS IMPLICITLY WAITING so we don't need to do p.join() to wait for all processes to finish
        r = queue.get()
        num_point_in_circle += r

    # Note the end time and print the elapsed time
    end_time = time.time()
    print(f'Time taken: {end_time - start_time}')

    print(f'{num_point_in_circle*4/n_data}')


Time taken: 0.8708505630493164
3.1405728


In [33]:
## monte carlo estimation of Pi example: with Pipe (not as neat/efficient as with Queues)
## sketch of solution : generate list of pipes with one end in the main process and others in sub-processes
# pipe the number of points in circle to main process and sum

### Values

The ```multiprocessing``` module also provides a way to share data between processes using the ```Value``` class. This class create a variable which references the same location in our computer's memory for each process. This means that changes to the variable in one process will be reflected in all other processes.

The data stored in ```Value```  will be in the form of a ```ctype``` object. This is a C-style data type which is used to store data in memory. The C family of languages underpins much of Python and other languages and this is why it is used here. When we create a ```Value``` object, we need to specify the type of data we want to store. We may import the different objects types from the ```ctype``` module (which is part of the Python Standard Library). The most common types are:

- ```ctype.c_int```: A 32-bit integer
- ```ctype.c_double```: A double precision floating point number
- ```ctype.c_bool```: A boolean value

We can retrieve and set the value of a ```Value``` object using its ```value``` attribute. The example below shows how we can create a shared ```Value``` object and increment it in a child process:

```python
import multiprocessing
import ctypes

# Create a shared memory value
# It is an integer with an initial value of 0
v = multiprocessing.Value(ctypes.c_int, 0)

def increment(v):
    v.value += 1

if __name__ == '__main__':
    # Create a process that increments the value
    p = multiprocessing.Process(target=increment, args=(v,))
    p.start()
    p.join()

    # Print the value
    print(v.value)

```

This code can be run in the file [```03_multiprocessing_scripts/value_example.py```](03_multiprocessing_scripts/value_example.py). 

As the data in a ```Value``` is shared between processes, it would now be possible to encounter race conditions as we did with threads. However, the ```Value``` class has a built in lock which we can access with the ```get_lock``` method and use to prevent this. We can use the ```acquire()``` and ```release()``` methods of the lock to acquire and release the lock, as below:

In [35]:
import multiprocessing
import ctypes

# Create a shared memory value
# It is an integer with an initial value of 0
v = multiprocessing.Value(ctypes.c_int, 0)

def increment(v):
    v.value += 1

if __name__ == '__main__':
    # Create a process that increments the value
    p = multiprocessing.Process(target=increment, args=(v,))
    p.start()
    p.join()

    # Print the value
    print(v.value)

1


In [None]:
# shared value race condition: example TODO

import multiprocessing
import ctypes

# Create a shared memory value
# It is an integer with an initial value of 0
v = multiprocessing.Value(ctypes.c_int, 0)

def increment(v):
    v.value += 1

if __name__ == '__main__':
    # Create a process that increments the value
    num_proc = 3
    for np in range(len(num_proc)):
        p = multiprocessing.Process(target=increment, args=(v,))
        p.start()

        p.join()

    # Print the value
    print(v.value)

In [38]:
import multiprocessing
import ctypes

# Create a shared memory value
# It is an integer with an initial value of 0
v = multiprocessing.Value(ctypes.c_int, 0)

# Get the lock
v.get_lock().acquire()
# Do our calculations altering the value
v.value += 1
# Release the lock
v.get_lock().release()

print(v.value)

1


We can also use a context manager to acquire and release the lock. The example below shows how we can use the lock:

In [39]:
import multiprocessing
import ctypes

# Create a shared memory value
# It is an integer with an initial value of 0
v = multiprocessing.Value(ctypes.c_int, 0)

with v.get_lock():
    # Perform our calculations altering the value in the indented code
    v.value += 1

print(v.value)

1


This is a more Pythonic way of managing locks and is less error-prone, as we cannot forget to release the lock.

The example below shows how we can use both types of locks to increment a ```Value``` object safely across multiple processes:

```python
import multiprocessing
import ctypes

def increment(v):
    # Manually acquire and release the lock
    v.get_lock().acquire()
    v.value += 1
    v.get_lock().release()

    # Use the context manager to acquire and release the lock
    with v.get_lock():
        v.value += 100

if __name__ == '__main__':
    # Create a shared memory value
    # It is an integer with an initial value of 0
    v = multiprocessing.Value(ctypes.c_int, 0)

    # Create n_process processes which increment the value
    n_process = 8
    processes = []
    for i in range(n_process):
        p = multiprocessing.Process(target=increment, args=(v,))
        p.start()
        processes.append(p)

    for i in range(n_process):
        p.join()

    # Print the value
    print(v.value)
```

This code can be run in the file [```03_multiprocessing_scripts/value_lock_example.py```](03_multiprocessing_scripts/value_lock_example.py).

In [None]:
import multiprocessing
import ctypes

def increment(v):
    
    # # Manually acquire and release the lock
    v.get_lock().acquire()
    v.value += 1
    v.get_lock().release()

    # # Use the context manager to acquire and release the lock
    with v.get_lock():
        v.value += 100
    
    # race condition:
    # v.value += 1

if __name__ == '__main__':
    # Create a shared memory value
    # It is an integer with an initial value of 0
    v = multiprocessing.Value(ctypes.c_int, 0)

    # Create n_process processes which increment the value
    n_process = 5
    processes = []
    for i in range(n_process):
        p = multiprocessing.Process(target=increment, args=(v,))
        p.start()
        processes.append(p)

    for i in range(n_process):
        p.join()

    # Print the value
    print(v.value)


5


### Arrays

The ```Array``` class of the ```multiprocessing``` module is similar to the ```Value``` class but allows us to store more than one value in a shared memory location. We can create an ```Array``` object using ```multiprocessing.Array()```. We need to specify the type of data we want to store and the size of the array. We can use the same ```ctype``` objects as we did with the ```Value``` class, and we set the initial values using a tuple of values, as below:

In [131]:
import multiprocessing
import ctypes

# Create a shared memory array
# It is an array of 5 floats with an initial value of 0
a = multiprocessing.Array(ctypes.c_double, (0, 0, 0, 0, 0))

# We can access a single value from the array using an index
print(a[1])

# We can modify a single value in the array using an index
a[1] = 7

# We access every value in the array using ':' as an index
# usual python indexing rules apply
print(a[:])

# We can iterate over the array
for x in a:
    print(x)

0.0
[0.0, 7.0, 0.0, 0.0, 0.0]
0.0
7.0
0.0
0.0
0.0


In the example above, we also saw how we can access and modify data in the array. Note that an array cannot be extended or shortened once it has been created.

It's also possible to create an array by specifying its type and the number of entries as an integer. If we do this, each value will be initially set to zero. For example:

In [132]:
import multiprocessing
import ctypes

# Create an array with 10 floating point vars - 0 by default
array = multiprocessing.Array(ctypes.c_double, 10)

print(array[:])

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]


An ```array``` has a lock in a similar way to a ```Value```.

The code below shows how we can create an array which keeps track of the number of times each number has been rolled on a six-sided dice:

```python
import multiprocessing
import ctypes
import random

def roll_n_dice(n, array):
    local_results = [0, 0, 0, 0, 0, 0]
    for i in range(n):
        current_roll = random.randint(1, 6)
        local_results[current_roll - 1] += 1

    with array.get_lock():
        for i in range(6):
            array[i] += local_results[i]

if __name__ == '__main__':
    # Create a shared memory array
    # It is an array of 6 floats with an initial value of 0
    a = multiprocessing.Array(ctypes.c_int, 6)

    n_rolls_total = int(1e6)

    # Create n_process processes
    n_process = 4
    processes = []
    for i in range(n_process):
        p = multiprocessing.Process(target=roll_n_dice, args=(n_rolls_total // n_process, a))
        processes.append(p)
        p.start()

    # Wait for all processes to finish
    for p in processes:
        p.join()

    # Print the results
    print(f'Results after {n_rolls_total} rolls: {list(a[:])}')
```

This code can be run in the file [```03_multiprocessing_scripts/array_example.py```](03_multiprocessing_scripts/array_example.py).

In the example above, the results are summed up each process independently and then added to the shared array. This means each process only needs access to the shared array for a short amount of time, meaning most of the code in each process can be run in parallel. This is a good way to structure code when using shared memory as it means each process will not spend much time waiting for other processes to release the lock on the shared array.

<p align="center">
<img src="resources/array_dice_rolls.png" alt="A figure showing the runtime for different numbers of processes as a function of n_data" class="center">
</p>

As we can see above, the use of `multiprocessing` adds an overhead to the code, causing the calculation to take longer for small numbers of dice rolls compared to the serial implementation. This overhead is the creation and management of the processes. However, as the number of dice rolls increases, the performance of the multiprocessing implementation improves and above about 10,000,000 rolls the `multiprocessing` implementation using 8 processes is around 3.5 times faster than the serial implementation.


In [141]:
import multiprocessing
import ctypes
import random

def roll_n_dice(n, array):

    ## number of times you roll each number
    local_results = [0, 0, 0, 0, 0, 0]

    for i in range(n):
        current_roll = random.randint(1, 6)
        local_results[current_roll - 1] += 1

    # add to the main array object - access using lock
    with array.get_lock():
        for i in range(6):
            array[i] += local_results[i]

if __name__ == '__main__':
    # Create a shared memory array
    # It is an array of 6 floats with an initial value of 0
    a = multiprocessing.Array(ctypes.c_int, 6)

    n_rolls_total = int(1e6)

    # Create n_process processes
    n_process = 4
    processes = []
    for i in range(n_process):
        p = multiprocessing.Process(target=roll_n_dice, args=(n_rolls_total // n_process, a))
        processes.append(p)
        p.start()

    # Wait for all processes to finish
    for p in processes:
        p.join()

    # Print the results
    print(f'Results after {n_rolls_total} rolls: {list(a[:])}')

    print(f'normalized {[v/n_rolls_total for v in list(a[:])]}')

Results after 1000000 rolls: [166812, 167025, 166190, 166911, 166205, 166857]
normalized [0.166812, 0.167025, 0.16619, 0.166911, 0.166205, 0.166857]


### Exercise: Projectile Ranges

In this exercise we'll consider a projectile which is launched. However, there is some imprecision relating to the angle and the speed it is launched at. Your job is quantify the resulting uncertainty in the range of the projectile. The physics is very simplified, with no air resistance. The projectile is flying over a flat surface. The range of both angles and speeds will be uniformly distributed.

The distribution of ranges is calculated by randomly sampling from the distributions of angles and speeds and calculating the range for each combination of angle and speed. The number of samples which produced a range in a number of bins is counted.

A non-concurrent version of the code is written in the file [`03_multiprocessing_scripts/projectile_exercises.py`](03_multiprocessing_scripts/projectile_exercise.py). Your task is to parallelise this code using the `multiprocessing` module. You should create a new copy of the file and adapt the code to use multiple processes. You should not need to modify the function `calculate_range`, but you will need to modify `calculate_range_distribution` and the code outside of the functions. You may create other functions if you wish. Do not feel obliged to keep the same interface to the functions as the original code. So long as the number of samples in each bin is printed at the end of the code, you can structure the code as you wish. As the angle and speed are generated randomly for each sample, the results will not be exactly the same each time the code is run, but you should check they are similar to the original code.

Once you have rewritten the code to use multiple processes, you can compare the performance of the code for different numbers of processes and samples.

Reminder: if you want to use more than 2 processes, you will need to [increase the number of cores the Codespace is using](https://docs.github.com/en/codespaces/customizing-your-codespace/changing-the-machine-type-for-your-codespace). There is a sample solution in the file [`sample_solutions/projectile.py`](sample_solutions/projectile.py).

## Pools

A `Pool` is class from the `multiprocessing` module which can make it easy to distribute work across a number of processes. We can create a `Pool` object using `multiprocessing.Pool()`, providing an argument specifying the number of processes to create. We can then use the `map` method of the `Pool` object to apply a function to a list of arguments. The `map` method will distribute the arguments across the processes in the `Pool` and return the results returned by the function in a list in the same order as the arguments. The `map` method will block until all the processes have finished.

```python
from multiprocessing import Pool, current_process
import time
import random

# This is the function we want to apply to each entry in the list
def f(x):
    # multiprocessing.current_process().name is a way to get the name of the process
    print(f'Process {current_process().name} is working on the value {x}')
    # Wait for a random amount of time between 0 and 1 seconds
    time.sleep(random.uniform(0, 1))
    # Perform a simple calculation and return the result
    return x * x

if __name__ == '__main__':
    # Create a list of values to apply the function to
    data = list(range(10))
    print(data)

    # Create a Pool object with 4 processes
    # The Pool will be discarded when the block is exited
    with Pool(4) as p:
        # Apply the function f to each entry in the list
        # every time the process in this Pool object finishes the value it's working on, it gets given the next value
        # doing some form of load balancing
        output = p.map(f, data)

    # Print the output
    print(output)
```

The code above can be found and run in [```03_multiprocessing_scripts/map_example.py```](03_multiprocessing_scripts/map_example.py). The code will create a `Pool` object with 4 processes and apply the function `f` to each entry in the list.

The random wait time represents how different calculations in different processes may take different amounts of time to execute, which is common in parallel programming. As each process finishes working on one value, it will receive a new one to work on, meaning it's not possible to predict in advance which process will work on which value and that each process may work on a different number of values. However, regardless of which process worked on which input value, the results will be returned in a list in the same order as the input list.

The `starmap` method is similar to the `map` method but allows us to pass arguments multiple arguments to a function by receiving a collection of collections with each inner collection containing the arguments for one call to the function.

```python
from multiprocessing import Pool, current_process
import time
import random

# This is the function we want to apply to pairs of values in a list
def g(x, y):
    print(f'Process {current_process().name} is working on the values {x} and {y}')
    # Wait a random time between 0 and 1 seconds
    time.sleep(random.uniform(0, 1))
    # Perform a simple calculation and return the result
    return x * y

if __name__ == '__main__':
    # Create a list of values to apply the function to
    data = [(i, i+1) for i in range(10)]
    print(data)

    # Create a Pool object with 4 processes
    # The Pool will be discarded when the block is exited
    with Pool(4) as p:
        # Apply the function g to each set of arguments in a list
        output = p.starmap(g, data)

    # Print the output
    print(output)
```

The code above can be found and run in [```03_multiprocessing_scripts/starmap_example.py```](03_multiprocessing_scripts/starmap_example.py). The code will create a `Pool` object with 4 processes and apply the function `g` to each pair of values in the list.

A `Pool` can be a simple an convenient way to distribute independent calculations across multiple processes.

map vs starmap: https://stackoverflow.com/questions/46172018/performance-of-map-vs-starmap

### Exercise: Approximating The Sine Function

In maths, the sine function can be approximated using the Taylor series:

$$
\sin(x) = \sum_{k=0}^{\infty}\frac{(-1)^{k}x^{2k + 1}}{(2k + 1)!} =x - \frac{x^3}{3!} + \frac{x^5}{5!} - \frac{x^7}{7!} + \ldots
$$

where $n!$ is the factorial of $n$ (you can use `math.factorial` to calculate this). The approximation will completely accurate if we include infinitely many terms, but we can get a good approximation for a small value of $x$ by including only a few terms.

Your task is to write a function which receives two arguments and calculates the approximation of the sine function. The first argument is the value of $x$ and the second argument is the number of terms to include in the approximation. For instance, if first argument is 0.5 and the second is 3, the function should calculate the approximation:

$$
\sin(0.5) \approx 0.5 - \frac{0.5^3}{3!} + \frac{0.5^5}{5!}
$$

In this case, the precise value of $\sin(0.5)$ is 0.4794255 and the approximation with 3 terms is 0.4794270.

Your function should split up the calculation of each term in the approximation across 2 processes using a `Pool`. You should create a new file and write the code to do this. You may structure the code as you wish, but you should print the result of the approximation at the end of the code. Once you have written your code, test it with a few examples.

There is a sample solution in the file [`sample_solutions/sine_approximation.py`](sample_solutions/sine_approximation.py).