# Parallel Optimization in Python

Parallelization allows you to speed up your code by running multiple operations at the same time, taking advantage of multiple CPU cores. In Python, this is especially useful for CPU-bound tasks, where the main bottleneck is computation rather than I/O.

We will cover two popular approaches:
1. `concurrent.futures` (standard library)
2. `joblib` (external library, often used in scientific computing)

---

## 1. Why Parallelize?

Python's default execution model (the Global Interpreter Lock, or GIL) means that only one thread executes Python bytecode at a time. However, you can use **process-based parallelism** to bypass the GIL for CPU-bound tasks. Essentially, multiple Python interpreters are created, each with their own GIL, allowing you to get around the lock. 

Of course, the first step is to profile your code and consider how parallelisable your algorithm is. Are there dependencies between steps in your code (i.e. step 2 depends on the result of step 1, etc.)? How much of your code can be parallelised, and how long do you think it would take?

Below, we provide code for two different tasks. The first is the familiar Monte Carlo algorithm for estimating $\pi$. The second is Euler's method for solving an ODE ([if you're not familiar](https://en.wikipedia.org/wiki/Euler_method))


In [1]:
import numpy as np

In [None]:
import random

def monte_carlo_pi(num_samples):
    inside_circle = 0
    for _ in range(num_samples):
        x, y = random.uniform(-1, 1), random.uniform(-1, 1)
        if x**2 + y**2 <= 1:
            inside_circle += 1
    return (inside_circle / num_samples) * 4

In [None]:
def euler_ode(f, y0, t0, t1, dt):
    #' Solves the ODE dy/dt = f(t, y) using Euler's method.
    #' f is the function, y0 the initial value, t0 the start time,
    #' t1 the end time, and dt the time step.
    n_steps = int((t1 - t0) / dt)
    t = np.linspace(t0, t1, n_steps + 1)
    y = np.zeros(n_steps + 1)
    y[0] = y0
    for i in range(n_steps):
        y[i + 1] = y[i] + f(t[i], y[i]) * dt
    return t, y


**Q1.1 How easy is it to parallelise `monte_carlo_pi()`? How easy do you think it is to parallelise `euler_ode()`?**

<details>
<summary>I need a hint</summary>

Consider the dependencies between steps in the two functions.
<details>
<summary>I need a bigger hint</summary>

`monte_carlo_pi` generates lots of *independent* random numbers then tests whether they're within the boundary. `euler_ode` calculates one derivative, then updates *based on the derivative*, then calculates the next.
</details>
</details>

### How many CPU cores do you have?
If you're not sure, you can use `import os; print(os.cpu_count())` in a REPL or Python script to find out.

**Important!** Almost everyone has laptops with >1 CPU nowadays. If you only have 1 CPU, this section won't work. Please talk to me.


## Parallelization with `concurrent.futures`

The `concurrent.futures` module provides a high-level interface for asynchronously execution using threads *or* processes. The nice part is that it handles process/thread cleanup for you, and of course it is in the standard library so you don't need to download anything (sometimes a concern). If you need lower-level control or advanced features, check out the `multiprocessing` module in the Python standard library.

To use `concurrent.futures` (after importing), we use a `ProcessPoolExecutor`, which will handle all the word behind the scenes. There are a few things to note:
1. By default this uses all the CPUs reported by `os.cpu_count()`, i.e. all of the available CPUs. Set the `max_workers` kwarg to use less.
2. Only objects that can be *pickled* (i.e. serialised) can be executed and returned from processes. Top-level functions are usually pickleable, but lambdas, local functions, open file handles etc. are not. If you're unsure, you can try to `pickle.dumps()` the object you're unsure about - if it doesn't raise an exception it should be fine.
3. `ProcessPoolExector` does not work in interactive interpreters (like notebooks), because the `__main__` module needs to be importable to the workers. In notebooks, `__main__` is not a real, importable module like it is in scripts. 

As such, the rest of this section will describe code, but the main work should be done in `.py` scripts and called from the CLI.


### Example: Parallel Square Calculation

```python
import time
from concurrent.futures import ProcessPoolExecutor

def slow_square(x):
    time.sleep(1)
    return x * x

if __name__ == "__main__":
    t1 = time.perf_counter()
    with ProcessPoolExecutor() as executor:
        results = list(executor.map(slow_square, range(5)))
    t2 = time.perf_counter()
    print(f"Parallel execution took {t2 - t1:.4f} seconds")
    print("Results:", results)
```


**Q1.2 How fast does this code run serially, and in parallel?**

<details>
<summary>I need a hint</summary>

Copy the code to a `.py` file and run it with no concurrency, and with concurrency. Time it.
</details>


### Using `ThreadPoolExecutor` vs `ProcessPoolExecutor`

- `ThreadPoolExecutor`: Best for I/O-bound tasks (e.g., network requests, file I/O).
- `ProcessPoolExecutor`: Best for CPU-bound tasks (e.g., numerical computation).

**Q1.3: Try swapping `ProcessPoolExecutor` for `ThreadPoolExecutor`. What happens?**

### Submitting Tasks Individually

You can also submit tasks one at a time and collect results as they finish:

```python
from concurrent.futures import ProcessPoolExecutor, as_completed

def slow_cube(x):
    time.sleep(1)
    return x ** 3

with ProcessPoolExecutor() as executor:
    futures = [executor.submit(slow_cube, i) for i in range(5)]
    for future in as_completed(futures):
        print(future.result())
```


### Exercise
We're going to parallelise the `monte_carlo_pi` function, which uses the Monte Carlo algorithm to estimate $\pi$ by estimating the proportion of random numbers generated in the unit square which fall within the unit circle. 

You'll want to copy the previously provided code to a separate script, if you haven't already.


**Q1.4 Modify the provide monte carlo code to be parallel. How long does it take when you use 2 cores, vs 1? How does the speed scale as the number of cores increases?**

<details>
<summary>I need a hint</summary>

Copy the code to a script, and use the `concurrent.futures` library to run in parallel. 
<details>
<summary>I need a bigger hint</summary>

You'll want to edit the code so it doesn't return the estimate of $\pi$ directly, and use `ProcessPoolExecutor` as shown above. Collate the number of points within the unit circle, and calculate $\pi$. <details>
<summary>Give me the code</summary>

```python
import random, os, time
from concurrent.futures import ProcessPoolExecutor

def monte_carlo_pi(num_samples):
    inside_circle = 0
    for _ in range(num_samples):
        x, y = random.uniform(-1, 1), random.uniform(-1, 1)
        if x**2 + y**2 <= 1:
            inside_circle += 1
    return inside_circle

def parallel_monte_carlo_pi(num_samples, num_processes):
    samples_per_process = num_samples // num_processes
    with ProcessPoolExecutor(max_workers=num_processes) as executor:
        results = list(executor.map(monte_carlo_pi, [samples_per_process] * num_processes))
    print(f"{len(results)} processes completed.")
    return (sum(results) / num_samples) * 4
```

</details>
</details>
</details>



## 3. Parallelization with `joblib`

`joblib` is a third-party library that makes parallelization simple, with a much simpler interface.

Returning to our example:

```python
from joblib import Parallel, delayed
import time

def slow_square(x):
    time.sleep(1)
    return x * x

results = Parallel(n_jobs=2)(delayed(slow_square)(i) for i in range(5))
print(results)
```

- `delayed` is a `Joblib` provided function that wraps our function so it can be scheduled, rather than executing immediately.
    - `for i in range(5)` creates 5 delayed tasks
- `Parallel(n_jobs = 2)` creates a parallel executor, which in this case uses 2 worker processes.
    - `n_jobs` specifies the number of processes to use (`-1` uses all available CPUs).

**Q2.1:  I hope you agree that the `slow_square(x)` function should take ~1 second per call. How long do you think the example parallel function will take? Try it. Were you correct?**

<details>
<summary>I need a hint</summary>

Copy the code to a `.py` file and run it with timing. 
</details>




## 4. Exercise

**Q2.2 Use `Joblib` to parallelise the `monte_carlo_pi` function. Which version do you prefer?**

<details>
<summary>I need a hint</summary>

Copy the code to a script, and use the `Joblib` library to run in parallel. 
<details>
<summary>I need a bigger hint</summary>

You'll want to edit the code so it doesn't return the estimate of $\pi$ directly, and use `delayed` and `Parallel` as shown above. Collate the number of points within the unit circle, and calculate $\pi$. <details>
<summary>Give me the code</summary>

```python
from joblib import Parallel, delayed


def parallel_monte_carlo_pi(num_samples, num_processes):
    samples_per_process = num_samples // num_processes
    results = Parallel(n_jobs=num_processes)(delayed(monte_carlo_pi)(samples_per_process) for _ in range(num_processes))
    return (sum(results) / num_samples) * 4
```

</details>
</details>
</details>

# Final Exercise

We will end here, as we have covered a fair few different techniques for speeding up code. Your final exercise, assuming there is enough time, is to *either*:

1. Work on your own code. Profile it, and see if and where you can apply the things you have learned in this workshop, and any speedup that results.
2. Work on the `unique_paths.py` code. This code generates the number of unique paths from the top-left to the bottom-right, in an `m * n` grid, when you can only move right or down. Time and profile this, and see if you how fast you can make it.