# Parallel loops through numpy arrays
We look at speeding up loops through numpy arrays. In this example we have to call a third-party library in each iteration and this third-party library will only accept a subset of our total array. As we are calling a third-party library we can't apply tricks like JIT compilation.

The scenario here is that we have a 3-dimensional array with dimensions (x,y,time). We will imagine that this is a time series of 2-dimensional maps of ocean salinity. Our third-party library is the seawater library. This seawater library only accepts 2-dimensional inputs so we need to loop through the time dimension and call this library on each iteration. 

# Libraries
In this example we will use the built-in [Concurrent.futures](https://docs.python.org/3/library/concurrent.futures.html), [Joblib](https://joblib.readthedocs.io/en/latest/) and [Dask](https://docs.dask.org/en/stable/) libraries.  In the case of Dask we are using the dask delayed API for parallelising the loop.

In [1]:
import numpy as np
import concurrent.futures
from concurrent.futures import ProcessPoolExecutor,ThreadPoolExecutor

from joblib import Parallel,delayed
import dask
import dask.array as da

# Generate data
We generate the numpy array we're going to loop through. 

We use a three-dimensional array where we'll think of the dimensions as being `(x,y,time)`. The function that we're calling, however, can only accept two-dimensional inputs in `(x,y)` so we will loop through the `time` dimension.

In [2]:
def generateData(xyLength:int,timesteps:int):
    arr = np.random.standard_normal(size=(xyLength,xyLength,timesteps))
    return arr

arraySmall = generateData(xyLength=3,timesteps=3)    

We define the function that we are going to call in each iteration `timestepFunc`

In [3]:
def timestepFunc(arrTimestep:np.ndarray,timeIndex:int):
    return np.exp(arrTimestep),timeIndex

#### First we create a baseline non-parallelised function to do sequential processing

In [4]:
def serialProcessing(arr:np.ndarray):
    return np.stack(
        [timestepFunc(arrTimestep=arr[:,:,timestep],timeIndex=timestep)[0] for timestep in range(arr.shape[2])],
        axis=2)

# Call the function
outputSerial = serialProcessing(arr=arraySmall)
# Check the outputs are what we expect
np.testing.assert_array_equal(outputSerial,np.exp(arraySmall))

#### The outputs of parallel functions are not in the same order as the inputs

#### Before we create parallel functions we define a helper function that will sort the list of outputs using the time index variable we added to `timestepFunc`.

In [5]:
def sortResults(resultList:list):
    resultList = sorted(resultList,key=lambda x:x[1])
    resultList = [el[0] for el in resultList]
    return resultList

# Parallel processing

## Concurrent.futures

##### The built-in concurrent.futures module is a great place to start with parallel processing. It comes with both a threading and multiprocessing backend. The APIs for these backends are similar, so it's also easy to swap them out and compare them.

In [6]:
def concurrentProcessing(arr:np.ndarray,executor):
    """
    Iterate through the array `arr` in parallel with either the threading executor or the multiprocessing executor
    """
#     Create the list that will hold the results of each iteration
    resultList = []
#     Set up the executor 
#   we use a with statement here to ensure the pool of threads/processes gets closed whether the jobs run successfully or not
    with executor() as pool:
#         Loop through the array and store the `futures` that we get from each iteration
        futuresList = [
             pool.submit(
                timestepFunc,
              arr[:,:,timestep],              
                 timestep
                ) for timestep in range(arr.shape[2])]
#         Gather up the completed tasks
        done_results = concurrent.futures.as_completed(futuresList)
#     Create the list of results
        for _ in futuresList: 
            resultList.append(next(done_results).result())
#         Sort the results back into their original order
        resultList = sortResults(resultList=resultList)
#     Convert the list of results back into a three-dimensional numpy array
    return np.stack(resultList,axis=2)

# Run the function with the multiprocessing `ProcessPoolExecutor` and check that the outputs are the same as for the serial processing
outputConcurrent = concurrentProcessing(arr=arraySmall,executor=ProcessPoolExecutor)
np.testing.assert_array_equal(outputSerial,outputConcurrent)

## Joblib
The joblib library can be either a reimplementation of the built-in `multiprocessing` and `threading` libraries or a wrapper for them. There are some other differences such 
- as a different way of writing the code that you might find more readable
- you can call ctrl-c (= hitting the stop button in a notebook) to interrupt execution of the parallel jobs
- ability to use shared memory for large numpy arrays

Finally we create a function for processing using dask delayed. We test each time to make sure that the outputs are the same in each case.

In [7]:
def joblibProcessing(arr:np.ndarray,backend = "threading",maxNbytes=1,nJobs:int=-1):
    resultList = Parallel(backend=backend,max_nbytes=maxNbytes,n_jobs=nJobs)(delayed(timestepFunc)(arr[:,:,timestep],timestep) for timestep in range(arr.shape[2]))
    resultList = sortResults(resultList=resultList)
    return np.stack(resultList,axis=2)

outputJoblib = joblibProcessing(arr=arraySmall)
np.testing.assert_array_equal(outputSerial,outputJoblib)

In [8]:
def daskDelayedProcessing(arr:np.ndarray):
    resultList = []
    for timestep in range(arr.shape[2]):
        resultList.append(
            dask.delayed(timestepFunc,pure=False)(arr[:,:,timestep],timestep)
        )
    resultList = dask.compute(*resultList)
    resultList = sortResults(resultList=resultList)
    return np.stack(resultList,axis=2)
outputDask = daskDelayedProcessing(arr=arraySmall)
np.testing.assert_array_equal(outputSerial,outputDask)

In [None]:
xyLength = 200
timesteps = 2000
arrayLarge = generateData(xyLength=xyLength,timesteps=timesteps)    

In [None]:
%timeit -n 1 -r 1 serialProcessing(arr=arrayLarge)

In [None]:
%timeit -n 1 -r 1 concurrentProcessing(arr=arrayLarge,executor=ProcessPoolExecutor)

In [None]:
%timeit -n 1 -r 1 concurrentProcessing(arr=arrayLarge,executor=ThreadPoolExecutor)

In [None]:
%timeit -n 1 -r 1 daskDelayedProcessing(arr=arrayLarge)

In [None]:
%timeit -n 1 -r 1 joblibProcessing(arr=arrayLarge,backend="threading")
%timeit -n 1 -r 1 joblibProcessing(arr=arrayLarge,backend="loky")
%timeit -n 1 -r 1 joblibProcessing(arr=arrayLarge,backend="multiprocessing")

In [None]:
%load_ext line_profiler

In [None]:
%lprun -f daskDelayedProcessing daskDelayedProcessing(arr=arrayLarge)

In [None]:
%timeit daskDelayedProcessing(SPTimeseries=SPTimeseries,p=p,lon=lon,lat=lat)

In [None]:
array_id = ray.put(arr)

In [None]:
%timeit -n 1 -r 1 ray.get(array_id[:,:,0])