<img src="http://dask.readthedocs.io/en/latest/_images/dask_horizontal.svg" 
     width="30%" 
     align=right
     alt="Dask logo">

Embarrassingly parallel Workloads
--------------------------------------------------

This notebook shows using [dask.delayed](http://dask.pydata.org/en/latest/delayed.html) or Futures interface to parallelize generic Python code. 

This example focuses on using Dask for building large embarrassingly parallel computation as often seen in scientific communities and on High Performance Computing facilities, for example with Monte Carlo methods. This kind of simulations suppose the following:
 - we have a function that run a heavy computation given some parameters,
 - we need to compute this function on a lot of different input parameters, each function call being independant
 - we want to gather all the results in one place for further analysis.

## Start Dask Client for Dashboard

Starting the Dask Client will provide a dashboard which 
is useful to gain insight on the computation.  We will also need it fot the
Future API part of this example. Moreover, as this kind of computation
is often launched on super computer or in the Cloud, you will probably end
up having to start a cluster and connect a client. See 
[dask-jobqueue](https://github.com/dask/dask-jobqueue),
[dask-kubernetes](https://github.com/dask/dask-kubernetes) or 
[dask-yarn](https://github.com/dask/dask-yarn) for easy ways to achieve this.

The link to the dashboard will become visible when you create the client below.  We recommend having it open on one side of your screen while using your notebook on the other side.  This can take some effort to arrange your windows, but seeing them both at the same is very useful when learning.

In [None]:
from dask.distributed import Client, progress
client = Client(threads_per_worker=4, n_workers=1)
client

## Define your computation calling function

This function do a simple operation: adding all numbers of a list/array together, but it sleeps for a random amount of time to simulate real work. In real use cases, this could call another python module, or even run an executable using subprocess module.

In [None]:
import time
import random

def costly_simulation(list_param):
    time.sleep(random.random())
    return sum(list_param)

We can try it

In [None]:
%%time
import numpy as np
result = costly_simulation([1,2,3,4])
print("Result = %s" % result)

## Define the set of input parameters to call the function

We will generate a set of inputs that we want our simulation evaluated on. Here we use pandas.dataframe, but we could have go with simple list too. Lets say that our simulation is run with four parameters called param_[a-d].

In [None]:
import pandas as pd
import numpy as np

input_params = pd.DataFrame(np.random.random(size=(1000, 4)),
               columns=['param_a', 'param_b', 'param_c', 'param_d'])
input_params.head()

We can now call our simulation on all this parameters with basic python code. 
Note that this is not very clever as we can easily parallelize code. Using module
like multiprocessing is a first step, and Dask a second for inter nodes distribution.  

Let's only do this on a sample of our parameters as it would be quite long otherwise.

In [None]:
%%time
result = []
for args in input_params.values[:10]:
    result.append(costly_simulation(args))
print(result)

## Use Dask Delayed to make our function lazy

We can call `dask.delayed` on our funtion to make it lazy.  Rather than compute its result immediately, it record what we want to compute as a task into a graph that we'll run later on parallel hardware.


Calling these lazy functions is now almost free.  We're just constructing a (simple) graph

In [None]:
%%time
import dask
delayed = []
for args in input_params.values[:10]:
    delayed.append(dask.delayed(costly_simulation)(args))
print(delayed[0])

## Run in parallel

Call `.compute()` when you want your result as a normal Python object

If you started `Client()` above then you may want to watch the status page during computation.

In [None]:
%%time
result = dask.compute(*delayed)
print(result)

We can now run this on all of our input parameters:

In [None]:
%%time
delayed = []
for args in input_params.values:
    delayed.append(dask.delayed(costly_simulation)(args))
    
futures = dask.persist(*delayed)  # trigger computation in the background

To make this go faster, add additional workers.

(although we're still only working on our local machine, this is more practical when using an actual cluster)

In [None]:
for i in range(10):
    client.cluster.start_worker(ncores=4)

By looking at the Dask dashboard we can see that Dask spreads this work around our cluster, managing load balancing, dependencies, etc..

Then get the result:

In [None]:
result = dask.compute(*futures)
result[:5]

## Using Future API

Same example can be done using the Future API by using the client object itself. It is a more explicit way of submiting computation to a cluster.

In [None]:
%%time
futures = []
for args in input_params.values[:10]:
    futures.append(client.submit(costly_simulation, args))
    
result = client.gather(futures)
print(result)

But the code above can be run in fewer lines with `client.map()` function, allowing to call a given function on a list of parameters.

As for delayed, we can only start the computation and not wait for results by not calling `client.gather()` right now.

In [None]:
%%time
futures = client.map(costly_simulation, input_params.values)

Then just get the results later:

In [None]:
results = client.gather(futures)
print(len(results))
print(results[0])

We encourage you to watch the [dashboard's status page](../proxy/8787/status) to watch on going computation.

## Doing some analysis on the results

One of the interest of dask here, outside from API simplicity, is that you are able to gather the result for all your simulations in one call. No need to implement complex mechanism or to write individual results in a shared file system or object store.

Just get your result, and do some computation.

Here, we will just get the results and expand our initial dataframe to have a nice view of params vs results for our computation

In [None]:
output = input_params.copy()
output['result'] = pd.Series(results, index=output.index)
output.sample(5)

Then we can do some nice statistical plots or save result locally with pandas interface here

In [None]:
%matplotlib inline
output['result'].plot()

In [None]:
output['result'].mean()

In [None]:
filtered_output = output[output['result'] > 2]
print(len(filtered_output))
filtered_output.to_csv('/tmp/simulation_result.csv')

## Handling very large simulation

The methods above work really well for a size of input parameters up to 100,000. Above that, Dask scheduler has trouble to handle the amount of task to schedule to workers.

In this case, one solution is described in [Avoid to many tasks](http://dask.pydata.org/en/latest/delayed-best-practices.html#avoid-too-many-tasks) doc section of delayed: using Bag API.

We just need to convert our input_params sequence into a dask.bag collection, asking for fewer partitions (so at most 100,000, which is already huge), and apply our function on every item of the bag.

In [None]:
%%time
import dask.bag as db
b = db.from_sequence(list(input_params.values), npartitions=100)
b = b.map(costly_simulation)
results_bag = b.compute()

Looking on Dashboard here, you should see only 100 tasks to run instead of 1000, each taking 10x more time in average, because each one is actually calling our function 10 times.

In [None]:
np.all(results) == np.all(results_bag)