<img src="images/dask_horizontal.svg" align="right" width="30%">

# Table of Contents
* [Distributed](#Distributed)
	* [Making a cluster](#Making-a-cluster)
		* [Detailed method](#Detailed-method)
		* [Simple method](#Simple-method)
	* [Executing with the distributed client](#Executing-with-the-distributed-client)
	* [Distributed futures](#Distributed-futures)


# Distributed

As we saw in Foundations, dask allows you to simply construct graphs of tasks with dependencies. In fact, if you skip forward, you will find that graphs can also be created automatically for you using functional, Numpy or Pandas syntax on data collections. None of this would be very useful, if there weren't also a way to execute these graphs, in a parallel and memory-aware way. Dask comes with four available schedulers:
- dask.threaded.get: a scheduler backed by a thread pool
- dask.multiprocessing.get: a scheduler backed by a process pool
- dask.async.get_sync: a synchronous scheduler, good for debugging
- distributed.Client.get: a distributed scheduler for executing graphs on multiple machines.

To select one of these for computation, you can specify at the time of asking for a result
```python
myvalue.compute(get=dask.async.get_sync)  # for debugging
```

or set the current default, either temporarily or globally
```python
with dask.set_options(get=dask.multiprocessing.get):
    # set temporarily fo this block only
    myvalue.compute()

dask.set_options(get=dask.multiprocessing.get)
# set until further notice
```

For single-machine use, the threaded and multiprocessing schedulers are fine choices. However, for scaling out work across a cluster, the distributed scheduler is required. Indeed, this is now generally preferred for all work, because it gives you additional monitoring information not available in the other schedulers. (Some of this monitoring is also available with explicit an progress bar and profiler, see [here](http://dask.pydata.org/en/latest/diagnostics.html).)

## Making a cluster

### Detailed method

The following process explains what happens under the hood when setting up a computation environment with the dask distributed scheduler *by hand*. It is not necessary to do this for the rest of the tutorial, but understanding what is going on will help a great deal when scaling up computations across a cluster. Users may wish to skip this section for now, and continue with the Simple method, below.

**The scheduler**
In a terminal, type the following:
```
dask-scheduler
```

You will get text something like the following:
```
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO -   Scheduler at:         192.168.0.11:8786
distributed.scheduler - INFO -       bokeh at:         192.168.0.11:8788
distributed.scheduler - INFO -        http at:         192.168.0.11:9786
distributed.bokeh.application - INFO - Web UI: http://192.168.0.11:8787/status/
distributed.scheduler - INFO - -----------------------------------------------
```

The top line gives the address at which the scheduler is waiting for connections - it is this address that workers and clients need to be given (your IP and/or port numbers may be different). The further addresses are for an in-process bokeh graph server for scheduler debugging, a JSON http endpoint for information about the server, and, finally, the URL of the main monitoring dashboard; you can type this into a web-browser, but it will not show much information yet.

The scheduler cannot do much without workers. We can create a worker process with:
```
dask-worker 192.168.0.11:8786
```

where the address should be the same as given by the scheduler process, above. By default, the worker will start a monitoring process (the *nanny*), and a worker process with the number of threads equal to the number of cores (all the values can be changed). The worker has its own http and bokeh server. From the text displayed in the console, we see that the worker connects to the scheduler - information is also printed by the scheduler indicating that it has received a connection from a worker. Notice that this worker process could have been on a different machine from the scheduler.

Next, in a new python session (perhaps in the notebook, or another console, we can do
```python
import distributed
c = distributed.Client('192.168.0.11:8786')
```

to connect to the scheduler. Again, the address must match the scheduler, above, and that, again, the scheduler logs the connection from the client. This client is now ready to accept work, and coordinate with the scheduler so cuh that tasks get executed by the threads of the worker process.

The three python-running consoles might look something like the following:
![distributed session](images/distributed_session.png)

A similar method can be used to set up the scheduler and workers across a number of cluster nodes, and connect to them from a client to do work. There are soma automated options for achieving this, including for resource management and dynamic clustering scenarios, see [here](http://distributed.readthedocs.io/en/latest/setup.html).

### Simple method

For the remainder of this tutorial, we will be using the default dask distributed cluster. This gets created automatically when creating a client with no arguments, if no client has yet been defined. Creating any distributed client also sets it to be the default executor of dask `compute` calls, unless otherwise specified.

In [None]:
import distributed
c = distributed.Client()
c

The scheduler is listening on the local loopback network, and has a number of worker processes connected. Furthermore, the web UI will be available on `127.0.0.1:8787/status` - you can open this in a new tab of your browser. Other monitoring output is also available, e.g., `/tasks`.
![ui](images/ui.png)

No tasks are yet being processed, and no data is held in the memory of the workers, so the lower part of the display is empty for the moment.

## Executing with the distributed client

Consider some trivial calculation, as in previous sections, where we have added sleep statements in order to simulate real work being done.

In [None]:
from dask import delayed
import time

def inc(x):
    time.sleep(5)
    return x + 1

def dec(x):
    time.sleep(3)
    return x - 1

def add(x, y):
    time.sleep(7)
    return x + y

total = delayed(add)(delayed(dec)(7), delayed(inc)(6))
total.compute()

The tasks will appear in the web UI as they are processed by the cluster and, eventually, a result will be printed as output of the cell above. Note that the kernel is blocked while waiting for the result. The resulting tasks block graph might look something like below. Hovering over each block gives which function it related to, and how long it took to execute. ![this](images/tasks.png)

If all you want to do is execute computations created using delayed, or run calculations based on the higher-level data collections (see the coming sections), then that is about all you need to know to scale your work up to cluster scale. However, there is more detail to know about the distributed scheduler that will help with efficient usage.

## Distributed futures

The above example showed that executing a calculation (created using delayed) with the distributed executor is identical to any other executor. However, we now have access to additional functionality, and control over what data is held in memory.

To begin, the `futures` interface (derived from the built-in `concurrent.futures`) allow map-reduce like functionality. We can submit individual functions for evaluation with one set of inputs, or evaluated over a sequence of inputs with `submit()` and `map()`. Notice that the call returns immediately, giving one or more *futures*, whose status begins as "pending" and later becomes "finished". There is no blocking of the local python session.

In [None]:
fut = c.submit(inc, 1)
fut

In [None]:
# functions runs on the cluster for a while, we have a local handle to
# that work
fut

In [None]:
# grab the information back
c.gather(fut)

In [None]:
# pass a sequenct for evluation - given list of futures
futs = c.map(inc, [1, 2, 3, 4, 5])
# futures can be passed to other calls - data is not downloaded, but
# used in-place
tot = c.submit(sum, futs)
tot

In [None]:
c.gather(tot)

Here we see an alternative way to execute work on the cluster: when you submit or map with the inputs as futures, the *computation moves to the data* rather than the other way around, and the client, in the local python session, need never see the intermediate values. This is similar to building the graph using delayed, and indeed, delayed can be used in conjunction with futures. Here we use the delayed object `total` from above.

In [None]:
# notice the difference from total.compute()
fut = c.compute(total)
c.gather(fut)

Critically, each futures represents a result held, or being evaluated by the cluster. Thus we can control caching of intermediate values - when a future is no longer referenced, its value is forgotten. For example, although we can explicitly pass data into the cluster using `scatter()`, we normally want to cause the workers to load as much of their own data as possible to avoid excessive communication overhead. 

The [full API](http://distributed.readthedocs.io/en/latest/api.html) of the distributed scheduler gives details of interacting with the cluster, which remember, can be on your local machine or possibly on a massive computational resource. Considering which data should be loaded in the workers, as opposed to passed, and which intermediate values to persist in worker memory, will in many cases determine the computation efficiency of a process.

In [None]:
# send data to the cluster - each item is stored on one of the workers,
# and three futures are returned pointing to them
c.scatter([1, 2, 3])