<img src='images/dask_horizontal.svg' width=500 align=left>

# Dask: Scaling Python Simply

Dask is a distributed compute system for Python which scales efficiently from a single laptop up to thousands of servers. Dask is developed in ongoing collaboration with the PyData community so that it is easy to learn, integrate, and operate. Dask leverages regular Python code to scale the work you already do using the skills you already have.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn.

* Built in Python
* Scales *properly* from single laptops to 1000-node clusters
* Leverages and interops with existing Python APIs as much as possible   

### Why scale?

In a nutshell,
* In the 1900s, computer processors got faster: they ran more and more instructions per second
* But, in the 2000s, due to a variety of engineering limitations, we can no longer acquire strictly faster processors. Instead we use more processors in collaboration
* Many computations are faster if we can hold a dataset in local memory: while individual computers with huge memory do exist, it's usually easier and cheaper to use a collection of computers and "pool" their memory to store our data
    
#### Dask Ecosystem

The Dask core library includes...
* Dataframe
* Array
* Bag (multiset)
* Delayed/Future
* Scheduler
* a few other cool tools

Dask integrates with tons of libraries -- https://dask.org/#powered-by -- and can be used for tabular data, image data, geodata, simulation ... pretty much analytics or scientific computing task that makes sense in Python.

In addition to the core Dask library and its Distributed scheduler, the Dask ecosystem connects several additional initiatives, including...
* Dask ML - parallel machine learning, with a scikit-learn-style API
* Dask-kubernetes and other deployment libraries
* Dask-XGBoost
* Dask-image
* Dask-cuDF
* ... and some others

#### What's Not Part of Dask?

There are lots of functions that integrate to Dask, but are not represented in the core Dask ecosystem, including...

* ~~a SQL engine~~ -- check out https://github.com/nils-braun/dask-sql
* data storage
* data catalog
* visualization
* coarse-grained scheduling / orchestration
* streaming

... although there are typically other Python packages that fill these needs (e.g., Kartothek or Intake for a data catalog).

### Our Agenda

We'll focus on one particular, common use case for Dask: scaling Pandas dataframes to 
* larger datasets (which don't fit in memory) and 
* multiple processes (which could be on multiple nodes)

## What is it like to use Dask?

There are 3 main ways that people use Dask, and you can use any combination or all of them.

We'll take a quick test drive and see each of the 3 approaches
* One-liners (or sometimes "zero-liners") where a tool already has Dask integration built in
* Dask large-scale datastructures like Dask Dataframe: a scalable Pandas dataframe
* Parallelizing custom computation: use the Dask engine to power your own code

Let's spin up a small cluster to work with:

In [None]:
from dask.distributed import Client, LocalCluster

cluster = LocalCluster(n_workers=2, threads_per_worker=1, memory_limit='700MiB')

client = Client(cluster)

client

Congratulations, you've started your first Dask cluster and you're using the state-of-the-art distributed scheduler, which is a great choice even though this cluster is on our local (front-end) machine right now.

But how can we be sure it's alive and has these specs?

__Workers Dashboard__

Let's take a look at the Workers dashboard panel
* Click the Dask logo in the JupyterLab side toolbar
* Click the Workers button

You should get a tab with a live, animated chart that looks something like this:

<img src='images/workers.png' width=701>

You can drag/position/snap that in JupyterLab, so that it's visible while you're coding.

What are these "workers"? Just regular Python processes!

Ok, let's try a "one-liner" ML application using Dask. We'll run an example from TPOT, an AutoML tool, to classify the "digits" dataset from Scikit-Learn (a set of very low-resolution handwritten digits)

In [None]:
import tpot
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()

X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.5)

In [None]:
tp = TPOTClassifier(
    generations=2,
    population_size=10,
    cv=2,
    n_jobs=-1,
    random_state=0,
    verbosity=0,
    config_dict=tpot.config.classifier_config_dict_light,
    use_dask=True,
)
tp.fit(X_train, y_train)

# quick look at test-set accuracy

sum(tp.predict(X_test) == y_test)/len(y_test)

Notice that the only Dask code we explicitly wrote there was a *kwarg* `use_dask=True` 

Next, let's take a very quick look at one of the Dask datastructures -- a parallel dataframe.

We'll look at some records from the Seattle library system

In [None]:
import dask.dataframe as ddf

loans = ddf.read_parquet('data/checkouts-micro')
loans.head()

In [None]:
loans.groupby('CheckoutYear')['Checkouts'].sum().compute()

For the last stop on our quick preview of Dask, let's parallelize some custom code.

In [None]:
import random

def roll_die(sides):
    return random.randint(1,sides)

Local (regular) Python to roll 4d6

In [None]:
results = map(roll_die, [6] * 4)

print(list(results))

Using our Dask cluster to roll 4d6 in parallel:

In [None]:
import dask

roll_die = dask.delayed(roll_die)

In [None]:
cluster_work = map(roll_die, [6] * 4)

dask.compute(*cluster_work)

Shut down our cluster for now

In [None]:
client.close()
cluster.close()

### Remote (or cloud) clusters vs. local cluster

In that example, we ran a local cluster. But creating a Dask cluster on a cluster manager (like Kubernetes) or a public cloud (like AWS) isn't very different.

One way to start a cluster on Kubernetes looks like

```python
cluster = KubeCluster.from_yaml('worker-spec.yml')
client = Client(cluster)
```

and one way to start a cluster on AWS Fargate (container service) looks like

```python
cluster = FargateCluster(image="<hub-user>/<repo-name>[:<tag>]")
client = Client(cluster)
```

## Where are the docs?

We'll provide more links to documentation as we go, but here's a quick list you can refer to:
* Main project page https://dask.org/
* Core documentation https://docs.dask.org/en/latest/
* Distributed (scheduler) https://distributed.dask.org/en/latest/
* Machine learning https://ml.dask.org/
* Deployment tools
    * Kubernetes https://kubernetes.dask.org/en/latest/
    * AWS or Azure https://cloudprovider.dask.org/en/latest/
    * YARN https://yarn.dask.org/en/latest/
    
## Some more Dask Community resources

* Dask issues and source code https://github.com/dask
* StackOverflow https://stackoverflow.com/questions/tagged/dask
* Gitter https://gitter.im/dask/dask

## How Do We Set Up and/or Deploy Dask?

The easiest way to install Dask is with Anaconda: `conda install dask`

__Schedulers and Clustering__

Dask has a simple default scheduler called the "single machine scheduler" -- this is the scheduler that's used if your `import dask` and start running code without explicitly using a `Client` object. It can be handy for quick-and-dirty testing, but I would (*warning! opinion!*) suggest that a best practice is to __use the distributed scheduler even for single-machine workloads__

The distributed scheduler can work with 
* threads (although that is often not a great idea due to the GIL) in one process
* multiple processes on one machine
* multiple processes on multiple machines

The distributed scheduler has additional useful features including data locality awareness and realtime graphical dashboards.

__Large-Scale and Cloud Deployment__

Check out links under https://docs.dask.org/en/latest/setup.html for info on deploying to...
* Clouds
* HPC environments
* Hadoop/YARN
* Kubernetes

... and more