# Use Cases
https://dask.pydata.org/en/latest/use-cases.html  

In [1]:
import pandas as pd
from pandas import DataFrame, Series

In [2]:
df = DataFrame({'date': pd.date_range(start = '2015-1-1', end = '2015-3-31')})
df['x'] = df.date.dt.day
df.tail()

Unnamed: 0,date,x
85,2015-03-27,27
86,2015-03-28,28
87,2015-03-29,29
88,2015-03-30,30
89,2015-03-31,31


In [3]:
# csv_file = '2015.csv'
# df.to_csv(csv_file)

### Overview
Dask use cases can be roughly divided in the following two categories:

1. Large NumPy/Pandas/Lists with dask.array, dask.dataframe, dask.bag to analyze large datasets with familiar techniques. This is similar to Databases, Spark, or big array libraries.
2. Custom task scheduling. You submit a graph of functions that depend on each other for custom workloads. This is similar to Luigi, Airflow, Celery, or Makefiles.  

Most people today approach Dask assuming it is a framework like Spark, designed for the first use case around large collections of uniformly shaped data. However, many of the more productive and novel use cases fall into the second category, using Dask to parallelize custom workflows.  




Dask compute environments can be divided into the following two categories:

1. Single machine parallelism with threads or processes: The Dask single-machine scheduler leverages the full CPU power of a laptop or a large workstation and changes the space limitation from “fits in memory” to “fits on disk”. This scheduler is simple to use and doesn’t have the computational or conceptual overhead of most “big data” systems.
2. Distributed cluster parallelism on multiple nodes: The Dask distributed scheduler coordinates the actions of multiple machines on a cluster. It scales anywhere from a single machine to a thousand machines, but not significantly beyond.


The single machine scheduler is useful to more individuals (more people have personal laptops than have access to clusters) and probably accounts for 80+% of the use of Dask today. The distributed machine scheduler is useful to larger organizations like universities, research labs, or private companies.

Below we give specific examples of how people use Dask. We start with large NumPy/Pandas/List examples because they’re somewhat more familiar to people looking at “big data” frameworks. We then follow with custom scheduling examples, which tend to be applicable more often, and are arguably a bit more interesting.

## Collection Examples
Dask contains large parallel collections for n-dimensional arrays (similar to NumPy), dataframes (similar to Pandas), and lists (similar to PyToolz or PySpark).

### On disk arrays
 They use dask.array to treat this stack of HDF5 or NetCDF files as a single NumPy array (or a collection of NumPy arrays with the XArray project). 

In [4]:
import h5py, os
import numpy as np

# path = 'data'
path = '../dask-tutorial/data'
filename = 'myfile.hdf5'
file = os.path.join(path, filename)

total_size = 1000000
chunks = 1000

if not os.path.exists(file):
    with h5py.File(file) as f:
        dset = f.create_dataset('/x', 
                                shape = (total_size,), 
                                dtype = np.float32,
                                chunks=(chunks,),
                                compression='gzip',
                                compression_opts=9) 
        
        for i in range(0, total_size, chunks):
            dset[i: i + chunks] = np.random.exponential(size=chunks)

  from ._conv import register_converters as _register_converters


In [5]:
import dask.array as da
from distributed import Client

client = Client()
client

0,1
Client  Scheduler: tcp://127.0.0.1:54608  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 4  Memory: 8.48 GB


In [6]:
h5file = h5py.File(file, mode = 'r')
dataset = h5file['/x']

# x = dataset[...]
x = da.from_array(dataset, chunks = dataset.chunks)
y = x[::10000] - x.mean(axis=0)

%time print('result:', y.compute())

h5file.close()

result: [-0.64674985 -0.22668874 -0.27774346  2.559743    0.2979015  -0.5956613
  0.79906476 -0.25978506 -0.6400137  -0.50122947 -0.5501237  -0.45786148
 -0.81786805 -0.6803148   1.8183072   0.0706774  -0.34766126 -0.2607717
 -0.9230949  -0.40851784 -0.2787838  -0.74477565 -0.8358296  -0.7863423
  0.9568     -0.5427391   1.3723187  -0.08791035  2.8399603  -0.7163671
  1.5204368  -0.7691104  -0.52979404 -0.50032794 -0.5012726  -0.6958561
  0.10943246 -0.29757774  2.0415967  -0.6752656  -0.5851799  -0.6956012
 -0.7376877   0.04875898  0.4183973   0.53458846 -0.29217488 -0.68081343
  0.38699257 -0.42110473  0.73694146 -0.56270605 -0.60957325  0.28478825
  0.08592284  0.05933285 -0.88665557 -0.88949794 -0.957826   -0.7015122
  1.2569599  -0.7414051   0.30147684  1.1287274   2.2218735  -0.30781156
 -0.50330496  0.51573336  1.5444198   2.7049658   0.9826366   0.74960566
 -0.4028632   0.36163366 -0.8244845  -0.1751982  -0.10236657 -0.3164003
 -0.76260257  0.37822425 -0.03461301  0.96427107  1

In [7]:
h5file.close()

### Directory of CSV or tabular HDF files
They use dask.dataframe to logically wrap all of these different files into one logical dataframe that is built on demand to save space.

In [8]:
path = '../dask-tutorial/data/csv/'
dates = pd.date_range(start = '2015-1-1', end = '2015-2-28') 

for date in dates:
    filename = str(date.date())
    file = os.path.join(path, filename)
    
    timedeltas = pd.timedelta_range(start = 0, periods = 10, freq = '1H')
    df = DataFrame({'timestamp': date + timedeltas})
    df['value'] = np.random.randn(timedeltas.shape[0])
    
    df.to_csv(file + '.csv') 
    
df.tail()

Unnamed: 0,timestamp,value
5,2015-02-28 05:00:00,1.756584
6,2015-02-28 06:00:00,0.581455
7,2015-02-28 07:00:00,-0.172263
8,2015-02-28 08:00:00,0.501906
9,2015-02-28 09:00:00,-0.016529


In [9]:
import dask.dataframe as dd

df = dd.read_csv(path + '2015-*-*.csv',
                 parse_dates=['timestamp'])

value_mean = df.groupby(df.timestamp.dt.hour).value.mean()
value_mean.compute() 

timestamp
0   -0.191419
1    0.142274
2   -0.024639
3   -0.147495
4    0.141604
5    0.138348
6   -0.148399
7    0.087410
8   -0.094926
9    0.009555
Name: value, dtype: float64

In [10]:
value_mean

Dask Series Structure:
npartitions=1
    float64
        ...
Name: value, dtype: float64
Dask Name: truediv, 432 tasks

In [11]:
type(value_mean)

dask.dataframe.core.Series

In [None]:
value_mean.visualize()

### Directory of CSV files on HDFS
The same analyst as above uses dask.dataframe with the dask.distributed scheduler to analyze terabytes of data on their institution’s Hadoop cluster straight from Python. This uses either the hdfs3 or pyarrow Python libraries for HDFS management

```python
from dask.distributed import Client
client = Client('cluster-address:8786')

import dask.dataframe as dd
df = dd.read_csv('hdfs://data/2016-*.*.csv', parse_dates=['timestamp'])
df.groupby(df.timestamp.dt.hour).value.mean().compute()
```


### Directories of custom format files
https://gist.github.com/mrocklin/e7b7b3a65f2835cda813096332ec73ca

### JSON data
Data Engineers with click stream data from a website or mechanical engineers with telemetry data from mechanical instruments have large volumes of data in JSON or some other semi-structured format. They use dask.bag to manipulate many Python objects in parallel either on their personal machine, where they stream the data through memory or across a cluster.

In [13]:
import dask.bag as db
import json

In [None]:
path = '../dask-tutorial/data/json/'
dates = pd.date_range(start = '2015-1-1', end = '2015-2-28') 

for date in dates:
    filename = str(date.date())
    file = os.path.join(path, filename)
    b = db.from_sequence([{'name': 'Alice', 'id': 123}] * 20)
    b = db.from_sequence(np.random.randn(20))
    
    b.to_textfiles(file + '-*.json.gz')

In [None]:
records = db.read_text('data/2015-*-*.json').map(json.loads)
records.filter(lambda d: d['name'] == 'Alice').pluck('id').frequencies()

## Custom Examples
The large collections (array, dataframe, bag) are wonderful when they fit the application, for example if you want to perform a groupby on a directory of CSV data. However several parallel computing applications don’t fit neatly into one of these higher level abstractions. Fortunately, Dask provides a wide variety of ways to parallelize more custom applications. These use the same machinery as the arrays and dataframes, but allow the user to develop custom algorithms specific to their problem.

### Embarrassingly parallel computation
A programmer has a function that they want to run many times on different inputs. Their function and inputs might use arrays or dataframes internally, but conceptually their problem isn’t a single large array or dataframe.

They want to run these functions in parallel on their laptop while they prototype but they also intend to eventually use an in-house cluster. They wrap their function in dask.delayed and let the appropriate dask scheduler parallelize and load balance the work.
```python
def process(data):
   ...
   return ...
```
#### Normal Sequential Processing:
```python
results = [process(x) for x in inputs]
```
#### Build Dask Computation:
```python
from dask import compute, delayed
values = [delayed(process)(x) for x in inputs]
```
#### Multiple Threads:
```python
import dask.threaded
results = compute(*values, get=dask.threaded.get)
```
#### Multiple Processes:
```python
import dask.multiprocessing
results = compute(*values, get=dask.multiprocessing.get)
```
#### Distributed Cluster:
```python
from dask.distributed import Client
client = Client("cluster-address:8786")
results = compute(*values, get=client.get)
```

### Complex dependencies
A financial analyst has many models that depend on each other in a complex web of computations.
```python
data = [load(fn) for fn in filenames]
reference = load_from_database(query)

A = [model_a(x, reference) for x in data]
B = [model_b(x, reference) for x in data]

roll_A = [roll(A[i], A[i + 1]) for i in range(len(A) - 1)]
roll_B = [roll(B[i], B[i + 1]) for i in range(len(B) - 1)]
compare = [compare_ab(a, b) for a, b in zip(A, B)]

results = summarize(compare, roll_A, roll_B)
```
These models are time consuming and need to be run on a variety of inputs and situations. The analyst has his code now as a collection of Python functions and is trying to figure out how to parallelize such a codebase. They use dask.delayed to wrap their function calls and capture the implicit parallelism.
```python
from dask import compute, delayed

data = [delayed(load)(fn) for fn in filenames]
reference = delayed(load_from_database)(query)

A = [delayed(model_a)(x, reference) for x in data]
B = [delayed(model_b)(x, reference) for x in data]

roll_A = [delayed(roll)(A[i], A[i + 1]) for i in range(len(A) - 1)]
roll_B = [delayed(roll)(B[i], B[i + 1]) for i in range(len(B) - 1)]
compare = [delayed(compare_ab)(a, b) for a, b in zip(A, B)]

lazy_results = delayed(summarize)(compare, roll_A, roll_B)
```
They then depend on the dask schedulers to run this complex web of computations in parallel.
```python
results = compute(lazy_results)
```
They appreciate how easy it was to transition from the experimental code to a scalable parallel version. This code is also easy enough for their teammates to understand easily and extend in the future.

### Algorithm developer
A graduate student in machine learning is prototyping novel parallel algorithms. They don’t have access to an institutional cluster, so instead they use dask-ec2 to easily provision clusters of varying sizes.

**Their algorithm is written the same in all cases**, drastically reducing the cognitive load, and letting the readers of their work experiment with their system on their own machines, aiding reproducibility.

### Scikit-Learn or Joblib User
A data scientist wants to scale their machine learning pipeline to run on their cluster to accelerate parameter searches. They already use the sklearn **njobs=** parameter to accelerate their computation on their local computer with Joblib. Now they wrap their sklearn code with a context manager to parallelize the exact same code across a cluster (also available with IPyParallel)
```python
import distributed.joblib

with joblib.parallel_backend('distributed',
                             scheduler_host=('192.168.1.100', 8786)):
    result = GridSearchCV( ... )  # normal sklearn code
```

### Academic Cluster Administrator
A system administrator for a university compute cluster wants to enable many researchers to use the available cluster resources, which are currently lying idle. The research faculty and graduate students lack experience with job schedulers and MPI, but are comfortable interacting with Python code through a Jupyter notebook.

Teaching the faculty and graduate students to parallelize software has proven time consuming. Instead the administrator sets up dask.distributed on a sandbox allocation of the cluster and broadly publishes the address of the scheduler, pointing researchers to the dask.distributed quickstart. Utilization of the cluster climbs steadily over the next week as researchers are more easily able to parallelize their computations without having to learn foreign interfaces. The administrator is happy because resources are being used without significant hand-holding.

As utilization increases the administrator has a new problem; the shared dask.distributed cluster is being overused. The administrator tracks use through Dask diagnostics to identify which users are taking most of the resources. They contact these users and teach them how to launch their own dask.distributed clusters using the traditional job scheduler on their cluster, making space for more new users in the sandbox allocation.

### Financial Modeling Team
Similar to the case above, a team of modelers working at a financial institution run a complex network of computational models on top of each other. They started using dask.delayed individually, as suggested above, but realized that they often perform highly overlapping computations, such as always reading the same data.

Now they decide to use the same Dask cluster collaboratively to save on these costs. Because Dask intelligently hashes computations in a way similar to how Git works, they find that when two people submit similar computations the overlapping part of the computation runs only once.

Ever since working collaboratively on the same cluster they find that their frequently running jobs run much faster, because most of the work is already done by previous users. When they share scripts with colleagues they find that those repeated scripts complete immediately rather than taking several hours.

They are now able to iterate and share data as a team more effectively, decreasing their time to result and increasing their competitive edge.

As this becomes more heavily used on the company cluster they decide to set up an auto-scaling system. They use their dynamic job scheduler (perhaps SGE, LSF, Mesos, or Marathon) to run a single dask-scheduler 24/7 and then scale up and down the number of dask-workers running on the cluster based on computational load. This solution ends up being more responsive (and thus more heavily used) than their previous attempts to provide institution-wide access to parallel computing but because it responds to load it still acts as a good citizen in the cluster.

### Streaming data engineering
A data engineer responsible for watching a data feed needs to scale out a continuous process. They combine dask.distributed with normal Python Queues to produce a rudimentary but effective stream processing system.

Because dask.distributed is elastic, they can scale up or scale down their cluster resources in response to demand.