<img src='images/dask-horizontal.svg' width=400>

# Dask natively scales Python
## Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love

### Integrates with existing projects
#### BUILT WITH THE BROADER COMMUNITY

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn.

*(from the Dask project homepage at dask.org)*

* * *

__What Does This Mean?__
* Built in Python
* Scales *properly* from single laptops to 1000-node clusters
* Leverages and interops with existing Python APIs as much as possible
* Adheres to (Tim Peters') "Zen of Python" (https://www.python.org/dev/peps/pep-0020/) ... especially these elements:
    * Explicit is better than implicit.
    * Simple is better than complex.
    * Complex is better than complicated.
    * Readability counts. <i>[ed: that goes for docs, too!]</i>
    * Special cases aren't special enough to break the rules.
    * Although practicality beats purity.
    * In the face of ambiguity, refuse the temptation to guess.
    * If the implementation is hard to explain, it's a bad idea.
    * If the implementation is easy to explain, it may be a good idea.
* While we're borrowing inspiration, it Dask embodies one of Perl's slogans, making easy things easy and hard things possible
    * Specifically, it supports common data-parallel abstractions like Pandas and Numpy
    * But also allows scheduling arbitary custom computation that doesn't fit a preset mold

### Let's See Some Code

Before we go any further, let's take a look at one particular, common use case for Dask: scaling Pandas dataframes to 
* larger datasets (which don't fit in memory) and 
* multiple processes (which could be on multiple nodes)

In [1]:
from dask.distributed import Client

client = Client(n_workers=2, threads_per_worker=1, memory_limit='1GB')

client

0,1
Client  Scheduler: tcp://127.0.0.1:54281  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 2  Cores: 2  Memory: 2.00 GB


In [2]:
import dask.dataframe

ddf = dask.dataframe.read_csv('data/beer_small.csv', blocksize=12e6)

In [3]:
ddf

Unnamed: 0_level_0,Unnamed: 0,brewery_id,brewery_name,review_time,review_overall,review_aroma,review_appearance,review_profilename,beer_style,review_palate,review_taste,beer_name,beer_abv,beer_beerid
npartitions=2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
,int64,int64,object,int64,float64,float64,float64,object,object,float64,float64,object,float64,int64
,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...


### What is this Dask Dataframe?

A large, virtual dataframe divided along the index into multiple Pandas dataframes:

<img src="images/dask-dataframe.svg" width="400px">

In [4]:
ddf.map_partitions(type).compute()

0    <class 'pandas.core.frame.DataFrame'>
1    <class 'pandas.core.frame.DataFrame'>
dtype: object

In [5]:
ddf.head()

Unnamed: 0.1,Unnamed: 0,brewery_id,brewery_name,review_time,review_overall,review_aroma,review_appearance,review_profilename,beer_style,review_palate,review_taste,beer_name,beer_abv,beer_beerid
0,784200,952,Great Dane Pub & Brewing Company (Downtown),1136269921,4.5,4.0,4.0,dirtylou,American IPA,4.0,4.0,Texas Speedbump IPA,,11846
1,1305265,29,Anheuser-Busch,1234830966,4.5,4.0,3.0,talkinghatrack,Light Lager,3.0,4.0,Bud Light Lime,4.2,41821
2,1526298,45,Brooklyn Brewery,1078599557,4.5,4.0,4.0,PopeJonPaul,Scotch Ale / Wee Heavy,4.0,4.5,Brooklyn Heavy Scotch Ale,7.5,16355
3,450647,590,New Glarus Brewing Company,1288790879,4.5,4.5,4.5,sweemzander,American Wild Ale,4.5,4.0,R&D Bourbon Barrel Kriek,5.5,60588
4,1223094,4,Allagash Brewing Company,1295320417,4.5,4.5,4.0,Jmoore50,American Wild Ale,4.0,4.0,Allagash Victor Francenstein,9.7,56665


In [6]:
ddf[ddf.beer_style.str.contains('IPA')].head()

Unnamed: 0.1,Unnamed: 0,brewery_id,brewery_name,review_time,review_overall,review_aroma,review_appearance,review_profilename,beer_style,review_palate,review_taste,beer_name,beer_abv,beer_beerid
0,784200,952,Great Dane Pub & Brewing Company (Downtown),1136269921,4.5,4.0,4.0,dirtylou,American IPA,4.0,4.0,Texas Speedbump IPA,,11846
9,426580,666,Emerson's Brewery,1192461083,5.0,4.0,4.5,Lukie,English India Pale Ale (IPA),4.0,5.0,1812 India Pale Ale,4.7,4594
24,728901,17963,Nectar Ales,1312873910,3.5,4.0,3.5,Sensaray,American IPA,3.5,3.5,IPA Nectar,6.8,9024
26,745463,12877,NINE G Brewing Company,1189556274,4.0,4.5,4.0,Phatz,American Double / Imperial IPA,4.0,4.5,Infidel Imperial IPA,8.4,31041
28,94239,140,Sierra Nevada Brewing Co.,1269655771,4.0,4.5,4.5,CaptainIPA,American IPA,4.5,4.5,Sierra Nevada Torpedo Extra IPA,7.2,30420


In [7]:
ipa = ddf[ddf.beer_style.str.contains('IPA')]

In [8]:
mean_ipa_review = ipa.groupby('brewery_name').review_overall.agg(['mean','count'])

In [9]:
mean_ipa_review.compute()

Unnamed: 0_level_0,mean,count
brewery_name,Unnamed: 1_level_1,Unnamed: 2_level_1
(512) Brewing Company,3.785714,7
1516 Brewing Company,4.000000,1
1702 / The Address Brewing Co.,4.000000,1
21st Amendment Brewery,3.923469,98
7 Seas Brewery and Taproom,4.000000,1
...,...,...
Yak and Yeti,4.500000,1
York Brewery Company Limited,4.000000,1
Yukon Brewing Company,4.250000,2
barVolo,4.000000,1


In [10]:
mean_ipa_review.nlargest(20, 'mean').compute()

Unnamed: 0_level_0,mean,count
brewery_name,Unnamed: 1_level_1,Unnamed: 2_level_1
Burnside Brewing Co.,5.0,1
Elk Mountain Brewing,5.0,1
Pioneer Brewing Co.,5.0,2
Crouch Vale Brewery Limited,5.0,1
Feral Brewing Co.,5.0,1
Flour City Brewing Co.,5.0,1
La Jolla Brew House,5.0,1
Uncle Buck's Brewery & Steakhouse,5.0,1
Glacier Brewhouse,4.875,4
The Kernel Brewery,4.75,2


`compute` doesn't just run the work, it collects the result to a single, regular Pandas dataframe right here in our initial Python VM.

Having a local result is convenient, but if we are generating large results, we may want (or need) to produce output in parallel to the filesystem, instead. 

There are writing counterparts to read methods which we can use:

- `read_csv` \ `to_csv`
- `read_hdf` \ `to_hdf`
- `read_json` \ `to_json`
- `read_parquet` \ `to_parquet`

In [11]:
mean_ipa_review.to_csv('ipa-*.csv') #the * is where the partition number will go

['/Users/hugobowne-anderson/Downloads/dask-mini-2019-master/ipa-0.csv']

In [12]:
client.close()

### About Dask

Dask was created in 2014 as part of the Blaze project, a DARPA funded project at Continuum/Anaconda. It has since grown into a multi-institution community project with developers from projects including NumPy, Pandas, Jupyter and Scikit-Learn. Many of the core Dask maintainers are employed to work on the project by companies including Continuum/Anaconda, Prefect, NVIDIA, Capital One, Saturn Cloud and Coiled.

Fundamentally, Dask allows a variety of parallel workflows using existing Python constructs, patterns, or libraries, including dataframes, arrays (scaling out Numpy), bags (an unordered collection construct a bit like `Counter`), and `concurrent.futures`

In addition to working in conjunction with Python ecosystem tools, Dask's extremely low scheduling overhead (nanoseconds in some cases) allows it work well even on single machines, and smoothly scale up.

Dask supports a variety of use cases for industry and research: https://stories.dask.org/en/latest/

With its recent 2.x releases, and integration to other projects (e.g., RAPIDS for GPU computation), many commercial enterprises are paying attention and jumping in to parallel Python with Dask.

__Dask Ecosystem__

In addition to the core Dask library and its Distributed scheduler, the Dask ecosystem connects several additional initiatives, including...
* Dask ML - parallel machine learning, with a scikit-learn-style API
* Dask-kubernetes
* Dask-XGBoost
* Dask-YARN
* Dask-image
* Dask-cuDF
* ... and some others

__What's Not Part of Dask?__

There are lots of functions that integrate to Dask, but are not represented in the core Dask ecosystem, including...

* a SQL engine
* data storage
* data catalog
* visualization
* coarse-grained scheduling / orchestration
* streaming

... although there are typically other Python packages that fill these needs (e.g., Kartothek or Intake for a data catalog).


### How Do We Set Up and/or Deploy Dask?

The easiest way to install Dask is with Anaconda: `conda install dask`

__Schedulers and Clustering__

Dask has a simple default scheduler called the "single machine scheduler" -- this is the scheduler that's used if your `import dask` and start running code without explicitly using a `Client` object. It can be handy for quick-and-dirty testing, but I would (*warning! opinion!*) suggest that a best practice is to __use the newer "distributed scheduler" even for single-machine workloads__

The distributed scheduler can work with 
* threads (although that is often not a great idea due to the GIL) in one process
* multiple processes on one machine
* multiple processes on multiple machines

The distributed scheduler has additional useful features including data locality awareness and realtime graphical dashboards.