<img src="https://snipboard.io/Kx6OAi.jpg">

# Session 3. Dask + Pandas
<div style="margin-top: -10px; padding: 5px; line-height: 20px;"><img src="https://snipboard.io/v5q47G.jpg" style="width: 35px; float: left; margin-right: 10px;"> Author:  <a href="http://www.linkedin.com/in/davidyerrington">David Yerrington</a>, Data Scientist<br>San Francisco, CA</div>

## Learning Objectives

- Be able to explain the difference between Pandas and Dask DataFrames
- Become familiar with storage options
- Understand nuances with schema

### Prerequisite Knowledge
- Basic Pandas 

## Environment Setup

We will first review some basic points to setup Python and the environment to start in [the setup guide](../environment.md).


# 1. Pandas and Dask

![](https://snipboard.io/QkGlLK.jpg)
> Photo by [VGC](https://www.facebook.com/vgcphotography/)

Many of us learn Pandas inside and out, staying completely in the safety of our Jupyter environments.  Most of what we need to deal with after we do analysis, engineer features, building cool recommendation prototypes, or supervised models means we're bound to leave the safety of Pandas for "real" big data frameworks like Spark.  This doesn't have to be the case anymore since Dask has reached a level of maturity that can reliably scale our work to production systems and beyond.

With a bit of knowledge around how DataFrames work in Dask, we can adapt existing work we've written in Pandas to Dask as long as we learn the basics.  Even though we're working on a single machine, there are still lots of performance gains by using a local cluster.

### In this session we will dive into:
- How to adapt common Pandas workflows to Dask
- Understand common pitfalls

## The Dask DataFrame

![](https://snipboard.io/CIHP8D.jpg)

So far we've looked at `Bag` and `Array`. Many of the same assumptions with those collection types apply to the Dask DataFrame which are:

- A Dask DataFrame is comprised of smaller DataFrames (partitions).
- Like `Array`, not every method is implemented 1:1.
- Column typing can be a bit difficult.
- Not every method behaves exactly the same.
- Joins are expensive when you don't use an index.
- Operations like `.set_index`, `.melt`, `.stack`/`.unstack`, `groupby().apply()` can be a bit on the slow side.

To master Dask DataFrames takes time.  There are tons of rules of thumb that are good to know but come with experience.  This is an intro-level session but we will explore the areas of Pandas-to-Dask that will give you the most impact and equip you with enough foundational knowledge to continue learning.

### Imports

In [1]:
import dask
from dask.distributed import Client

client = Client()
client



0,1
Client  Scheduler: tcp://127.0.0.1:11368  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 8  Memory: 17.11 GB




### Make Some "Fake" People Data

The datasets `make_people` function practically puts our dataset into a `Bag` object for us.  It's only semi-structured once we convert it to a Dask `DataFrame`.

In [2]:
people = dask.datasets.make_people().to_dataframe()



### Examining our DataFrame

Out of the box, the first problem we will find when examining data is that fact that we can't see any of it.

In [3]:
people

Unnamed: 0_level_0,age,name,occupation,telephone,address,credit-card
npartitions=10,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
,int64,object,object,object,object,object
,...,...,...,...,...,...
...,...,...,...,...,...,...
,...,...,...,...,...,...
,...,...,...,...,...,...


### `.info` isn't as informative at first glance

One of the first things we should be doing is looking for missing values and expected types.

In [4]:
people.info()

<class 'dask.dataframe.core.DataFrame'>
Columns: 6 entries, age to credit-card
dtypes: object(5), int64(1)

### Question:  Any ideas on how to know what's missing and if the dtypes are expected?
Follow up: Can we trust the `dtypes` reported?  Why?

>  Quick refresher:  `.info()` with regular Pandas tells us the  name of each variable, how many non-null records, size in memory, and how many records there are.

In [9]:
people.head(10).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   age          10 non-null     int64 
 1   name         10 non-null     object
 2   occupation   10 non-null     object
 3   telephone    10 non-null     object
 4   address      10 non-null     object
 5   credit-card  10 non-null     object
dtypes: int64(1), object(5)
memory usage: 608.0+ bytes


### Everything is lazy, including `.describe`

In our very specific case, we have a DataFrame from a `Bag`.  It will load fast.  However, if you are loading from a method like `dask.read_csv("people-*.csv")` where each file has the same columns, by the time you call `.compute()` it would have to read them all into memory and **then** perform the computations accross every dimension for many of the build-in methods such as `.describe`.  Keep in mind that pre-requisite steps loaded in the dag will need to be performed to read the data before it will be computed.

In [11]:
people.compute().describe()

Unnamed: 0,age
count,10000.0
mean,40.9719
std,14.668545
min,16.0
25%,28.0
50%,41.0
75%,54.0
max,66.0


### `.compute()` gives you eveything

`.compute()` returns every observersion with specified on a Dask `DataFrame` with no operations before it.  The return is a Pandas `DataFrame`.

>  Be careful when returning everything.  `.compute()` because it may not fit in memory.

In [12]:
people.compute()

Unnamed: 0,age,name,occupation,telephone,address,credit-card
0,39,"(Merrill, Wagner)",Agent,+1-(703)-357-7115,"{'address': '753 Hillside Plantation', 'city':...","{'number': '4251 5038 0465 4953', 'expiration-..."
1,17,"(Ignacio, Salas)",Land Agent,(367) 755-7299,"{'address': '462 Third Annex', 'city': 'Goodle...","{'number': '5181 0643 7154 8682', 'expiration-..."
2,17,"(Earnest, Roach)",Bank Manager,729.976.1387,"{'address': '810 Nauman Mews', 'city': 'Park F...","{'number': '2508 4209 3314 7913', 'expiration-..."
3,53,"(Damian, Brown)",Purchasing Assistant,824-391-7395,"{'address': '668 Vale Extension', 'city': 'Nam...","{'number': '3494 116142 31651', 'expiration-da..."
4,20,"(Eugene, Prince)",Publisher,(253) 401-3378,"{'address': '328 Ward Road', 'city': 'Anderson'}","{'number': '3435 828153 99699', 'expiration-da..."
...,...,...,...,...,...,...
995,25,"(Jay, Arnold)",Progress Chaser,1-022-762-6396,"{'address': '334 Lawton Trail', 'city': 'Platt...","{'number': '2223 7916 9728 4199', 'expiration-..."
996,39,"(Marquis, Buchanan)",Cable Contractor,(348) 879-0051,"{'address': '142 Blythdale High Street', 'city...","{'number': '4796 3609 2481 8204', 'expiration-..."
997,17,"(Jasper, Sweet)",Accounts Manager,470.506.9562,"{'address': '925 Mclea Turnpike', 'city': 'Elm...","{'number': '5243 2441 3222 2399', 'expiration-..."
998,50,"(Winford, Grimes)",Metal Dealer,712-377-7785,"{'address': '672 Normandie Grove', 'city': 'Ha...","{'number': '4704 4094 6493 6453', 'expiration-..."


### `.head()` is still your friend
Head fortunately isn't too different but it is roughly the equivalent to `.compute()` since it does bring data into context.  The data returned however is no longer a Dask DataFrame, but a regular Pandas `DataFrame`.  

- `.head()` will `.compute()` but only return limited results.
- Only the first partition is pulled from unless specified with `npartitions`

> **If you had 10 records each in 10 partitions**
> - The maximum number of records `.head()` will return is 10 with `.head(npartitions=1)` (the default).
> - `.head(npartitions=2)` would return the maximum number of records in 2 partitions (ie: 20). 
>

In [20]:
people.head(5000, npartitions = 2)

Unnamed: 0,age,name,occupation,telephone,address,credit-card
0,39,"(Merrill, Wagner)",Agent,+1-(703)-357-7115,"{'address': '753 Hillside Plantation', 'city':...","{'number': '4251 5038 0465 4953', 'expiration-..."
1,17,"(Ignacio, Salas)",Land Agent,(367) 755-7299,"{'address': '462 Third Annex', 'city': 'Goodle...","{'number': '5181 0643 7154 8682', 'expiration-..."
2,17,"(Earnest, Roach)",Bank Manager,729.976.1387,"{'address': '810 Nauman Mews', 'city': 'Park F...","{'number': '2508 4209 3314 7913', 'expiration-..."
3,53,"(Damian, Brown)",Purchasing Assistant,824-391-7395,"{'address': '668 Vale Extension', 'city': 'Nam...","{'number': '3494 116142 31651', 'expiration-da..."
4,20,"(Eugene, Prince)",Publisher,(253) 401-3378,"{'address': '328 Ward Road', 'city': 'Anderson'}","{'number': '3435 828153 99699', 'expiration-da..."
...,...,...,...,...,...,...
995,59,"(Marshall, Delacruz)",Training Assistant,1-907-337-2234,"{'address': '270 Diamond Heights Bypass', 'cit...","{'number': '2241 3893 5858 7212', 'expiration-..."
996,54,"(Un, Collins)",Underwriter,293-703-5398,"{'address': '636 Peninsula Spur', 'city': 'New...","{'number': '4183 5339 1508 3217', 'expiration-..."
997,57,"(Letisha, Douglas)",Gallery Owner,(379) 252-5926,"{'address': '649 Bercut Access Grove', 'city':...","{'number': '4521 6968 7967 2035', 'expiration-..."
998,35,"(Dewey, Ball)",Yard Manager,(845) 505-1116,"{'address': '445 Crescent Side road', 'city': ...","{'number': '5146 9718 1690 5018', 'expiration-..."


### Get a specific partition with `.get_partition(n)`

If you need a specifc chunk, like `.head()`, `.get_partition(n)` is handy to know and works well with `.head()` or `.compute()`

> It's also "lazy".  Also, note the index doesn't change.

In [28]:
people.get_partition(4).describe().compute()

Unnamed: 0,age
count,1000.0
mean,41.14
std,15.002382
min,16.0
25%,28.0
50%,41.0
75%,54.0
max,66.0


### Aggregates work great as well.

In [33]:
people['age'].value_counts().compute()

58    225
30    218
18    211
61    211
56    210
37    210
39    209
41    207
52    206
17    205
26    204
27    204
25    204
36    204
48    203
22    203
32    201
33    201
42    201
62    201
46    199
53    198
44    198
57    197
34    197
38    197
45    197
35    196
16    195
43    195
31    194
29    194
24    194
50    193
64    192
49    191
65    190
55    190
20    189
54    189
60    189
21    188
59    186
63    186
40    184
19    181
66    181
47    178
51    170
23    167
28    167
Name: age, dtype: int64

In [34]:
# groupby
people.groupby(["age"])['name'].count().compute()

age
16    195
17    205
18    211
19    181
20    189
21    188
22    203
23    167
24    194
25    204
26    204
27    204
28    167
29    194
30    218
31    194
32    201
33    201
34    197
35    196
36    204
37    210
38    197
39    209
40    184
41    207
42    201
43    195
44    198
45    197
46    199
47    178
48    203
49    191
50    193
51    170
52    206
53    198
54    189
55    190
56    210
57    197
58    225
59    186
60    189
61    211
62    201
63    186
64    192
65    190
66    181
Name: name, dtype: int64

In [35]:
# mean age
people['age'].mean().compute()

40.9719

### `.map()` and `.apply()` are also still the same.

A common strategy is to test your map with a smaller subset `.head()`.

In [37]:
# Basic map function
people['first_name'] = people['name'].map(lambda part: part[0])
people['last_name'] = people['name'].map(lambda part: part[1])

Then once you've got it right, assign an entirely new column.

In [38]:

# Create new features from nested tuple features
people.compute()

Unnamed: 0,age,name,occupation,telephone,address,credit-card,first_name,last_name
0,39,"(Merrill, Wagner)",Agent,+1-(703)-357-7115,"{'address': '753 Hillside Plantation', 'city':...","{'number': '4251 5038 0465 4953', 'expiration-...",Merrill,Wagner
1,17,"(Ignacio, Salas)",Land Agent,(367) 755-7299,"{'address': '462 Third Annex', 'city': 'Goodle...","{'number': '5181 0643 7154 8682', 'expiration-...",Ignacio,Salas
2,17,"(Earnest, Roach)",Bank Manager,729.976.1387,"{'address': '810 Nauman Mews', 'city': 'Park F...","{'number': '2508 4209 3314 7913', 'expiration-...",Earnest,Roach
3,53,"(Damian, Brown)",Purchasing Assistant,824-391-7395,"{'address': '668 Vale Extension', 'city': 'Nam...","{'number': '3494 116142 31651', 'expiration-da...",Damian,Brown
4,20,"(Eugene, Prince)",Publisher,(253) 401-3378,"{'address': '328 Ward Road', 'city': 'Anderson'}","{'number': '3435 828153 99699', 'expiration-da...",Eugene,Prince
...,...,...,...,...,...,...,...,...
995,25,"(Jay, Arnold)",Progress Chaser,1-022-762-6396,"{'address': '334 Lawton Trail', 'city': 'Platt...","{'number': '2223 7916 9728 4199', 'expiration-...",Jay,Arnold
996,39,"(Marquis, Buchanan)",Cable Contractor,(348) 879-0051,"{'address': '142 Blythdale High Street', 'city...","{'number': '4796 3609 2481 8204', 'expiration-...",Marquis,Buchanan
997,17,"(Jasper, Sweet)",Accounts Manager,470.506.9562,"{'address': '925 Mclea Turnpike', 'city': 'Elm...","{'number': '5243 2441 3222 2399', 'expiration-...",Jasper,Sweet
998,50,"(Winford, Grimes)",Metal Dealer,712-377-7785,"{'address': '672 Normandie Grove', 'city': 'Ha...","{'number': '4704 4094 6493 6453', 'expiration-...",Winford,Grimes


### Our map and apply functions don't run until `.compute()` is run.

If you queue up a bunch of transformations, these will all be added the the DAG.

In [40]:
%%time
people.compute()

CPU times: user 62.5 ms, sys: 15.6 ms, total: 78.1 ms
Wall time: 603 ms


Unnamed: 0,age,name,occupation,telephone,address,credit-card,first_name,last_name
0,39,"(Merrill, Wagner)",Agent,+1-(703)-357-7115,"{'address': '753 Hillside Plantation', 'city':...","{'number': '4251 5038 0465 4953', 'expiration-...",Merrill,Wagner
1,17,"(Ignacio, Salas)",Land Agent,(367) 755-7299,"{'address': '462 Third Annex', 'city': 'Goodle...","{'number': '5181 0643 7154 8682', 'expiration-...",Ignacio,Salas
2,17,"(Earnest, Roach)",Bank Manager,729.976.1387,"{'address': '810 Nauman Mews', 'city': 'Park F...","{'number': '2508 4209 3314 7913', 'expiration-...",Earnest,Roach
3,53,"(Damian, Brown)",Purchasing Assistant,824-391-7395,"{'address': '668 Vale Extension', 'city': 'Nam...","{'number': '3494 116142 31651', 'expiration-da...",Damian,Brown
4,20,"(Eugene, Prince)",Publisher,(253) 401-3378,"{'address': '328 Ward Road', 'city': 'Anderson'}","{'number': '3435 828153 99699', 'expiration-da...",Eugene,Prince
...,...,...,...,...,...,...,...,...
995,25,"(Jay, Arnold)",Progress Chaser,1-022-762-6396,"{'address': '334 Lawton Trail', 'city': 'Platt...","{'number': '2223 7916 9728 4199', 'expiration-...",Jay,Arnold
996,39,"(Marquis, Buchanan)",Cable Contractor,(348) 879-0051,"{'address': '142 Blythdale High Street', 'city...","{'number': '4796 3609 2481 8204', 'expiration-...",Marquis,Buchanan
997,17,"(Jasper, Sweet)",Accounts Manager,470.506.9562,"{'address': '925 Mclea Turnpike', 'city': 'Elm...","{'number': '5243 2441 3222 2399', 'expiration-...",Jasper,Sweet
998,50,"(Winford, Grimes)",Metal Dealer,712-377-7785,"{'address': '672 Normandie Grove', 'city': 'Ha...","{'number': '4704 4094 6493 6453', 'expiration-...",Winford,Grimes


### Persisting

If our dataset can fit into memory, we may want to use `.persist()`

In [41]:
people = people.persist()

Persisting a portion in memory is also pretty handy for developing on a smaller scale to test your code before scaling up.  There's always a question to be asked "Do I need to keep working in Dask?"  Sometimes you just need a sample and then you can break off a piece and move back to Pandas.  That's totally fine too if you're just examining a portion of data for a while for instance looking at timeseries for a specific time period rather than the entire history of data.

In [42]:
%%time

people.compute()

CPU times: user 15.6 ms, sys: 46.9 ms, total: 62.5 ms
Wall time: 84.6 ms


Unnamed: 0,age,name,occupation,telephone,address,credit-card,first_name,last_name
0,39,"(Merrill, Wagner)",Agent,+1-(703)-357-7115,"{'address': '753 Hillside Plantation', 'city':...","{'number': '4251 5038 0465 4953', 'expiration-...",Merrill,Wagner
1,17,"(Ignacio, Salas)",Land Agent,(367) 755-7299,"{'address': '462 Third Annex', 'city': 'Goodle...","{'number': '5181 0643 7154 8682', 'expiration-...",Ignacio,Salas
2,17,"(Earnest, Roach)",Bank Manager,729.976.1387,"{'address': '810 Nauman Mews', 'city': 'Park F...","{'number': '2508 4209 3314 7913', 'expiration-...",Earnest,Roach
3,53,"(Damian, Brown)",Purchasing Assistant,824-391-7395,"{'address': '668 Vale Extension', 'city': 'Nam...","{'number': '3494 116142 31651', 'expiration-da...",Damian,Brown
4,20,"(Eugene, Prince)",Publisher,(253) 401-3378,"{'address': '328 Ward Road', 'city': 'Anderson'}","{'number': '3435 828153 99699', 'expiration-da...",Eugene,Prince
...,...,...,...,...,...,...,...,...
995,25,"(Jay, Arnold)",Progress Chaser,1-022-762-6396,"{'address': '334 Lawton Trail', 'city': 'Platt...","{'number': '2223 7916 9728 4199', 'expiration-...",Jay,Arnold
996,39,"(Marquis, Buchanan)",Cable Contractor,(348) 879-0051,"{'address': '142 Blythdale High Street', 'city...","{'number': '4796 3609 2481 8204', 'expiration-...",Marquis,Buchanan
997,17,"(Jasper, Sweet)",Accounts Manager,470.506.9562,"{'address': '925 Mclea Turnpike', 'city': 'Elm...","{'number': '5243 2441 3222 2399', 'expiration-...",Jasper,Sweet
998,50,"(Winford, Grimes)",Metal Dealer,712-377-7785,"{'address': '672 Normandie Grove', 'city': 'Ha...","{'number': '4704 4094 6493 6453', 'expiration-...",Winford,Grimes


### Avoid calling `.compute()` too often.

Whenever you run a compute, remember it will carry all the other processes in the graph with it and can be expensive.

In [44]:
%time

# This will run the graph 2x
min_age = people['age'].min().compute()
max_age = people['age'].max().compute()

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 7.87 µs


In [45]:
# This will run it 1x -- combine tasks when possible
min_age, max_age = dask.compute(people['age'].min(), people['age'].max())

In [47]:
min_age, max_age

(16, 66)

### Using Parquet Files

You may get to a point where lots of transformations have been performed such as when you have cleaned your data or engineered a bunch of features.  Rather than carry those tasks across the graph to continue working, you may want to save your data.

The standard formats that Pandas supports, Dask also supports for the most part.  A format that is good to know how to work with is Apache Parquet.

<center><img src="https://snipboard.io/q1Lx25.jpg"></center>

Parquet is:
- Cross-platform
- Fast
- Strongly typed
- [Columnar-based](https://www.techwell.com/techwell-insights/2020/05/benefits-using-columnar-storage-relational-database-management-systems)  
- Compatible with many big data systems
- Supports compression encodings

Also, Dasks implementation of Parquet supports storage_options for cloud storage such as `s3:// or `gcs://`:
```python
people.to_parquet(
    "s3://my_bucket/my_directory/people.parquet"
    storage_options = {
        'account_name': 'ACCOUNT_NAME',
        'account_key': 'ACCOUNT_KEY'
    }, 
    ...
)

```

>  Another good one is **Avro**.  Avro files also have a lot of the same benefits.  If you are on Google Cloud and load a lot of data to BigQuery, the last time I checked, you don't get charged for using the Avro format.  Choosing the right format for larger datasets depends on the use case.  Using Avro with `DataFrame` is a bit trickier unless you convert back to `Bag` format then define a schema with a JSON file.  You can read about it more in [the docs](https://docs.dask.org/en/latest/bag-creation.html).

## Saving to formats like Parquet or Avro requires you to be a bit more specific about your data.

One of the big problems you'll run into using these types of formats is that `object` isn't specific enough to allow saving to partuet format.  Parquet tries to save it as a UTF8 string but at the lowest level in the DataFrame, the `name` field is actually a `tuple`.

In [48]:
# inspect name feature
people['name'].head()

0    (Merrill, Wagner)
1     (Ignacio, Salas)
2     (Earnest, Roach)
3      (Damian, Brown)
4     (Eugene, Prince)
Name: name, dtype: object

In [49]:
# attempt to save file and see error
people.to_parquet("../data/people.parquet", object_encoding='UTF8')

ValueError: Object encoding (UTF8) not one of infer|utf8|bytes|json|bson|bool|int|int32|float|decimal

### Question:  What should we do?

In [51]:
# Attempt something here and report in thread.
features = ['first_name', 'last_name', 'age', 'occupation', 'telephone']
people[features].to_parquet('../data/people.parquet')

## One more thing:  Working with Pandas

So far we've looked into working with Dask DataFrames staying within the ecosystem of Dask.  In the real world you may start a project in Pandas and then find the scope of your projects creeping up on you.  Or you perhaps you have a [gridsearch problem](https://ml.dask.org/hyper-parameter-search.html) that has grown out of control with sklearn.  It's fairly straight forward to convert your Pandas DataFrames on over.

In [52]:
import pandas as pd

pokemon = pd.read_csv("../data/pokemon.csv")

Using `dask.dataframe.from_pandas()` we have to define either `npartitions=n` or `chunksize=n`.

- `npartitions`, Dask will attempt to divide your rows equally amongst `n` specified parititons.  If we have 800 rows, and we say `npartitions=8` then it will create a Dask `DataFrame` with 8 partitions having 100 rows in each partition.
- `chunksize`:  Dask tries to create partitions equally based on how many rows you specify per partition.


> #### General rules of thumb with Pandas to -> Dask
>
> - Stay in Pandas unless you have a real need to scale.  Pandas is faster to program in but there are benefits to parallalizing your code for CPU intensive transformations.
> - For Partition size, try to evenly distribute your data so data is spread evenly across your workers.
> - Don't be afraid to convert back from Dask to Pandas for portions of your project when it makes sense.

In [57]:
pokedask =  dask.dataframe.from_pandas(pokemon, npartitions = 8)
pokedask

Unnamed: 0_level_0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
npartitions=8,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,int64,object,object,object,int64,int64,int64,int64,int64,int64,int64,bool
100,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...
700,...,...,...,...,...,...,...,...,...,...,...,...
799,...,...,...,...,...,...,...,...,...,...,...,...


## Scaling to Cloud

I'm sure someone will have asked about this by now.  For the most part we've explored the Dask framework for parallel computing.  There will be a time when you will need to setup a real cluster and move beyond your laptop.  My recommended options include:

### Kubernetes

Kubernetes is a framework for managing contianerized resources. There's a bit of a learning curve with Kubernetes but it's very robust in terms of booting up containers and even autoscaling based on usage.  THere are plenty of cloud providers that offer hosted Kubernetes solutions and the best one is not surprisingly Google (they make Kubernetes).  

[Kubernetes + Helm](https://docs.dask.org/en/latest/setup/kubernetes-helm.html)https://github.com/jmcarpenter2/swifter

### Yarn on AWS EMR or GCS Dataproc

Yarn is a cluster manager and both AWS and Google Cloud offer a hosted version of this well-supported service.  

[yarn.dask.org](https://yarn.dask.org/en/latest/)

### Dask Cloud Provider

The Dask cloud Provider library includes a handful of APIs for deploying directly to cloud-specific VMs.

[Dask Cloud Provider](https://cloudprovider.dask.org/en/latest/azure.html) 


### Saturncloud.io - Vendor specific

Saturncloud.io exclusively provides Dask services for your choice of cloud provider and has their own API for integration.  Their strength is in having a complete end-to-end data science set of services that include Jupyter, GPU-enabled hardware, and some 3rd-party integrations.  What I like about this service is that if you want to get started with a Jupyter instance and have it outfitted to a Dask cluster with minimal effort, this is a great option.

[Saturncloud.io](https://www.saturncloud.io/s/)

### Resources

- A handy [Dask cheatsheet](https://docs.dask.org/en/latest/cheatsheet.html) from Anaconda.
- A great but intense look at [optimizing graphs](https://docs.dask.org/en/latest/optimize.html) Dask guide. 
- Newer Gridsearch alternative "Hyperband" for [hyperparameter tuning](https://examples.dask.org/machine-learning/hyperparam-opt.html) with Dask.
- Using Dask Array to [scale Sklearn predictions](https://examples.dask.org/machine-learning/parallel-prediction.html).
- [Swifter](https://github.com/jmcarpenter2/swifter) is an interesting wrapper for Pandas that automatically loads Dask as an extension for quick and easy jobs.

# Summary

- A Dask DataFrame is comprised of smaller DataFrames (partitions).
- Like Array, not every method from Pandas is implemented in Dask DataFrames.

### Dask DataFrames as still lazy!

Nothing is computed until you ask.  

### `.compute()` and `.head()` return Pandas DataFrames

- `.compute()` will process all records through the graph.
- `.head()` will only process the first partition 

### Functions like `.info` and `.describe()` don't tell the whole story

With large datasets, these methods might actually be impractical to run but they are also "lazy."  It's still a good idea to use samples or smaller subsets.  Its not a bad idea to use `get_partition` with these to inspect a bit more granularly.

### Whenever you add an operation without `.compute`, it adds it to the graph

This is desirable because you want to avoid calling `.compute` once you're working on a large dataset.  So if you're loading a bunch of CSV files or files in sequence, any kind of cleaning operation you do on it prior to running say an aggregation on a column like `df['age'].max()`, Dask will execute the entire graph that includes any data loading and cleaning prior to getting to the aggregation.

### Use `.persist()` to incrementally store data in memory.

As you interatively update code and write your transformations, the persist function is handy to store a portion or all of your data in memory when possible.

### Formats for storing files like Parquet are a bit more specific.

For literal Python objects and JSON, you will have to convert them explicity to strings, even if they are typed as an `object` in the DataFrame before storage. This type of format enforces a stricter control over what format of data it expects.  However, you do gain the potential to save your data to different cloud bucket services like S3.