<p float="center">
  <img src="images/horizontal.png" alt="Coiled logo" width="415" hspace="10"/>
  <img src="images/dask_horizontal_no_pad.svg" alt="Dask logo" width="415" hspace="10" />
</p>

# Scalable DataFrames

Now we've seen Dask in Action, let's jump into seeing how you can use it just as you would Pandas.

In this notebook, we'll 

* set up a local cluster on our machine by instantiating a Dask Client
* load a CSV file into a Dask DataFrame
* inspect that DataFrame's internals to see how it's built
* perform calculations on that Dask DataFrame just as we would with a Pandas DataFrame

Though we'll use a small dataset here to feature the Dask API for analytics, keep in mind that the most common use cases for Dask are scaling Pandas dataframes to
* larger datasets (which don't fit in memory) and 
* multiple processes (which could be on multiple nodes)

*A bit about me:* I'm Hugo Bowne-Anderson, Head of Data Science Evangelism and Marketing at [Coiled](coiled.io/). We build products that bring the power of scalable data science and machine learning to you, such as single-click hosted clusters on the cloud. We want to take the DevOps out of data science so you can get back to your real job. If you're interested in taking Coiled for a test drive, you can sign up for our [free Beta here](beta.coiled.io/).

## 1. Setting up a local cluster with Dask

<img src="images/dask_horizontal_no_pad.svg" alt="Dask logo" style="width: 500px;">

Let's get coding. It's trivial to instantiate a Dask Client, which allows us to set up a local cluster on our computer. Here, we'll specify that we want four Dask workers with `n_workers=4`. Reminder:

* The *client* is the user-facing entry point for cluster users. The client gives directions (which you gave to the client in the form of code!) to the *scheduler*.
* The *scheduler*  listens to these directions and sends tasks to the *workers* accordingly.
* The *workers* compute the tasks.

In [1]:
from dask.distributed import Client

client = Client(n_workers=4)

client

Perhaps you already have a cluster running?
Hosting the HTTP server on port 55001 instead


0,1
Client  Scheduler: tcp://127.0.0.1:55002  Dashboard: http://127.0.0.1:55001/status,Cluster  Workers: 4  Cores: 8  Memory: 8.59 GB


We have now set up a local cluster with 4 workers (and 8 cores in total) that we can spread work out amongst to speed up the execution of our code.

Though unintuitive at first, starting the Dask Client is *optional* and therefore *not required* to use Dask DataFrames. However, starting a client provides a dashboard which is useful to gain insight on the computation. The link to the dashboard is hyperlinked in blue above. Dask recommends having it open on one side of your screen while using your notebook on the other side. This can take some effort to arrange your windows, but seeing them both at the same is very useful when learning.

## 2. Reading data into a Dask DataFrame

The plan:

* read data into a Dask DataFrame
* look at our data's structure via the structure of our Dask DataFrame

We can read a CSV file that contains beer data into a Dask DataFrames as follows. You'll notice that the Dask code is very similar to the equivalent Pandas code for reading in CSV files.

In [6]:
import dask.dataframe

ddf = dask.dataframe.read_csv('data/beer_small.csv', blocksize=6e6)

Let's take a look at the structure of the DataFrame.

In [7]:
ddf

Unnamed: 0_level_0,Unnamed: 0,brewery_id,brewery_name,review_time,review_overall,review_aroma,review_appearance,review_profilename,beer_style,review_palate,review_taste,beer_name,beer_abv,beer_beerid
npartitions=4,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
,int64,int64,object,int64,float64,float64,float64,object,object,float64,float64,object,float64,int64
,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...


A few things to notice:

* At the top, we see column names with data types. We can infer that each row in this dataset contains a review for a specific beer.
* On the left, we have `npartitions=4` (more on partitions shortly), which corresponds to the number of Pandas DataFrames under the hood here.
* We also see ellipses (`...`) under each of the column names. These are here because Dask Dataframes are lazy. They do not load values until we call compute, which we'll do shortly.

### What is this Dask DataFrame?

A large, virtual DataFrame divided (or *partitioned*) along the index into multiple Pandas DataFrames:

<img src="images/dask-dataframe.svg" width="400px">

**Recap:** Up to this point, we have

* imported `Client` from the `dask.distributed` module
* set up a local cluster on our computer by instantiating a Dask Client
* imported the `dask.dataframe` module (which is kind of like `pandas.DataFrame`)
* loaded in the entirety of a CSV file containing beer reviews into a Dask DataFrame
* looked at the external structure of the Dask DataFrame

## 3. What's under the hood?

The `.map_partitions()` method of a Dask DataFrame applies a Python function to each DataFrame partition. Here, we'll apply the `type` function to display what's under the hood of a Dask DataFrame. Hint: 🐼.

In [8]:
# See that we actually have a collection of Pandas DataFrames
ddf.map_partitions(type).compute()

0    <class 'pandas.core.frame.DataFrame'>
1    <class 'pandas.core.frame.DataFrame'>
2    <class 'pandas.core.frame.DataFrame'>
3    <class 'pandas.core.frame.DataFrame'>
dtype: object

We actually have a collection of four Pandas DataFrames! Note how we needed to call compute (with `.compute()` here) because Dask is lazy.

We can call the `.head()` method to display the contents of the first five rows of beer data.

In [9]:
# View head of Dask DataFrame
ddf.head()

Unnamed: 0.1,Unnamed: 0,brewery_id,brewery_name,review_time,review_overall,review_aroma,review_appearance,review_profilename,beer_style,review_palate,review_taste,beer_name,beer_abv,beer_beerid
0,784200,952,Great Dane Pub & Brewing Company (Downtown),1136269921,4.5,4.0,4.0,dirtylou,American IPA,4.0,4.0,Texas Speedbump IPA,,11846
1,1305265,29,Anheuser-Busch,1234830966,4.5,4.0,3.0,talkinghatrack,Light Lager,3.0,4.0,Bud Light Lime,4.2,41821
2,1526298,45,Brooklyn Brewery,1078599557,4.5,4.0,4.0,PopeJonPaul,Scotch Ale / Wee Heavy,4.0,4.5,Brooklyn Heavy Scotch Ale,7.5,16355
3,450647,590,New Glarus Brewing Company,1288790879,4.5,4.5,4.5,sweemzander,American Wild Ale,4.5,4.0,R&D Bourbon Barrel Kriek,5.5,60588
4,1223094,4,Allagash Brewing Company,1295320417,4.5,4.5,4.0,Jmoore50,American Wild Ale,4.0,4.0,Allagash Victor Francenstein,9.7,56665


## 4. Doing #analytics with the pandas-like Dask API

So loading data with Dask was very pandas-like. Sweet! It gets better -- performing calculations with Dask is *also* very Pandas-like.

The plan:

* write Dask code to *set up* computations to group Dask DataFrames by certain columns and perform calculations on those groups
* write Dask code to *actually* compute and display your results
* write Pandas code to further interact with these results

### Ratings as a function of beer type

Let's calculate the average rating for each type of beer (e.g., lager, wheat beer, etc.). We'll group by the `beer_style` column then take the mean of the `review_overall` column.

In [10]:
ratings = ddf.groupby('beer_style').review_overall.mean()
ratings.compute()

beer_style
Altbier                       3.825748
American Adjunct Lager        3.011778
American Amber / Red Ale      3.779610
American Amber / Red Lager    3.598146
American Barleywine           3.889695
                                ...   
Vienna Lager                  3.725216
Weizenbock                    4.014408
Wheatwine                     3.810026
Winter Warmer                 3.711612
Witbier                       3.769504
Name: review_overall, Length: 104, dtype: float64

Note how `.compute()` doesn't just run the work, it collects the result to a single, regular Pandas DataFrame right here in our initial Python virtual machine.

In [11]:
ratings.compute().sort_values()

beer_style
Low Alcohol Beer                    2.551282
American Malt Liquor                2.676039
Light Lager                         2.736338
Euro Strong Lager                   2.865979
Happoshu                            2.950000
                                      ...   
American Double / Imperial Stout    4.022270
Lambic - Unblended                  4.077982
Gueuze                              4.081597
Quadrupel (Quad)                    4.091667
American Wild Ale                   4.105042
Name: review_overall, Length: 104, dtype: float64

**Recap:** Using our Dask DataFrame that contains our beer review data, we have

* wrote code to compute the average rating for each type of beer by chaining the Dask `.groupby()` and `.mean()` methods and stored the [lazy computation's Dask graph](https://distributed.dask.org/en/latest/manage-computation.html) in `ratings`
* called `.compute()` on `ratings` to convert the lazy Dask collection to a concrete value in local memory (in the form of a Pandas DataFrame here)
* called `.compute()` again then used the Pandas `.sort_values()` method to sort the resulting Pandas DataFrame by ascending beer rating

### Subsetting to dive deep into IPAs

Let's do another calculation! Since [IPAs](https://en.wikipedia.org/wiki/IPAhttps://en.wikipedia.org/wiki/IPA) (India Pale Ales) are my favorite type of beer, let's do the above calculation, this time on an "IPA" subset of the dataset. The way we subset in Dask, again, is very Pandas-like:

In [12]:
# Check out IPAs
ddf[ddf.beer_style.str.contains('IPA')].head()

Unnamed: 0.1,Unnamed: 0,brewery_id,brewery_name,review_time,review_overall,review_aroma,review_appearance,review_profilename,beer_style,review_palate,review_taste,beer_name,beer_abv,beer_beerid
0,784200,952,Great Dane Pub & Brewing Company (Downtown),1136269921,4.5,4.0,4.0,dirtylou,American IPA,4.0,4.0,Texas Speedbump IPA,,11846
9,426580,666,Emerson's Brewery,1192461083,5.0,4.0,4.5,Lukie,English India Pale Ale (IPA),4.0,5.0,1812 India Pale Ale,4.7,4594
24,728901,17963,Nectar Ales,1312873910,3.5,4.0,3.5,Sensaray,American IPA,3.5,3.5,IPA Nectar,6.8,9024
26,745463,12877,NINE G Brewing Company,1189556274,4.0,4.5,4.0,Phatz,American Double / Imperial IPA,4.0,4.5,Infidel Imperial IPA,8.4,31041
28,94239,140,Sierra Nevada Brewing Co.,1269655771,4.0,4.5,4.5,CaptainIPA,American IPA,4.5,4.5,Sierra Nevada Torpedo Extra IPA,7.2,30420


On this subset of data, let's now do the same average rating calculation as before, then also add another aggregation for reviews count.

In [13]:
# Store the IPA subset in a variable
ipa = ddf[ddf.beer_style.str.contains('IPA')]

# Calculate mean rating and count of ratings for IPA subset
mean_ipa_review = ipa.groupby('brewery_name').review_overall.agg(['mean','count'])
mean_ipa_review.compute()

Unnamed: 0_level_0,mean,count
brewery_name,Unnamed: 1_level_1,Unnamed: 2_level_1
(512) Brewing Company,3.785714,7
1516 Brewing Company,4.000000,1
21st Amendment Brewery,3.923469,98
7 Seas Brewery and Taproom,4.000000,1
8 Wired Brewing Co.,4.250000,2
...,...,...
Three Needs Brewery & Taproom,4.000000,1
Thunderhead Brewing Company,4.500000,1
Tofino Brewing Company,4.500000,1
barVolo,4.000000,1


Now we can subset for the 20 highest-rated IPAs with `.nlargest()`.

In [14]:
mean_ipa_review.nlargest(20, 'mean').compute()

Unnamed: 0_level_0,mean,count
brewery_name,Unnamed: 1_level_1,Unnamed: 2_level_1
Elk Mountain Brewing,5.0,1
Pioneer Brewing Co.,5.0,2
Burnside Brewing Co.,5.0,1
Feral Brewing Co.,5.0,1
Flour City Brewing Co.,5.0,1
La Jolla Brew House,5.0,1
Uncle Buck's Brewery & Steakhouse,5.0,1
Crouch Vale Brewery Limited,5.0,1
Glacier Brewhouse,4.875,4
The Kernel Brewery,4.75,2


As noted above, `.compute()` doesn't just run the work, it collects the result to a single, regular Pandas dataframe right here in our initial Python VM.

**Recap:** Using our Dask DataFrame that contains our beer review data, we have

* subset the Dask DataFrame to contain only beers that contain "IPA" in their style value
* calculated mean rating and count of ratings for this IPA subset
* displayed the 20 highest-rated beers and their counts using our Dask DataFrame's `.nlargest()` method

## 5. Storing the results

Having a local result is convenient, but if we are generating large results, we may want (or need) to produce output in parallel to the filesystem, instead. 

There are writing counterparts to read methods which we can use:

- `read_csv` \ `to_csv`
- `read_hdf` \ `to_hdf`
- `read_json` \ `to_json`
- `read_parquet` \ `to_parquet`

In [15]:
mean_ipa_review.to_csv('ipa-*.csv') #the * is where the partition number will go

['/Users/hugobowne-anderson/Downloads/data-science-at-scale-master/ipa-0.csv']

## 6. Shut down the Dask Client

It's best practice to close the Dask Client when done.

In [16]:
client.close()

**Recap (of this entire notebook!):** We
* set up a local cluster on our machine by instantiating a Dask Client
* loaded a CSV file into a Dask DataFrame
* inspected that DataFrame's internals to see how it's built
* performed calculations on that Dask DataFrame just as we would with a Pandas DataFrame
* stored the results of our calculations in CSV files locally
* shut down the Dask Client when finished