Now we've seen Dask in Action, let's jump into seeing how you can use it just as you would Pandas. The dataset we're using here is small so we can focus on the Dask API for analytics.

### Let's See Some Code

Before we go any further, let's take a look at one particular, common use case for Dask: scaling Pandas dataframes to 
* larger datasets (which don't fit in memory) and 
* multiple processes (which could be on multiple nodes)

In [1]:
from dask.distributed import Client

client = Client(n_workers=4)

client

Perhaps you already have a cluster running?
Hosting the HTTP server on port 64063 instead


0,1
Client  Scheduler: tcp://127.0.0.1:64064  Dashboard: http://127.0.0.1:64063/status,Cluster  Workers: 4  Cores: 8  Memory: 8.59 GB


In [2]:
import dask.dataframe

ddf = dask.dataframe.read_csv('data/beer_small.csv', blocksize=6e6)

In [3]:
ddf

Unnamed: 0_level_0,Unnamed: 0,brewery_id,brewery_name,review_time,review_overall,review_aroma,review_appearance,review_profilename,beer_style,review_palate,review_taste,beer_name,beer_abv,beer_beerid
npartitions=4,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
,int64,int64,object,int64,float64,float64,float64,object,object,float64,float64,object,float64,int64
,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...


### What is this Dask Dataframe?

A large, virtual dataframe divided along the index into multiple Pandas dataframes:

<img src="images/dask-dataframe.svg" width="400px">

In [4]:
# See that we actually have a collection of Pandas DataFrames
ddf.map_partitions(type).compute()

0    <class 'pandas.core.frame.DataFrame'>
1    <class 'pandas.core.frame.DataFrame'>
2    <class 'pandas.core.frame.DataFrame'>
3    <class 'pandas.core.frame.DataFrame'>
dtype: object

In [5]:
# View heaf of Dask DataFrame
ddf.head()

Unnamed: 0.1,Unnamed: 0,brewery_id,brewery_name,review_time,review_overall,review_aroma,review_appearance,review_profilename,beer_style,review_palate,review_taste,beer_name,beer_abv,beer_beerid
0,784200,952,Great Dane Pub & Brewing Company (Downtown),1136269921,4.5,4.0,4.0,dirtylou,American IPA,4.0,4.0,Texas Speedbump IPA,,11846
1,1305265,29,Anheuser-Busch,1234830966,4.5,4.0,3.0,talkinghatrack,Light Lager,3.0,4.0,Bud Light Lime,4.2,41821
2,1526298,45,Brooklyn Brewery,1078599557,4.5,4.0,4.0,PopeJonPaul,Scotch Ale / Wee Heavy,4.0,4.5,Brooklyn Heavy Scotch Ale,7.5,16355
3,450647,590,New Glarus Brewing Company,1288790879,4.5,4.5,4.5,sweemzander,American Wild Ale,4.5,4.0,R&D Bourbon Barrel Kriek,5.5,60588
4,1223094,4,Allagash Brewing Company,1295320417,4.5,4.5,4.0,Jmoore50,American Wild Ale,4.0,4.0,Allagash Victor Francenstein,9.7,56665


### Ratings as a function of beer type

In [6]:
ratings = ddf.groupby('beer_style').review_overall.mean()
ratings.compute()

beer_style
Altbier                       3.825748
American Adjunct Lager        3.011778
American Amber / Red Ale      3.779610
American Amber / Red Lager    3.598146
American Barleywine           3.889695
                                ...   
Vienna Lager                  3.725216
Weizenbock                    4.014408
Wheatwine                     3.810026
Winter Warmer                 3.711612
Witbier                       3.769504
Name: review_overall, Length: 104, dtype: float64

`compute` doesn't just run the work, it collects the result to a single, regular Pandas dataframe right here in our initial Python VM.


In [7]:
ratings.compute().sort_values()

beer_style
Low Alcohol Beer                    2.551282
American Malt Liquor                2.676039
Light Lager                         2.736338
Euro Strong Lager                   2.865979
Happoshu                            2.950000
                                      ...   
American Double / Imperial Stout    4.022270
Lambic - Unblended                  4.077982
Gueuze                              4.081597
Quadrupel (Quad)                    4.091667
American Wild Ale                   4.105042
Name: review_overall, Length: 104, dtype: float64

### A Deep Dive into IPAs

In [8]:
# Check out IPAs
ddf[ddf.beer_style.str.contains('IPA')].head()

Unnamed: 0.1,Unnamed: 0,brewery_id,brewery_name,review_time,review_overall,review_aroma,review_appearance,review_profilename,beer_style,review_palate,review_taste,beer_name,beer_abv,beer_beerid
0,784200,952,Great Dane Pub & Brewing Company (Downtown),1136269921,4.5,4.0,4.0,dirtylou,American IPA,4.0,4.0,Texas Speedbump IPA,,11846
9,426580,666,Emerson's Brewery,1192461083,5.0,4.0,4.5,Lukie,English India Pale Ale (IPA),4.0,5.0,1812 India Pale Ale,4.7,4594
24,728901,17963,Nectar Ales,1312873910,3.5,4.0,3.5,Sensaray,American IPA,3.5,3.5,IPA Nectar,6.8,9024
26,745463,12877,NINE G Brewing Company,1189556274,4.0,4.5,4.0,Phatz,American Double / Imperial IPA,4.0,4.5,Infidel Imperial IPA,8.4,31041
28,94239,140,Sierra Nevada Brewing Co.,1269655771,4.0,4.5,4.5,CaptainIPA,American IPA,4.5,4.5,Sierra Nevada Torpedo Extra IPA,7.2,30420


In [9]:
ipa = ddf[ddf.beer_style.str.contains('IPA')]
mean_ipa_review = ipa.groupby('brewery_name').review_overall.agg(['mean','count'])
mean_ipa_review.compute()

Unnamed: 0_level_0,mean,count
brewery_name,Unnamed: 1_level_1,Unnamed: 2_level_1
(512) Brewing Company,3.785714,7
1516 Brewing Company,4.000000,1
21st Amendment Brewery,3.923469,98
7 Seas Brewery and Taproom,4.000000,1
8 Wired Brewing Co.,4.250000,2
...,...,...
Three Needs Brewery & Taproom,4.000000,1
Thunderhead Brewing Company,4.500000,1
Tofino Brewing Company,4.500000,1
barVolo,4.000000,1


In [10]:
mean_ipa_review.nlargest(20, 'mean').compute()

Unnamed: 0_level_0,mean,count
brewery_name,Unnamed: 1_level_1,Unnamed: 2_level_1
Elk Mountain Brewing,5.0,1
Pioneer Brewing Co.,5.0,2
Burnside Brewing Co.,5.0,1
Feral Brewing Co.,5.0,1
Flour City Brewing Co.,5.0,1
La Jolla Brew House,5.0,1
Uncle Buck's Brewery & Steakhouse,5.0,1
Crouch Vale Brewery Limited,5.0,1
Glacier Brewhouse,4.875,4
The Kernel Brewery,4.75,2


As noted above, `compute` doesn't just run the work, it collects the result to a single, regular Pandas dataframe right here in our initial Python VM.

Having a local result is convenient, but if we are generating large results, we may want (or need) to produce output in parallel to the filesystem, instead. 

There are writing counterparts to read methods which we can use:

- `read_csv` \ `to_csv`
- `read_hdf` \ `to_hdf`
- `read_json` \ `to_json`
- `read_parquet` \ `to_parquet`

In [11]:
mean_ipa_review.to_csv('ipa-*.csv') #the * is where the partition number will go

['/Users/hugobowne-anderson/Downloads/data-science-at-scale-master/ipa-0.csv']

In [12]:
client.close()