<p float="center">
  <img src="images/horizontal.png" alt="Coiled logo" width="415" hspace="10"/>
  <img src="images/dask_horizontal_no_pad.svg" alt="Dask logo" width="415" hspace="10" />
</p>

### Scalable DataFrames (feat. Wikimedia Traffic Data!)

We've witnessed the scaling power of Dask DataFrames. Let's do a mini-lab to get *even more comfortable* with them. Practice makes perfect.

The plan:

* Set up a Dask cluster.
* Read the Wikimedia Traffic Data into a Dask DataFrame.
* Clean up the columns.
* Run a few queries on some Wikimedia traffic data using Dask DataFrame.

*A bit about me:* I'm Hugo Bowne-Anderson, Head of Data Science Evangelism and Marketing at [Coiled](coiled.io/). We build products that bring the power of scalable data science and machine learning to you, such as single-click hosted clusters on the cloud. We want to take the DevOps out of data science so you can get back to your real job. If you're interested in taking Coiled for a test drive, you can sign up for our [free Beta here](beta.coiled.io/).

## 1. Set up a Dask cluster

First, let's create a Client and request 4 workers, 1 thread, and 1GB of RAM each.

In [1]:
from dask.distributed import Client

client = Client(n_workers=4)

client

Perhaps you already have a cluster running?
Hosting the HTTP server on port 55522 instead


0,1
Client  Scheduler: tcp://127.0.0.1:55523  Dashboard: http://127.0.0.1:55522/status,Cluster  Workers: 4  Cores: 8  Memory: 8.59 GB


## 2. Read in the Wikimedia Traffic Data

Now, we'll read data from the `pageviews_small.csv` file. We'll use Dask's `blocksize=` parameter to set each partition to max of 10 MB.

*Hint: use Pandas' sep parameter to indicate that the columns are space-separated*

In [2]:
import dask.dataframe

ddf = dask.dataframe.read_csv('data/pageviews_small.csv', sep=' ', blocksize=10e6)

ddf

Unnamed: 0_level_0,en.m,Article_51,1,0
npartitions=4,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,object,object,int64,int64
,...,...,...,...
,...,...,...,...
,...,...,...,...
,...,...,...,...


## 3. Clean up the columns

Now, we'll change the column names to `project`, `page`, `requests`, and `x` then drop the `x` column. Data cleaning is vital.

In [3]:
ddf.columns = ['project', 'page', 'requests', 'x']

ddf2 = ddf.drop('x', axis=1)

ddf2

Unnamed: 0_level_0,project,page,requests
npartitions=4,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
,object,object,int64
,...,...,...
,...,...,...
,...,...,...
,...,...,...


## 4. Filter for projects in English Wikipedia

Next, we'll filter for `project` matching "en" (English Wikipedia).

In [4]:
ddf3 = ddf2[ddf2.project == 'en']
ddf3

Unnamed: 0_level_0,project,page,requests
npartitions=4,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
,object,object,int64
,...,...,...
,...,...,...
,...,...,...
,...,...,...


## 5. Count!

Let's count how many pages were accessed from English Wikipedia vs. all projects in this dataset. (Note: each project/page combination appears on a unique line, so this amounts to just counting records.)

In [5]:
ddf2.count().compute() #all

project     1118999
page        1118988
requests    1118999
dtype: int64

In [6]:
ddf3.count().compute() #English

project     196882
page        196881
requests    196882
dtype: int64

## 6. Counting other languages

What are the record counts for English (en), French (fr), Chinese (zh), and Polish (pl)?

Though `isin` isn't supported on the Dask DataFrame index, we can `reset_index` to move the `project` into a "regular" column and use `isin` on that.

In [7]:
ddf4 = ddf2.groupby('project').count().reset_index()

ddf4[ddf4.project.isin(['en', 'fr', 'zh', 'pl'])].compute()

Unnamed: 0,project,page,requests
230,en,196881,196882
308,fr,33915,33915
742,pl,11931,11931
1079,zh,17577,17577


## 7. Close the Client

As always, we make sure to close our client when done.

In [8]:
client.close()