<p float="center">
  <img src="images/horizontal.png" alt="Coiled logo" width="415" hspace="10"/>
  <img src="images/dask_horizontal_no_pad.svg" alt="Dask logo" width="200" hspace="10" />
</p>

### Scalable Data Analytics Lab: Wikimedia Traffic Data

We'll now do a mini-lab to get even more comfortable with Dask DataFrames.

In this mini-lab, you'll get a chance to create a Dask cluster and run a few queries on some Wikimedia traffic data, using Dask dataframe.

__1. Create a Client__ and request 4 workers, 1 thread, and 1GB of RAM each.

In [1]:
from dask.distributed import Client

client = Client(n_workers=4, threads_per_worker=1, memory_limit='1GB')

client

Perhaps you already have a cluster running?
Hosting the HTTP server on port 64333 instead


0,1
Client  Scheduler: tcp://127.0.0.1:64334  Dashboard: http://127.0.0.1:64333/status,Cluster  Workers: 4  Cores: 4  Memory: 4.00 GB


__2. Read data__ from the `pageviews_small.csv` file. Use Dask's `blocksize=` parameter to set each partition to max of 10 MB.

*Hint: use Pandas' sep parameter to indicate that the columns are space-separated*

In [2]:
import dask.dataframe

ddf = dask.dataframe.read_csv('data/pageviews_small.csv', sep=' ', blocksize=10e6)

ddf

Unnamed: 0_level_0,en.m,Article_51,1,0
npartitions=4,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,object,object,int64,int64
,...,...,...,...
,...,...,...,...
,...,...,...,...
,...,...,...,...


__3. Change the column names__ to `project`, `page`, `requests`, and `x` then drop the `x` column.

In [3]:
ddf.columns = ['project', 'page', 'requests', 'x']

ddf2 = ddf.drop('x', axis=1)

ddf2

Unnamed: 0_level_0,project,page,requests
npartitions=4,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
,object,object,int64
,...,...,...
,...,...,...
,...,...,...
,...,...,...


__4. Filter__ for `project` matching "en" (English Wikipedia)

In [4]:
ddf3 = ddf2[ddf2.project == 'en']
ddf3

Unnamed: 0_level_0,project,page,requests
npartitions=4,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
,object,object,int64
,...,...,...
,...,...,...
,...,...,...
,...,...,...


__5. Count__ how many pages were accessed from English Wikipedia vs. all projects in this dataset. (Note: each project/page combination appears on a unique line, so this amounts to just counting records)

In [5]:
ddf2.count().compute() #all

project     1118999
page        1118988
requests    1118999
dtype: int64

In [6]:
ddf3.count().compute() #English

project     196882
page        196881
requests    196882
dtype: int64

__Extra Credit: 6. What are the record counts__ for English (en), French (fr), Chinese (zh), and Polish (pl)?

*Hint: `isin` isn't supported on the Dask dataframe index, but you can `reset_index` to move the `project` into a "regular" column and use `isin` on that*

In [7]:
ddf4 = ddf2.groupby('project').count().reset_index()

ddf4[ddf4.project.isin(['en', 'fr', 'zh', 'pl'])].compute()

Unnamed: 0,project,page,requests
230,en,196881,196882
308,fr,33915,33915
742,pl,11931,11931
1079,zh,17577,17577


In [8]:
client.close()