### Scalable Data Analytics Lab: Wikimedia Traffic Data

In this mini-lab, you'll get a chance to create a Dask cluster and run a few queries on some Wikimedia traffic data, using Dask dataframe.

*Hint: Copy useful code snippets from the intro notebook.*

__1. Create a Client__ and request 2 workers, 1 thread, and 1GB of RAM each.

In [None]:
from dask.distributed import Client

client = Client(n_workers=2, threads_per_worker=1, memory_limit='1GB')

client

__2. Read data__ from the `pageviews_small.csv` file. Use Dask's `blocksize=` parameter to set each partition to max of 10 MB.

*Hint: use Pandas' sep parameter to indicate that the columns are space-separated*

In [None]:
import dask.dataframe

ddf = dask.dataframe.read_csv('data/pageviews_small.csv', sep=' ', blocksize=10e6)

ddf

__3. Change the column names__ to `project`, `page`, `requests`, and `x` then drop the `x` column.

In [None]:
ddf.columns = ['project', 'page', 'requests', 'x']

ddf2 = ddf.drop('x', axis=1)

ddf2

__4. Filter__ for `project` matching "en" (English Wikipedia)

In [None]:
ddf3 = ddf2[ddf2.project == 'en']
ddf3

__5. Count__ how many pages were accessed from English Wikipedia vs. all projects in this dataset. (Note: each project/page combination appears on a unique line, so this amounts to just counting records)

In [None]:
ddf2.count().compute() #all

In [None]:
ddf3.count().compute() #English

__Extra Credit: 6. What are the record counts__ for English (en), French (fr), Chinese (zh), and Polish (pl)?

*Hint: `isin` isn't supported on the Dask dataframe index, but you can `reset_index` to move the `project` into a "regular" column and use `isin` on that*

In [None]:
ddf4 = ddf2.groupby('project').count().reset_index()

ddf4[ddf4.project.isin(['en', 'fr', 'zh', 'pl'])].compute()

In [None]:
client.close()