# Wikipedia Clickstream

The idea for this analysis was inspired by [this blog post](https://blog.wikimedia.org/2018/01/16/wikipedia-rabbit-hole-clickstream/). 

In [1]:
import os 
from pathlib import Path
from urllib import request

import pandas as pd

In [2]:
bundle_root = Path(os.environ['LABS_BUNDLE_ROOT'])
data_dir = bundle_root / 'data'
data_raw = data_dir / 'raw'
data_processed = data_dir / 'processed'

In [3]:
data_raw.mkdir(exist_ok=True, parents=True)
data_processed.mkdir(exist_ok=True, parents=True)

## Get data

In [5]:
url = "https://dumps.wikimedia.org/other/clickstream/2018-01/"

def maybe_download(filename):
    """Download a file if not present."""    
    dest_filename = data_raw / filename
    if not dest_filename.exists():
        print("Attempting to download:", filename)
        request.urlretrieve(url + filename, dest_filename)
        print("Download complete!")
    return dest_filename

We'll use the most recent month of data that is available, which is from January 2018. 

The data is made up of counts of _(referer, resource)_ pairs. The resource is the Wikipedia page that the user lands on and the referer is how they got there (via a link, a search, etc.). 

The data contains about 28.5M of these _(referer, resource)_ pairs.

In [6]:
# Takes ~3 minutes on my machine if not already downloaded
clickstream_filename = maybe_download("clickstream-enwiki-2018-01.tsv.gz")

In [7]:
%%time 
# use na_filter=False because the entry 'NaN' refers to https://en.wikipedia.org/wiki/NaN
dat = pd.read_csv(
    clickstream_filename, sep='\t', names=['prev', 'curr', 'type', 'n'], na_filter=False)

CPU times: user 23.3 s, sys: 1.17 s, total: 24.5 s
Wall time: 24.5 s


In [8]:
"The data has {:,} rows.".format(len(dat))

'The data has 28,538,901 rows.'

Referers are grouped in the following way: 

- article in main [namespace](https://en.wikipedia.org/wiki/Wikipedia:Namespace) of Wikipedia --> article title
- article in other Wikimedia project --> `other-internal`
- external search engine --> `other-search`
- other external site --> `other-external`
- [empty referer](https://stackoverflow.com/a/6880668) --> `other-empty`
- anything else --> `other-other`

The data contains the following columns:

- **prev**: the referer group 
- **curr**: title of the requested article 
- **type**: one of 
    - `link`: referer and request are both articles, and referer links to request
    - `external`: referer host is not `en(.m)?.wikipedia.org`
    - `other`: referer and request are both articles, and referer does not link to request
- **n**: number of occurrences of the _(referer, resource)_ pair

Please see [here](https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream) for a more in-depth explanation of the data.

## Write to disk

In [10]:
dat.to_parquet(data_processed / '2018-01.parquet')

For our analysis, we're only interested in clickstreams from and to Wikipedia, i.e., where the referer is another Wikipedia article. Write this data to a Parquet file so that we can use it for future analysis. 

In [13]:
df = dat.loc[~dat['prev'].str.startswith('other')]

In [14]:
"The data where the referer was Wikipedia has {:,} rows.".format(len(df))

'The data where the referer was Wikipedia has 17,490,626 rows.'

In [15]:
df.to_parquet(data_processed / 'wikipedia_referer.parquet')