# Wikipedia Clickstream

The idea for this function was inspired by [this blog post](https://blog.wikimedia.org/2018/01/16/wikipedia-rabbit-hole-clickstream/). 

In [1]:
library(feather)
library(googleVis)
library(tidyverse)

Creating a generic function for ‘toJSON’ from package ‘jsonlite’ in package ‘googleVis’

Welcome to googleVis version 0.6.2

Please read Google's Terms of Use
before you start using the package:
https://developers.google.com/terms/

Note, the plot method of googleVis will by default use
the standard browser to display its output.

See the googleVis package vignettes for more details,
or visit http://github.com/mages/googleVis.

To suppress this message use:
suppressPackageStartupMessages(library(googleVis))

Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats


## Get data

In [2]:
url <- "https://dumps.wikimedia.org/other/clickstream/2018-01/"
data_root <- "/labs/data/raw"

maybe_download <- function(filename) {
    dest_filename <- file.path(data_root, filename)
    if (!file.exists(dest_filename)) {
        print(paste("Attempting to download:", filename))
        download.file(file.path(url, filename), dest_filename)
        print("Download complete!")
    }
    dest_filename
}

We'll use the most recent month of data that is available, which is from January 2018. 

The data is made up of counts of _(referer, resource)_ pairs. The resource is the Wikipedia page that the user lands on and the referer is how they got there (via a link, a search, etc.). 

The data contains about 28.5M of these _(referer, resource)_ pairs.

In [3]:
clickstream_filename <- maybe_download("clickstream-enwiki-2018-01.tsv.gz")

In [4]:
dat <- read_delim(
    clickstream_filename, 
    delim = "\t",
    col_names = c("prev", "curr", "type", "n"),
    col_types = "ccci",
    escape_double = FALSE
)

In [5]:
paste("The data has", prettyNum(nrow(dat), big.mark = ","), "rows.")

Referers are grouped in the following way: 

- article in main [namespace](https://en.wikipedia.org/wiki/Wikipedia:Namespace) of Wikipedia --> article title
- article in other Wikimedia project --> `other-internal`
- external search engine --> `other-search`
- other external site --> `other-external`
- [empty referer](https://stackoverflow.com/a/6880668) --> `other-empty`
- anything else --> `other-other`

The data contains the following columns:

- **prev**: the referer group 
- **curr**: title of the requested article 
- **type**: one of 
    - `link`: referer and request are both articles, and referer links to request
    - `external`: referer host is not `en(.m)?.wikipedia.org`
    - `other`: referer and request are both articles, and referer does not link to request
- **n**: number of occurrences of the _(referer, resource)_ pair

Please see [here](https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream) for a more in-depth explanation of the data.

In [6]:
df <- dat %>%
    filter(!startsWith(prev, "other-"))

In [7]:
paste("The data where the referer was Wikipedia has", prettyNum(nrow(df), big.mark = ","), "rows.")

In [8]:
write_feather(df, "/labs/data/processed/wikipedia_referer.feather")