#ClickStream - Getting Started - Explorations

This post gives an introduction to working with the newly released [Wikipedia Clickstream](http://figshare.com/articles/Wikipedia_Clickstream/1305770) dataset. It shows how people get to a Wikipedia article and what articles they click on next. In other words, it gives a weighted network of articles, where each edge weight corresponds to how often people navigate from one page to another. To give an example, consider the figure below, which shows incoming and outgoing traffic to the "London" article.

In [1]:
from IPython.display import Image
Image(filename='../images/London_Sankey.png', width=800, height=500)

IOError: [Errno 2] No such file or directory: u'../images/London_Sankey.png'

The example shows that most people found the "London" page through Google Search and that only a small fraction of readers went on to another article. Before diving into some examples of working with the data, let me give a more detailed explanation of how the data was collected.

###Data Preparation

The data contains counts of `(referer, resource)` pairs extracted from the request logs of English Wikipedia. When a client requests a resource by following a link or performing a search, the URI of the webpage that linked to the resource is included with the request in an HTTP header called the "referer". This data captures 22 million `(referer, resource)` pairs from a total of 3.2 billion requests collected during the month of February 2015. 

The dataset only includes requests for articles in the [main namespace](https://en.wikipedia.org/wiki/Wikipedia:Namespace) of the desktop version of English Wikipedia.

Referers were [mapped](https://github.com/ewulczyn/wmf/blob/f76d43eee9e3e9756cd6a062a11e382fd4425edf/clickstream/oozie/hive_query.sql#L152-L203) to a fixed set of values corresponding to internal traffic or external traffic from one of the top 5 global traffic sources to English Wikipedia, based on this scheme:
    - an article in the main namespace of English Wikipedia -> the article title
    - any Wikipedia page that is not in the main namespace of English Wikipedia -> `other-wikipedia`
    - an empty referer -> `other-empty`
    - a page from any other Wikimedia project -> `other-internal`
    - Google -> `other-google`
    - Yahoo -> `other-yahoo`
    - Bing -> `other-bing`
    - Facebook -> `other-facebook`
    - Twitter -> `other-twitter`
    - anything else -> `other-other`

 
MediaWiki Redirects are used to forward clients from one page name to another. They can be useful if a particular article is referred to by multiple names, or has alternative punctuation, capitalization or spellings. Requests for pages that get redirected where mapped to the page they redirect to. For example, requests for 'Obama' redirect to the 'Barack_Obama' page. Redirects where resolved using a snapshot of the redirects table from March 1 2015.

Redlinks are are links to an article that does not exist. Either the article was deleted after the creation of the link or the author intended to signal the need for such an article. Requests for redlinks are included in the data. 

We attempt to exclude spider traffic by classifying user agents with the [ua-parser](https://github.com/tobie/ua-parser) library and a few additonal Wikipedia specific filters. Furthermore, we attempt to filter out traffic from bots that request a page and then request all or most of the links on that page (BFS traversal) by setting a threshold on the rate at which a client can requests articles with the same referer. Requests that where made at too high of a rate get discarded. For the exact details, see [here](https://github.com/ewulczyn/wmf/blob/f76d43eee9e3e9756cd6a062a11e382fd4425edf/clickstream/oozie/throttle.py). The threshold is quite high to avoid excluding human readers who open tabs as they read. As a result requests from slow moving bots are likely to remain in the data. More sophisticated bot detection, that evaluates the clients entire request graph is an avenue of future work.

Finally, any `(referer, resource)` pair with 10 or fewer observations was removed from the dataset. 


### Format
The data includes the following 6 fields:

- **prev_id:** if the referer does not correspond to an article in the main namespace of English Wikipedia, this value will be empty. Otherwise, it contains the unique MediaWiki page ID of the article corresponding to the referer i.e. the previous article the client was on
- **curr_id:** the MediaWiki unique page ID of the article the client requested 
- **n:** the number of occurrences of the `(referer, resource)` pair
- **prev_title:** the result of mapping the referer URL to the fixed set of values described above
- **curr_title:** the title of the article the client requested
- **type** 
    - "link" if the referer and request are both articles and the referer links to the request
    - "redlink" if the referer is an article and links to the request, but the request is not in the produiction enwiki.page table
    - "other" if the referer and request are both articles but the referer does not link to the request. This can happen when clients search or spoof their refer

#Getting to know the Data

There are various quirks in the data due to the dynamic nature of the network of articles in English Wikipedia and the prevalence of requests from automata. The following section gives a brief overview of the data fields and caveats that need to be kept in mind.



### Loading the Data
First lets load the data into a pandas DataFrame. 

In [2]:
import pandas as pd
df = pd.read_csv("2015_02_clickstream.tsv", sep='\t', header=0)
#we won't use ids here, so lets discard them
df = df[['prev_title', 'curr_title', 'n', 'type']]
df.columns = ['prev', 'curr', 'n', 'type']

In [4]:
df[df['type'] == 'other'].head()

Unnamed: 0,prev,curr,n,type
0,other-google,!!,121,other
1,other-wikipedia,!!,93,other
2,other-empty,!!,46,other
3,other-other,!!,10,other
4,!_(disambiguation),!!,11,other


### Top articles
It has been possible to estimate which pages get the most pageviews per month from the public pageview dumps that WMF releases. Unfortunately, there is no attmpt to remove spiders and bots from those dumps. This month the "Layer 2 Tunneling Protocol" was the 3rd most requested article. The logs show that this article was requested by a small number of clients hundreds of times per minute within a 4 day window. This kind of request pattern is removed from the clickstream data, which gives the following as the top 10 pages:

In [23]:
df.groupby('curr').sum().sort('n', ascending=False)[:10]

Unnamed: 0_level_0,n
curr,Unnamed: 1_level_1
Main_Page,127500620
87th_Academy_Awards,2559794
Fifty_Shades_of_Grey,2326175
Alive,2244781
Chris_Kyle,1709341
Fifty_Shades_of_Grey_(film),1683892
Deaths_in_2015,1614577
Birdman_(film),1545842
Islamic_State_of_Iraq_and_the_Levant,1406530
Stephen_Hawking,1384193


The most requested pages tend to be about media that was popular in February. 

### Top Referers
The clickstream data aslo lets us investigate who the top referers to Wikipedia are:

In [24]:
df.groupby('prev').sum().sort('n', ascending=False)[:10]

Unnamed: 0_level_0,n
prev,Unnamed: 1_level_1
other-google,1494662520
other-empty,347424627
other-wikipedia,129619543
other-other,77496915
other-bing,65895496
other-yahoo,48445941
Main_Page,29897807
other-twitter,19222486
other-facebook,2312328
87th_Academy_Awards,1680559


The top referer by a large margin is Google. Next comes refererless traffic (usually clients using HTTPS). Then come other language Wikipedias and pages in English Wikipedia that are not in the main (i.e. article) namespace. Bing directs significanlty more traffic to Wikipedia than Yahoo. Social media referals are tiny compared to Google, with twitter leading to 10x more requests to Wikipedia than Facebook. 

### Trending on Social Media
Lets look at what articles where trending on Twitter:

In [25]:
df_twitter = df[df['prev'] == 'other-twitter']
df_twitter.groupby('curr').sum().sort('n', ascending=False)[:5]

Unnamed: 0_level_0,n
curr,Unnamed: 1_level_1
Johnny_Knoxville,198908
Peter_Woodcock,126259
2002_Tampa_plane_crash,119906
Sơn_Đoòng_Cave,116012
The_boy_Jones,114401


I'm have no explanations for this, but if you find any of the tweets linking to these article, I would be curious to see why they got so many click throughs. 

### Most Requested Mising Pages
Next lets look at the most popular redinks. Redlinks are links to a Wikipedia page that does not exist, either because it has been deleted, or because the author is anticipating the creation of the page. Seeing which redlinks are the most viewed is interesting because it gives some indication about demand for missing content. Since the set of pages and links is constantly changing, the labeling of redlinks is not an exact science. In this case, I used the page and links tables from Feb 28th to mark a page as a redlink.

In [26]:
df_redlinks = df[df['type'] == 'redlink']
df_redlinks.groupby('curr').sum().sort('n', ascending=False)[:5]

Unnamed: 0_level_0,n
curr,Unnamed: 1_level_1
2027_Cricket_World_Cup,6782
Rethinking,5279
Chris_Soules,5229
Anna_Lezhneva,3764
Jillie_Mack,3685


### Searching Within Wikipedia

 Usually, clients navigate from one article to another through follwing a link. The other prominent case is search. The article from which the user searched is also passed as the referer to the found article. Hence, you will find a high count of `(Wikipedia, Chris_Kyle)` tuples. People went to the "Wikipedia" article to search for "Chris_Kyle". There is not a link to the "Chris_Kyle" article from the "Wikipedia" article. Finally, it is possible that the client messed with their referer header. The vast majority of requests with an internal referer correspond to a true link.

In [27]:
df_search = df[df['type'] == 'other']
df_search =  df_search[df_search.prev.str.match("^other.*").apply(bool) == False]
print "Number of searches/ incorrect referers: %d" % df_search.n.sum()

Number of searches/ incorrect referers: 106772349


In [28]:
df_link = df[df['type'] == 'link']
df_link =  df_link[df_link.prev.str.match("^other.*").apply(bool) == False]
print "Number of links followed: %d" % df_link.n.sum()

Number of links followed: 983436029


### Inflow vs Outflow

You might be tempted to think that there can't be more traffic going out of a node than going into a node. This is not true for two reasons. People will follow links in multiple tabs as they read an article. Hence, a single pageview can lead to multiple records with that page as the referer. The data is also certain to include requests from bots which we did not correctly filter out. Bots will often follow most if not all the links in the article. Lets look at the ratio of incoming to outgoing links for the most requested pages.

In [31]:
df_in = df.groupby('curr').sum()  # pageviews per article
df_in.columns = ['in_count',]
df_out = df.groupby('prev').sum() # link clicks per article
df_out.columns = ['out_count',]
df_in_out = df_in.join(df_out)
df_in_out['ratio'] = df_in_out['out_count']/df_in_out['in_count'] #compute ratio if outflow/infow

In [30]:
df_in_out.sort('in_count', ascending = False)[:3]

Unnamed: 0_level_0,in_count,out_count,ratio
curr,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Main_Page,127500620,29897807,0.234491
87th_Academy_Awards,2559794,1680559,0.656521
Fifty_Shades_of_Grey,2326175,1146354,0.492806


Looking at the pages with the highest ratio of outgoing to incoming traffic reveals how messy the data is, even after the carefull data preparation described above.

In [32]:
df_in_out.sort('ratio', ascending = False)[:3]

Unnamed: 0_level_0,in_count,out_count,ratio
curr,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
List_of_Major_League_Baseball_players_(H),57,1323,23.210526
2001–02_Slovak_Superliga,22,472,21.454545
Principle_of_good_enough,23,374,16.26087


All of these pages have more traversals of a single link than they have requests for the page to begin with.  As a post processing step, we might enforce that there can't be more traversals of a link than there where requests to the page. Better bot filtering should help reduce this issue in the future.

In [27]:
df_post = pd.merge(df, df_in, how='left', left_on='prev', right_index=True)
df_post['n'] = df_post[['n', 'in_count']].min(axis=1)
del df_post['in_count']

# Simple Network Analysis

We can think of Wikipedia as a network with articles as nodes and links between articles as edges. With the clickstream data we can assign weights to the edges, which correspond to how often the edges where traversed. Remember, that if an edge was traversed less than ten times in the month, it is not included in the data. The data set is on the large side, but with some patience, we can load it into networkx and look at some basic properties of the network. 

In [28]:
import networkx as nx
clickstream = nx.DiGraph()

for i, row in df_post.iterrows():
    clickstream.add_edge(row['prev'], row['curr'], traffic = row['n'])

First lets look at the number and distribution of sizes of [strongly connected components]().

In [29]:
strongly_connected_components = nx.strongly_connected_component_subgraphs(clickstream)

In [31]:
component_sizes = []
for g in strongly_connected_components:
    component_sizes.append(g.number_of_nodes())

KeyboardInterrupt: 

In [90]:
component_sizes

(23141217, 4)

In [30]:
clickstream.number_of_nodes ()

4200855

In [None]:
len(strongly_connected_components)

In [76]:
pr = nx.pagerank(clickstream, alpha=0.9, weight = 'traffic')

In [13]:
df_in = df_post.groupby('curr').sum()  # pageviews per article
df_in.columns = ['in_count',]
df_out = df_post.groupby('prev').sum() # link clicks per article
df_out.columns = ['out_count',]
df_in_out = df_in.join(df_out)
df_in_out['ratio'] = df_in_out['out_count']/df_in_out['in_count']


In [14]:
df_in_out.sort('ratio', ascending = False)[:5]

Unnamed: 0_level_0,in_count,out_count,ratio
curr,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Inch,141,2144,15.205674
List_of_Polish_gminas_(B),41,447,10.902439
List_of_Polish_gminas_(A),38,342,9.0
International_League_Most_Valuable_Player_Award,33,275,8.333333
List_of_largest_known_nebulae,31,254,8.193548


In [15]:
df_post.head()

Unnamed: 0,prev,curr,n,type
0,other-empty,!!,66,other
1,other-google,!!,110,other
2,other-wikipedia,!!,31,other
3,!_(disambiguation),!!,10,other
4,other-google,!!!_(album),17,other


# Simple Network Analysis

In [71]:
clickstream = nx.DiGraph()
for i, row in df_post.iterrows():
    clickstream.add_edge(row['prev'], row['curr'], weight = row['n'])

KeyboardInterrupt: 

In [61]:
clickstream.add_edge('a', 'v', weight = 8)

In [36]:
df_norm = pd.merge(df_post, df_out, how='left', left_on='prev', right_index=True)

In [37]:
df_norm = df_norm.sort(['prev', 'curr'])

In [38]:
df_norm['w'] = df_norm['n'] / df_norm['out_count']

In [40]:
df_norm.head(40)

Unnamed: 0,prev,curr,n,type,out_count,w
11512688,!!,!!!,31,link,129,0.24031
2058414,!!,Chess_annotation_symbols,31,link,129,0.24031
14500433,!!,Double_factorial,26,link,129,0.20155
14928832,!!,Exclamation_mark,18,link,129,0.139535
20209148,!!,Retroflex_clicks,12,link,129,0.093023
22442410,!!,Universal_Character_Set_characters,11,other,129,0.085271
5,!!!,!!!_(album),311,link,1847,0.168381
11512757,!!!,!_(disambiguation),14,link,1847,0.00758
13312978,!!!,Cake_(band),10,link,1847,0.005414
2596570,!!!,Dance-punk,40,link,1847,0.021657


In [10]:
node = "Alive"
n=10

In [11]:
prev = df[df['curr'] == node].sort(columns='n', ascending=False)[:n]
tuples = [list(x) for x in prev.values]
from pprint import pprint 
pprint(tuples)

[['other-wikipedia', 'Alive', 2221485, 'other'],
 ['other-empty', 'Alive', 22109, 'other'],
 ['Main_Page', 'Alive', 540, 'other'],
 ['Alive', 'Alive', 215, 'other'],
 ['other-google', 'Alive', 140, 'other'],
 ['LAN_Chile_Flight_210', 'Alive', 98, 'other'],
 ['other-other', 'Alive', 42, 'other'],
 ['other-yahoo', 'Alive', 40, 'other'],
 ['other-bing', 'Alive', 26, 'other'],
 ['Live', 'Alive', 23, 'other']]


In [12]:
prev = df[df['prev'] == node].sort(columns='n', ascending=False)[:n]
tuples = [list(x) for x in prev.values]
from pprint import pprint 
pprint(tuples)

[['Alive', 'Alive_(1993_film)', 1948, 'link'],
 ['Alive', 'Alive:_The_Story_of_the_Andes_Survivors', 482, 'link'],
 ['Alive', 'Alive', 215, 'other'],
 ['Alive', 'Bat_Out_of_Hell_III:_The_Monster_Is_Loose', 202, 'link'],
 ['Alive', 'Alive:_20_Years_Later', 106, 'link'],
 ['Alive', 'Alive_(Pearl_Jam_song)', 97, 'link'],
 ['Alive', 'Alive!_(Kiss_album)', 69, 'link'],
 ['Alive', 'Life', 68, 'link'],
 ['Alive', 'Alive_(2002_film)', 51, 'link'],
 ['Alive', 'Alive_(Natalie_Bassingthwaighte_song)', 48, 'link']]
