# Wikipedia Click Prediction

Given a Wikipedia dump, we extract all articles and their structed content (title, sections, subsections, paragraph, sentence, words). With the extracted text data and [the clickstream data](https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream), we predict clicks from one article to another based on its content.

- Download dump from: https://dumps.wikimedia.org/simplewiki/ and/or https://dumps.wikimedia.org/enwiki/
- Clickstreams from: https://dumps.wikimedia.org/other/clickstream/2019-08/


#### Format

The current data includes the following 4 fields:
```
    prev: the result of mapping the referer URL to the fixed set of values described above
    curr: the title of the article the client requested
    type: describes (prev, curr)
        link: if the referer and request are both articles and the referer links to the request
        external: if the referer host is not en(.m)?.wikipedia.org
        other: if the referer and request are both articles but the referer does not link to the request. This can happen when clients search or spoof their refer.

    n: the number of occurrences of the (referer, resource) pair
```

### Learning

```
HAN: encode(a) + encode(b)

Siamese: a+b => click count 
(add negative examples for better learning: a + c = 0 - if no click data exist, or even further away in click path)

```

In [163]:
import logging
import re
import json
import pandas as pd
from tqdm import tqdm_notebook as tqdm

In [144]:
cs_dump_path = '/Volumes/data/repo/data/wikipedia/clickstream-enwiki-2019-08.tsv'
docs_path = 'simplewiki.jsonl'

In [136]:
LOG_FORMAT = '[%(asctime)s] [%(levelname)s] %(message)s (%(funcName)s@%(filename)s:%(lineno)s)'
logging.basicConfig(level=logging.INFO, format=LOG_FORMAT)


In [142]:
rows = []

with open(cs_dump_path, 'r') as f:
    for i, line in enumerate(f):
        cols = line.split('\t')
        
        if cols[2] == 'link':  # type
            rows.append([
                cols[0], # prev
                cols[1], # current
                int(cols[3]), # n
            ])
            
# TODO use dumps from more months

In [139]:
cdf = pd.DataFrame(rows, columns=['prev', 'current', 'n'])
cdf.head()

Unnamed: 0,prev,current,n
0,Henrik_Purienne,Mirage_(magazine),35
1,African_wild_dog,Tswalu_Kalahari_Reserve,60
2,Common_warthog,Tswalu_Kalahari_Reserve,24
3,Greater_kudu,Tswalu_Kalahari_Reserve,14
4,Sable_antelope,Tswalu_Kalahari_Reserve,26


In [141]:
print(f'Total click pairs: {len(cdf):,}')

Total click pairs: 20,599,206


In [147]:
# Load preprocessed
title2sects = {}

with open(docs_path, 'r') as f:
    for i, line in enumerate(f):
        doc = json.loads(line)
        
        title = doc['title'].replace(' ', '_')
        title2sects[title] = [sect['paragraphs'] for sect in doc['sections']]
        
print(f'Completed after {i} lines')

Completed after 10000 lines


In [159]:
available_titles = set(title2sects.keys())  # save as set (makes it faster)

In [207]:
# Clicks for that we have matching articles
fdf = cdf[(cdf['prev'].isin(available_titles)) & (cdf['current'].isin(available_titles))].copy()

print(f'Click pairs with articles: {len(fdf):,}')

Click pairs with articles: 144,016


In [208]:
fdf.sample(n=10)

Unnamed: 0,prev,current,n
3066709,Newfoundland_and_Labrador,Yukon,219
7342140,Chicken,Babylonia,31
14281982,Computer_hardware,Software,714
4251505,Village,Mahatma_Gandhi,13
11190678,Hardcore_punk,The_Dillinger_Escape_Plan,26
16642895,Partition_of_India,Myanmar,723
4125728,Adam_Smith,Peace,15
17456518,Brazil,Airport,10
14743035,1860,1859,120
17171706,Masturbation,Western_world,290


In [176]:
fdf['rel_n'] = 0

In [202]:
max_n = fdf.groupby(['prev']).agg({'n': 'max'})
max_n.sample(n=3)

Unnamed: 0_level_0,n
prev,Unnamed: 1_level_1
Sunset,471
GNU_General_Public_License,605
Polytheism,457


In [218]:
# One example
max_n['n']['Zeus']

5090

In [216]:
# Normalize click count with max value
fdf['rel_n'] = 0.

for idx, r in fdf.iterrows():
    fdf.at[idx, 'rel_n'] = r['n'] / max_n['n'][r['prev']]


In [217]:
fdf.sample(n=10)

Unnamed: 0,prev,current,n,max_n,rel_n
3941382,Silver,Lahn,15,927,0.016181
19692937,Tanzania,Uganda,582,1958,0.297242
1831337,Münster,Osnabrück,31,226,0.137168
7334701,Yugoslavia,Luftwaffe,13,6299,0.002064
7768008,Methane,Natural_gas,460,460,1.0
15594330,Hungary,Slovenia,346,3290,0.105167
16428285,November_3,November_5,11,125,0.088
11512521,Moose,Mammal,20,1162,0.017212
5719026,Thuringia,Friedrich_Schiller,26,1033,0.025169
310407,Cellular_respiration,Photosynthesis,50,194,0.257732
