# Wikipedia Click Prediction

Given a Wikipedia dump, we extract all articles and their structed content (title, sections, subsections, paragraph, sentence, words). With the extracted text data and [the clickstream data](https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream), we predict clicks from one article to another based on its content.

- Download dump from: https://dumps.wikimedia.org/simplewiki/ and/or https://dumps.wikimedia.org/enwiki/
- Clickstreams from: https://dumps.wikimedia.org/other/clickstream/2019-08/


#### Format

The current data includes the following 4 fields:
```
    prev: the result of mapping the referer URL to the fixed set of values described above
    curr: the title of the article the client requested
    type: describes (prev, curr)
        link: if the referer and request are both articles and the referer links to the request
        external: if the referer host is not en(.m)?.wikipedia.org
        other: if the referer and request are both articles but the referer does not link to the request. This can happen when clients search or spoof their refer.

    n: the number of occurrences of the (referer, resource) pair
```

### Learning

```
HAN: encode(a) + encode(b)

Siamese: a+b => click count 
(add negative examples for better learning: a + c = 0 - if no click data exist, or even further away in click path)

```

In [None]:
import logging
import re
import json
import pandas as pd
from tqdm import tqdm_notebook as tqdm

In [None]:
cs_dump_path = './data/clickstream-enwiki-2019-08.tsv'
docs_path = './data/simplewiki.jsonl'

In [None]:
LOG_FORMAT = '[%(asctime)s] [%(levelname)s] %(message)s (%(funcName)s@%(filename)s:%(lineno)s)'
logging.basicConfig(level=logging.INFO, format=LOG_FORMAT)


In [None]:
rows = []

with open(cs_dump_path, 'r') as f:
    for i, line in enumerate(f):
        cols = line.split('\t')
        
        if cols[2] == 'link':  # type
            rows.append([
                cols[0], # prev
                cols[1], # current
                int(cols[3]), # n
            ])
            
        if i == 1000:
            break
            
# TODO use dumps from more months

In [None]:
df = pd.read_csv(cs_dump_path, sep='\t', quoting=3, header=None, nrows=100, names=['prev', 'curr', 'type', 'n'])

In [None]:
df = df[df['type'] == 'link']
df = df.drop(['type'], axis=1)

In [None]:
df.sample(n=10)

In [None]:
cdf = pd.DataFrame(rows, columns=['prev', 'current', 'n'])
cdf.head()

In [None]:
print(f'Total click pairs: {len(cdf):,}')

In [None]:
# Load preprocessed
title2sects = {}

with open(docs_path, 'r') as f:
    for i, line in enumerate(f):
        doc = json.loads(line)
        
        title = doc['title'].replace(' ', '_')
        title2sects[title] = [sect['paragraphs'] for sect in doc['sections']]
        
        break
        
print(f'Completed after {i} lines')

In [None]:
doc

In [None]:
available_titles = set(title2sects.keys())  # save as set (makes it faster)

In [None]:
# Clicks for that we have matching articles
# fdf = cdf[(cdf['prev'].isin(available_titles)) & (cdf['current'].isin(available_titles))].copy()

fdf = cdf

print(f'Click pairs with articles: {len(fdf):,}')

In [None]:
fdf.sample(n=10)

In [None]:
fdf['rel_n'] = 0

In [None]:
max_n = fdf.groupby(['prev']).max()
max_n.sample(n=3)

In [None]:
# One example
max_n['n']['Zeus']

In [None]:
fdf = fdf.merge(max_n, on='prev')

In [None]:
fdf['rel_n2'] = fdf['n_x'] / fdf['n_y']

In [None]:
count_article = fdf.groupby(['prev']).count()

In [None]:
fdf[fdf['prev'] == '1870']

In [None]:
# Normalize click count with max value
fdf['rel_n'] = 0.

for idx, r in fdf.iterrows():
    fdf.at[idx, 'rel_n'] = r['n'] / max_n['n'][r['prev']]


# Playing around with Wiki text data

In [None]:
from gensim.scripts.segment_wiki import extract_page_xmls

wiki_dump_path = './data/simplewiki-20191220-pages-articles.xml'

with open(wiki_dump_path, 'rb') as xml_fileobj:
    page_xmls = extract_page_xmls(xml_fileobj)
    
    for i, page_xml in enumerate(page_xmls):
        print(page_xml)
    
        break

In [113]:
a = pd.Series(['a', 'b', 'c', 'd'])
a1 = pd.Series(['Paris', 'Rome', 'Rio', 'Berlin'])
b = pd.Series([1, 2, 3, 4])
df1 = pd.DataFrame(zip(a, a1, b))
df1.columns = ['c1', 'c2', 'c3']

c = pd.Series(['a', 'b', 'e', 'f'])
c1 = pd.Series(['Nairobi', 'Rome', 'Rio', 'Barcelona'])
d = pd.Series([11, 12, 13, 14])
df2 = pd.DataFrame(zip(c, c1, d))
df2.columns = ['c1', 'c2', 'c3']

print(df1)
print(df2)


  c1      c2  c3
0  a   Paris   1
1  b    Rome   2
2  c     Rio   3
3  d  Berlin   4
  c1         c2  c3
0  a    Nairobi  11
1  b       Rome  12
2  e        Rio  13
3  f  Barcelona  14


In [118]:
df3 = pd.merge(df1, df2, on=['c1', 'c2'], how='right')
df3[df3['c3_x'].isna()]

Unnamed: 0,c1,c2,c3_x,c3_y
1,a,Nairobi,,11
2,e,Rio,,13
3,f,Barcelona,,14


In [None]:
df