mwaddlink

MediaWiki AddLink Extension Model and API

Introduction

This repository contains the necessary code for the link recommendation model for Wikipedia articles. This consists of code to, both, train the model (including all the necessary pre-processing) and how to query the model to get link recommendations for an individual articles. The method is context-free and can be scaled to (virtually) any language, provided that we have enough existing links to learn from.

Querying the model

Once the model and all the utility files are computed (see "Training the model" below), they can be loaded and used to build an API to add new links to a Wikipedia page automatically. For this we have the following utilites

command-line tool:

python addlink-query_links.py -l de -p Garnet_Carter

This will return all recommended links for a given page (-p) in a given wiki (-l). You can also specify the threshold for the probability of the link (-t, default=0.9)

interactive notebook:

addlink-query_notebook.ipynb

This allows you to inspect the recommendations in a notebook.

Notes:

currently this works only on stat1008 in the analytics cluster as the underlying data from the trained model is available only locally there
we need set up a python virtual environment:

virtualenv -p /usr/bin/python3 venv/
source venv_query/bin/activate
pip install -r requirements.txt

This contains only the packages required for querying the model and is thus lighter than the environment for training the model (see below)

on the stat-machines, make sure you have the http-proxy set up https://wikitech.wikimedia.org/wiki/HTTP_proxy
you might have to install the following nltk-package manually: python -m nltk.downloader punkt

Training the model

To load, the API will need to pre-compute the some datasets for each target language.

It is essential to follow these steps sequentially because some scripts may require the output of previous ones.

You can run the pipeline for a given language (change the variable LANG)

./run-pipeline.sh

Notes:

we need set up a python virtual environment:

virtualenv -p /usr/bin/python3 venv/
source venv/bin/activate
pip install -r requirements_train.txt

some parts in the script rely on using the spark cluster using a specific conda-environment from a specific stat-machine (stat1008).
on the stat-machines, make sure you have the http-proxy set up https://wikitech.wikimedia.org/wiki/HTTP_proxy
you might have to install the following nltk-package manually: python -m nltk.downloader punkt

Anchors Dictionary

This is the main dictionary to find candidates and mentions; the bigger, the better (barring memory issues) for English, this is a ~2G pickle file.

compute with:

PYSPARK_PYTHON=python3.7 PYSPARK_DRIVER_PYTHON=python3.7 spark2-submit --master yarn --executor-memory 8G --executor-cores 4 --driver-memory 2G  ./scripts/generate_anchor_dictionary_spark.py $LANG

store in:

./data/<LANG>/<LANG>.anchors.pkl

normalising link-titles (e.g. capitalize first letter) and anchors (lowercase the anchor-string) via scripts/utils.py
for candidate links, we resolve redirects and only keep main-namespace articles

This also adds the two following helper-dictionaries

./data/<LANG>/<LANG>.pageids.pkl

this is a dictionary of all main-namespace and non-redirect articles with the mapping of {page_title:page_id}

./data/<LANG>/<LANG>.redirects.pkl

this is a dictionary of all main-namespace and redirect articles with the mapping {page_title:page_title_rd}, where page_title_rd is the title of the redirected-to article.

Note that the default setup uses the spark-cluster from stat1008 (in order to use the anaconda-wmf newpyter setup. This is necessary for filtering the anchor-dictionary by link-probability. Alternatively, one can run:

python ./scripts/generate_anchor_dictionary.py <LANG>

Wikipedia2Vec:

This models semantic relationship. Get it from: https://github.com/wikipedia2vec/wikipedia2vec then run:

wikipedia2vec train --min-entity-count=0 --dim-size 50 --pool-size 10 "/mnt/data/xmldatadumps/public/"$LANG"wiki/latest/"$LANG"wiki-latest-pages-articles.xml.bz2" "./data/"$LANG"/"$LANG".w2v.bin"

store in

./data/<LANG>/<LANG>.w2v.bin

We filter only those vectors from articles in the main-namespace that are not redirects by running

python filter_dict_w2v.py $LANG

and storing the resulting dictionary as a pickle

./data/<LANG>/<LANG>.w2v.filtered.pkl

Nav2Vec:

This models how current Wikipedia readers navigate through Wikipedia.

compute via:

PYSPARK_PYTHON=python3.7 PYSPARK_DRIVER_PYTHON=python3.7 spark2-submit --master yarn --executor-memory 8G --executor-cores 4 --driver-memory 2G  ./scripts/generate_features_nav2vec-01-get-sessions.py -l $LANG

gets reading sessions from webrequest from 1 week (this can be changed)

python ./scripts/generate_features_nav2vec-02-train-w2v.py -l $LANG -rfin True

fits a word2vec-model with 50 dimensions (this and other hyperparameters can also be changed)

This will generate an embedding for in

./data/<LANG>/<LANG>.nav.bin

We filter only those vectors from articles in the main-namespace that are not redirects by running

python filter_dict_nav.py $LANG

and storing the resulting dictionary as a pickle

./data/<LANG>/<LANG>.nav.filtered.pkl

Raw datasets:

There is a backtesting dataset to a) test the accuracy of the model, and b) train the model. We mainly want to extract fully formed and linked sentences as our ultimate ground truth.

compute with:

python ./scripts/generate_backtesting_data.py $LANG

Datasets are then stored in:

./data/<LANG>/training/sentences_train.csv
./data/<LANG>/testing/sentences_test.csv

Feature datasets:

We need dataset with features and training labels (true link, false link)

compute with:

python ./scripts/generate_training_data.py <LANG>

This is going to generate a file to be stored here:

./data/<LANG>/training/link_train.csv

XGBoost Classification Model:

This is the main prediction model it takes (Page_title, Mention, Candidate Link) and produces a probability of linking.

compute with:

python ./scripts/generate_addlink_model.py <LANG>

store in:

./data/<LANG>/<LANG>.linkmodel.bin

Backtesting evaluation:

Evaluate the prediction algorithm on a set of sentences in the training set using micro-precision and micro-recall.

compute with (first 10000 sentences):

python generate_backtesting_eval.py -l $LANG -nmax 10000

store in:

./data/<LANG>/<LANG>.backtest.eval

memory-mapping

The pickle-dictionaries (anchors, pageids, redirects, w2v,nav) are converted to sqlite-databases using the sqlitedict-package in order to reduce memory-footprint when reading these dictionaries when getting link-recommendations for individual articles.

computed via

python ./scripts/generate_sqlite_data.py $LANG

stored in

./data/<LANG>/<LANG>.anchors.sqlite
./data/<LANG>/<LANG>.pageids.sqlite
./data/<LANG>/<LANG>.redirects.sqlite
./data/<LANG>/<LANG>.w2v.filtered.sqlite
./data/<LANG>/<LANG>.nav.filtered.sqlite

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
data		data
scripts		scripts
.gitignore		.gitignore
README.md		README.md
addlink-query_links.py		addlink-query_links.py
addlink-query_notebook.ipynb		addlink-query_notebook.ipynb
requirements.txt		requirements.txt
requirements_query.txt		requirements_query.txt
requirements_train.txt		requirements_train.txt
run-pipeline.sh		run-pipeline.sh
todos.txt		todos.txt

dedcode/mwaddlink

Folders and files

Latest commit

History

Repository files navigation

mwaddlink

Introduction

Querying the model

Training the model

Anchors Dictionary

Wikipedia2Vec:

Nav2Vec:

Raw datasets:

Feature datasets:

XGBoost Classification Model:

Backtesting evaluation:

memory-mapping

About

Resources

Stars

Watchers

Forks

Languages