Cleansing Wikipedia Categories using Centrality
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.

Cleansing Wikipedia Categories using Centrality

by Paolo Boldi and Corrado Monti

Laboratory for Web Algorithmics

We propose a novel general technique aimed at pruning and cleansing the Wikipedia category hierarchy, with a tunable level of aggregation. Our approach is endogenous, since it does not use any information coming from Wikipedia articles, but it is based solely on the user-generated (noisy) Wikipedia category folksonomy itself.

For more information see the paper, presented at WWW2016 (companion), Wiki Workshop 2016, at Montreal.

Provided dataset

We provide here a ready-to-use dataset, with a recategorization of the wikpedia pages to a set of 10 000 categories (the most important ones according to our approach). If you wish to use a different number of categories, please run the provided code.

To download the dataset go to Releases. Here, you'll find:

  • page2cat.tsv.gz is a gzipped TSV file with the mapping from Wikipedia pages to cleansed categories, listed from the most important to the least important.
  • ranked-categories.tsv.gz is a gzipped TSV file with every Wikipedia category and our importance score.

We also provide the head of these files (page2cat-HEAD.tsv and ranked-categories-HEAD.tsv) to show how they look like after unzip.

If you wish to use the dataset or the code, please cite: Paolo Boldi and Corrado Monti. "Cleansing wikipedia categories using centrality." Proceedings of the 25th International Conference Companion on World Wide Web. International World Wide Web Conferences Steering Committee, 2016.


    title={Cleansing wikipedia categories using centrality},
    author={Boldi, Paolo and Monti, Corrado},
    booktitle={Proceedings of the 25th International Conference Companion on World Wide Web},
    organization={International World Wide Web Conferences Steering Committee}

PLEASE NOTE: Experiments described in the paper were run on a 2014 snapshot called enwiki-20140203-pages-articles.xml.bz2, while – to provide an updated version – this dataset refers to enwiki-20160407-pages-articles.xml.bz2.

How to run code

Set up the environment

In order to compile the code, you'll need Java 8, Ant and Ivy. To install them (e.g. inside a clean Vagrant box with ubuntu/trusty64), you should use these lines:

sudo apt-get --yes update
sudo apt-get install -y software-properties-common python-software-properties
echo oracle-java8-installer shared/accepted-oracle-license-v1-1 select true | sudo /usr/bin/debconf-set-selections
sudo add-apt-repository ppa:webupd8team/java -y
sudo apt-get update
sudo apt-get --yes install oracle-java8-installer
sudo apt-get --yes install oracle-java8-set-default
sudo apt-get --yes install ant ivy
sudo ln -s -T /usr/share/java/ivy.jar /usr/share/ant/lib/ivy.jar

Compile the code

If the environment is set up properly, you should install git and download this repo with

sudo apt-get install git
git clone

and then go to the directory java. There, run:

  • ant ivy-setupjars to download dependencies
  • ant to compile
  • . to include the produced jar inside the Java classpath.

Now you are ready to run, which will assume to have the file WIKIDUMP_XML as enwiki-20160407-pages-articles.xml.bz2.