Skip to content
Go to file

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Wikigrouth - A Python tool for extracting entity mentions from a collection of Wikipedia documents.

What is it good for?

Are you working on some coreference (or named entity) resolution task? Have you already reached the point of evaluating your solution? Then you probably know that you need some ground truth data set...and it is verly likely that you don't have it.

If you are looking for some large-scale and domain-independent data set, then you should look into Wikilinks.

If you are working on some domain-specific task (e.g., sports, medicine, etc.), then Wikigrouth could be your solution. It takes a file containing Wikipedia URIs as input, grabs all corresponding pages from Wikipedia, and gives you a file of all Wikipedia entity mentions in those pages.


Make sure Python > 3 and pip are running on your machine:

python --version

Now clone Wikigrouth...

git clone

... and install the Wikigrouth library:

cd wikigrouth
pip install -r requirements.txt
python install

Create a seed file (test.txt) containing a list of Wikipedia page URIs:

...and run the corpus generation tool.

wikigrouth test.txt

If you want to override already existing files use:

wikigrouth -f test.txt

Your console will then tell you what's going on.

Corpus Structure

Taking above example, the corpus will be generated in a folder test having the following internal file structure:

|- index.csv (*corpus index file*)
|- entities.csv (*extracted entities file*)
|- html
  |- Vienna.html (*HTML page downloaded from Wikipedia*)
  |- ...
|- text
  |- Vienna.txt (*Raw text file extracted from HTML page*)

Corpus index file fields

For example above:


CSV field semantics:

  • doc_id: autogenerated document id
  • uri: the seed uri
  • html_file: file name of HTML page retrieved from Wikipedia
  • text_file: file name of extracted text

Extracted entities file fields

For example above:

0,128,states of Austria,,0
0,244,metropolitan area,,0
1,59,capital of Germany,,0
1,97,states of Germany,,0
1,208,most populous city proper,,0
1,250,most populous urban area,,0
1,282,European Union,,0

CSV field semantics:

  • doc_id: document id (same as in corpus index file)
  • offset: character offset of entity surface form (token, term, text)
  • text: textual representation of entity
  • uri: unique entity identifier
  • in_seed: whether or not (0=no, 1=yes) entity is part of seed file


A tool for generating coreference corpora from Wikipedia




You can’t perform that action at this time.