A tool for generating coreference corpora from Wikipedia
Python
Switch branches/tags
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
scripts
wikigrouth
.gitignore
CHANGELOG
LICENSE
README.md
requirements.txt
setup.py

README.md

Wikigrouth - A Python tool for extracting entity mentions from a collection of Wikipedia documents.

What is it good for?

Are you working on some coreference (or named entity) resolution task? Have you already reached the point of evaluating your solution? Then you probably know that you need some ground truth data set...and it is verly likely that you don't have it.

If you are looking for some large-scale and domain-independent data set, then you should look into Wikilinks.

If you are working on some domain-specific task (e.g., sports, medicine, etc.), then Wikigrouth could be your solution. It takes a file containing Wikipedia URIs as input, grabs all corresponding pages from Wikipedia, and gives you a file of all Wikipedia entity mentions in those pages.

Usage

Make sure Python > 3 and pip are running on your machine:

python --version

Now clone Wikigrouth...

git clone https://github.com/behas/wikigrouth.git

... and install the Wikigrouth library:

cd wikigrouth
pip install -r requirements.txt
python setup.py install

Create a seed file (test.txt) containing a list of Wikipedia page URIs:

http://en.wikipedia.org/wiki/Vienna
http://en.wikipedia.org/wiki/Berlin

...and run the corpus generation tool.

wikigrouth test.txt

If you want to override already existing files use:

wikigrouth -f test.txt

Your console will then tell you what's going on.

Corpus Structure

Taking above example, the corpus will be generated in a folder test having the following internal file structure:

|- index.csv (*corpus index file*)
|- entities.csv (*extracted entities file*)
|- html
  |- Vienna.html (*HTML page downloaded from Wikipedia*)
  |- ...
|- text
  |- Vienna.txt (*Raw text file extracted from HTML page*)

Corpus index file fields

For example above:

doc_id,uri,html_file,text_file
0,http://en.wikipedia.org/wiki/Vienna,Vienna.html,Vienna.txt
1,http://en.wikipedia.org/wiki/Berlin,Berlin.html,Berlin.txt

CSV field semantics:

  • doc_id: autogenerated document id
  • uri: the seed uri
  • html_file: file name of HTML page retrieved from Wikipedia
  • text_file: file name of extracted text

Extracted entities file fields

For example above:

doc_id,offset,text,uri,in_seed
0,21,German,http://en.wikipedia.org/wiki/German_language,0
0,99,Austria,http://en.wikipedia.org/wiki/Austria,0
0,128,states of Austria,http://en.wikipedia.org/wiki/States_of_Austria,0
0,244,metropolitan area,http://en.wikipedia.org/wiki/Metropolitan_area,0
...
1,59,capital of Germany,http://en.wikipedia.org/wiki/Capital_of_Germany,0
1,97,states of Germany,http://en.wikipedia.org/wiki/States_of_Germany,0
1,208,most populous city proper,http://en.wikipedia.org/wiki/Largest_cities_of_the_European_Union_by_population_within_city_limits,0
1,250,most populous urban area,http://en.wikipedia.org/wiki/Largest_urban_areas_of_the_European_Union,0
1,282,European Union,http://en.wikipedia.org/wiki/European_Union,0
1,353,Spree,http://en.wikipedia.org/wiki/Spree,0
1,363,Havel,http://en.wikipedia.org/wiki/Havel,0    

CSV field semantics:

  • doc_id: document id (same as in corpus index file)
  • offset: character offset of entity surface form (token, term, text)
  • text: textual representation of entity
  • uri: unique entity identifier
  • in_seed: whether or not (0=no, 1=yes) entity is part of seed file