Wikigrouth - A Python tool for extracting entity mentions from a collection of Wikipedia documents.
Are you working on some coreference (or named entity) resolution task? Have you already reached the point of evaluating your solution? Then you probably know that you need some ground truth data set...and it is verly likely that you don't have it.
If you are looking for some large-scale and domain-independent data set, then you should look into Wikilinks.
If you are working on some domain-specific task (e.g., sports, medicine, etc.), then Wikigrouth could be your solution. It takes a file containing Wikipedia URIs as input, grabs all corresponding pages from Wikipedia, and gives you a file of all Wikipedia entity mentions in those pages.
Make sure Python > 3 and pip are running on your machine:
python --version
Now clone Wikigrouth...
git clone https://github.com/behas/wikigrouth.git
... and install the Wikigrouth library:
cd wikigrouth
pip install -r requirements.txt
python setup.py install
Create a seed file (test.txt
) containing a list of Wikipedia page URIs:
http://en.wikipedia.org/wiki/Vienna
http://en.wikipedia.org/wiki/Berlin
...and run the corpus generation tool.
wikigrouth test.txt
If you want to override already existing files use:
wikigrouth -f test.txt
Your console will then tell you what's going on.
Taking above example, the corpus will be generated in a folder test
having the following internal file structure:
|- index.csv (*corpus index file*)
|- entities.csv (*extracted entities file*)
|- html
|- Vienna.html (*HTML page downloaded from Wikipedia*)
|- ...
|- text
|- Vienna.txt (*Raw text file extracted from HTML page*)
For example above:
doc_id,uri,html_file,text_file
0,http://en.wikipedia.org/wiki/Vienna,Vienna.html,Vienna.txt
1,http://en.wikipedia.org/wiki/Berlin,Berlin.html,Berlin.txt
CSV field semantics:
- doc_id: autogenerated document id
- uri: the seed uri
- html_file: file name of HTML page retrieved from Wikipedia
- text_file: file name of extracted text
For example above:
doc_id,offset,text,uri,in_seed
0,21,German,http://en.wikipedia.org/wiki/German_language,0
0,99,Austria,http://en.wikipedia.org/wiki/Austria,0
0,128,states of Austria,http://en.wikipedia.org/wiki/States_of_Austria,0
0,244,metropolitan area,http://en.wikipedia.org/wiki/Metropolitan_area,0
...
1,59,capital of Germany,http://en.wikipedia.org/wiki/Capital_of_Germany,0
1,97,states of Germany,http://en.wikipedia.org/wiki/States_of_Germany,0
1,208,most populous city proper,http://en.wikipedia.org/wiki/Largest_cities_of_the_European_Union_by_population_within_city_limits,0
1,250,most populous urban area,http://en.wikipedia.org/wiki/Largest_urban_areas_of_the_European_Union,0
1,282,European Union,http://en.wikipedia.org/wiki/European_Union,0
1,353,Spree,http://en.wikipedia.org/wiki/Spree,0
1,363,Havel,http://en.wikipedia.org/wiki/Havel,0
CSV field semantics:
- doc_id: document id (same as in corpus index file)
- offset: character offset of entity surface form (token, term, text)
- text: textual representation of entity
- uri: unique entity identifier
- in_seed: whether or not (0=no, 1=yes) entity is part of seed file