Skip to content

A few scripts for using the Entity Linker TagMe, starting from a Wikipedia Dump.

License

Notifications You must be signed in to change notification settings

fedenanni/Reimplementing-TagMe

Repository files navigation

Reimplementing-TagMe

How to rebuild the Entity Linker TagMe in a few scripts, starting from a Wikipedia Dump. This work is inspired by the paper "On the Reproducibility of the TAGME Entity Linking System" [paper, code].

Pre-Processing Procedure

Process a Wikipedia Dump

Download a Wikipedia dump from here and process it with the WikiExtractor with the following command:

python python WikiExtractor.py -l -s -o output/ [here put the path to the Wikipedia Dump .xml.bz2]

Note that the flag -s will keep the sections and the flag -l the links.

Extract entity, mention and ngram frequencies + entity aspects

Having the Wiki dump processed by the WikiExtractor in the "output/" folder, the first step is to collect a set of all entity-mentions in Wikipedia, so that you can later collect their frequency as ngrams. You can do this by using

1-CollectAllMentions.ipynb 

that will produce a all_mentions.pickle file. Note that I am using an English word tokenizer, which is the only language-dependent component of the pipeline.

The second step will extract mention, ngrams and entity counts as well as mention_to_entities statistics (e.g., how many times the mention "Obama" is pointing to "Barack_Obama" and how many times to "Michele_Obama"). Statistics are still divided in the n-folders consituting the output of the WikiExtractor and will be saved in the "Store-Counts/" folder as json files. The script will also store a .json file for each entity, with all its aspects (see here to know more about Entity-Aspect Linking).

2-ExtractingFreqAndAspects.ipynb

The final pre-processing script will aggregate all counts needed for using TagMe in single .pickle files and save them in the "Resources/" folder. You can do this by running:

3-AggregateCounts.ipynb

Note that after having processed each json from "Store-Counts/", the script will save an intermediate count in "Resources/". This way you could start already using TagMe, with partial statistics.

Download resources

To use directly my reimplementation of TagMe withouth processing a Wikipedia dump you can download all resources needed from here. You will find the five .pickle files containing the needed statistics plus a tfidf_asps.pkl file having TF-IDF statistics for the final aspect linking step. As before, these statistics are computed on English text, but they are straight forward to produce in another language by simply changing the word tokenizer.

Using TagMe

Through the script

Presentation-TagMe.ipynb 

you will have a step-by-step overview of the TagMe algorithm for entity linking. It is designed to be working with RISE. The last cell in the notebook shows the potential of using aspects for further adding semantics to the linking process. If you'd like to know more about this, check out the original TagMe, the work done by Hasibi et al. in assessing its reproducibility and our recent dataset and demo of entity-aspect links.

About

A few scripts for using the Entity Linker TagMe, starting from a Wikipedia Dump.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published