Reimplementing-TagMe

How to rebuild the Entity Linker TagMe in a few scripts, starting from a Wikipedia Dump. This work is inspired by the paper "On the Reproducibility of the TAGME Entity Linking System" [paper, code].

Pre-Processing Procedure

Process a Wikipedia Dump

Download a Wikipedia dump from here and process it with the WikiExtractor with the following command:

python python WikiExtractor.py -l -s -o output/ [here put the path to the Wikipedia Dump .xml.bz2]

Note that the flag -s will keep the sections and the flag -l the links.

Extract entity, mention and ngram frequencies + entity aspects

Having the Wiki dump processed by the WikiExtractor in the "output/" folder, the first step is to collect a set of all entity-mentions in Wikipedia, so that you can later collect their frequency as ngrams. You can do this by using

1-CollectAllMentions.ipynb

that will produce a all_mentions.pickle file. Note that I am using an English word tokenizer, which is the only language-dependent component of the pipeline.

The second step will extract mention, ngrams and entity counts as well as mention_to_entities statistics (e.g., how many times the mention "Obama" is pointing to "Barack_Obama" and how many times to "Michele_Obama"). Statistics are still divided in the n-folders consituting the output of the WikiExtractor and will be saved in the "Store-Counts/" folder as json files. The script will also store a .json file for each entity, with all its aspects (see here to know more about Entity-Aspect Linking).

2-ExtractingFreqAndAspects.ipynb

The final pre-processing script will aggregate all counts needed for using TagMe in single .pickle files and save them in the "Resources/" folder. You can do this by running:

3-AggregateCounts.ipynb

Note that after having processed each json from "Store-Counts/", the script will save an intermediate count in "Resources/". This way you could start already using TagMe, with partial statistics.

Download resources

To use directly my reimplementation of TagMe withouth processing a Wikipedia dump you can download all resources needed from here. You will find the five .pickle files containing the needed statistics plus a tfidf_asps.pkl file having TF-IDF statistics for the final aspect linking step. As before, these statistics are computed on English text, but they are straight forward to produce in another language by simply changing the word tokenizer.

Using TagMe

Through the script

Presentation-TagMe.ipynb

you will have a step-by-step overview of the TagMe algorithm for entity linking. It is designed to be working with RISE. The last cell in the notebook shows the potential of using aspects for further adding semantics to the linking process. If you'd like to know more about this, check out the original TagMe, the work done by Hasibi et al. in assessing its reproducibility and our recent dataset and demo of entity-aspect links.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reimplementing-TagMe

Pre-Processing Procedure

Process a Wikipedia Dump

Extract entity, mention and ngram frequencies + entity aspects

Download resources

Using TagMe

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
Aspects		Aspects
Figures		Figures
Resources		Resources
Store-Counts		Store-Counts
scripts		scripts
.gitignore		.gitignore
1-CollectAllMentions.ipynb		1-CollectAllMentions.ipynb
2-ExtractingFreqAndAspects.ipynb		2-ExtractingFreqAndAspects.ipynb
3-AggregateCounts.ipynb		3-AggregateCounts.ipynb
LICENSE		LICENSE
Presentation-TagMe.ipynb		Presentation-TagMe.ipynb
README.md		README.md

License

fedenanni/Reimplementing-TagMe

Folders and files

Latest commit

History

Repository files navigation

Reimplementing-TagMe

Pre-Processing Procedure

Process a Wikipedia Dump

Extract entity, mention and ngram frequencies + entity aspects

Download resources

Using TagMe

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages