Source code for the paper "Probabilistic Bag-Of-Hyperlinks Model for Entity Linking" , http://dl.acm.org/citation.cfm?id=2882988
Switch branches/tags
Nothing to show
Clone or download
octavian-ganea Merge pull request #11 from Jorigorn/master
added a script to downlaod the index from website automatically
Latest commit 562e435 Oct 28, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
src/main/scala Update download_data.py Oct 26, 2018
README.md Update README.md Apr 1, 2018
pom.xml code beta (needs refactoring) Dec 1, 2016

README.md

pboh-entity-linking

PBoH Entity Linking system.

Code: beta version

Paper: "Probabilistic Bag-Of-Hyperlinks Model for Entity Linking" , Ganea O-E et al. , (proc. WWW 2016), http://dl.acm.org/citation.cfm?id=2882988

Slides, poster, online system and comparison with existing systems : http://people.inf.ethz.ch/ganeao

Newest GERBIL results:

Indexes download link: https://polybox.ethz.ch/index.php/s/IOWjGrU3mjyzDSV . They are required in various places (i.e. wherever there are file paths containing the prefix '/media/hofmann-scratch/'). The files whose names end in part* need to be concatenated in one big file without these suffixes before being used, e.g. one file called anchorsListFromEachWikiPage.txt_dev_index will be made by merging all files anchorsListFromEachWikiPage.txt_dev_index.part*. The provided indexes are already in the suitable format for indexes that are loaded in here: https://github.com/dalab/pboh-entity-linking/tree/master/src/main/scala/index . For the eval datasets , there are indications where to get the data from at the beginning of each file in here: https://github.com/dalab/pboh-entity-linking/tree/master/src/main/scala/eval/datasets For the AIDA dataset, a sample of the format is shown here: https://github.com/dalab/pboh-entity-linking/issues/3 Please contact octavian.ganea at inf dot ethz dot ch to receive the required password and for other questions you might have.

How to run the code:

  • Download the above indexes and update their locations inside the code. Do the same for the test sets.
  • Compile with 'mvn package'. Will generate a self-contained jar called target/PBoH-1.0-SNAPSHOT-jar-with-dependencies.jar
  • Run the code to test PBOH on the datasets mentioned in the paper using the command: scala -J-Xmx90g target/PBoH-1.0-SNAPSHOT-jar-with-dependencies.jar testPBOHOnAllDatasets max-product
  • A new dataset can be added as follows: one needs to write a class similar to eval/datasets/AQUAINT_MSNBC_ACE04.scala that transforms an input text file with entity annotations into an object of type Array[(String, Array[(String,Int, Array[Int])])]. Each element of this list is a pair (doc_name, doc_annotations) in which doc_annotations is a list of entity annotations from the given document in the format ((mention.toLowerCase(), entity, context)). Context is here a list of word IDs of words surrounding the given mention in a window of fixed size. The word IDs are obtained from word strings using the in-memory dictionary index/WordFreqDict.scala.