Quickstarter for DBpedia Spotlight models

Update, February 2020

The DBpedia-Spotlight server downloads the most recent language models from the DBpedia Databus. The language models are build with the latest version of redirects, disambiguations, and instance-types artifacts, downloaded from the DBpedia Databus.

Update, January 2016

This tool now uses the wikistatsextractor by the great folks over at DiffBot. This means: no more Hadoop and Pig! Running the biggest model (English) takes around 2h on a single machine with around 32GB of RAM. We recommend running this script on an SSD with around 100GB of free space.

Requirements

Git
Maven 3

Spotlight model creation

In the command line run the following command:

./mainModelBuilder.sh LANG_LOC-Stemmer

where LANG is the two digits language code, LOC is the two digits locator code, and Stemmer is the Snowball stemmer algorithm. The language and locator codes correspondes to the BCP47 documentation. If the stemmer algorithm is not available the None string must be used, e.g., ja_JP-None for the Japanese language.

The process is divided in the following steps:

Preparing the data: Download the Wikipedia dump file for the specified language.
DBpedia extraction: Downloads the redirects, disambiguations, and instance-type artifacts from the DBpedia Databus.
Extracting wiki stats: It analyze the Wikipedia dump file to extract statistical information such as the number of uris (uriCounts), the number of times a token appears (tokenCounts), etc.
Setting up Spotlight: It clones and setup the dbpedia-spotlight-model project to the working directory (wdir).
Build Spotlight model: Collects the data from the previous steps to build the corresponding language model.

Datasets

You can find pre-built language models in the DBpedia Databus.

Contribution

The DBpedia forum describes some tasks needed to improve the language model building process. The main idea is to add more language models and/or improve the available models.

Citation

If you use the current (statistical version) of DBpedia Spotlight or the data/models created using this repository, please cite the following paper.

@inproceedings{isem2013daiber,
  title = {Improving Efficiency and Accuracy in Multilingual Entity Extraction},
  author = {Joachim Daiber and Max Jakob and Chris Hokamp and Pablo N. Mendes},
  year = {2013},
  booktitle = {Proceedings of the 9th International Conference on Semantic Systems (I-Semantics)}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quickstarter for DBpedia Spotlight models

Update, February 2020

Update, January 2016

Requirements

Spotlight model creation

Datasets

Contribution

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 134 Commits
ar		ar
bg		bg
ca		ca
cs		cs
da		da
de		de
docker		docker
el		el
en		en
es		es
et		et
fi		fi
fr		fr
ga		ga
hi		hi
hr		hr
it		it
ja		ja
ko		ko
lt		lt
lv		lv
ms		ms
nl		nl
oldScripts		oldScripts
pl		pl
pomTemplates		pomTemplates
pt		pt
ro		ro
ru		ru
scripts		scripts
sk		sk
spotlight		spotlight
testScripts		testScripts
zh		zh
README.md		README.md
eval.sh		eval.sh
mainModelBuilder.sh		mainModelBuilder.sh
model_readme.txt		model_readme.txt
prepare.sh		prepare.sh
run.sh		run.sh

dbpedia/model-quickstarter

Folders and files

Latest commit

History

Repository files navigation

Quickstarter for DBpedia Spotlight models

Update, February 2020

Update, January 2016

Requirements

Spotlight model creation

Datasets

Contribution

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages