Skip to content

Tools and data for creating DBpedia Spotlight models.

Notifications You must be signed in to change notification settings

dbpedia/model-quickstarter

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Quickstarter for DBpedia Spotlight models

Gitter

Update, February 2020

The DBpedia-Spotlight server downloads the most recent language models from the DBpedia Databus. The language models are build with the latest version of redirects, disambiguations, and instance-types artifacts, downloaded from the DBpedia Databus.

Update, January 2016

This tool now uses the wikistatsextractor by the great folks over at DiffBot. This means: no more Hadoop and Pig! Running the biggest model (English) takes around 2h on a single machine with around 32GB of RAM. We recommend running this script on an SSD with around 100GB of free space.

Requirements

  • Git
  • Maven 3

Spotlight model creation

In the command line run the following command:

./mainModelBuilder.sh LANG_LOC-Stemmer

where LANG is the two digits language code, LOC is the two digits locator code, and Stemmer is the Snowball stemmer algorithm. The language and locator codes correspondes to the BCP47 documentation. If the stemmer algorithm is not available the None string must be used, e.g., ja_JP-None for the Japanese language.

The process is divided in the following steps:

  1. Preparing the data: Download the Wikipedia dump file for the specified language.
  2. DBpedia extraction: Downloads the redirects, disambiguations, and instance-type artifacts from the DBpedia Databus.
  3. Extracting wiki stats: It analyze the Wikipedia dump file to extract statistical information such as the number of uris (uriCounts), the number of times a token appears (tokenCounts), etc.
  4. Setting up Spotlight: It clones and setup the dbpedia-spotlight-model project to the working directory (wdir).
  5. Build Spotlight model: Collects the data from the previous steps to build the corresponding language model.

Datasets

You can find pre-built language models in the DBpedia Databus.

Contribution

The DBpedia forum describes some tasks needed to improve the language model building process. The main idea is to add more language models and/or improve the available models.

Citation

If you use the current (statistical version) of DBpedia Spotlight or the data/models created using this repository, please cite the following paper.

@inproceedings{isem2013daiber,
  title = {Improving Efficiency and Accuracy in Multilingual Entity Extraction},
  author = {Joachim Daiber and Max Jakob and Chris Hokamp and Pablo N. Mendes},
  year = {2013},
  booktitle = {Proceedings of the 9th International Conference on Semantic Systems (I-Semantics)}
}

About

Tools and data for creating DBpedia Spotlight models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 97.2%
  • Python 1.9%
  • Dockerfile 0.9%