Skip to content

Extraction Instructions

EK edited this page Oct 8, 2019 · 33 revisions

Requirements

  • Java 7
  • Maven3

Download

$ git clone https://github.com/dbpedia/extraction-framework.git

Dump-Based Extraction

In the root directory run the following commands

$ mvn clean install # Compiles the code
$ cd dump
$ ../run download <download-config-file> # Downloads the Wikipedia dumps according to your specified download-config-file 
$ ../run extraction <extraction-config-file> # Extracts triples from the downloaded dumps according to your specified extraction-config-file 

For download-config-file & extract-config-file you can either re-use existing files from the repository or adapt them to your needs.

Before running the existing config files, open the config file with a editor and modify according to your environment. Make sure you have the base-dir and languages options aligned between the download and extraction configuration files

For fine-grained serialization options you can read this page.

Windows instructions

Note that run is a linux bash script, for windows you should replace the last two steps from ABOVE with:

mvn scala:run "-Dlauncher=download"   "-DaddArgs=<config=download-config-file>" 
mvn scala:run "-Dlauncher=extraction" "-DaddArgs=<extraction-config-file>" 

Note: the name of the config files denoted in angle brackets above have to be replaced by a specific .properties filename without angle brackets

Abstract Extraction

Note: Besides the following instructions you can simply download the abstracts triples from the DBpedia dumps directory. For example, here you can find the abstracts files (long and sort versions) for the Spanish (es) chapter: http://downloads.dbpedia.org/3.9/es/

Abstracts are not generated by the Simple Wiki Parser, they are produced by a local wikipedia clone using a modified mediawiki installation.

In order to generate clean abstracts from Wikipedia articles one needs to render wiki templates as they would be rendered in the original Wikipedia instance. So in order for the DBpedia Abstract Extractor to work, a running Media Wiki instance with Wikipedia data in a MySQL database is necessary.

To install and start the MySQL server, you can use dump/src/main/bash/mysql.sh. Set MYSQL_HOME to the folder where you installed the MySQL binaries.

To import the data, you need to run the Scala 'import' launcher:

First you have to adapt the settings for the 'import' launcher in dump/pom.xml:

<arg>/home/release/wikipedia</arg><!-- path to folder containing Wikipedia XML dumps -->
<arg>/home/release/data/projects/mediawiki/core/maintenance/tables.sql</arg><!--file containing MediaWiki table definitions -->
<arg>localhost</arg><!--  MySQL host:port - localhost should work if you use mysql.sh -->
<arg>true</arg><!--  require-download-complete -->
<arg>10000-</arg><!-- languages and article count ranges, comma-separated, e.g. "en,de" -->

Then you need to cd to dump/ and call

../run import

This should import all the templates into the MySQL tables.

To set up the local Wikipedia instance you have to use the modified MediaWiki code from here: https://github.com/dbpedia/extraction-framework/tree/master/dump/src/main/mediawiki and configure it to listen to the URL from here

See http://wiki.dbpedia.org/AbstractExtraction for a few more details. TODO: move content from that page here.

See https://github.com/dbpedia/extraction-framework/wiki/Dbpedia-Abstract-Extraction-step-by-step-guide for Step by step approach.

Core Module

Download Ontology

To download a fresh copy of the DBpedia ontology from the mappings wiki, use the following commands

$ cd ../core
$ ../run download-ontology

Download Mappings

You can download the mappings from http://mappings.dbpedia.org offline and use the local files for the extraction (configuration in extract.config file)

$ cd ../core
$ ../run download-mappings

Generate Settings

Updates various settings for all Wikipedia language editions

$ cd ../core
$ ../run generate-settings

Server Module

TODO

Live Module

If you want to setup a new DBpedia Live instance use this page

Wiktionary Module

TODO