Skip to content
Andre Pereira edited this page Mar 17, 2015 · 2 revisions

Local Wikipedia mirror of Wikipedia

To setup a running DBpedia Live instance you need to setup a local Wikipedia mirror. We use the mwdumper for this job.

First download the following files from the latest Wikipedia dump for the language you desire.

  1. pages-articles.xml.bz2
  2. imagelinks.sql.gz
  3. image.sql.gz
  4. langlinks.sql.gz
  5. templatelinks.sql.gz

and unzip the latest mediawiki into a folder visible to your Apache server.

You can also use this shell script to download the latest dumps for your language.

Edit the maintenace/tables.sql file of your mediawiki installation and append DEFAULT CHARACTER SET binary after every table definition. E.g.:

CREATE TABLE /*_*/user_former_groups (
  -- Key to user_id
  ufg_user int unsigned NOT NULL default 0,
  ufg_group varbinary(255) NOT NULL default ''
) /*$wgDBTableOptions*/ DEFAULT CHARACTER SET binary  ;`

Create a MySQL database, e.g. dbpedia_live, and load the tables.sql in that database. Then run the following command:

$ java -jar mwdumper.jar --format=sql:1.5 <dump-lang>-pages-articles.xml.bz2 | mysql -u <username> -p -f --default-character-set=utf8 <databasename>

If the import fails due to UTF-8 errors, try to "clean" the XML dump with the following command iconv -f utf-8 -t utf-8 -c (or try a different dump). After a successful import, load the other tables dumps (image, imagelinks, langlinks, templatelinks) into the database.

Setup OAI

Since your Wikipedia mirror needs to keep track with the changes made to its source, you need to install the OAI extension on your clone.

$ cd /var/www/wikipedia/extensions/
$ git clone http://git.wikimedia.org/git/mediawiki/extensions/OAI.git

In your browser, go to the page http://localhost/wikipedia/ (or wherever you installed mediawiki) and configure your copy of wikipedia. Download LocalSettings.php and place it in your wikipedia folder.

The configuration happens in the LocalSettings.php file (adjust the credentials for the OAI source repository):

require_once("$IP/extensions/OAI/OAIHarvest.php");
$oaiSourceRepository = "http://<user>:<password>@<oaiserverurl>";
# OAI repository for update server
require_once("$IP/extensions/OAI/OAIRepo.php");
$oaiAuth = true;
$wgDebugLogGroups['oai'] = '<pathToLog>/oai.log';

MySQL part

Import some SQL to your Wikipedia database.

$ mysql dbpedia_live -uroot -p < OAI/update_table.sql (Note, importing this file can take a long time and/or give errors, best approach is to run the create table statement from the commandline, followed by the insert/updates statements)
$ mysql dbpedia_live -uroot -p < OAI/oaiuser_table.sql
$ mysql dbpedia_live -uroot -p < OAI/oaiharvest_table.sql
$ mysql dbpedia_live -uroot -p < OAI/oaiaudit_table.sql
$ echo "INSERT INTO /*$wgDBprefix*/oaiuser(ou_name, ou_password_hash) VALUES ('dbpedia', md5('<aPasswordForLocalOAI>') );" | mysql dbpedia_live -u root -p

Before we forget it, create a new file pw.txt that contains your <aPasswordForLocalOAI>.

Import the previously downloaded files to the database. If you're using phpmyadmin you'll most likely have to edit php.ini to increase the maximum POST size and my.ini to increase the maximum memory available. The BigDump script can help when importing big files.

You can also use this shell script to load all the relevant files into your database.

Running the synchronization

$ cd /var/www/wiki/extensions/OAI/ (or equivalent, this is however the path I recommend)
$ php oaiUpdate.php (this actually starts the synchronization)

Attention ! The synchronization does not include any kind of delay, so you will hammer the wikiproxy. Since the wikiproxy is relatively slow it will stop responding and the oaiUpdate.php will crash with an error. (This plugins needs to be refactored in order to introduce better exception handling, but for now the following hack works.)

In order to introduce a delay go to the oaiHarvest.php file: look for the following line:
"function fetchURL( $url, &$resultCode ) {"
introduce the following line after it:
"sleep(delay)"
(where delay is the number of seconds between update calls. I recommend at least a value of 10.)

More information about how to install/setup and configure the OAI extension can be found here:

Install Extraction Framework

$ git clone git://github.com/dbpedia/extraction-framework.git
$ cd extraction-framework
$ mvn clean install # Compiles the code

Adjust the settings in live.ini and live.xml according to your language and needs. Put the pw.txt file in the live folder. (Examples of preconfigured files for German can be found here)

Generate initial cache

Create a MySQL database for caching extracted triples, e.g. dbpedia_live_cache, and load the dbstructure.sql in that database.

$ mysql dbpedia_live_cache -uroot -p < dbstructure.sql 

(NOTE: in case you get an error when importing the sql file, change the "SET SESSION" entries with "SET GLOBAL" in the dbstructure.sql file)

Apache configuration

You must place the following in you apache website configuration

# Enable cross origin policy
Header set Access-Control-Allow-Origin "*"  

# Avoid open your server to proxying
ProxyRequests Off

# Let apache pass the original host not the ProxyPass one
ProxyPreserveHost On
ProxyTimeout 1200

# Virtuoso / DBpedia VAD proxying
ProxyPass        /conductor   http://localhost:XXXX/conductor
ProxyPassReverse /conductor   http://localhost:XXXX/conductor
ProxyPass        /about       http://localhost:XXXX/about
ProxyPassReverse /about       http://localhost:XXXX/about
ProxyPass        /category    http://localhost:XXXX/category
ProxyPassReverse /category    http://localhost:XXXX/category
ProxyPass        /class       http://localhost:XXXX/class
ProxyPassReverse /class       http://localhost:XXXX/class
ProxyPass        /data4       http://localhost:XXXX/data4
ProxyPassReverse /data4       http://localhost:XXXX/data4
ProxyPass        /data3       http://localhost:XXXX/data3
ProxyPassReverse /data3       http://localhost:XXXX/data3
ProxyPass        /data2       http://localhost:XXXX/data2
ProxyPassReverse /data2       http://localhost:XXXX/data2
ProxyPass        /data        http://localhost:XXXX/data
ProxyPassReverse /data        http://localhost:XXXX/data
ProxyPass        /describe    http://localhost:XXXX/describe
ProxyPassReverse /describe    http://localhost:XXXX/describe
ProxyPass        /delta.vsp   http://localhost:XXXX/delta.vsp
ProxyPassReverse /delta.vsp   http://localhost:XXXX/delta.vsp
ProxyPass        /fct         http://localhost:XXXX/fct
ProxyPassReverse /fct         http://localhost:XXXX/fct
ProxyPass        /isparql     http://localhost:XXXX/isparql
ProxyPassReverse /isparql     http://localhost:XXXX/isparql
ProxyPass        /ontology    http://localhost:XXXX/ontology
ProxyPassReverse /ontology    http://localhost:XXXX/ontology
ProxyPass        /page        http://localhost:XXXX/page
ProxyPassReverse /page        http://localhost:XXXX/page
ProxyPass        /property    http://localhost:XXXX/property
ProxyPassReverse /property    http://localhost:XXXX/property
ProxyPass        /rdfdesc     http://localhost:XXXX/rdfdesc
ProxyPassReverse /rdfdesc     http://localhost:XXXX/rdfdesc
ProxyPass        /resource    http://localhost:XXXX/resource
ProxyPassReverse /resource    http://localhost:XXXX/resource
ProxyPass        /services    http://localhost:XXXX/services
ProxyPassReverse /services    http://localhost:XXXX/services
ProxyPass        /snorql      http://localhost:XXXX/snorql
ProxyPassReverse /snorql      http://localhost:XXXX/snorql
ProxyPass        /sparql-auth http://localhost:XXXX/sparql-auth
ProxyPassReverse /sparql-auth http://localhost:XXXX/sparql-auth
ProxyPass        /sparql      http://localhost:XXXX/sparql
ProxyPassReverse /sparql      http://localhost:XXXX/sparql
ProxyPass        /statics     http://localhost:XXXX/statics
ProxyPassReverse /statics     http://localhost:XXXX/statics
ProxyPass        /void        http://localhost:XXXX/void
ProxyPassReverse /void        http://localhost:XXXX/void
ProxyPass        /wikicompany http://localhost:XXXX/wikicompany
ProxyPassReverse /wikicompany http://localhost:XXXX/wikicompany

Virtuoso setup

Install Virtuoso and configure it as follows, where XX is the ISO 639-1 code of your clone.

DB.DBA.RDF_GRAPH_GROUP_CREATE ('http://XX.dbpedia.org',1);
DB.DBA.RDF_GRAPH_GROUP_INS ('http://XX.dbpedia.org','http://live.XX.dbpedia.org');
DB.DBA.RDF_GRAPH_GROUP_INS ('http://XX.dbpedia.org','http://static.XX.dbpedia.org');
DB.DBA.RDF_GRAPH_GROUP_INS ('http://XX.dbpedia.org','http://XX.dbpedia.org/resource/classes#');

Setup Linked Data interface

The same way with the normal xx.dbpedia.org installation. Check https://github.com/dbpedia/dbpedia-vad-i18n for details.

Clone this wiki locally