Skip to content
This repository has been archived by the owner. It is now read-only.

Internationalization (DB backed core)

Jo Daiber edited this page Jan 30, 2016 · 59 revisions
Clone this wiki locally

Important note: This page explains how to create a Spotlight model on your own server. This is a detailed tutorial to explain each step and is partially outdated. A fully automated and up-to-date script for these steps can be found here.


For this part, you need Apache Hadoop and Apache Pig. If you don't have Hadoop and Pig installed, we recommend the following tutorials for setting up Hadoop. The indexing can also be run on a single machine, in this case it is enough to download Apache Pig and run it in local mode (add "-x local" after every pig command to run locally without hadoop).

and for Apache Pig:

For more details on Hadoop-based indexing, see Indexing with Pignlproc and Hadoop, which also contains all required versions.

In the following sections, we take the Dutch language as an example. If you want to run the indexing for other languages, just replace the nl (the language code for Dutch) with its corresponding language code. We also assume the default working directory is /user/hadoop in HDFS.

The quick way

This section provides a quick way of creating the Spotlight model by executing the indexing script. Before that, you need to prepare some models as shown in the following two points. Meanwhile, the following programs must be available on your server: hadoop, pig, mvn, git, curl.

  1. Create working directory and download OpenNLP models for your language:

      $ mkdir -p /data/spotlight/nl
      $ ls /data/spotlight/nl/opennlp
      nl-chunker.bin  nl-pos-maxent.bin  nl-sent.bin  nl-token.bin
    

Note: the working directory is given as an absolute path. The NLP models can be download from http://opennlp.sourceforge.net/models-1.5/.

  1. Create a list of stopwords:

      $ head -n5 /data/spotlight/nl/stopwords.nl.list 
      de
      en
      van
      ik
      te
    
  2. Run the indexing script, which would create the model /data/spotlight/nl/model_nl:

      $ cd /data/spotlight/nl
      $ wget https://raw.github.com/jodaiber/dbpedia-spotlight/master/bin/index_db.sh
      $ ./index_db.sh -o  /data/spotlight/nl/opennlp /data/spotlight/nl nl_NL /data/spotlight/nl/stopwords.nl.list Dutch /data/spotlight/nl/model_nl
    

Note: start up the hadoop workers before you try the above commands; paths in this command are absolute paths. You can change nl to other language code. But remember to change Dutch to the corresponding language specific Lucene Analyzer, e.g. English for EnglishAnalyzer.

The detailed way

This section describes the detailed steps for creating the Spotlight model. These steps are all performed by index_db.sh.

Download the data

  1. Prepare a stopword list file under /data/spotlight/nl and NLP models under /data/spotlight/nl/opennlp as the quick way.

  2. Download DBpedia data (see here)

      $ mkdir -p /data/spotlight/nl/processed/
      $ cd /data/spotlight/nl/processed/
      $ curl http://nl.dbpedia.org/downloads/nlwiki/20121003/nlwiki-20121003-redirects.ttl.gz | gzcat > redirects.nt
      $ curl http://nl.dbpedia.org/downloads/nlwiki/20121003/nlwiki-20121003-disambiguations.ttl.gz | gzcat > disambiguations.nt
      $ curl http://nl.dbpedia.org/downloads/nlwiki/20121003/nlwiki-20121003-instance-types.ttl.gz | gzcat > instance_types.nt
    

Note: if gzcat is not available, then replace it with gunzip -c.

  1. Download the Wikipedia dump:

      $ cd /data/spotlight/nl
      $ wget http://dumps.wikimedia.org/nlwiki/latest/nlwiki-latest-pages-articles.xml.bz2
    

Process the data

  1. Check out and build our version of pignlproc:

     $ mkdir pig
     $ cd pig
     $ git clone git://github.com/dbpedia-spotlight/pignlproc.git
    

    Note: There are redirect definitions for most languages that have a local Wikipedia, if you are unsure if your language is among those, check the that the language is supported in the method getRedirectPatterns in AnnotatingMarkupParser.

     $ mvn assembly:assembly -Dmaven.test.skip=true
    

    Note: if fails due to the core-0.6.jar of org.dbpedia.spotlight is not available from the info-bliki-repository, then you need to prepare that jar file by downloading the code of Spotligth and mvn install it manually.

  2. Split the corpus in train, tune and test sets and move the training part into HDFS:

      $ cd /data/spotlight/nl
      $ bzcat nlwiki-latest-pages-articles.xml.bz2 | python pig/pignlproc/utilities/split_train_test.py 12000 /data/spotlight/nl/processed/test.txt | hadoop fs -put  nlwiki-latest-pages-articles.xml
    

Move the stopwords and tokenizer model into HDFS:

     $ hadoop fs -put /data/spotlight/nl/stopwords.nl.list stopwords.nl.list
     $ hadoop fs -put /data/spotlight/nl/opennlp/nl-token.bin nl.tokenizer_model
  1. Adapt examples/indexing/token_counts.pig.params and examples/indexing/names_and_entities.pig.params to your language. See the two linked files for the example of Dutch.

    Note: Due to line 87 of RestrictedNGramGenerator.java, the path for the tokenizer model is fixed to ./nl.tokenizer_model. Thus, you have to make your working directory be the default working directory ./ in HDFS, i.e, /user/hadoop for our example. Otherwise the function would report error for missing the nl.tokenizr_model file.

  2. Run Apache Pig:

      $ cd /data/spotlight/nl/pig/pignlproc
      $ pig -m examples/indexing/token_counts.pig.params examples/indexing/token_counts.pig
      $ pig -m examples/indexing/names_and_entities.pig.params examples/indexing/names_and_entities.pig
    

    Note: If you got "java.lang.OutOfMemoryError" error,try to set heap space larger by the following:1) added this line to the script: SET mapred.child.java.opts '-Xmx2048m'; 2) commented out this line: --set io.sort.mb 1024

  3. Move the results of both jobs:

      hadoop fs -cat tokenCounts/tokenCounts/part* > tokenCounts
      hadoop fs -cat names_and_entities/pairCounts/part* > pairCounts
      hadoop fs -cat names_and_entities/uriCounts/part* > uriCounts
      hadoop fs -cat names_and_entities/sfAndTotalCounts/part* > sfAndTotalCounts 
    

then, you should have the following files:

     $ ls /data/spotlight/nl/processed/
     pairCounts  sfAndTotalCounts  tokenCounts  uriCounts

Create the Spotlight model

  1. Create the Spotlight model /data/spotlight/nl/model_nl with :

      $ java -cp dbpedia-spotlight.jar org.dbpedia.spotlight.db.CreateSpotlightModel nl_NL /data/spotlight/nl/processed/ /data/spotlight/nl/model_nl /data/spotlight/nl/opennlp /data/spotlight/nl/stopwords.nl.list None
    

    This will create the following Spotlight model folder:

      $ tree /data/spotlight/nl/model_nl/
      /data/spotlight/nl/model_nl/
      ├── model
      │   ├── candmap.mem
      │   ├── context.mem
      │   ├── res.mem
      │   ├── sf.mem
      │   └── tokens.mem
      ├── model.properties
      ├── opennlp
      │   ├── chunker.bin
      │   ├── pos-maxent.bin
      │   ├── sent.bin
      │   └── token.bin
      ├── opennlp_chunker_thresholds.txt
      └── stopwords.list
    

Run the server

That's it, you can now run the server with your newly created model.

$ java -jar dbpedia-spotlight.jar /data/spotlight/nl/model_nl http://localhost:2222/rest

Note: If you want only fast run Statistical backend without creating a model, there are pre-built models available from the download page.