Skip to content

mrlin is 'MapReduce processing of Linked Data' … because it's magic

Notifications You must be signed in to change notification settings

brettbeaudoin/mrlin

 
 

Repository files navigation

mrlin - MapReduce processing of Linked Data

...because it's magic

The basic idea of mrlin is to enable Map Reduce processing of Linked Data - hence the name. In the following I'm going to show you first to how to use HBase to store Linked Data with RDF, and then how to use Hadoop to run MapReduce jobs.

Background

Dependencies

Representing RDF triples in HBase

Learn about how mrlin represents RDF triples in HBase.

RESTful Interaction with HBase

Dig into RESTful interactions with HBase, in mrlin.

Usage

Setup

I assume you have HBase installed in some directory HBASE_HOME and mrlin in some other directory MRLIN_HOME. First let's make sure that Happybase is installed correctly - we will use a virtualenv. You only need to do this once: go to MRLIN_HOME and type:

$ virtualenv hb

Time to launch HBase and the Thrift server: in the HBASE_HOME directory, type the following:

$ ./bin/start-hbase.sh 
$ ./bin/hbase thrift start -p 9191

OK, now we're ready to launch mrlin - change to the directory MRLIN_HOME and first activate the virtualenv we created earlier:

$ source hb/bin/activate

You should see a change in the prompt to something like (hb)michau@~/Documents/dev/mrlin$ ... and this means we're good to go!

Import RDF/NTriples

To import RDF NTriples documents, use the mrlin import script.

First, try to import a file from the local filesystem. Note the second parameter (http://example.org/), which specifies the target graph URI to import into:

$ (hb)michau@~/Documents/dev/mrlin$ python mrlin_import.py data/test_0.ntriples http://example.org/

If this works, try to import directly from a URL http://dbpedia.org/data/Galway.ntriples:

(hb)michau@~/Documents/dev/mrlin$ python mrlin_import.py http://dbpedia.org/data/Galway.ntriples http://dbpedia.org/
2012-10-30T08:56:21 Initialized mrlin table.
2012-10-30T08:56:31 Importing RDF/NTriples from URL http://dbpedia.org/data/Galway.ntriples into graph http://dbpedia.org/
2012-10-30T08:56:31 == STATUS ==
2012-10-30T08:56:31  Time to retrieve source: 9.83 sec
2012-10-30T08:56:31 == STATUS ==
2012-10-30T08:56:31  Time elapsed since last checkpoint:  0.07 sec
2012-10-30T08:56:31  Import speed: 1506.61 triples per sec
2012-10-30T08:56:31 == STATUS ==
2012-10-30T08:56:31  Time elapsed since last checkpoint:  0.02 sec
2012-10-30T08:56:31  Import speed: 4059.10 triples per sec
2012-10-30T08:56:31 ==========
2012-10-30T08:56:31 Imported 233 triples.

Note that you can also import an entire directory (mrlin will look for .nt and .ntriples files):

(hb)michau@~/Documents/dev/mrlin$ python mrlin_import.py data/ http://example.org/
2012-10-30T03:55:18 Importing RDF/NTriples from directory /Users/michau/Documents/dev/mrlin/data into graph http://example.org/
...

To reset the HBase table (and remove all triples from it), use the mrlin utils script like so:

(hb)michau@~/Documents/dev/mrlin$ python mrlin_utils.py clear

Query

In order to query the mrlin datastore in HBase, use the mrlin query script:

(hb)michau@~/Documents/dev/mrlin$ python mrlin_query.py Tribes
2012-10-30T04:01:22 Scanning table rdf with filter ValueFilter(=,'substring:Tribes')
2012-10-30T04:01:22 Key: http://dbpedia.org/resource/Galway - Value: {'O:148': 'u\'"City of the Tribes"\'', 'O:66': 'u\'"City of the Tribes"\'',  ...}
2012-10-30T04:01:22 ============
2012-10-30T04:01:22 Query took me 0.01 seconds.

Running MapReduce jobs

TBD

  • setup in virtual env: source hb/bin/activate then pip install mrjob
  • cp .mrjob.conf ~ before launch
  • source hb/bin/activate
  • run python mrlin_mr.py README.md for standalone
  • set up Hadoop 1.0.4 - if unsure follow a single-node setup tutorial
  • cp .mrjob.conf ~ before launch if you change settings (!)
  • note all changes that were necessary in conf/core-site.xml, conf/mapred-site.xml, conf/hdfs-site.xml, and hadoop-env.sh (provide examples)
  • run python mrlin_mr.py -r hadoop README.md for local Hadoop

Debug

  • tail -f hadoop-michau-namenode-Michael-Hausenblas-iMac.local.log

License

All artifacts in this repository are licensed under Apache 2.0 Software License.

About

mrlin is 'MapReduce processing of Linked Data' … because it's magic

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published