Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Hadoop Starter Kit: getting started with Hadoop in Clojure

branch: master

Fetching latest commit…

Octocat-spinner-32-eaf2f5

Cannot retrieve the latest commit at this time

Octocat-spinner-32 sample
Octocat-spinner-32 src
Octocat-spinner-32 test
Octocat-spinner-32 Makefile
Octocat-spinner-32 README.md
Octocat-spinner-32 project.clj
README.md

hsk: Hadoop Starter Kit for Clojure.

Based on clojure-hadoop by Stuart Sierra (https://github.com/stuartsierra/clojure-hadoop).

Uses Cascalog by Nathan Marz for logging support (https://github.com/nathanmarz/cascalog).

Example Usage

Unit Tests

lein clean, jar, test

There is also a Makefile : running "make test" runs the above plus hadoop command-line shell tests. Your Hadoop install must be configured validly for "make test" to succeed: see "Hadoop Configuration" below.

You must make sure to run "lein jar" before you run "lein test" because the tests need the jar file so that they can run through the Hadoop framework.

In case of trouble, please try the "Hadoop Configuration" section below which will hopefully give you a failsafe Hadoop local-only testing environment.

HDFS shell

$ lein repl
=> (load "hsk/shell")
=> (ns hsk.shell)
=> (shell "ls file:///tmp")
=> (shell "mkdir hdfs://localhost:9000/foo")

MapReduce jobs

The model here is a one-to-one correspondence between a Clojure namespace and a MapReduce job definition. To define a MR job, you create a namespace, define some classes using (gen-class) and define a (tool-run) function using (defn). Below we use hsk's provided hsk.wordcount namespace as an example.

Setup

$ lein repl
=> (ns myns (:use [hsk.shell][hsk.logging][hsk.wordcount]))
=> (import '[hsk.wordcount Tool])

If running in emacs with M-x clojure-jack-in, run:

=> (enable-logging-in-emacs)

(tool-run)

The (Tool.) constructor can now be used to create a Job, which can then be run on your Hadoop cluster using the (tool-run) method. (tool-run) takes 3 parameters:

  • A Job (created by (Tool.)
  • An input directory (file:///.., hdfs://.., ..)
  • An output directory (same options for filesystem scheme apply as with input directory).

Run in standalone mode

(First, clear out previously-run output, if any, using (shell)):

=> (shell "rmr file:///tmp/wordcount-out")

Then:

=> (tool-run (Tool.) (list "file:///tmp/wordcount-in" "file:///tmp/wordcount-out"))
=> (shell "ls file:///tmp/wordcount-out")
=> (shell "cat file:///tmp/wordcount-out/part-00000")

Run in distributed mode

Your URL will be a fully distributed URL as in the following example. Also you will need to have your hadoop conf/ directory available in your classpath: see below for more on getting a simple Hadoop configuration working.

=> (tool-run (Tool.) (list "hdfs://mynamenode:9000/wordcount-in" "hdfs://mynamenode:9000/wordcount-out"))

Building Clojure

This is only necessary so that hadoop can find the clojure 1.3.0 jar. If you have the clojure 1.3.0 jar in a known path, you can simply configure your HADOOP_CLASSPATH (see below) to point to it. Running lein deps in the current directory (hsk/) will attempt to fetch the clojure jar remotely and place it in lib/.

$ git clone https://github.com/clojure/clojure.git
$ cd clojure
$ git checkout clojure-1.3.0
$ mvn clean install

The last command installs the newly-built clojure jar in $HOME/.m2/repository/org/clojure/clojure/1.3.0/clojure-1.3.0.jar, which is used to configure Hadoop in the following section.

Building and Installing Hadoop

$ git clone http://github.com/apache/hadoop-common.git
$ git checkout branch-1.0
$ ant clean jar
$ mvn install:install-file -DgroupId=org.apache -DartifactId=hadoop-core -Dversion=1.0.1 -Dpackaging=jar -Dfile=build/hadoop-core-1.0.1-SNAPSHOT.jar

The last command installs the newly-built Hadoop jar so that leiningen can find it, although you will need to modify project.clj to change:

[org.apache.hadoop/hadoop-core "1.0.1"]

to:

[org.apache.hadoop/hadoop-core "1.0.1-SNAPSHOT"]

If you are developing with others, you may want to set up a maven repository to share your snapshots. See the sample project.clj to learn how to set project.clj to access your snapshot repository.

Hadoop Configuration

Modify conf/hadoop-env.sh like so:

export HADOOP_CLASSPATH=$HOME/.m2/repository/org/clojure/clojure/1.3.0/clojure-1.3.0.jar

Modify conf/core-site.xml like so:

<configuration>
  <property>
      <name>fs.default.name</name>
      <value>hdfs://localhost:9000</value>
  </property>
</configuration>

Modify conf/mapred-site.xml like so:

<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:9001</value>
  </property>
</configuration>

And format your new HDFS filesystem:

hadoop namenode -format

Now start all four Hadoop daemons:

hadoop namenode &
hadoop datanode &
hadoop jobtracker &
hadoop tasktracker &

License

Copyright (C) 2011 Eugene Koontz

Distributed under the Eclipse Public License, the same as Clojure.

Something went wrong with that request. Please try again.