Based on clojure-hadoop by Stuart Sierra (https://github.com/stuartsierra/clojure-hadoop).
Uses Cascalog by Nathan Marz for logging support (https://github.com/nathanmarz/cascalog).
lein clean, jar, test
There is also a Makefile : running "make test" runs the above plus hadoop command-line shell tests. Your Hadoop install must be configured validly for "make test" to succeed: see "Hadoop Configuration" below.
You must make sure to run "lein jar" before you run "lein test" because the tests need the jar file so that they can run through the Hadoop framework.
In case of trouble, please try the "Hadoop Configuration" section below which will hopefully give you a failsafe Hadoop local-only testing environment.
$ lein repl => (load "hsk/shell") => (ns hsk.shell) => (shell "ls file:///tmp") => (shell "mkdir hdfs://localhost:9000/foo")
The model here is a one-to-one correspondence between a Clojure namespace and a MapReduce job definition. To define a MR job, you create a namespace, define some classes using (gen-class) and define a (tool-run) function using (defn). Below we use hsk's provided hsk.wordcount namespace as an example.
$ lein repl => (ns myns (:use [hsk.shell][hsk.logging][hsk.wordcount])) => (import '[hsk.wordcount Tool])
If running in emacs with M-x clojure-jack-in, run:
The (Tool.) constructor can now be used to create a Job, which can then be run on your Hadoop cluster using the (tool-run) method. (tool-run) takes 3 parameters:
- A Job (created by (Tool.)
- An input directory (file:///.., hdfs://.., ..)
- An output directory (same options for filesystem scheme apply as with input directory).
(First, clear out previously-run output, if any, using (shell)):
=> (shell "rmr file:///tmp/wordcount-out")
=> (tool-run (Tool.) (list "file:///tmp/wordcount-in" "file:///tmp/wordcount-out")) => (shell "ls file:///tmp/wordcount-out") => (shell "cat file:///tmp/wordcount-out/part-00000")
Your URL will be a fully distributed URL as in the following example. Also you will need to have your hadoop conf/ directory available in your classpath: see below for more on getting a simple Hadoop configuration working.
=> (tool-run (Tool.) (list "hdfs://mynamenode:9000/wordcount-in" "hdfs://mynamenode:9000/wordcount-out"))
This is only necessary so that hadoop can find the clojure 1.3.0 jar. If you have the clojure 1.3.0 jar in a known path, you can simply configure your HADOOP_CLASSPATH (see below) to point to it. Running lein deps in the current directory (hsk/) will attempt to fetch the clojure jar remotely and place it in lib/.
$ git clone https://github.com/clojure/clojure.git $ cd clojure $ git checkout clojure-1.3.0 $ mvn clean install
The last command installs the newly-built clojure jar in $HOME/.m2/repository/org/clojure/clojure/1.3.0/clojure-1.3.0.jar, which is used to configure Hadoop in the following section.
$ git clone http://github.com/apache/hadoop-common.git $ git checkout branch-1.0 $ ant clean jar $ mvn install:install-file -DgroupId=org.apache -DartifactId=hadoop-core -Dversion=1.0.1 -Dpackaging=jar -Dfile=build/hadoop-core-1.0.1-SNAPSHOT.jar
The last command installs the newly-built Hadoop jar so that leiningen can find it, although you will need to modify project.clj to change:
If you are developing with others, you may want to set up a maven repository to share your snapshots. See the sample project.clj to learn how to set project.clj to access your snapshot repository.
Modify conf/hadoop-env.sh like so:
Modify conf/core-site.xml like so:
<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration>
Modify conf/mapred-site.xml like so:
<configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> </configuration>
And format your new HDFS filesystem:
hadoop namenode -format
Now start all four Hadoop daemons:
hadoop namenode & hadoop datanode & hadoop jobtracker & hadoop tasktracker &
Copyright (C) 2011 Eugene Koontz
Distributed under the Eclipse Public License, the same as Clojure.