dlwh edited this page Sep 12, 2010 · 4 revisions
Clone this wiki locally

In progress: port to Hadoop

SMR is now trying to position itself as a layer on top of hadoop. What’s available now is more or less a straight port of the original SMR to Hadoop. More Hadoop-like features will be created soon.

Using the the code is very similar:

import smr.hadoop._;
import org.apache.hadoop.fs.Path;

val h = Hadoop(Array(""),new Path("output"));
println(h.distribute(1 to 1000,3) reduce ( _+_));
println(h.distribute(1 to 1000,3) map (2*)  reduce ( _+_));

where 3 is the number of shards, and 1 to 1000 is the the range.

Building is I hope fairly easy: you just need to specify a path to your scala install and your hadoop install in $SCALA_HOME and $HADOOP_HOME respectively, the run ant.

Testing in non-distributed modes.

scala -cp $CLASSPATH:classes/ -Xplugin:jars/seroverride.jar scripts/hadoop.scala

Note that the interpreter and the script runner cannot be used in a distributed way, because classes defined there are held in memory and so other processes can’t access them. (Getting access to those would be a nice TODO). Also, I have not tested a distributed setup, since I don’t have a cluster set up (yet).

Welcome to the smr wiki!

Examples are in the testscripts/ directory, but here’s an example. Suppose you want to add all the numbers from 1 to 100000, except for those divisible by 7, and you want to multiply them by 6 before adding (say). In normal scala you’d write something like this:

(1 to 100000).filter(_%7!=0).map(BigInt(_*6)).reduceLeft(_+_);

With smr, to parallelize it, you say:

distributor.distribute(1 to 100000).filter(_%7!=0).map(BigInt(_*6)).reduce(_+_);

And that’s it! Notice we use reduce instead of reduceLeft. This is because we need an operator that is fully associative, and not just left associative like Scala’s normal operators.