bashreduce : mapreduce in a bash script
bashreduce lets you apply your favorite unix tools in a mapreduce fashion across multiple machines/cores. There’s no installation, administration, or distributed filesystem. You’ll need:
- br somewhere handy in your path
- gnu core utils on each machine: sort, awk, grep
- netcat on each machine
If you wish, you may edit
/etc/br.hosts and enter the machines you wish to use as workers. You can also specify this at runtime:
br -m "host1 host2 host3"
To take advantage of multiple cores, repeat the host name as many times as you wish.
br < input > output
br -r "uniq -c" < input > output
count words that begin with ‘b’
br -r "grep ^b | uniq -c" < input > output
Completely spurious numbers, all this shows you is how useful br is to me :-)
I have four compute machines and I’m usually relegated to using one core on one machine to sort. How about when I use br?
|sort -k1,1 4gb_file > 4gb_file_sorted||coreutils||82m54.746s||843 kBps|
|br -i 4gb_file -o 4gb_file_sorted||coreutils||9m17.431s||7.35 MBps|
|br -i 4gb_file -o 4gb_file_sorted||coreutils, brp||6m5.968s||11.19 MBps|
When I have more time I’ll compare this apples to apples. There’s also still a ton of room for improvement.
I’m also interested in seeing how bashreduce compares to hadoop. Of course it won’t win… but how close does it come?