bashreduce lets you apply your favorite unix tools in a mapreduce fashion across multiple machines/cores. There’s no installation, administration, or distributed filesystem. You’ll need:
- br somewhere handy in your path
- gnu core utils on each machine: sort, awk, grep
- netcat on each machine
If you wish, you may edit /etc/br.hosts
and enter the machines you wish to use as workers. You can also specify this at runtime:
br -m "host1 host2 host3"
To take advantage of multiple cores, repeat the host name as many times as you wish.
br < input > output
br -r "uniq -c" < input > output
br -r "grep ^b | uniq -c" < input > output
Completely spurious numbers, all this shows you is how useful br is to me :-)
I have four compute machines and I’m usually relegated to using one core on one machine to sort. How about when I use br?
command | time | mb/s |
---|---|---|
sort -k1,1 4gb_file > 4gb_file_sorted | 82m54.746s | 843 kBps |
br -i 4gb_file -o 4gb_file_sorted | 9m49s | 6.95 MBps |
When I have more time I’ll compare this apples to apples. There’s also still a ton of room for improvement.
I’m also interested in seeing how bashreduce compares to hadoop. Of course it won’t win… but how close does it come?