bashreduce : mapreduce in a bash script
bashreduce lets you apply your favorite unix tools in a mapreduce fashion across multiple machines/cores. There’s no installation, administration, or distributed filesystem. You’ll need:
- br somewhere handy in your path
- gnu core utils on each machine: sort, awk, grep
- netcat on each machine
/etc/br.hosts and enter the machines you wish to use as workers. Or specify your machines at runtime:
br -m "host1 host2 host3"
To take advantage of multiple cores, repeat the host name.
br < input > output
br -r "uniq -c" < input > output
count words that begin with ‘b’
br -r "grep ^b | uniq -c" < input > output
big honkin’ local machine
Let’s start with a simpler scenario: I have a machine with multiple cores and with normal unix tools I’m relegated to using just one core. How does br help us here? Here’s br on an 8-core machine, essentially operating as a poor man’s multi-core sort:
|sort -k1,1 -S2G 4gb_file > 4gb_file_sorted||coreutils||30m32.078s||2.24 MBps|
|br -i 4gb_file -o 4gb_file_sorted||coreutils||11m3.111s||6.18 MBps|
|br -i 4gb_file -o 4gb_file_sorted||brp/brm||7m13.695s||9.44 MBps|
The job completely i/o saturates, but still a reasonable gain!
many cheap machines
Here lies the promise of mapreduce: rather than use my big honkin’ machine, I have a bunch of cheaper machines lying around that I can distribute my work to. How does br behave when I add four cheaper 4-core machines into the mix?
|pv 4gb_file||sort -k1,1 -S2G > 4gb_file_sorted||coreutils||30m32.078s||2.24 MBps|
|br -i 4gb_file -o 4gb_file_sorted||coreutils||8m30.652s||8.02 MBps|
|br -i 4gb_file -o 4gb_file_sorted||brp/brm||4m7.596s||16.54 MBps|
We have a new bottleneck: we’re limited by how quickly we can partition/pump our dataset out to the nodes. awk and sort begin to show their limitations (our clever awk script is a bit cpu bound, and
sort -m can only merge so many files at once). So we use two little helper programs written in C (yes, I know! it’s cheating! if you can think of a better partition/merge using core unix tools, contact me) to remove these bottlenecks.
br really doesn’t need a dfs. But you can simulate one with the current script:
br -r "> /tmp/myfile" < input
Other niceties would be to more closely mimic the options presented in sort (numeric, reverse, etc).