Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
tree: 6208fbba82
Fetching contributors…

Cannot retrieve contributors at this time

62 lines (35 sloc) 2.699 kb

bashreduce : mapreduce in a bash script

bashreduce lets you apply your favorite unix tools in a mapreduce fashion across multiple machines/cores. There’s no installation, administration, or distributed filesystem. You’ll need:

  • br somewhere handy in your path
  • gnu core utils on each machine: sort, awk, grep
  • netcat on each machine

Configuration

Edit /etc/br.hosts and enter the machines you wish to use as workers. Or specify your machines at runtime:

br -m "host1 host2 host3"

To take advantage of multiple cores, repeat the host name.

Examples

sorting

br < input > output

word count

br -r "uniq -c" < input > output

count words that begin with ‘b’

br -r "grep ^b | uniq -c" < input > output

Performance

big honkin’ local machine

Let’s start with a simpler scenario: I have a machine with multiple cores and with normal unix tools I’m relegated to using just one core. How does br help us here? Here’s br on an 8-core machine, essentially operating as a poor man’s multi-core sort:

command using time rate
sort -k1,1 -S2G 4gb_file > 4gb_file_sorted coreutils 30m32.078s 2.24 MBps
br -i 4gb_file -o 4gb_file_sorted coreutils 11m3.111s 6.18 MBps
br -i 4gb_file -o 4gb_file_sorted brp/brm 7m13.695s 9.44 MBps

The job completely i/o saturates, but still a reasonable gain!

many cheap machines

Here lies the promise of mapreduce: rather than use my big honkin’ machine, I have a bunch of cheaper machines lying around that I can distribute my work to. How does br behave when I add four cheaper 4-core machines into the mix?

command using time rate
pv 4gb_file sort -k1,1 -S2G > 4gb_file_sorted coreutils 30m32.078s 2.24 MBps
br -i 4gb_file -o 4gb_file_sorted coreutils 8m30.652s 8.02 MBps
br -i 4gb_file -o 4gb_file_sorted brp/brm 4m7.596s 16.54 MBps

We have a new bottleneck: we’re limited by how quickly we can partition/pump our dataset out to the nodes. awk and sort begin to show their limitations (our clever awk script is a bit cpu bound, and sort -m can only merge so many files at once). So we use two little helper programs written in C (yes, I know! it’s cheating! if you can think of a better partition/merge using core unix tools, contact me) to remove these bottlenecks.

Future work

br really doesn’t need a dfs. But you can simulate one with the current script:

br -r "> /tmp/myfile" < input

Other niceties would be to more closely mimic the options presented in sort (numeric, reverse, etc).

Jump to Line
Something went wrong with that request. Please try again.