Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

mapreduce in bash

tree: d4f473061f

Fetching latest commit…

Cannot retrieve the latest commit at this time

README.textile

bashreduce : mapreduce in a bash script

bashreduce lets you apply your favorite unix tools in a mapreduce fashion across multiple machines/cores. There’s no installation, administration, or distributed filesystem. You’ll need:

  • br somewhere handy in your path
  • gnu core utils on each machine: sort, awk, grep
  • netcat on each machine

Configuration

If you wish, you may edit /etc/br.hosts and enter the machines you wish to use as workers. You can also specify this at runtime:

br -m "host1 host2 host3"

To take advantage of multiple cores, repeat the host name as many times as you wish.

Examples

sorting

br < input > output

word count

br -r "uniq -c" < input > output

count words that begin with ‘b’

br -r "grep ^b | uniq -c" < input > output

Performance

Completely spurious numbers, all this shows you is how useful br is to me :-)

I have four compute machines and I’m usually relegated to using one core on one machine to sort. How about when I use br?

command using time rate
sort -k1,1 4gb_file > 4gb_file_sorted coreutils 82m54.746s 843 kBps
br -i 4gb_file -o 4gb_file_sorted coreutils 9m17.431s 7.35 MBps
br -i 4gb_file -o 4gb_file_sorted coreutils, brp 6m5.968s 11.19 MBps

When I have more time I’ll compare this apples to apples. There’s also still a ton of room for improvement.

I’m also interested in seeing how bashreduce compares to hadoop. Of course it won’t win… but how close does it come?

Something went wrong with that request. Please try again.