bashreduce : mapreduce in a bash script

bashreduce lets you apply your favorite unix tools in a mapreduce fashion across multiple machines/cores. There’s no installation, administration, or distributed filesystem. You’ll need:

br somewhere handy in your path
gnu core utils on each machine: sort, awk, grep
netcat on each machine

Configuration

If you wish, you may edit /etc/br.hosts and enter the machines you wish to use as workers. You can also specify this at runtime:

br -m "host1 host2 host3"

To take advantage of multiple cores, repeat the host name as many times as you wish.

Examples

sorting

br < input > output

word count

br -r "uniq -c" < input > output

count words that begin with ‘b’

br -r "grep ^b | uniq -c" < input > output

Performance

Completely spurious numbers, all this shows you is how useful br is to me :-)

I have four compute machines and I’m usually relegated to using one core on one machine to sort. How about when I use br?

command	time	mb/s
sort -k1,1 4gb_file > 4gb_file_sorted	82m54.746s	843 kBps
br -i 4gb_file -o 4gb_file_sorted	9m49s	6.95 MBps

When I have more time I’ll compare this apples to apples. There’s also still a ton of room for improvement.

I’m also interested in seeing how bashreduce compares to hadoop. Of course it won’t win… but how close does it come?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.textile

README.textile

bashreduce : mapreduce in a bash script

Configuration

Examples

sorting

word count

count words that begin with ‘b’

Performance

Files

README.textile

Latest commit

History

README.textile

File metadata and controls

bashreduce : mapreduce in a bash script

Configuration

Examples

sorting

word count

count words that begin with ‘b’

Performance