Skip to content

Latest commit

 

History

History
43 lines (24 loc) · 1.46 KB

README.textile

File metadata and controls

43 lines (24 loc) · 1.46 KB

bashreduce : mapreduce in a bash script

bashreduce lets you apply your favorite unix tools in a mapreduce fashion across multiple machines/cores. There’s no installation, administration, or distributed filesystem. You’ll need:

  • br somewhere handy in your path
  • gnu core utils on each machine: sort, awk, grep
  • netcat on each machine

Configuration

If you wish, you may edit /etc/br.hosts and enter the machines you wish to use as workers. You can also specify this at runtime:

br -m "host1 host2 host3"

To take advantage of multiple cores, repeat the host name as many times as you wish.

Examples

sorting

br < input > output

word count

br -r "uniq -c" < input > output

count words that begin with ‘b’

br -r "grep ^b | uniq -c" < input > output

Performance

Completely spurious numbers, all this shows you is how useful br is to me :-)

I have four compute machines and I’m usually relegated to using one core on one machine to sort. How about when I use br?

command time mb/s
sort -k1,1 4gb_file > 4gb_file_sorted 82m54.746s 843 kBps
br -i 4gb_file -o 4gb_file_sorted 9m49s 6.95 MBps

When I have more time I’ll compare this apples to apples. There’s also still a ton of room for improvement.

I’m also interested in seeing how bashreduce compares to hadoop. Of course it won’t win… but how close does it come?