Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Newer
Older
100644 66 lines (37 sloc) 2.953 kB
dda7cef @erikfrey new readme
authored
1 h2. bashreduce : mapreduce in a bash script
2
3 bashreduce lets you apply your favorite unix tools in a mapreduce fashion across multiple machines/cores. There's no installation, administration, or distributed filesystem. You'll need:
4
5 * "br":http://github.com/erikfrey/bashreduce/blob/master/br somewhere handy in your path
8b9a159 @erikfrey fixes #003
authored
6 * vanilla unix tools: sort, awk, ssh, netcat, pv
0533a83 @erikfrey add ssh note
authored
7 * password-less ssh to each machine you plan to use
dda7cef @erikfrey new readme
authored
8
9 h2. Configuration
10
8bd1abb @erikfrey updated performance now with brm and brp
authored
11 Edit @/etc/br.hosts@ and enter the machines you wish to use as workers. Or specify your machines at runtime:
95c413a @erikfrey more readme updates and info
authored
12
13 <pre>br -m "host1 host2 host3"</pre>
dda7cef @erikfrey new readme
authored
14
8bd1abb @erikfrey updated performance now with brm and brp
authored
15 To take advantage of multiple cores, repeat the host name.
dda7cef @erikfrey new readme
authored
16
17 h2. Examples
18
19 h3. sorting
20
21 <pre>br < input > output</pre>
22
23 h3. word count
24
25 <pre>br -r "uniq -c" < input > output</pre>
26
845c7d3 @erikfrey better example
authored
27 h3. great big join
dda7cef @erikfrey new readme
authored
28
845c7d3 @erikfrey better example
authored
29 <pre>LC_ALL='C' br -r "join - /tmp/join_data" < input > output</pre>
95c413a @erikfrey more readme updates and info
authored
30
31 h2. Performance
32
af82aba @erikfrey few more performance figures
authored
33 h3. big honkin' local machine
95c413a @erikfrey more readme updates and info
authored
34
6208fbb @erikfrey some ideas for the future
authored
35 Let's start with a simpler scenario: I have a machine with multiple cores and with normal unix tools I'm relegated to using just one core. How does br help us here? Here's br on an 8-core machine, essentially operating as a poor man's multi-core sort:
95c413a @erikfrey more readme updates and info
authored
36
1b1afc6 @erikfrey fix formatting
authored
37 |_. command |_. using |_. time |_. rate |
6208fbb @erikfrey some ideas for the future
authored
38 | sort -k1,1 -S2G 4gb_file > 4gb_file_sorted | coreutils | 30m32.078s | 2.24 MBps |
af82aba @erikfrey few more performance figures
authored
39 | br -i 4gb_file -o 4gb_file_sorted | coreutils | 11m3.111s | 6.18 MBps |
6208fbb @erikfrey some ideas for the future
authored
40 | br -i 4gb_file -o 4gb_file_sorted | brp/brm | 7m13.695s | 9.44 MBps |
95c413a @erikfrey more readme updates and info
authored
41
af82aba @erikfrey few more performance figures
authored
42 The job completely i/o saturates, but still a reasonable gain!
95c413a @erikfrey more readme updates and info
authored
43
af82aba @erikfrey few more performance figures
authored
44 h3. many cheap machines
45
6208fbb @erikfrey some ideas for the future
authored
46 Here lies the promise of mapreduce: rather than use my big honkin' machine, I have a bunch of cheaper machines lying around that I can distribute my work to. How does br behave when I add four cheaper 4-core machines into the mix?
af82aba @erikfrey few more performance figures
authored
47
48 |_. command |_. using |_. time |_. rate |
8dd5dea @erikfrey bad column
authored
49 | sort -k1,1 -S2G 4gb_file > 4gb_file_sorted | coreutils | 30m32.078s | 2.24 MBps |
af82aba @erikfrey few more performance figures
authored
50 | br -i 4gb_file -o 4gb_file_sorted | coreutils | 8m30.652s | 8.02 MBps |
51 | br -i 4gb_file -o 4gb_file_sorted | brp/brm | 4m7.596s | 16.54 MBps |
52
0533a83 @erikfrey add ssh note
authored
53 We have a new bottleneck: we're limited by how quickly we can partition/pump our dataset out to the nodes. awk and sort begin to show their limitations (our clever awk script is a bit cpu bound, and @sort -m@ can only merge so many files at once). So we use two little helper programs written in C (yes, I know! it's cheating! if you can think of a better partition/merge using core unix tools, contact me) to partition the data and merge it back.
af82aba @erikfrey few more performance figures
authored
54
6208fbb @erikfrey some ideas for the future
authored
55 h3. Future work
56
16d14c8 @erikfrey netcat has different parameters on redhat? will have to investigate
authored
57 I've tested this on ubuntu/debian, but not on other distros. According to Daniel Einspanjer, netcat has different parameters on Redhat.
58
59 br has a poor man's dfs like so:
6208fbb @erikfrey some ideas for the future
authored
60
f5d17c4 @erikfrey corrected dfs example
authored
61 <pre>br -r "cat > /tmp/myfile" < input</pre>
6208fbb @erikfrey some ideas for the future
authored
62
16d14c8 @erikfrey netcat has different parameters on redhat? will have to investigate
authored
63 But this breaks if you specify the same host multiple times. Maybe some kind of very basic virtualization is in order. Maybe.
64
6208fbb @erikfrey some ideas for the future
authored
65 Other niceties would be to more closely mimic the options presented in sort (numeric, reverse, etc).
Something went wrong with that request. Please try again.