-
Notifications
You must be signed in to change notification settings - Fork 3
/
prll.txt
346 lines (275 loc) · 14.2 KB
/
prll.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
NAME
prll - parallelize execution of shell functions
SYNOPSIS
prll [ -b | -B ] [ -c num ] [ -q | -Q ] { -s str | funct } { -p | -0 | args }
DESCRIPTION
prll (pronounced "parallel") is a utility for use with sh-compatible
shells, such as bash(1), zsh(1) and dash(1). It provides a
convenient interface for parallelizing the execution of a single
task over multiple data files, or actually any kind of data that you
can pass as a shell function argument. It is meant to make it simple
to fully utilize a multicore/multiprocessor machine, or to just run
long running tasks in parallel. Its distinguishing feature is the
ability to run shell functions in the context of the current shell.
OPTIONS
-s str Use string str as shell code to run.
-b Disable output buffering.
-B Enable output buffering, which is the default.
Use to override the PRLL_BUFFER variable.
-p Read arguments as lines from standard input instead
of command line.
-0 Same as -p, but use the null character as delimiter
instead of newline.
-c num Set number of parallel jobs to num. This overrides
the PRLL_NRJOBS variable and disables checking
of /proc/cpuinfo.
-q Disable progress messages.
-Q Disable all messages except errors.
ENVIRONMENT
PRLL_BUFFER Set to 'no' or '0' to disable output buffering.
PRLL_NRJOBS Set to the number of parallel jobs to run. If set,
it disables checking of /proc/cpuinfo.
PRLL_NR_CPUS Deprecated in favor of PRLL_NRJOBS.
RESERVED SHELL SYMBOLS
All names beginning with 'prll_' are reserved and should not be
used. The following are intended for use in user supplied
functions:
prll_interrupt Cause prll to stop running new jobs. It will wait
for running jobs to complete and then exit.
prll_seq A simple substitute for seq(1). With one argument
prints numbers from 1 up to the argument. With two
arguments prints numbers from the first up to the
second argument.
prll_lock Acquires a lock. There are 5 locks available,
numbered from 0 to 4. If the lock is already taken,
waits until it is available. Defaults to lock 0
when no argument is given
prll_unlock Release a lock taken with prll_lock. Defaults to
lock 0 when no argument is given.
prll_splitarg Splits a quoted argument according to shell
rules. The words are assigned to variables named
prll_arg_X, where X numbers them from 1 upwards.
prll_arg_ Variables that hold arguments as generated by
prll_splitarg.
prll_arg_num A variable containing the number of prll_arg_
variables, as generated by prll_splitarg.
prll_jobnr A variable containing the current job's number. It
starts counting from zero.
OPERATION
prll is designed to be used not just in shell scripts, but
especially in interactive shells. To make the latter convenient, it
is implemented as a shell function. This means that it inherits the
whole environment of your current shell. It uses helper programs,
written in C. To prevent race conditions, System V Message Queues
and Semaphores are used to signal job completion. It also features
full output buffering to prevent mangling of data because of
concurrent output.
USAGE
To execute a task, create a shell function that does something to
its first argument. Pass that function to prll along with the
arguments you wish to execute it on.
As an alternative, you may pass the -s flag, followed by a
string. The string will be executed as if it were the body of a
shell function. Therefore, you may use '$1' to reference its first
(and only) argument. Be sure to quote the string properly to
prevent shell expansion.
Instead of arguments, you can use options -p or -0. prll will then
take its arguments from stdin. The -p flag will make it read lines
and the -0 flag will make it read null-delimited input. This mode
emulates the xargs(1) utility a bit, but is easier for interactive
use because xargs(1) makes it hard to pass complex commands. Reading
large arguments (such as lines several megabytes long) in this
fashion is slow, however. If your data comes in such large chunks,
it is much faster to split it into several files and pass a list of
those to prll instead.
The -b option disables output buffering. See below for
explanation. Alternatively, buffering may be disabled by setting the
PRLL_BUFFER environment variable to 'no'. Use the -B option to
override this.
The -q and -Q options provide two levels of quietness. Both suppress
progress reports. The -Q option also disables the startup and end
messages. They both let errors emited by your jobs through.
The number of tasks to be run in parallel is provided with the -c
option or via the PRLL_NRJOBS environment variable. If it is not
provided, prll will look into the /proc/cpuinfo file and extract the
number of CPUs in your computer.
SUSPENDING AND ABORTING
Execution can be suspended normally using Ctrl+Z. prll should be
subject to normal job control, depending on the shell.
If you need to abort execution, you can do it with the usual Ctrl+C
key combination. prll will wait for remaining jobs to complete
before exiting. If the jobs are hung and you wish to abort
immediately, use Ctrl+Z to suspend prll and then kill it using your
shell's job control.
The command prll_interrupt is available from within your
functions. It causes prll to abort execution in the same way as
Ctrl+C.
CLEANUP
prll cleans after itself, except when you force termination. If you
kill prll, jobs and stale message queues and semaphores will be left
lying around. The jobs' PIDs are printed during execution so you can
track them down and terminate them. You can list the queues and
semaphores using the ipcs(1) command and remove them with the
ipcrm(1) command. Refer to your system's documentation for
details. Be aware that other programs might (and often do) make use
of IPC facilities, so make sure you remove the correct queue or
semaphore. Their keys are printed when prll starts.
BUFFERING
Transport of data between programs is normally buffered by the
operating system. These buffers are small (e.g. 4kB on Linux), but
are enough to enhance performance. Multiple programs writing to the
same destination, as is the case with prll, is then arranged like
this:
+-----+ +-----------+
| job |--->| OS buffer |\\
+-----+ +-----------+ \\
\\
+-----+ +-----------+ \\+-------------+
| job |--->| OS buffer |--->| Output/File |
+-----+ +-----------+ /+-------------+
/
+-----+ +-----------+ /
| job |--->| OS buffer |/
+-----+ +-----------+
The output can be passed to another program, over a network or into
a file. But the jobs run in parallel, so the question is: what will
the data they produce look like at the destination when they write
it at the same time?
If a job writes less data than the size of the OS buffer, then
everything is fine: the buffer is never filled and the OS flushes it
when the job exits. All output from that job is in one piece because
the OS will flush only one buffer at a time.
If, however, a job writes more data than that, then the OS flushes
the buffer each time it is filled. Because several jobs run in
parallel, their outputs become interleaved at the destination, which
is not good.
prll does additional job output buffering by default. The actual
arrangement when running prll looks like this:
+-----+ +-----------+ +-------------+
| job |--->| OS buffer |--->| prll buffer |\\
+-----+ +-----------+ +-------------+ \\
| \\
+-----+ +-----------+ +-------------+ \\+-------------+
| job |--->| OS buffer |--->| prll buffer |--->| Output/File |
+-----+ +-----------+ +-------------+ /+-------------+
| /
+-----+ +-----------+ +-------------+ /
| job |--->| OS buffer |--->| prll buffer |/
+-----+ +-----------+ +-------------+
Note the vertical connections between prll buffers: they synchronise
so that they only write data to the destination one at a time. They
make sure that all of the output of a single job is in one piece. To
keep performance high, the jobs must keep running, therefore each
buffer must be able to keep taking in data, even if it cannot
immediately write it. To make this possible, prll buffers aren't
limited in size: they grow to accomodate all data a job produces.
This raises another concern: you need to have enough memory to
contain the data until it can be written. If your jobs produce more
data than you have memory, you need to redirect it to files. Have
each job create a file and redirect all its output to that file. You
can do that however you want, but there should be a helpful utility
available on your system: mktemp(1). It is dedicated to creating
files with unique names. The prll_jobnr variable can also be used.
As noted in the usage instructions, prll's additional buffering can
be disabled. It is not necessary to do this when each job writes to
its own file. It is meant to be used as a safety measure. prll was
written with interactive use in mind, and when writing functions on
the fly, it can easily happen that an error creeps in. If an error
causes spurious output (e.g. if the function gets stuck in an
infinite loop) it can easily waste a lot of memory. The option to
disable buffering is meant to be used when you believe that your
jobs should only produce a small amount of data, but aren't sure
that they actually will.
It should be noted that buffering only applies to standard
output. OS buffers standard error differently (i.e. by lines) and
prll does nothing to change that.
EXAMPLES
Suppose you have a set of photos that you wish to process using the
mogrify(1) utility. Simply do
myfn() { mogrify -flip $1 ; }
prll myfn *.jpg
This will run mogrify on each jpg file in the current directory. If
your computer has 4 processors, but you wish to run only 3 tasks at
once, you should use
prll -c 3 myfn *.jpg
Or, to make it permanent in the current shell, do
PRLL_NRJOBS=3
on a line of its own. You don't need to export the variable because
prll automatically has access to everything your shell can see.
All examples here are very short. Unless you need it later, it is
quicker to pass such a short function on the command line directly:
prll -s 'mogrify -flip $1' *.jpg
prll now automatically wraps the code in an internal function so you
don't have to. Don't forget about the single quotes, or the shell
will expand $1 before prll is run.
If you have a more complicated function that has to take more than
one argument, you can use a trick: combine multiple arguments into
one when passing them to prll, then split them again inside your
function. You can use shell quoting to achieve that. Inside your
function, prll_splitarg is available to take the single argument
apart again, i.e.
myfn() {
prll_splitarg
process $prll_arg_1
compute $prll_arg_2
mangle $prll_arg_3
}
prll myfn 'a1 b1 c1' 'a2 b2 c3' 'a3 b3 c3' ...
If you have even more complex requirements, you can use the '-0'
option and pipe null-delimited data into prll, then split it any way
you want. Modern shells have powerful read(1) builtins.
You may wish to abort execution if one of the results is wrong. In
that case, use something like this:
myfn() { compute $1; [ $result = "wrong" ] && prll_interrupt; }
This is useful also when doing anything similar to a parallel
search: abort execution when the result is found.
If you have many arguments to process, it might be easier to pipe
them to standard input. Suppose each line of a file is an argument
of its own. Simply pipe the file into prll:
myfn() { some; processing | goes && on; here; }
cat file_with_arguments | prll myfn -p > results
Remember that it's not just CPU-intensive tasks that benefit from
parallel excution. You may have many files to download from several
slow servers, in which case, the following might be useful:
prll -c 10 -s 'wget -nv "$1"' -p < links.txt
You may wish to observe prll's progress in your terminal, but
collect your jobs' non-data output in a single file. Since opening a
file from multiple parallel jobs is unsafe, you should protect it
with a lock, i.e.
myfn() {
compute_stuff $1
prll_lock 0
echo "Job $prll_jobnr report:" >> jobreports.txt
write_report >> jobreports.txt
echo "-----------------------" >> jobreports.txt
prll_unlock 0
return $jobstatus
}
This function uses lock number 0 to protect the jobreports.txt file
so that it is written by only one job at a time. There are several
locks available. Be sure to release them, otherwise the jobs will
hang waiting for one another. The prll_jobnr variable is used to
denote each report.
BUGS
This section describes issues and bugs that were known at the time
of release. Check the homepage for more current information.
Known issues:
- In zsh, the Ctrl+C combination forces prll into the background.
- The return value of prll itself is not useful, and cannot easily
be made useful because it depends on the shell's behaviour.
- User should be able to limit buffer memory usage, but still use
buffering without loss of data. Is this possible to solve
elegantly?
- The test suite should be expanded. Specifically, termination
behaviour on external interrupt signal currently currently has to
be checked manually. Also, checking of stderr output is not done.
- Cross-compilation should be documented and made easier.
- Shell's job table becomes saturated with a large number of jobs.
This is not really an issue, since it happens when the number of
jobs is above 500 or so. Nevertheless, it might be possible to
disown jobs if such a large number of them should be required.
SEE ALSO
sh(1), xargs(1), mktemp(1), ipcs(1), ipcrm(1), svipc(7)
Homepage: https://github.com/exzombie/prll
AUTHOR
Jure Varlec <jure@varlec.si>