prll.txt

NAME
  prll - parallelize execution of shell functions
SYNOPSIS
  prll [ -b | -B ] [ -c num ] [ -q | -Q ] { -s str | funct } { -p | -0 | args }

DESCRIPTION

  prll (pronounced "parallel") is a utility for use with sh-compatible
  shells, such as bash(1), zsh(1) and dash(1). It provides a
  convenient interface for parallelizing the execution of a single
  task over multiple data files, or actually any kind of data that you
  can pass as a shell function argument. It is meant to make it simple
  to fully utilize a multicore/multiprocessor machine, or to just run
  long running tasks in parallel. Its distinguishing feature is the
  ability to run shell functions in the context of the current shell.

OPTIONS
  -s str   Use string str as shell code to run.
  -b 	   Disable output buffering.
  -B	   Enable output buffering, which is the default.
  	   Use to override the PRLL_BUFFER variable.
  -p	   Read arguments as lines from standard input instead
  	   of command line.
  -0	   Same as -p, but use the null character as delimiter
  	   instead of newline.
  -c num   Set number of parallel jobs to num. This overrides
     	   the PRLL_NRJOBS variable and disables checking
	   of /proc/cpuinfo.
  -q	   Disable progress messages.
  -Q	   Disable all messages except errors.

ENVIRONMENT

  PRLL_BUFFER	Set to 'no' or '0' to disable output buffering.
  PRLL_NRJOBS	Set to the number of parallel jobs to run. If set,
  		it disables checking of /proc/cpuinfo.
  PRLL_NR_CPUS	Deprecated in favor of PRLL_NRJOBS.

RESERVED SHELL SYMBOLS

  All names beginning with 'prll_' are reserved and should not be
  used. The following are intended for use in user supplied
  functions:

  prll_interrupt   Cause prll to stop running new jobs. It will wait
  		   for running jobs to complete and then exit.
  prll_seq	   A simple substitute for seq(1). With one argument
  		   prints numbers from 1 up to the argument. With two
  		   arguments prints numbers from the first up to the
  		   second argument.
  prll_lock	   Acquires a lock. There are 5 locks available,
  		   numbered from 0 to 4. If the lock is already taken,
  		   waits until it is available. Defaults to lock 0
  		   when no argument is given
  prll_unlock	   Release a lock taken with prll_lock. Defaults to
  		   lock 0 when no argument is given.
  prll_splitarg	   Splits a quoted argument according to shell
  		   rules. The words are assigned to variables named
  		   prll_arg_X, where X numbers them from 1 upwards.
  prll_arg_	   Variables that hold arguments as generated by
  		   prll_splitarg.
  prll_arg_num	   A variable containing the number of prll_arg_
  		   variables, as generated by prll_splitarg.
  prll_jobnr	   A variable containing the current job's number. It
  		   starts counting from zero.

OPERATION

  prll is designed to be used not just in shell scripts, but
  especially in interactive shells. To make the latter convenient, it
  is implemented as a shell function. This means that it inherits the
  whole environment of your current shell. It uses helper programs,
  written in C. To prevent race conditions, System V Message Queues
  and Semaphores are used to signal job completion. It also features
  full output buffering to prevent mangling of data because of
  concurrent output.

 USAGE

  To execute a task, create a shell function that does something to
  its first argument. Pass that function to prll along with the
  arguments you wish to execute it on.

  As an alternative, you may pass the -s flag, followed by a
  string. The string will be executed as if it were the body of a
  shell function. Therefore, you may use '$1' to reference its first
  (and only) argument. Be sure to quote the string properly to
  prevent shell expansion.

  Instead of arguments, you can use options -p or -0. prll will then
  take its arguments from stdin. The -p flag will make it read lines
  and the -0 flag will make it read null-delimited input. This mode
  emulates the xargs(1) utility a bit, but is easier for interactive
  use because xargs(1) makes it hard to pass complex commands. Reading
  large arguments (such as lines several megabytes long) in this
  fashion is slow, however. If your data comes in such large chunks,
  it is much faster to split it into several files and pass a list of
  those to prll instead.

  The -b option disables output buffering. See below for
  explanation. Alternatively, buffering may be disabled by setting the
  PRLL_BUFFER environment variable to 'no'. Use the -B option to
  override this.

  The -q and -Q options provide two levels of quietness. Both suppress
  progress reports. The -Q option also disables the startup and end
  messages. They both let errors emited by your jobs through.

  The number of tasks to be run in parallel is provided with the -c
  option or via the PRLL_NRJOBS environment variable. If it is not
  provided, prll will look into the /proc/cpuinfo file and extract the
  number of CPUs in your computer.

 SUSPENDING AND ABORTING

  Execution can be suspended normally using Ctrl+Z. prll should be
  subject to normal job control, depending on the shell.

  If you need to abort execution, you can do it with the usual Ctrl+C
  key combination. prll will wait for remaining jobs to complete
  before exiting. If the jobs are hung and you wish to abort
  immediately, use Ctrl+Z to suspend prll and then kill it using your
  shell's job control.

  The command prll_interrupt is available from within your
  functions. It causes prll to abort execution in the same way as
  Ctrl+C.

 CLEANUP

  prll cleans after itself, except when you force termination. If you
  kill prll, jobs and stale message queues and semaphores will be left
  lying around. The jobs' PIDs are printed during execution so you can
  track them down and terminate them. You can list the queues and
  semaphores using the ipcs(1) command and remove them with the
  ipcrm(1) command. Refer to your system's documentation for
  details. Be aware that other programs might (and often do) make use
  of IPC facilities, so make sure you remove the correct queue or
  semaphore. Their keys are printed when prll starts.

 BUFFERING

  Transport of data between programs is normally buffered by the
  operating system. These buffers are small (e.g. 4kB on Linux), but
  are enough to enhance performance. Multiple programs writing to the
  same destination, as is the case with prll, is then arranged like
  this:

    +-----+    +-----------+
    | job |--->| OS buffer |\\
    +-----+    +-----------+ \\
                              \\
    +-----+    +-----------+   \\+-------------+
    | job |--->| OS buffer |--->| Output/File |
    +-----+    +-----------+   /+-------------+
                              /
    +-----+    +-----------+ /
    | job |--->| OS buffer |/
    +-----+    +-----------+

  The output can be passed to another program, over a network or into
  a file. But the jobs run in parallel, so the question is: what will
  the data they produce look like at the destination when they write
  it at the same time?

  If a job writes less data than the size of the OS buffer, then
  everything is fine: the buffer is never filled and the OS flushes it
  when the job exits. All output from that job is in one piece because
  the OS will flush only one buffer at a time.

  If, however, a job writes more data than that, then the OS flushes
  the buffer each time it is filled. Because several jobs run in
  parallel, their outputs become interleaved at the destination, which
  is not good.

  prll does additional job output buffering by default. The actual
  arrangement when running prll looks like this:

    +-----+    +-----------+    +-------------+
    | job |--->| OS buffer |--->| prll buffer |\\
    +-----+    +-----------+    +-------------+ \\
                                       |         \\
    +-----+    +-----------+    +-------------+   \\+-------------+
    | job |--->| OS buffer |--->| prll buffer |--->| Output/File |
    +-----+    +-----------+    +-------------+   /+-------------+
                                       |         /
    +-----+    +-----------+    +-------------+ /
    | job |--->| OS buffer |--->| prll buffer |/
    +-----+    +-----------+    +-------------+

  Note the vertical connections between prll buffers: they synchronise
  so that they only write data to the destination one at a time. They
  make sure that all of the output of a single job is in one piece. To
  keep performance high, the jobs must keep running, therefore each
  buffer must be able to keep taking in data, even if it cannot
  immediately write it. To make this possible, prll buffers aren't
  limited in size: they grow to accomodate all data a job produces.

  This raises another concern: you need to have enough memory to
  contain the data until it can be written. If your jobs produce more
  data than you have memory, you need to redirect it to files. Have
  each job create a file and redirect all its output to that file. You
  can do that however you want, but there should be a helpful utility
  available on your system: mktemp(1). It is dedicated to creating
  files with unique names. The prll_jobnr variable can also be used.

  As noted in the usage instructions, prll's additional buffering can
  be disabled. It is not necessary to do this when each job writes to
  its own file. It is meant to be used as a safety measure. prll was
  written with interactive use in mind, and when writing functions on
  the fly, it can easily happen that an error creeps in. If an error
  causes spurious output (e.g. if the function gets stuck in an
  infinite loop) it can easily waste a lot of memory. The option to
  disable buffering is meant to be used when you believe that your
  jobs should only produce a small amount of data, but aren't sure
  that they actually will.

  It should be noted that buffering only applies to standard
  output. OS buffers standard error differently (i.e. by lines) and
  prll does nothing to change that.

 EXAMPLES

  Suppose you have a set of photos that you wish to process using the
  mogrify(1) utility. Simply do

    myfn() { mogrify -flip $1 ; }
    prll myfn *.jpg

  This will run mogrify on each jpg file in the current directory. If
  your computer has 4 processors, but you wish to run only 3 tasks at
  once, you should use

    prll -c 3 myfn *.jpg

  Or, to make it permanent in the current shell, do

    PRLL_NRJOBS=3

  on a line of its own. You don't need to export the variable because
  prll automatically has access to everything your shell can see.

  All examples here are very short. Unless you need it later, it is
  quicker to pass such a short function on the command line directly:

    prll -s 'mogrify -flip $1' *.jpg

  prll now automatically wraps the code in an internal function so you
  don't have to. Don't forget about the single quotes, or the shell
  will expand $1 before prll is run.

  If you have a more complicated function that has to take more than
  one argument, you can use a trick: combine multiple arguments into
  one when passing them to prll, then split them again inside your
  function. You can use shell quoting to achieve that. Inside your
  function, prll_splitarg is available to take the single argument
  apart again, i.e.

    myfn() {
      prll_splitarg
      process $prll_arg_1
      compute $prll_arg_2
      mangle $prll_arg_3
    }
    prll myfn 'a1 b1 c1' 'a2 b2 c3' 'a3 b3 c3' ...

  If you have even more complex requirements, you can use the '-0'
  option and pipe null-delimited data into prll, then split it any way
  you want. Modern shells have powerful read(1) builtins.

  You may wish to abort execution if one of the results is wrong. In
  that case, use something like this:

    myfn() { compute $1; [ $result = "wrong" ] && prll_interrupt; }

  This is useful also when doing anything similar to a parallel
  search: abort execution when the result is found.

  If you have many arguments to process, it might be easier to pipe
  them to standard input. Suppose each line of a file is an argument
  of its own. Simply pipe the file into prll:

    myfn() { some; processing | goes && on; here; }
    cat file_with_arguments | prll myfn -p > results

  Remember that it's not just CPU-intensive tasks that benefit from
  parallel excution. You may have many files to download from several
  slow servers, in which case, the following might be useful:

    prll -c 10 -s 'wget -nv "$1"' -p < links.txt

  You may wish to observe prll's progress in your terminal, but
  collect your jobs' non-data output in a single file. Since opening a
  file from multiple parallel jobs is unsafe, you should protect it
  with a lock, i.e.

    myfn() {
      compute_stuff $1
      prll_lock 0
      echo "Job $prll_jobnr report:" >> jobreports.txt
      write_report >> jobreports.txt
      echo "-----------------------" >> jobreports.txt
      prll_unlock 0
      return $jobstatus
    }

  This function uses lock number 0 to protect the jobreports.txt file
  so that it is written by only one job at a time. There are several
  locks available. Be sure to release them, otherwise the jobs will
  hang waiting for one another. The prll_jobnr variable is used to
  denote each report.

BUGS

  This section describes issues and bugs that were known at the time
  of release. Check the homepage for more current information.

  Known issues:

  - In zsh, the Ctrl+C combination forces prll into the background.

  - The return value of prll itself is not useful, and cannot easily
    be made useful because it depends on the shell's behaviour.

  - User should be able to limit buffer memory usage, but still use
    buffering without loss of data. Is this possible to solve
    elegantly?

  - The test suite should be expanded. Specifically, termination
    behaviour on external interrupt signal currently currently has to
    be checked manually. Also, checking of stderr output is not done.

  - Cross-compilation should be documented and made easier.

  - Shell's job table becomes saturated with a large number of jobs.
    This is not really an issue, since it happens when the number of
    jobs is above 500 or so. Nevertheless, it might be possible to
    disown jobs if such a large number of them should be required.

SEE ALSO
  sh(1), xargs(1), mktemp(1), ipcs(1), ipcrm(1), svipc(7)

  Homepage: https://github.com/exzombie/prll

AUTHOR
  Jure Varlec <jure@varlec.si>