simplify setup/bootstrap cmds/scripts/actions #206

davidmarin opened this Issue Sep 5, 2011 · 44 comments


None yet

3 participants


mrjob has a lot of options that are designed to upload a file, run a command, or add something to the environment:

  • bootstrap_actions
  • bootstrap_cmds
  • bootstrap_files
  • bootstrap_mrjob
  • bootstrap_python_packages
  • bootstrap_scripts
  • file_upload_args (not available from the command line)
  • python_archives
  • setup_cmds
  • setup_scripts
  • upload_archives
  • upload_files

There are two main problems with the way things are now:

  • It's confusing
  • Options don't always run in the order you want. For example, if you upgrade Python in bootstrap_cmds, bootstrap_python_packages becomes useless because it runs first (so it'll install packages for the old version of Python).

I don't have a complete solution, but I imagine something where you simply specify the commands you want to run, possibly referencing local paths or S3/HDFS URIs, and mrjob just does the right thing. We just need a clean way of disambiguating local and remote files.

We should aim to make this solution the canonical way of doing things in mrjob v0.4.


I think the way I want to specify a local file that should be copied is by putting a hash after it (e.g. /path/to/file#). This matches the syntax for Hadoop's -file arguments.

/path/to/file#remote_name and /path/to/*.files# should be allowed too (we'd have to resolve the glob ourselves, but we already have the framework to do that).

irskep commented Mar 25, 2012

Here's my current thought about how this would look in the config.

    bootstrap_actions: []
      - sudo apt-get install cowsay  # has spaces, so it's a command
      - pyyaml.tar.gz # detect and install it
      - data.bz2  # detect no python files, just expand
      - # maybe detect #!/sh at the top, or executable permissions to use as a bootstrap_script?
    task_setup: [] # same stuff but done before each task instead of at bootstrap time

Then have --bootstrap-item and --task-setup-item to replace the myriad options we have now.

irskep commented Mar 25, 2012

We could also try something like:

- command sudo apt-get install cowsay
- install-python pyyaml.tar.gz
- upload data.bz2
- run

But that doesn't seem clean.


I'm iffy about auto-installing tarballs for a couple of reasons:

  • I'm not willing to give away that part of the namespace yet; I'd want to wait and see.
  • I'd actually rather the config be verbose, so people reading it can tell what's going on.

I think for now I'd rather have recipes in the documentation like:

- for i in *.tar.gz#; do tar xfz $i; cd ${i/.tar.gz/}; python install; cd -; done

and see what people come up with.

It should also be possible to write a script and do:

- ./ *.tar.gz#

I'm realizing we also need a convention for distinguishing environment variables to be resolved before the script is run from ones to resolve remotely during the script.

Probably something like $LOCAL_ENV_VAR# would do it. sigh.

irskep commented Mar 26, 2012

What if we provided them as tools?

- python -m *.tar.gz

(Also, can you link me to the behavior of the # with no local name following it?)


# with no name following it just means, I don't care, pick any name. In practice, we try to pick the same name as the file, but if there are are multiple files with the same name, we prepend a random string or something.

I don't really want to have to support giving arbitrary names to globs (e.g. packages/*.tar.gz#packages-*.tar.gz) but we could probably ticket that as a possible future feature.


Oh, and I really like the idea of providing scripts in place of bootstrap options. :)

irskep commented Mar 26, 2012

Perhaps under mrjob.bootstrap instead of to keep our namespace flat and separate ad hoc tools from bootstrap scripts.


Probably Would like to keep all the scripts in tools.

irskep commented May 15, 2012

So, to sum up:

  • Add scripts in to expand tarballs, install Python packages, etc
  • Deprecate python_archives, setup_scripts, upload_archives, bootstrap_python_packages, and bootstrap_scripts
  • Leave alone bootstrap_cmds, setup_cmds, bootstrap_files, upload_files, file_upload_args, and bootstrap_mrjob
  • Support bare # on all *_files options

I say leave bootstrap_mrjob alone because having to put it in a bootstrap command will probably confuse a lot of people.

irskep commented May 15, 2012

Edited above for accuracy.

How do we deal with upload_files vs bootstrap_files right now?

irskep commented May 15, 2012

It looks like the difference between bootstrap_files and upload_files that bootstrap_files does the work of downloading the files itself, and upload_files has Hadoop do it via arguments to the Hadoop Streaming jar.

So we should probably keep that distinction. No need to e.g. install a SQLite file at both bootstrap time and task setup time.

irskep commented May 15, 2012

Oh, and Hadoop also takes care of un-archiving upload_archives. So we would actually have to move that step from Hadoop into mrjob as a setup cmd. It will probably be negligibly slower. (see EMRJobRunner._cache_args())


Okay, so here's what I've got.

We add two new options, bootstrap and setup, both of which take a list of strings containing shell commands. For example:

--bootstrap "sudo apt-get install -y cython"

If you want a file on the local filesystem or available via a URI to be available at setup/bootstrap time, you append a # to it (this matches Hadoop's syntax for uploading files). For example:

--bootstrap "sudo dpkg -i garply-latest.deb#"

If you care deeply about what the name of the file is inside EMR/Hadoop, you can put the desired filename after the # (again, Hadoop syntax):

--bootstrap "sudo dpkg -i garply-latest.deb#garply.deb"

If you want to refer to a local environment variable, just use it and the shell will escape it; whereas if you want to refer to an environment variable set at the time the script runs, you'll need to use \$:

# local environment variable
--bootstrap "sudo install $MY_DEBIAN_PKGS/garply.deb#"
# remote environment variable
--bootstrap 'ln -s \$HADOOP_HOME hadoop-land'

which would look like this in mrjob.conf

        - sudo install $MY_DEBIAN_PKGS/garply.deb#
        - ln -s \$HADOOP_HOME hadoop-land

mrjob should automatically detect archives (e.g. tarballs, jars) based on the filename, so that there isn't a distinction between uploading files or uploading archives. So, for example, to install a python package from a tarball:

--bootstrap "cd py-garply-latest.tar.gz#/garply; sudo python install"

The mapper/reducer script should be the last line of a wrapper shell script, so it should be possible to set environment variables merely by using export:

--setup "export PYTHONPATH=\$PYTHONPATH:my-nifty-code.tar.gz#"

You should be able to upload all files matching a glob (but not specify their names on EMR/Hadoop). For example:

--setup "for i in $MY_PYTHON_ARCHIVES/*.tar.gz#; do ..."

Finally, bootstrap_python_packages and python_archives (and probably other conveniences) should be made available as shell functions. So people could still do pretty much the same thing in their mrjob.conf:

- sudo-install-python-packages $MY_PACKAGES/*.tar.gz#

We still need to auto-bootstrap mrjob in most cases. Probably that should happen after other shell commands (in case people upgrade Python). If people want things to happen in a different order, they can turn off auto-bootstrapping and install it from a tarball.

I think we can totally eliminate (well, deprecate) the remaining command-line options. If someone really needs to upload a file/archive into the local directory but not do anything in particular with it, there's always --setup "true /path/to/file#".

The syntax for parsing bootstrap/setup commands gets a little tricky, but I think it comes down to:

  • parse with shlex.split()
  • expand environment variables with os.path.expandvars
  • match files to upload with (\(\w+:/\)?[^:$\\]+)#([^:;/\s\\]*)
  • expand globs for local file names
  • replace \$ with $

There's some tension between using : in setting path variables (which shouldn't be part of the "local" filename), and using : in URIs (which should be included). I can't think of any examples that are actually ambiguous to a human, but we should come up with some clear rules that make sense.

I'm also not sure whether we want to automatically mark files as executable if they look like they're going to be run. I can see that people would expect:


to work, but at the same time, it's not so hard to do:


Anyways, those are my thoughts. This system sounds pretty elaborate, but it mostly does what you expect, it would make upgrading to other versions of Python easy, and allows us to introduce other setup/bootstrap conveniences without having to worry about running them in the right order.

What do you think?


Actually, we might need a better solution for local/remote environment variables. The problem is that the shell is going to behave different from mrjob.conf, depending on whether you use single or double quotes on the command line:

# acceptable
--setup 'echo $FOO' # mrjob resolves $FOO
--setup 'echo \$FOO' # mrjob passes $FOO through to remote script
--setup "echo $FOO" # shell resolves $FOO

# counterintuitive
--setup "echo \$FOO" # shell resolves \$ to $, mrjob resolves $FOO
--setup "echo \\$FOO" # shell resolves \\ to \ and resolves $FOO
--setup "echo \\\$FOO" # mrjob passes $FOO through to remote script. But huh?

We could also only have mrjob resolve environment variables that are part of paths. That's the main use case anyway.


Also, we probably want to come up with a better "file-locking" solution to prevent multiple task wrapper (setup) scripts from executing simultaneously.

You can use flock with bash, but not with dash. I'd really prefer something that works with dash.

Also, we should use redirects (1>&2) to prevent setup scripts from inadvertently outputting "data" to stdout.


Okay, this is getting a little crazy. The main problem comes from trying to find the beginning and end of local paths and environment variables.

I think we should require local paths to be separate tokens of the form local_path#optional_remote_filename, and provide a magic concatenation operator +# to concatenate them into other strings (e.g. PYTHONPATH). We'll resolve environment variables and globs in local_path, but otherwise, I don't think we need a mechanism for passing local environment variables through to remote scripts, so we can just use $ for environment variables on the remote system.

So, some examples:

# same as before
--bootstrap "sudo dpkg -i garply-latest.deb#"
--bootstrap 'sudo install $MY_DEBIAN_PKGS/garply.deb#'
# "remote" environment variable
--bootstrap 'ln -s $HADOOP_HOME hadoop-land'
# concatenation
--bootstrap 'cd py-garply-latest.tar.gz# +# /garply; sudo python install'
--setup 'export PYTHONPATH=$PYTHONPATH: +# my-nifty-code.tar.gz#'

I think all of our examples that use shell commands should use single quotes. Here's how you'd do two of the examples above with double quotes:

# this does the same thing (though now it's the shell resolving $MY_DEBIAN_PKGS):
--bootstrap "sudo install $MY_DEBIAN_PKGS/garply.deb#"
# we mean $HADOOP_HOME on the system where the script is run;
# don't resolve it now.
--bootstrap "ln -s \$HADOOP_HOME hadoop-land"

By the way, strings ending in # appear to be safe to use unquoted in mrjob.conf, the Python yaml module expects a space before the # in comments.


Oh, we should be smart enough to separate out semicolons at the end of a local file token (i.e. not consider them part of the remote file name). So this:

--bootstrap 'cd py-garply-latest.tar.gz# +# /garply; sudo python install'

could be just:

--bootstrap 'cd py-garply-latest.tar.gz#; cd garply; sudo python install'

Maybe auto-detecting archives is a bit of a stretch. We could make it explicit by adding a / at the end:

--bootstrap 'cd py-garply-latest.tar.gz#/; cd garply; sudo python install'

Also, was thinking that each --bootstrap should fire off its own bootstrap action so that it's easier to see what's going on with elastic-mapreduce --describe. You get 16; the first would be taken up by the file uploader script, and the last would be used for bootstrapping mrjob, so that leaves 14. People probably aren't using more that 14 bootstrap actions now, and any number of actions could be reduced to a tarball and a script (which could be inside the tarball).

I'm leaning towards saving the file uploader for just before the first bootstrap action that needs a local file uploaded, just in case there's a bug in the AMI that affects downloading files.


shlex.split() doesn't handle semicolons properly (it considers them part of tokens). But you can get them to parse correctly if you instantiate a lexer like this: lexer = shlex.shlex(instream='...', posix=True), and call list(lexer).

This allows us to, for example, distinguish "foo"; bar from "foo;" bar.


Also, bootstrap actions may not actually run shell commands, so we may have to resort to a script and/or wrap every bootstrap action in sh -c "...".


Actually, shlex isn't really the right tool for arbitrary shell commands with pipes, redirects, etc. We just need something that can identify which extents of the string are local files and the concatenator (+#) and leave the rest of it unchanged.

irskep commented Jul 6, 2012

Longest ticket evar! I have a new idea.

Perhaps we could think of bootstrap actions not so much as commands that require files, but units of work that take parameters which may be uploadable files. This concept is best illustrated using a series of examples.

In code:

class AptGetAction(BootstrapAction):

    NAME = 'apt_get_install'

    def __init__(self, packages):
        super(AptGetAction, self).__init__()
        self.packages = packages

    # for bootstrap script
    def render_bootstrap_script(self):
        return "sudo apt_get install %s" % ' '.join(self.packages)

class UploadAndUntarAction(UploadAction):

    NAME = 'upload_and_untar'

    # superclass already knows how to expand globs and upload files
    # self.files is a dict with keys local_path, remote_path, unarchive_to
    # __init__ would set unarchive_to, didn't think about how that would work

    def render_bootstrap_script(self):
        # can also be a generator?
        for f in self.files:
            yield 'tar -xf %s -C %s' % (f['unarchive_to'], f['remote_path'])

class PythonArchivesAction(UploadAndUntarAction):

    NAME = 'python_archives'

    # file is already uploaded and unarchived

    # for wrapper script
    def render_wrapper_script(self):
        return 'export PYTHONPATH=%s' % ':'.join(f['unarchive_to'] for f in self.files)

class ScriptAction(UploadAction):

    NAME = 'script'

    def __init__(self, path):
        self.script_path = path

    def render_bootstrap_script(self):
        with open(self.script_path) as f:

In config:

      - action: script
      - action: apt_get_install
        packages: [cowsay]
      - action: python_archives
        files: pycowsay.tar.gz

See what I'm getting at? They're easy to read, compose, refactor, and test. An InstallPython3 action is simple to write and distribute, though I didn't include it here.

irskep commented Jul 6, 2012

Edited for a couple of minor errors.


Yeah. I like that the actions "know" whether their arguments are local files, remote files, or something else.

I dislike that we'd essentially be creating a new standard that only exists inside mrjob, to do something that's tangential to the core functionality of mrjob. The shell syntax does this too, but only insofar as it's a novel combination of shell and Hadoop distributed cache syntax.

As an aside, I think we should have a built-in action called upload that's basically a no-op; it just gives you something to hang files you want to upload on. So instead of this:

--bootstrap "sudo dpkg -i /path/to/garply-latest.deb#"

You could do this:

--bootstrap "upload /path/to/garply-latest.deb#garply.deb"
--bootstrap "sudo dpkg -i garply.deb"

Oh, also, shell-syntax recipes are easier to email around and install than Python-object-based ones; getting python_archives to work is farther up the learning curve than dumping stuff into mrjob.conf.

I'm imagining a world where someone wants to use Python 2.7 on EMR, and they copy some shell script off a wiki or email thread and add it to their job's command line, without ever thinking about how or why it works.

irskep commented Jul 6, 2012

What about defining commands my way and using them in configs your way? The shlexd arguments are just passed to __init__ as *args, and the Python object is then kept around and used to generate the necessary scripts. So it retains the syntactic simplicity while also solving the python_archives problem.

If we don't have a Python-defined action for something, we can just render it as a regular command, with your set of Hadoop-conventioned rules documented in an obvious place.

Re-summary of the main advantages I see of using Python objects:

  • One action knows how to interact with different steps - wrapper vs bootstrap, etc. - and there's no ambiguity to paths or environment variables
  • Easy to inherit and compose
  • Default action can be 'regular command with the ability to introspect and upload its files'

So in the case of Python 2.7, it could start as an emailed command line script but eventually be merged as a more readable and usable Python action. From this in 0.4.100:


to this in 0.4.101:


or even provide a mr3po-like external library.

Python bootstrap objects could also be aliases to S3-hosted bootstrap actions like configure-hadoop and install-ganglia without making the user remember and type those horrendously long paths.

irskep commented Jul 6, 2012

In fact, I definitely agree that command line syntax is the way to go for user-facing configuration. It just makes sense.

In retrospect, my idea was more about internal refactoring. It gives us the ability to encapsulate relatively complex sequences of actions into small objects while providing a more conventional interface to the user. It also lets us break more messy EMR setup code into a separate module.

Easy things should be easy. Hard things should be possible.


It'd be really nice to have an install-ganglia function. I think that'd be do-able as a shell script; something like this:

function install-ganglia () {
    local GANGLIA_SCRIPT=$(tempfile)
    hadoop -copyToLocal s3://bucket/path/to/ $GANGLIA_SCRIPT
irskep commented Jul 6, 2012

install-ganglia is already available as a bootstrap action with --bootstrap-action="s3://some-aws-bucket/blah/install-ganglia.

I would do this:

class InstallGanglia(BootstrapAction):

    def render_bootstrap_action(self):
        return 's3://some-aws-bucket/blah/install-ganglia'

My core thesis is that internally, we should organize bootstrapping logic this way, which provides opportunities for us to write convenient aliases for things in well-documented ways. We should absolutely encourage users to share shell scripts that do useful things and are well supported by mrjob.

I don't think our proposals are mutually exclusive at all.

irskep commented Jul 6, 2012

(Except for the YAML config thing, which I now backpedal from.)

@davidmarin davidmarin was assigned Aug 1, 2012

Returned to this thread to find the spec to work from, only to realize it's mostly still in my head (and some in mrjob/

My basic idea is that the bootstrap and setup scripts will just be ordinary shell scripts, and each --bootstrap/--setup command will correspond to a line in the shell script.

Just as it is currently, the setup script will actually be a wrapper that runs python [args] on the last line. This means if you want to pass an environment variable to your script, you just need to use export. (The setup script will also need a different way of locking other wrapper scripts from running at the same time since shell script doesn't have flock(), but that's do-able).

Before bootstrapping/setup, mrjob will scan your setup strings for references to local files, which look local_file#remote_file. local_file cannot contain : (to make it easy to add to PATH environment variables). remote_file is optional, and cannot contain / or :. Neither may contain whitespace unless you escape it with backslashes. Everything that's not a reference to a local file will be left exactly as-is. We won't even resolve quotes and backslash expressions; all we'll do is identify them so we can find the boundaries of the file references.

For archives, just put a / after the #: local_archive.tar.gz#/remote_dir.

If local_file contains shell variables (e.g. $MY_PKGS/boto.tar.gz#/, they'll be resolved from the local environment. Shell variables elsewhere in the setup string will be left as-is (they'll be resolved when the script is run remotely).

If local_file contains * or other glob characters, we'll resolve them before writing out the script. If you use globs, you can't set remote_file (since you're actually referring to multiple files).

We'll eventually want an option to refer to URIs as well; these will be downloaded on the remote end using hadoop fs -get. URIs will of course be allowed to contain :; we'll we able to pick them out because they'll contain ://. We can even combine globs (well, *, at least) with URIs if we use the runner's Filesystem rather than the glob module).

There won't be a magic concatenation character; in shell you use concatenate things like""this. For more info on the (very simple) rules for shell escaping, see

The groundwork for keeping track of file references and assigning them unique names is already done in mrjob/ All we're going to do is take the setup expression (a string), and turn it into a list of strings and path dictionaries, and then pop the path dictionaries into an instance of the appropriate Manager class.

irskep commented Feb 21, 2013

What about "local archive"#/"remote dir"? Quotes too complicated?

URIs should be detected by whether urlparse will parse them, not searching for ://. But that's a debatable nitpick.

👍 👍 👍


Oh, that's an important point. I was only going to look for file references outside of quoted strings.

I don't think I can fold urlparse into my simple lexer. It's really up to hadoop to determine what's a valid URL; I just need to be able to tell whether to start after the : or include it.


By the way, here's a safe, non-racy way to keep two copies of the setup wrapper script from running simultaneously:

# create lockfile and point file descriptor 9 at it
exec 9>/tmp/lockfilename
# get an exclusive lock
# (use python since we don't know for sure that flock(1) is installed)
python -c 'import fcntl; fcntl.flock(9, fcntl.LOCK_EX)'

# critical section: setup commands go here

# close file descriptor 9 to release the lock
exec 9>&-

# run the mapper/reducer

See for details.


Another important point; when we refer to uploaded files in the bootstrap/setup scripts, we should use their absolute paths, so that we don't have to worry so much about which directory we're in. We should also run the mapper/reducer from the original directory (this is important for file options).

For example, if we had a archive foo.tar.gz#/, that we wanted to cd into, make, and add to $PYTHONPATH, we could run our script with:

--setup 'cd foo.tar.gz#/'
--setup 'make'
--setup 'export PYTHONPATH=$PYTHONPATH:foo.tar.gz#/'

and our generated wrapper script would look something like this:


exec 9>/tmp/__mrjob_lockfilename
python -c 'import fcntl; fcntl.flock(9, fcntl.LOCK_EX)'

cd $__mrjob_PWD/foo/
export PYTHONPATH=$PYTHONPATH:$__mrjob_PWD/foo/

exec 9>&-

cd $__mrjob_PWD

Dang, looks like there are no regression tests for the setup options. Guess I'll start there!


#206 (comment)

@davidmarin could you please explain what are the scenarios in which we need to be concerned about multiple invocations of the wrapper script?


Invoking make on a cacheArchive was (and presumably, still is) Yelp's main use case. Hadoop runs multiple tasks on the same task node, and although each task gets its own working directory, they merely get symlinks to the unpacked archive. This makes sense if you think about what is most efficient, but I wouldn't say it's intuitive (nor are the resulting sporadic errors).

I imagine make is but one of many non-reentrant utilities one might want to run during setup.


moving this issue to the next release

@davidmarin davidmarin was assigned Aug 28, 2013
@scottknight scottknight added a commit to timtadh/mrjob that referenced this issue Oct 10, 2013
@scottknight scottknight Merge tag 'tags/v0.4.1' into features/update_mobile_defense_production
secondary sort and self-terminating job flows
 * jobs:
   * SORT_VALUES: Secondary sort by value (#240)
     * see mrjob/examples/
   * can now override jobconf() again (#656)
   * renamed mrjob.compat.get_jobconf_value() to jobconf_from_env()
   * examples:
     * bash_wrap/ (mapper/reducer_cmd() example)
     * (two step job)
     * (SORT_VALUES example)
 * runners:
   * All runners:
     * single --setup option works but is not yet documented (#206)
     * setup now uses sh rather than python internally
   * EMR runner:
     * max_hours_idle: self-terminating idle job flows (#628)
       * mins_to_end_of_hour option gives finer control over self-termination.
     * Can reuse pooled job flows where previous job failed (#633)
     * Throws IOError if output path already exists (#634)
     * Gracefully handles SSL cert issues (#621, #706)
     * Automatically infers EMR/S3 endpoints from region (#658)
     * ls() supports s3n:// schema (#672)
     * Fixed log parsing crash on JarSteps (#645)
     * visible_to_all_users works with boto <2.8.0 (#701)
     * must use --interpreter with non-Python scripts (#683)
     * cat() can decompress gzipped data (#601)
   * Hadoop runner:
     * check_input_paths: can disable input path checking (#583)
     * cat() can decompress gzipped data (#601)
   * Inline/Local runners:
     * Fixed counter parsing for multi-step jobs in inline mode
     * Supports per-step jobconf (#616)
 * Documentation revamp
 * mrjob.parse.urlparse() works consistently across Python versions (#686)
 * deprecated:
   * many constants in mrjob.emr replaced with functions in
 * removed deprecated features:
   * old conf locations (~/.mrjob and in PYTHONPATH) (#747)
   * built-in protocols must be instances (#488)
@irskep irskep closed this in #806 Nov 5, 2013
This was referenced Nov 27, 2013

Noting that I never actually implemented globs (for local or remote files). #938 is now the ticket for this.

@davidmarin davidmarin removed their assignment Jul 8, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment