Optional host prefix for get #79

bitprophet opened this Issue Aug 19, 2011 · 12 comments


None yet

1 participant



Currently, when the file is downloaded using get(), it'll add host prefix to the downloaded file - but only when using multiple hosts. It also means that in the code you always need to re-check the name of the downloaded file. I found it both unnecessary and error-prone, and therefore suggest:

  • Make the suffix optional, for example with param: get(remote, local, host_prefix=True)
  • Return the path of the local, downloaded file as a result of get()

Here's what I (Jeff) see as the primary use cases for get that we need to consider in our solution, and what I think the default local results should be. Note: using trailing slashes for readability, they're not required; all local paths are relative for brevity. Multi-file examples are written with Erich's current host-suffix-based implementation first and a host-directory-based one in parens.

  • File to directory: get('foo.txt', 'files/')
    • Single: files/foo.txt
    • Multi: files/foo.txt.<host> (or files/<host>/foo.txt; or, maybe better, <host>/files/foo.txt)
  • File to file: get('foo.txt', 'bar.txt')
    • Single: bar.txt
    • Multi: bar.txt.<host> (or <host>/bar.txt)
  • (Recursive) directory to directory: get('.vim/', 'dotfiles/')
    • Single: dotfiles/.vim/<files>
    • Multi: dotfiles/<host>/.vim/<files> (or <host>/dotfiles/.vim/<files> ?)
  • (Recursive) directory to file: get('.vim/', '.vimrc')
    • No-op (if .vimrc was a local, existing file -- otherwise it'll be treated as a dir and created, thus falling under the previous bullet point)

There are other, more extreme use cases that should probably be left up to the user to deal with, such as multiple files with identical basenames, or multiple runs of the same task in a row, both of which will clobber to various extents. Exposing a paramaterized string is probably a good way to go about this, as per Max's comment and other supporting commentary on IRC.

Originally submitted by **** (jmu) on 2009-11-04 at 01:49pm EST


  • Duplicated by #233: Make get() return retrieved file path
  • Related to #274: See if it makes sense to have get/put return values
  • Related to #279: Prune recursive kwarg from get()/put()
  • Related to #140: Recursive put() and get()

Closed as Done on 2011-02-15 at 01:06am EST

@bitprophet bitprophet was assigned Aug 19, 2011

Max Arnold (LwarX) posted:

I think host prefix should be completely optional and customizable. Probably this can be done with string interpolation:

without prefix:
get("/path/to/source/file.txt", "/tmp/%(filename)s")

with prefix:
get("/path/to/source/file.txt", "/tmp/%(host)s/%(filename)s")
get("/path/to/source/file.txt", "/tmp/%(filename)s.%(host)s")

To do this fabric can populate special dictionary with several variables for each get run ('filename', 'host', etc...)

on 2010-08-16 at 03:21am EDT


Jeff Forcier (bitprophet) posted:

Making this 1.0, it should be considered along with #140 and friends. Note that the implementation Erich wrote for #140 does not address this problem other than making sure that the existing behavior (create per-host files if running on more than one host) applies cleanly to folders as well as files.

Thus we still have to figure out how this problem is best solved, and make sure we preserve the folder/recursive behavior too.

on 2010-09-17 at 03:29pm EDT


Jeff Forcier (bitprophet) posted:

So speaking specifically about this ticket, I'm seeing the following solutions:

  1. Always use a host suffix/folder, regardless of the size of the host list.
    • Plus: works well even when used as a library (where all_hosts will not be usable)
    • Plus: simple to implement
    • Plus: won't ever cause people to lose data due to overwrites (multiple runs of your fab task in a row notwithstanding)
    • Minus: base case users downloading one file might expect the local file to lack the suffix and be a bit confused at first
  2. Always use a suffix by default, but allow users to deactivate the suffix.
    • Plus: more control.
    • Minus: adds a new kwarg for arguably little gain (is it really that important to be able to turn off the suffix when you know you're only downloading a file from one host?)
    • Minus: allows less experienced users to shoot themselves in the foot if they turn the suffix off, then run against multiple hosts.
  3. Continue using a suffix conditionally based on all_hosts, but return the filename gotten.
    • Plus: allows users to figure out what file was downloaded.
    • Minus: doesn't play very well with recursive use (return an iterable? a multi-layered dict? ew.)
    • Minus: still breaks for library users
  4. Allow full user control over suffixing format, as per Max's comment.
    • Plus: lets power users do what works best for them.
    • Minus: doesn't mesh well with recursive use case (i.e. needs to apply equally well to files and/or directories).

Out of all these, I actually think the first one is possibly the best, all things considered; unless somebody has objections I haven't thought of yet (quite possible). Having recursive behavior definitely complicates things (insofar as you will sometimes end up with local files named <hostname>/filename.txt and other times filename.txt.<hostname>) but I don't see any easy way around that part of things.

on 2010-09-17 at 04:17pm EDT


Erich Heine (sophacles) posted:

I think that in all cases, a directory should be made for each host in the target directory.

so with the fabfile:

@hosts("foo", "bar")
def getfiles():
    get('/file/path.file', 'target/dir')

@hosts("foo", "bar")
def getglob():
    get("pathname/*", 'target/dir2')

$ fab getfiles
results in:



$fab getglob

results in:


It is simple, consistent and easily scriptable for later stuff.

I also think that there should be some customization allowed, in the form of the dirname. I think there should be a way to look up a representation on a per host basis. This last bit tho, could probably wait until host objects are done.

on 2010-09-17 at 05:33pm EDT


Jeff Forcier (bitprophet) posted:

I do like the "just always use per-host directories" approach as it could simplify things (basically a cleaner version of option 1 above).

I also wonder if it might not be even better to make the "target" dir/filename optional, and default to using the remote file path. E.g. connecting to fabfile.org as jforcier and calling get('TODO.txt') would result in a local file/folder structure of $CWD/fabfile.org/home/jforcier/TODO.txt. And get('/var/www/.htaccess') would result in $CWD/fabfile.org/var/www/.htaccess. Etc.

I'd like to think this approach would solve any/all collision problems, at least across the dimensions of multiple hosts or multiple files with the same basename (e.g. get('/home/jforcier/.bashrc') ; get('/home/deploy/.bashrc') would result in file overwriting under any other implementation).

It introduces one minor extra complexity on top of the overall "what's my local filename?" issue that all solutions have, namely the implicit remote $HOME in relative remote paths. However I think this is small enough that the benefits of this approach outweigh it.

Unfortunately, we'd almost definitely still have to allow for users manually specifying the target local path, which means we still have to handle all the messy file-vs-directory crap anyways. Ugh. Will have to think on this some more, or maybe we ought to approach this from the use-case angle.

on 2010-09-19 at 01:41pm EDT


Jeff Forcier (bitprophet) posted:

After updating the description and chatting in IRC I think we should use a combined approach that focuses on the format-string solution.

We should use the following components:

  • host - Connection hostname, e.g. website.com or
  • dirname - File dirname, e.g. the doc in doc/TODO.txt (or the project/doc in project/doc/TODO.txt -- i.e. the same as Unix basename)
  • basename - File basename, e.g. the TODO.txt in same
  • filename (or fullname perhaps) - Conjunction of dirname and basename, so doc/TODO.txt
    • Having dirname and basename distinct allows users to set up formats like mydir/%(dirname)s/%(host)s/%(basename)s if they so desire.

This format string applies to resulting individual files; doesn't matter if recursion is used. This threw me for a conceptual loop earlier on, I didn't see how this could work in tandem with the recursive behavior. If you just think about remote_dir/subdir/ as turning into a list of eg ['remote_dir/subdir/file1.txt', 'remote_dir/subdir/subdir2/file2.txt', ...], then it makes sense.

The default format string should be %(host)s/%(filename)s, so the single vs multi host situation has a single default. We should also return a list of the final file paths; a nested structure is too complicated, but we might as well return something instead of nothing.

Note that this means the only required argument would now be the source/remote filename.

The only problem I see with this approach is that of backwards compatibility, and/or users simply giving a non-format string in the 2nd argument position. The only solution to that I can see is to use the following heuristics:

  • If given path does not exist, use it as a prefix directory, so that the format string used is %(local_path)s/%(host)s/%(filename)s. This may not be intuitive, but at this point we've decided to move to a power user friendly solution, so if a newbie isn't reading the docs (or isn't realizing they are reading an old tutorial while using a newer version) then there's nothing we can really do.
    • Printing a warning may go a long way towards staving off spurious help requests, though!
  • If the path exists and is a directory, see above.
  • If the path exists and is a file, abort? (Or do we test to see if the paths being downloaded is only one single file, and then overwrite? That would be slightly more user friendly but strikes me as an oddball corner case.)

on 2010-09-20 at 04:55pm EDT


Jeff Forcier (bitprophet) posted:

I think the bullet points above are too much effort given the tradeoff. Instead, I think this is what I'll do:

  • Existing behavior, re: local_path argument, stays exactly the same for single-host runs. Single file vs directory vs existing vs non-existing, all the same, not changing anything;
  • For multi-host runs, drop the existing suffix behavior. If you specify an explicit local_path and don't use the new interpolation variables, too bad, your file is getting overwritten each time;
  • Interpolate local_path with the new variables as stated above, so power users may make use of them;
  • An empty or missing local_path argument will now be cwd + host + filename, also as per the above comment.
  • Probably start pitching the "default" use of get() to be get(remote_path) (utilizing the new "safe"/consistent default) instead of get(remote_path, local_path) (implying you always need to specify something for the local side).

The benefits of this are A) it's less overall work, we're simply dropping one old chunk of code and adding one unrelated new one, and B) it's heavily backwards compatible for all the more common use cases (and also the way we use it in our test suite, to be honest).

on 2011-01-19 at 01:21pm EST


Jeff Forcier (bitprophet) posted:

OK, I think I'm pretty much done. All the older tests (as in, the ones put in prior to this ticket's work) still pass, except for a few that no longer apply and were removed; some were mutated to fit the new behavior; and another one or two were added.

Merging into the put-get integration branch and continuing...

on 2011-01-21 at 04:15pm EST


Jeff Forcier (bitprophet) posted:

Found a case not tested which is, naturally, broken -- using format strings plus recursion. Derp.

Think I may end up merging sftp.py's get/get_dir split, but have to look at it a bit more.

on 2011-01-23 at 11:01am EST


Jeff Forcier (bitprophet) posted:

The core problem with this edge case (applying format strings to recursive=True) seems to be that it forms a different behavior from what has already been implemented, re: how local_path needs to be treated in the simple (no format strings) case:

  • get('/path/to/remote_file', 'local_file') writes to local_file. The local path is not modified in any way.
  • get('/path/to/remote_file', 'local_existing_dir') writes to local_existing_dir/remote_file. Also pretty straightforward: we notice that the local path already exists and is a directory, so we place the downloaded file inside of it with its remote basename. It's just an append.
  • get('/path/to/remote_directory/*', 'local_existing_dir', recursive=False) is simply a multi-call case of the previous bullet point.

However! When we need to get recursive, the expectation is that remote_path and its subdirectories will be created locally, mirroring the remote end. Thus, given a remote filesystem containing remote_directory/subdir/file1, calling get('remote_directory', 'local_directory', recursive=True) would not write out local_directory/file1, but would write local_directory/remote_directory/subdir/file1 instead. And thus get() needs to tack on the remote directory hierarchy to the end of local_path.

In the above bullet-point cases, this is not a problem, because we're only getting the "leaf" file: even if one calls get('big/long/hierarchy/file.txt', 'local_dir'), we still only write out local_dir/file.txt. This isn't feasible for the recursive case -- flattening just won't work -- thus the recreation of the remote hierarchy.

So! What happens if local_path contains format strings such as %(dirname)s or %(path)s? The simple cases work fine, because they will still map to a single local file path. But in recursion, appending the remote directory structure would result in local paths like %(dirname)s/remote_directory/subdir/file1. Expand this and it becomes remote_directory/subdir/remote_directory/subdir/file1 -- which is incorrect.

I don't have a ready answer for this offhand, but it took a while just to get a grasp on why this was such a problem, so wanted to write it down here. Will have to think a bit. Suggestions welcome.

on 2011-02-13 at 10:10pm EST


Jeff Forcier (bitprophet) posted:

More thoughts:

  • The real issue here is that this is the only situation where we need to know whether the user specified a format string (or rather, whether they specified %(dirname)s and/or %(path)s, since those "use up" the directory structure we'd need to add ourselves otherwise) or gave a regular string.
  • While I feel like it's messy and shouldn't be done, we can probably just test local_path early on in the process to see if it contains any literal format string specifiers.
    • If nothing else springs to mind, this is probably what I'll do since I've wasted enough time running in circles about this.

on 2011-02-13 at 10:29pm EST


Jeff Forcier (bitprophet) posted:

Went with the above, plus a handful of other changes I noticed while implementing/testing.

Most notably, making it behave like scp in that %(dirname)s/%(path)s reflect the relative remote path instead of the absolute one. I.e. get('/var/log/*', '%(path)s', recursive=True) was, earlier, creating $CWD/var/log/apache2/access.log and so forth. It should be cutting things off after /var/log, and creating $CWD/apache2/access.log etc.

After all this, the code is pretty messy and the docstring is absolutely gargantuan; I'm unhappy with how this has all turned out. However, it's definitely not worth holding off the 1.0 release to make this "perfect". And barring any more unfound bugs, it's much more powerful/flexible than the 0.9 incarnation, while (I think) still almost as usable in the base case, and almost entirely backwards compatible to boot.

Marking this done for reals this time, can reopen if any prerelease sprints/testing finds more bugs.

on 2011-02-15 at 01:06am EST

@bitprophet bitprophet closed this Aug 19, 2011
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment