ENH: Add --input flag #2464

kyleam · 2018-04-30T17:56:47Z

This is a first pass at allowing inputs to be specified when calling datalad run.

WIP because

I haven't tested it too thoroughly
I haven't yet tried to add --output, but we might want to do that here (as opposed to a separate PR)
I'm not sure how we should handle glob expansion.

If a glob is given as the input value, it is expanded at the time of the run. As an example, if '*' was used to indicate all annex files, a rerun would consider the inputs to be all of the annex files at the time of the initial run.

Expansion during the initial run seems appropriate because most of the time re-executing a command should only require those inputs. However, it's possible that the caller would want to rerun a command that uses globs itself to get its input files, in which case files that were added after the initial run wouldn't be retrieved.

Perhaps we could also store the unexpanded inputs and add a rerun flag that says to use the unexpanded set.

Update: I'm still not sure, but the current approach is to store unexpanded unless the --expand option is used.

Note: #2432 mentioned --input=. as meaning all annex files should be present, but I didn't end up adding that because it's possible with globbing (--input='*').

re: #2158, #2432

We'd need something like feature described in dataladgh-1424 to address this.

The next commit will add a `get` call to run, and `get` sets the message to a tuple.

Allow the caller to specify files that we should obtain with 'git annex get' before the run. In addition to data files, this input could be a singularity container (as described in dataladgh-2158). There are at least a couple of decisions here that should be given more thought: * The path to --input must be relative to the root of the dataset. * If a glob is given as the input value, it is expanded at the time of the run. As an example, if '*' was used to indicate all annex files, a rerun would consider the inputs to be all of the annex files at the time of the initial run. Expansion during the initial run seems appropriate because most of the time re-executing a command should only require those inputs. However, it's possible that the caller would want to rerun a command that uses globs itself to get its input files, in which case files that were added after the initial run wouldn't be retrieved. Perhaps we could also store the unexpanded inputs and add a rerun flag that says to use the unexpanded set. Re: datalad#2432

codecov · 2018-04-30T19:13:42Z

Codecov Report

Merging #2464 into master will decrease coverage by 8.32%.
The diff coverage is 52.87%.

@@            Coverage Diff             @@
##           master    #2464      +/-   ##
==========================================
- Coverage   88.99%   80.66%   -8.33%     
==========================================
  Files         283      240      -43     
  Lines       33146    29629    -3517     
==========================================
- Hits        29499    23901    -5598     
- Misses       3647     5728    +2081

Impacted Files	Coverage Δ
datalad/tests/test_utils.py	`95.9% <100%> (-0.13%)`	⬇️
datalad/utils.py	`87.1% <100%> (-1.66%)`	⬇️
datalad/support/annexrepo.py	`86.95% <100%> (-2.38%)`	⬇️
datalad/interface/rerun.py	`86.39% <100%> (-6.02%)`	⬇️
datalad/support/tests/test_annexrepo.py	`95.31% <100%> (-1.74%)`	⬇️
datalad/interface/tests/test_run.py	`58.76% <20.48%> (-40.95%)`	⬇️
datalad/interface/run.py	`87.87% <63.63%> (-12.13%)`	⬇️
datalad/tests/utils_testdatasets.py	`11.76% <0%> (-88.24%)`	⬇️
datalad/interface/tests/test_annotate_paths.py	`24% <0%> (-76%)`	⬇️
datalad/metadata/tests/test_aggregation.py	`25.69% <0%> (-74.31%)`	⬇️
... and 143 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9017b45...c663104. Read the comment docs.

mih · 2018-05-07T13:33:14Z

Thanks! First comment:

I think we should stick to the expansion style that annex is using, and that we are relying on elsewhere too. Otherwise we break the paradigm (for no obvious reason). Hence . would refer to any file/directory in the entire repo (but not subdatasets). With globbing it would be hard to say: all this, but only until you hit a dataset boundary.

Also we should be careful with expanding the inputs. . would mean 'everything', but expanded it would mean 'everything that was at that time' -- which would have implications on the semantics of rerun --onto....

kyleam · 2018-05-07T14:59:24Z

I think we should stick to the expansion style that annex is using, and that we are relying on elsewhere too. Otherwise we break the paradigm (for no obvious reason).

My reason is that the ability to glob was requested in gh-2432. What's currently possible is to glob via annex's --include. I'm not against adding "--input=. -> get('.')", but I viewed it as redundant because, if we have globbing, we have --input='*'. As long as --input is simply mapping to git annex get (or more generally, just annexed files), "." and --include='*' are equivalent, I think.

Hence . would refer to any file/directory in the entire repo (but not subdatasets). With globbing it would be hard to say: all this, but only until you hit a dataset boundary.

The globbing is documented as "what would git annex find --include" do, so by that definition it doesn't cross dataset boundaries. Perhaps we shouldn't limit globs to the current dataset, but given that ATM rerunning with anything other than HEAD^ doesn't work (in the sense that the subdataset state is not adjusted using something like gh-1424), I don't view making globs work across subdatasets as a priority.

At any rate, I'll plan to add --input=. for consistency.

Also we should be careful with expanding the inputs. . would mean 'everything', but expanded it would mean 'everything that was at that time' -- which would have implications on the semantics of rerun --onto....

Yes, I agree. For globbing in general, I tried to touch on this in my opening comment (and the message of d5d00f2). As I mentioned there, I see the stricter "expand at the initial run time" as the better default, but I can imagine situations where the caller would want the "expand on each rerun" behavior.

So

should --input=. remain unexpanded while globs are expanded?
should there be an additional flag to say "don't expand my globs"?
something else?

We'll use this for --output as well.

An upcoming commit will want lists from partition again, so let's avoid repeating this pattern.

This doesn't matter for inputs, but for outputs we'll need to do this check before unlocking them, so we might as well move the inputs block to keep inputs/outputs processing close together.

This also avoids the false warning of no matching files on rerun.

Right now, we always expand at run time, so there is no point in calling _resolve_files.

While --input maps to 'git annex get', --output maps to 'git annex unlock' or 'git rm', depending on whether content is present. We can't unconditionally unlock because that leads to an error if content isn't present (dataladgh-2432). We're expanding the globs on the initial run in the same way we do with --input. As discussed in d5d00f2, this should be revisited because there are cases for both --input and --output where the caller would want to store the unexpanded patterns and re-glob on each rerun. Re: datalad#2432

Re: datalad#2464 (comment)

Unlike the previous version, this will work with annex files in subdatasets, and it's simpler because we let `remove` handle the annex checks.

This pattern is no longer used in multiple places.

When content isn't present for a file that was added/modified in a revision, it'd be better to remove the file rather than trying to unlock it, which fails. But if we do that, we should make sure we process inputs first since that step may retrieve an added/modified file (perhaps via --input=.). To do both these things, piggyback on run's outputs argument.

mih · 2018-05-12T06:55:30Z

Playing with the beast. Observations:

In a plain Git repo:

% datalad run --input . bash -c "ls -l dicoms > this"
get(notneeded): . [no dataset annex, content already present]
[INFO   ] == Command start (output follows) ===== 
[INFO   ] == Command exit (modification check follows) ===== 
add(ok): this (file)
save(ok): /tmp/func (dataset)
action summary:
  add (ok: 1)
  get (notneeded: 1)
  save (ok: 1)
(datalad3-dev) mih@meiner /tmp/func (git)-[master] % datalad run --input '*' bash -c "ls -l dicoms > that"
[ERROR  ] 'GitRepo' object has no attribute 'is_under_annex' [run.py:<lambda>:153] (AttributeError)

The latter code path assumes the presence of an AnnexRepo.

Continuing:

% git annex init
init  ok
(recording state in git...)
(datalad3-dev) mih@meiner /tmp/func (git)-[master] % datalad run --input '*' bash -c "ls -l dicoms > that"
[WARNING] No matching files found for --input

All files are tracked in Git directly, hence nothing is there that concerns annex, but still the message is misleading in this case.

Here I have a folder will 5461 files under annex:

% datalad run --input annexdicoms bash -c "ls -l annexdicoms > that3"
[ERROR  ] AssertionError() [annexrepo.py:info:2518] (AssertionError) 
datalad run --input annexdicoms bash -c "ls -l annexdicoms > that3"  8.92s user 0.99s system 107% cpu 9.258 total

It aborts before anything happens (10s for checking something). Repo stays clean, no added commit, no untracked stuff. The corresponding fire-and-forget get call:

% datalad get annexdicoms
get(notneeded): /tmp/func/annexdicoms (directory) [nothing to get from /tmp/func/annexdicoms]
datalad get annexdicoms  4.82s user 0.43s system 107% cpu 4.872 total

That is half the time (and probably still slower than what it could be doing). Now the run call with --dbg:

% datalad --dbg run --input annexdicoms bash -c "ls -l annexdicoms > that3"
Traceback (most recent call last):
  File "/home/mih/env/datalad3-dev/bin/datalad", line 8, in <module>
    main()
  File "/home/mih/hacking/datalad/git/datalad/cmdline/main.py", line 378, in main
    ret = cmdlineargs.func(cmdlineargs)
  File "/home/mih/hacking/datalad/git/datalad/interface/base.py", line 420, in call_from_parser
    ret = list(ret)
  File "/home/mih/hacking/datalad/git/datalad/interface/utils.py", line 414, in generator_func
    result_renderer, result_xfm, _result_filter, **_kwargs):
  File "/home/mih/hacking/datalad/git/datalad/interface/utils.py", line 482, in _process_results
    for res in results:
  File "/home/mih/hacking/datalad/git/datalad/interface/run.py", line 142, in __call__
    message=message):
  File "/home/mih/hacking/datalad/git/datalad/interface/run.py", line 209, in run_command
    inputs = _resolve_files(ds, inputs)
  File "/home/mih/hacking/datalad/git/datalad/interface/run.py", line 154, in _resolve_files
    globs, files = list(globs), list(files)
  File "/home/mih/hacking/datalad/git/datalad/utils.py", line 669, in <genexpr>
    return ((item for pred, item in a if not pred),
  File "/home/mih/hacking/datalad/git/datalad/utils.py", line 668, in <genexpr>
    a, b = tee((predicate(item), item) for item in items)
  File "/home/mih/hacking/datalad/git/datalad/interface/run.py", line 153, in <lambda>
    lambda f: dset.repo.is_under_annex([f], batch=True)[0])
  File "/home/mih/hacking/datalad/git/datalad/support/gitrepo.py", line 300, in newfunc
    result = func(self, files_new, *args, **kwargs)
  File "/home/mih/hacking/datalad/git/datalad/support/annexrepo.py", line 1865, in is_under_annex
    info = self.info(files, normalize_paths=False, batch=batch)
  File "/home/mih/hacking/datalad/git/datalad/support/gitrepo.py", line 300, in newfunc
    result = func(self, files_new, *args, **kwargs)
  File "/home/mih/hacking/datalad/git/datalad/support/annexrepo.py", line 2518, in info
    assert(j.pop('file') == f)
AssertionError

> /home/mih/hacking/datalad/git/datalad/support/annexrepo.py(2518)info()
-> assert(j.pop('file') == f)
(Pdb) print(j)
{'local annex size': '115317918', 'size of annexed files in working tree': '115317918', 'numcopies stats': [['+0', 5460]], 'combined size of repositories containing these files': '115317918', 'command': 'info annexdicoms', 'success': True, 'local annex keys': 5460, 'annexed files in working tree': 5460, 'repositories containing these files': [{'here': True, 'size': '115317918', 'uuid': '104deefb-ace4-4bf4-9b1a-3fc451200ec7', 'description': 'mih@meiner:/tmp/func'}], 'directory': 'annexdicoms'}

So it wants 'annexdicoms/*' (which is OK, but it makes me think that supporting . and not this is inconsistent. So we might want to drop . again (but see below). Another attempt:

% datalad run --input 'annexdicoms/*' bash -c "ls -l annexdicoms > that3"

<snip 5461 get(notneeded) results that we might want to filter out>

[INFO   ] == Command start (output follows) ===== 
[INFO   ] == Command exit (modification check follows) ===== 
[ERROR  ] [Errno 7] Argument list too long: 'git' [subprocess.py:_execute_child:1344] (OSError)                                                                                              
datalad run --input 'annexdicoms/*' bash -c "ls -l annexdicoms > that3"  10.68s user 0.90s system 114% cpu 10.119 total

I guess argument expansion hit.

% git st
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

        new file:   that3

It managed get and staged the result, but then breaks at:

['git', '-c', 'receive.autogc=0', '-c', 'gc.auto=0', 'commit', '-m', '[DATALAD RUNCMD] bash -c ls -l annexdicoms > that4\n\n=== Do not change lines below ===\n{\n "cmd": [\n  "bash",\n  "-c",\n  "ls -l annexdicoms > that4"\n ],\n "exit": 0,\n "chain": [],\n "inputs": [\n  "annexdicoms/MR.1.3.46.670589.11.38317.5.0.4476.2014042516042547586",\n  "annexdicoms/MR.1.3.46.670589.11.38317.5.0.4476.2014042516042548588",\n 

<5k more lines>

So no argument expansion per se, but if we want to go down this path, we would need to write the commit message into a file first and then ingest it via git commit -F.

But I am not sure at all if that is a good idea. I have real datasets with >200k annex files that are all input into an analysis. In my DICOM case they maybe be spit into ~50k chunks per directory (one acquisition session). So I would routinely end up with 50k line commit messages for saying "I need this directory" -- if aggregated metadata hits such size, we are already putting it in annex ;-)

Thanks @kyleam for pushing this forward!

Re: datalad#2464 (comment)

This test (introduced by this series) will fail when this branch is merged to master because dcdbf65 removes add's commit argument.

kyleam · 2018-05-12T20:52:26Z

Thanks for giving this a test drive @mih. I'll need to digest/work through your comments, but there are clearly issues with my initial attempt at implementing this :/

In a plain git repo, this avoids (1) displaying a misleading "No matching files..." warning when a glob is passed to --input and (2) running an unnecessary 'datalad get .' when '.' is passed to --input. Re: datalad#2464 (comment)

We want to be able to give a directory to --input. Doing so allows a command to depend on a potentially large number of files without depending on everything with "." or globbing with --input='subdir/*', which stores all the subdirectory's file names in the commit message (which isn't pleasant to look at in the log and will currently fail if the message reaches a certain length). AnnexRepo.is_under_annex, however, fails when passed a directory because it calls AnnexRepo.info, which raises an AssertionError. It's not clear what "under annex" should mean for a directory (contains one or more annexed files?), but at the least AnnexRepo.info should probably return information for the directory rather than raising an AssertionError. So avoid passing directories to is_under_annex. If is_under_annex / info are adjusted to handle directories, it might make sense to modify _resolve_files to return a directory only when it contains an annexed file. Re: datalad#2464 (comment)

kyleam · 2018-05-14T15:14:10Z

OK, so the issues I took away from @mih's feedback were

--input fails in plain git repos
the "no matching" warning is misleading in plain git repos
passing a directory to --input fails
globbing can lead to commit messages that exceed the accepted command line length

The first three should be fixed. For (3), passing a directory to --input should now translate directly to datalad get <directory> in the same way that --input=. maps to datalad get ..

The fourth should be less of an issue now because, for a subdirectory with lots of files, the subdirectory name itself can now be passed. But we should probably still address it by either using git commit -F if needed or by just giving a more informative error message.

mih · 2018-05-15T18:55:20Z

Thanks @kyleam !

I'll try to explore it a little more over the next two days and get back to you shortly.

The main motivation for this change is that (1) globbing isn't restricted to the current dataset, (2) passing the pattern to glob.glob respects the current working directory rather than the top-level of the repo, and (3) we might want to make --input/--output about more than just getting/unlocking files, so we may not want to couple the globbing to 'git annex find'.

Globbing can quickly result in really long commit messages, so store the unexpanded form by default. The next commit will add an option to store the expanded globs.

kyleam · 2018-05-16T04:16:15Z

The latest push is a bit rough and a few tests are failing for reasons I don't yet understand. I'll will dig into it more tomorrow.

When reruning, `outputs` is used to unlock the added/modified files. But with --onto, these might not be present. Don't issue a warning if the glob doesn't match in this case because it's expected and the output isn't a user-supplied output. One issue with this solution is that we no longer warn if user-supplied output globs don't match on rerun. In order to do this, we'd need to decouple automated and user-supplied outputs.

If a dataset is given on the command line (or when run is called as a dataset method), expand globs relative to the top-level of the dataset. This follows the convention that datalad uses for path handling. The large test diff is due to an indentation change; we no longer need a chpwd block.

kyleam · 2018-05-16T15:19:45Z

The latest push is a bit rough and a few tests are failing for reasons I don't yet understand.

Should now be addressed.

The main changes in the recent pushes

globbing is now done with python's glob.glob rather than git annex find --include=*
inputs and outputs are stored unexpanded in the log message unless run's --expand option is used.

mih · 2018-05-17T19:27:30Z

Feel free to merge this any time.

kyleam · 2018-05-17T19:34:04Z

I gonna go with "ballerina crossword", I guess?

mih · 2018-05-18T14:48:37Z

BAMM!

kyleam added 7 commits April 30, 2018 08:18

ENH: utils: Add partition helper

eb771e3

TST: annexrepo: Test get_annexed_files directly

8152ab3

DOC: annexrepo: Extend get_annexed_files's docstring

7f6321c

ENH: get_annexed_files: Support --include=glob

d5afee8

DOC: rerun: Add note about --onto/submodule limitation

d6e88f4

We'd need something like feature described in dataladgh-1424 to address this.

TST: run: Fix incorrect assumption about results' messages

a1ca3f2

The next commit will add a `get` call to run, and `get` sets the message to a tuple.

kyleam added the WIP work in progress label Apr 30, 2018

kyleam added 13 commits May 7, 2018 11:15

DOC: run: Clarify --input "note"

b2f9686

RF: run/rerun: Add more keywords to the run_command call

57987f5

RF: run: Rename _resolve_inputs to _resolve_files

d209bc2

We'll use this for --output as well.

RF: run: Add wrapper around partition that returns lists

a6608ed

An upcoming commit will want lists from partition again, so let's avoid repeating this pattern.

RF: run: Check for modifications before processing inputs

7d43fcf

This doesn't matter for inputs, but for outputs we'll need to do this check before unlocking them, so we might as well move the inputs block to keep inputs/outputs processing close together.

BF: run: Don't needlessly process empty inputs on rerun

dd27870

This also avoids the false warning of no matching files on rerun.

RF: run: Don't bother trying to resolve files on rerun

81ec17e

Right now, we always expand at run time, so there is no point in calling _resolve_files.

RF: run: Don't call "get" with an empty list of files

ae128ba

ENH: run: Add support for --input=.

33cfe22

Re: datalad#2464 (comment)

BF: run: Rewrite "unlock/remove" bits to work with subdatasets

bbeaa75

Unlike the previous version, this will work with annex files in subdatasets, and it's simpler because we let `remove` handle the annex checks.

RF: run: Remove partition helper

1ce28ac

This pattern is no longer used in multiple places.

kyleam added 2 commits May 12, 2018 14:33

BF: run: Prevent --input/--output from failing on non-annex repos

fb44a98

Re: datalad#2464 (comment)

RF+TST: annexrepo: Don't use add(..., commit=True)

21db2ee

This test (introduced by this series) will fail when this branch is merged to master because dcdbf65 removes add's commit argument.

kyleam added 2 commits May 14, 2018 09:36

DOC: run: Expand _resolve_files's docstring

288a46d

RF: run: Move "is annex?" check upstream

bc3eb79

In a plain git repo, this avoids (1) displaying a misleading "No matching files..." warning when a glob is passed to --input and (2) running an unnecessary 'datalad get .' when '.' is passed to --input. Re: datalad#2464 (comment)

mih mentioned this pull request May 15, 2018

Outline scope of extension (in docs) datalad/datalad-container#4

Closed

kyleam added 3 commits May 15, 2018 23:56

RF: run: Store unexpanded inputs/outputs

13b4f2e

Globbing can quickly result in really long commit messages, so store the unexpanded form by default. The next commit will add an option to store the expanded globs.

ENH: run: Add flag to store expanded --input/--output values

37af359

kyleam added 3 commits May 16, 2018 10:44

BF: run: Don't abort if --input is outside of the repo

c663104

mih approved these changes May 17, 2018

View reviewed changes

mih mentioned this pull request May 18, 2018

Implement a run command with container support datalad/datalad-container#10

Closed

kyleam removed the WIP work in progress label May 18, 2018

mih merged commit da2c167 into datalad:master May 18, 2018

This was referenced May 18, 2018

Add --inputs/--env to run #2432

Closed

Enhance run/rerun with the ability to record containers required for execution #2158

Closed

kyleam deleted the enh-run-inputs branch May 18, 2018 15:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add --input flag #2464

ENH: Add --input flag #2464

kyleam commented Apr 30, 2018 •

edited

codecov bot commented Apr 30, 2018 •

edited

mih commented May 7, 2018 •

edited

kyleam commented May 7, 2018 •

edited

mih commented May 12, 2018 •

edited

kyleam commented May 12, 2018 •

edited

kyleam commented May 14, 2018 •

edited

mih commented May 15, 2018

kyleam commented May 16, 2018

kyleam commented May 16, 2018

mih commented May 17, 2018

kyleam commented May 17, 2018

mih commented May 18, 2018

ENH: Add --input flag #2464

ENH: Add --input flag #2464

Conversation

kyleam commented Apr 30, 2018 • edited

codecov bot commented Apr 30, 2018 • edited

Codecov Report

mih commented May 7, 2018 • edited

kyleam commented May 7, 2018 • edited

mih commented May 12, 2018 • edited

kyleam commented May 12, 2018 • edited

kyleam commented May 14, 2018 • edited

mih commented May 15, 2018

kyleam commented May 16, 2018

kyleam commented May 16, 2018

mih commented May 17, 2018

kyleam commented May 17, 2018

mih commented May 18, 2018

kyleam commented Apr 30, 2018 •

edited

codecov bot commented Apr 30, 2018 •

edited

mih commented May 7, 2018 •

edited

kyleam commented May 7, 2018 •

edited

mih commented May 12, 2018 •

edited

kyleam commented May 12, 2018 •

edited

kyleam commented May 14, 2018 •

edited