Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add --input flag #2464

Merged
merged 31 commits into from May 18, 2018
Merged

ENH: Add --input flag #2464

merged 31 commits into from May 18, 2018

Conversation

kyleam
Copy link
Contributor

@kyleam kyleam commented Apr 30, 2018

This is a first pass at allowing inputs to be specified when calling datalad run.

WIP because

  • I haven't tested it too thoroughly

  • I haven't yet tried to add --output, but we might want to do that here (as opposed to a separate PR)

  • I'm not sure how we should handle glob expansion.

    If a glob is given as the input value, it is expanded at the time of the run. As an example, if '*' was used to indicate all annex files, a rerun would consider the inputs to be all of the annex files at the time of the initial run.

    Expansion during the initial run seems appropriate because most of the time re-executing a command should only require those inputs. However, it's possible that the caller would want to rerun a command that uses globs itself to get its input files, in which case files that were added after the initial run wouldn't be retrieved.

    Perhaps we could also store the unexpanded inputs and add a rerun flag that says to use the unexpanded set.

    Update: I'm still not sure, but the current approach is to store unexpanded unless the --expand option is used.

Note: #2432 mentioned --input=. as meaning all annex files should be present, but I didn't end up adding that because it's possible with globbing (--input='*').

re: #2158, #2432

We'd need something like feature described in dataladgh-1424 to address this.
The next commit will add a `get` call to run, and `get` sets the
message to a tuple.
Allow the caller to specify files that we should obtain with 'git
annex get' before the run.  In addition to data files, this input
could be a singularity container (as described in dataladgh-2158).

There are at least a couple of decisions here that should be given
more thought:

  * The path to --input must be relative to the root of the dataset.

  * If a glob is given as the input value, it is expanded at the time
    of the run.  As an example, if '*' was used to indicate all annex
    files, a rerun would consider the inputs to be all of the annex
    files at the time of the initial run.

    Expansion during the initial run seems appropriate because most of
    the time re-executing a command should only require those inputs.
    However, it's possible that the caller would want to rerun a
    command that uses globs itself to get its input files, in which
    case files that were added after the initial run wouldn't be
    retrieved.

    Perhaps we could also store the unexpanded inputs and add a rerun
    flag that says to use the unexpanded set.

Re: datalad#2432
@kyleam kyleam added the WIP work in progress label Apr 30, 2018
@codecov
Copy link

codecov bot commented Apr 30, 2018

Codecov Report

Merging #2464 into master will decrease coverage by 8.32%.
The diff coverage is 52.87%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2464      +/-   ##
==========================================
- Coverage   88.99%   80.66%   -8.33%     
==========================================
  Files         283      240      -43     
  Lines       33146    29629    -3517     
==========================================
- Hits        29499    23901    -5598     
- Misses       3647     5728    +2081
Impacted Files Coverage Δ
datalad/tests/test_utils.py 95.9% <100%> (-0.13%) ⬇️
datalad/utils.py 87.1% <100%> (-1.66%) ⬇️
datalad/support/annexrepo.py 86.95% <100%> (-2.38%) ⬇️
datalad/interface/rerun.py 86.39% <100%> (-6.02%) ⬇️
datalad/support/tests/test_annexrepo.py 95.31% <100%> (-1.74%) ⬇️
datalad/interface/tests/test_run.py 58.76% <20.48%> (-40.95%) ⬇️
datalad/interface/run.py 87.87% <63.63%> (-12.13%) ⬇️
datalad/tests/utils_testdatasets.py 11.76% <0%> (-88.24%) ⬇️
datalad/interface/tests/test_annotate_paths.py 24% <0%> (-76%) ⬇️
datalad/metadata/tests/test_aggregation.py 25.69% <0%> (-74.31%) ⬇️
... and 143 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9017b45...c663104. Read the comment docs.

@mih
Copy link
Member

mih commented May 7, 2018

Thanks! First comment:

I think we should stick to the expansion style that annex is using, and that we are relying on elsewhere too. Otherwise we break the paradigm (for no obvious reason). Hence . would refer to any file/directory in the entire repo (but not subdatasets). With globbing it would be hard to say: all this, but only until you hit a dataset boundary.

Also we should be careful with expanding the inputs. . would mean 'everything', but expanded it would mean 'everything that was at that time' -- which would have implications on the semantics of rerun --onto....

@kyleam
Copy link
Contributor Author

kyleam commented May 7, 2018

I think we should stick to the expansion style that annex is using, and that we are relying on elsewhere too. Otherwise we break the paradigm (for no obvious reason).

My reason is that the ability to glob was requested in gh-2432. What's currently possible is to glob via annex's --include. I'm not against adding "--input=. -> get('.')", but I viewed it as redundant because, if we have globbing, we have --input='*'. As long as --input is simply mapping to git annex get (or more generally, just annexed files), "." and --include='*' are equivalent, I think.

Hence . would refer to any file/directory in the entire repo (but not subdatasets). With globbing it would be hard to say: all this, but only until you hit a dataset boundary.

The globbing is documented as "what would git annex find --include" do, so by that definition it doesn't cross dataset boundaries. Perhaps we shouldn't limit globs to the current dataset, but given that ATM rerunning with anything other than HEAD^ doesn't work (in the sense that the subdataset state is not adjusted using something like gh-1424), I don't view making globs work across subdatasets as a priority.

At any rate, I'll plan to add --input=. for consistency.

Also we should be careful with expanding the inputs. . would mean 'everything', but expanded it would mean 'everything that was at that time' -- which would have implications on the semantics of rerun --onto....

Yes, I agree. For globbing in general, I tried to touch on this in my opening comment (and the message of d5d00f2). As I mentioned there, I see the stricter "expand at the initial run time" as the better default, but I can imagine situations where the caller would want the "expand on each rerun" behavior.

So

  • should --input=. remain unexpanded while globs are expanded?

  • should there be an additional flag to say "don't expand my globs"?

  • something else?

kyleam added 13 commits May 7, 2018 11:15
We'll use this for --output as well.
An upcoming commit will want lists from partition again, so let's
avoid repeating this pattern.
This doesn't matter for inputs, but for outputs we'll need to do this
check before unlocking them, so we might as well move the inputs block
to keep inputs/outputs processing close together.
This also avoids the false warning of no matching files on rerun.
Right now, we always expand at run time, so there is no point in
calling _resolve_files.
While --input maps to 'git annex get', --output maps to 'git annex
unlock' or 'git rm', depending on whether content is present.  We
can't unconditionally unlock because that leads to an error if content
isn't present (dataladgh-2432).

We're expanding the globs on the initial run in the same way we do
with --input.  As discussed in d5d00f2, this should be revisited
because there are cases for both --input and --output where the caller
would want to store the unexpanded patterns and re-glob on each rerun.

Re: datalad#2432
Unlike the previous version, this will work with annex files in
subdatasets, and it's simpler because we let `remove` handle the annex
checks.
This pattern is no longer used in multiple places.
When content isn't present for a file that was added/modified in a
revision, it'd be better to remove the file rather than trying to
unlock it, which fails.  But if we do that, we should make sure we
process inputs first since that step may retrieve an added/modified
file (perhaps via --input=.).  To do both these things, piggyback on
run's outputs argument.
@mih
Copy link
Member

mih commented May 12, 2018

Playing with the beast. Observations:

In a plain Git repo:

% datalad run --input . bash -c "ls -l dicoms > this"
get(notneeded): . [no dataset annex, content already present]
[INFO   ] == Command start (output follows) ===== 
[INFO   ] == Command exit (modification check follows) ===== 
add(ok): this (file)
save(ok): /tmp/func (dataset)
action summary:
  add (ok: 1)
  get (notneeded: 1)
  save (ok: 1)
(datalad3-dev) mih@meiner /tmp/func (git)-[master] % datalad run --input '*' bash -c "ls -l dicoms > that"
[ERROR  ] 'GitRepo' object has no attribute 'is_under_annex' [run.py:<lambda>:153] (AttributeError) 

The latter code path assumes the presence of an AnnexRepo.

Continuing:

% git annex init
init  ok
(recording state in git...)
(datalad3-dev) mih@meiner /tmp/func (git)-[master] % datalad run --input '*' bash -c "ls -l dicoms > that"
[WARNING] No matching files found for --input 

All files are tracked in Git directly, hence nothing is there that concerns annex, but still the message is misleading in this case.

Here I have a folder will 5461 files under annex:

% datalad run --input annexdicoms bash -c "ls -l annexdicoms > that3"
[ERROR  ] AssertionError() [annexrepo.py:info:2518] (AssertionError) 
datalad run --input annexdicoms bash -c "ls -l annexdicoms > that3"  8.92s user 0.99s system 107% cpu 9.258 total

It aborts before anything happens (10s for checking something). Repo stays clean, no added commit, no untracked stuff. The corresponding fire-and-forget get call:

% datalad get annexdicoms
get(notneeded): /tmp/func/annexdicoms (directory) [nothing to get from /tmp/func/annexdicoms]
datalad get annexdicoms  4.82s user 0.43s system 107% cpu 4.872 total

That is half the time (and probably still slower than what it could be doing). Now the run call with --dbg:

% datalad --dbg run --input annexdicoms bash -c "ls -l annexdicoms > that3"
Traceback (most recent call last):
  File "/home/mih/env/datalad3-dev/bin/datalad", line 8, in <module>
    main()
  File "/home/mih/hacking/datalad/git/datalad/cmdline/main.py", line 378, in main
    ret = cmdlineargs.func(cmdlineargs)
  File "/home/mih/hacking/datalad/git/datalad/interface/base.py", line 420, in call_from_parser
    ret = list(ret)
  File "/home/mih/hacking/datalad/git/datalad/interface/utils.py", line 414, in generator_func
    result_renderer, result_xfm, _result_filter, **_kwargs):
  File "/home/mih/hacking/datalad/git/datalad/interface/utils.py", line 482, in _process_results
    for res in results:
  File "/home/mih/hacking/datalad/git/datalad/interface/run.py", line 142, in __call__
    message=message):
  File "/home/mih/hacking/datalad/git/datalad/interface/run.py", line 209, in run_command
    inputs = _resolve_files(ds, inputs)
  File "/home/mih/hacking/datalad/git/datalad/interface/run.py", line 154, in _resolve_files
    globs, files = list(globs), list(files)
  File "/home/mih/hacking/datalad/git/datalad/utils.py", line 669, in <genexpr>
    return ((item for pred, item in a if not pred),
  File "/home/mih/hacking/datalad/git/datalad/utils.py", line 668, in <genexpr>
    a, b = tee((predicate(item), item) for item in items)
  File "/home/mih/hacking/datalad/git/datalad/interface/run.py", line 153, in <lambda>
    lambda f: dset.repo.is_under_annex([f], batch=True)[0])
  File "/home/mih/hacking/datalad/git/datalad/support/gitrepo.py", line 300, in newfunc
    result = func(self, files_new, *args, **kwargs)
  File "/home/mih/hacking/datalad/git/datalad/support/annexrepo.py", line 1865, in is_under_annex
    info = self.info(files, normalize_paths=False, batch=batch)
  File "/home/mih/hacking/datalad/git/datalad/support/gitrepo.py", line 300, in newfunc
    result = func(self, files_new, *args, **kwargs)
  File "/home/mih/hacking/datalad/git/datalad/support/annexrepo.py", line 2518, in info
    assert(j.pop('file') == f)
AssertionError

> /home/mih/hacking/datalad/git/datalad/support/annexrepo.py(2518)info()
-> assert(j.pop('file') == f)
(Pdb) print(j)
{'local annex size': '115317918', 'size of annexed files in working tree': '115317918', 'numcopies stats': [['+0', 5460]], 'combined size of repositories containing these files': '115317918', 'command': 'info annexdicoms', 'success': True, 'local annex keys': 5460, 'annexed files in working tree': 5460, 'repositories containing these files': [{'here': True, 'size': '115317918', 'uuid': '104deefb-ace4-4bf4-9b1a-3fc451200ec7', 'description': 'mih@meiner:/tmp/func'}], 'directory': 'annexdicoms'}

So it wants 'annexdicoms/*' (which is OK, but it makes me think that supporting . and not this is inconsistent. So we might want to drop . again (but see below). Another attempt:

% datalad run --input 'annexdicoms/*' bash -c "ls -l annexdicoms > that3"

<snip 5461 get(notneeded) results that we might want to filter out>

[INFO   ] == Command start (output follows) ===== 
[INFO   ] == Command exit (modification check follows) ===== 
[ERROR  ] [Errno 7] Argument list too long: 'git' [subprocess.py:_execute_child:1344] (OSError)                                                                                              
datalad run --input 'annexdicoms/*' bash -c "ls -l annexdicoms > that3"  10.68s user 0.90s system 114% cpu 10.119 total

I guess argument expansion hit.

% git st
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

        new file:   that3

It managed get and staged the result, but then breaks at:

['git', '-c', 'receive.autogc=0', '-c', 'gc.auto=0', 'commit', '-m', '[DATALAD RUNCMD] bash -c ls -l annexdicoms > that4\n\n=== Do not change lines below ===\n{\n "cmd": [\n  "bash",\n  "-c",\n  "ls -l annexdicoms > that4"\n ],\n "exit": 0,\n "chain": [],\n "inputs": [\n  "annexdicoms/MR.1.3.46.670589.11.38317.5.0.4476.2014042516042547586",\n  "annexdicoms/MR.1.3.46.670589.11.38317.5.0.4476.2014042516042548588",\n 

<5k more lines>

So no argument expansion per se, but if we want to go down this path, we would need to write the commit message into a file first and then ingest it via git commit -F.

But I am not sure at all if that is a good idea. I have real datasets with >200k annex files that are all input into an analysis. In my DICOM case they maybe be spit into ~50k chunks per directory (one acquisition session). So I would routinely end up with 50k line commit messages for saying "I need this directory" -- if aggregated metadata hits such size, we are already putting it in annex ;-)

Thanks @kyleam for pushing this forward!

This test (introduced by this series) will fail when this branch is
merged to master because dcdbf65 removes add's commit argument.
@kyleam
Copy link
Contributor Author

kyleam commented May 12, 2018

Thanks for giving this a test drive @mih. I'll need to digest/work through your comments, but there are clearly issues with my initial attempt at implementing this :/

In a plain git repo, this avoids (1) displaying a misleading "No
matching files..."  warning when a glob is passed to --input and (2)
running an unnecessary 'datalad get .' when '.' is passed to --input.

Re: datalad#2464 (comment)
We want to be able to give a directory to --input.  Doing so allows a
command to depend on a potentially large number of files without
depending on everything with "." or globbing with --input='subdir/*',
which stores all the subdirectory's file names in the commit
message (which isn't pleasant to look at in the log and will currently
fail if the message reaches a certain length).

AnnexRepo.is_under_annex, however, fails when passed a directory
because it calls AnnexRepo.info, which raises an AssertionError.  It's
not clear what "under annex" should mean for a directory (contains one
or more annexed files?), but at the least AnnexRepo.info should
probably return information for the directory rather than raising an
AssertionError.

So avoid passing directories to is_under_annex.  If is_under_annex /
info are adjusted to handle directories, it might make sense to modify
_resolve_files to return a directory only when it contains an annexed
file.

Re: datalad#2464 (comment)
@kyleam
Copy link
Contributor Author

kyleam commented May 14, 2018

OK, so the issues I took away from @mih's feedback were

  1. --input fails in plain git repos

  2. the "no matching" warning is misleading in plain git repos

  3. passing a directory to --input fails

  4. globbing can lead to commit messages that exceed the accepted command line length

The first three should be fixed. For (3), passing a directory to --input should now translate directly to datalad get <directory> in the same way that --input=. maps to datalad get ..

The fourth should be less of an issue now because, for a subdirectory with lots of files, the subdirectory name itself can now be passed. But we should probably still address it by either using git commit -F if needed or by just giving a more informative error message.

@mih
Copy link
Member

mih commented May 15, 2018

Thanks @kyleam !

I'll try to explore it a little more over the next two days and get back to you shortly.

The main motivation for this change is that (1) globbing isn't
restricted to the current dataset, (2) passing the pattern to
glob.glob respects the current working directory rather than the
top-level of the repo, and (3) we might want to make --input/--output
about more than just getting/unlocking files, so we may not want to
couple the globbing to 'git annex find'.
Globbing can quickly result in really long commit messages, so store
the unexpanded form by default.  The next commit will add an option to
store the expanded globs.
@kyleam
Copy link
Contributor Author

kyleam commented May 16, 2018

The latest push is a bit rough and a few tests are failing for reasons I don't yet understand. I'll will dig into it more tomorrow.

When reruning, `outputs` is used to unlock the added/modified files.
But with --onto, these might not be present.  Don't issue a warning if
the glob doesn't match in this case because it's expected and the
output isn't a user-supplied output.

One issue with this solution is that we no longer warn if
user-supplied output globs don't match on rerun.  In order to do this,
we'd need to decouple automated and user-supplied outputs.
If a dataset is given on the command line (or when run is called as a
dataset method), expand globs relative to the top-level of the
dataset.  This follows the convention that datalad uses for path
handling.

The large test diff is due to an indentation change; we no longer need
a chpwd block.
@kyleam
Copy link
Contributor Author

kyleam commented May 16, 2018

The latest push is a bit rough and a few tests are failing for reasons I don't yet understand.

Should now be addressed.


The main changes in the recent pushes

  • globbing is now done with python's glob.glob rather than git annex find --include=*

  • inputs and outputs are stored unexpanded in the log message unless run's --expand option is used.

@mih
Copy link
Member

mih commented May 17, 2018

image

Feel free to merge this any time.

@kyleam
Copy link
Contributor Author

kyleam commented May 17, 2018

I gonna go with "ballerina crossword", I guess?

@kyleam kyleam removed the WIP work in progress label May 18, 2018
@mih mih merged commit da2c167 into datalad:master May 18, 2018
@mih
Copy link
Member

mih commented May 18, 2018

BAMM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants