New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH+BF: run-procedure #2905

Merged
merged 19 commits into from Oct 16, 2018

Conversation

Projects
None yet
4 participants
@bpoldrack
Member

bpoldrack commented Oct 8, 2018

Fixes and additions to run-procedure:

  • run-procedure required the -d parameter to discover procedures within a dataset even if called from within the dataset. That's not intuitive and I changed that. When called from within a dataset it will use it (except when passing in sth else via -d of course).
  • BF: files that are neither executable nor end with .py or .sh but are in a configured look up location for procedures caused the command to crash. Fixed that.
  • procedures can now recursively be discovered in subdatasets as well. The uppermost has highest priority
  • added two configs for procedures: datalad.procedures.NAME.call-format and datalad.procedures.NAME.help. The former replaces (if given) the default, which is based on discovered file type. This allows not only for changing order or even execute different scripts than just python and bash, but also for usage of run substitutions.
    The latter is queried by datalad run-procedure --help-proc NAME, which yields the value of this config as message. So we can provide description, usage info etc. on a procedure.
  • proper 'impossible' result when procedure not found (instead of ValueError)

Please have a look @datalad/developers

bpoldrack added some commits Oct 8, 2018

BF: run_procedure on datasets
- discovery failed when procedure location contains additional files
- discovery used to need -d parameter even when called from within a
dataset. Otherwise it would ignore that very dataset's procedures

@bpoldrack bpoldrack force-pushed the bpoldrack:enh-run-procedure branch from 3a70275 to fdcf482 Oct 8, 2018

@codecov

This comment has been minimized.

codecov bot commented Oct 8, 2018

Codecov Report

Merging #2905 into master will decrease coverage by <.01%.
The diff coverage is 85.5%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2905      +/-   ##
==========================================
- Coverage   90.29%   90.28%   -0.01%     
==========================================
  Files         246      246              
  Lines       31907    32013     +106     
==========================================
+ Hits        28809    28904      +95     
- Misses       3098     3109      +11
Impacted Files Coverage Δ
datalad/interface/tests/test_run_procedure.py 100% <100%> (ø) ⬆️
datalad/interface/run_procedure.py 84.72% <73.68%> (-3.7%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6ee236c...a392291. Read the comment docs.

@yarikoptic

This comment has been minimized.

Member

yarikoptic commented Oct 8, 2018

Description is quite non descriptive...

@bpoldrack

This comment has been minimized.

Member

bpoldrack commented Oct 8, 2018

There's more to come. I'll add a description.

@bpoldrack bpoldrack force-pushed the bpoldrack:enh-run-procedure branch from 7d32688 to 1f83770 Oct 10, 2018

bpoldrack added some commits Oct 10, 2018

@bpoldrack bpoldrack force-pushed the bpoldrack:enh-run-procedure branch from cf2aaf2 to 1013139 Oct 11, 2018

bpoldrack added some commits Oct 11, 2018

@bpoldrack

This comment has been minimized.

Member

bpoldrack commented Oct 11, 2018

FTR: For some reason the creation of the test dataset (all three tests) fails to correctly commit if on windows and in direct mode. Works in direct mode on linux and in v6 on windows though. Changed the skipping accordingly.

@kyleam : There's one thing I'm currently not sure about. Should we expose inputs and outputs for run-procedure and pass it on to run or do you see any reason to not do that? While a procedure can do some (un-)locking and getting from within, I wonder whether we should allow to use the mechanism that is already there.

@mih: Anything in mind we need, that I may have missed?

@bpoldrack

This comment has been minimized.

Member

bpoldrack commented Oct 11, 2018

@kyleam

This comment has been minimized.

Member

kyleam commented Oct 11, 2018

@kyleam : There's one thing I'm currently not sure about. Should we expose inputs and outputs for run-procedure and pass it on to run or do you see any reason to not do that? While a procedure can do some (un-)locking and getting from within, I wonder whether we should allow to use the mechanism that is already there.

How would the inputs and outputs of a procedure be specified?

@bpoldrack

This comment has been minimized.

Member

bpoldrack commented Oct 11, 2018

How would the inputs and outputs of a procedure be specified?

Just the same way. I thought about providing those parameters for run-procedure as well and just pass them on to run. This would lack a possibility to define them via that call format string, but at least allow to use them with the actual call to run-procedure and possibly use {{inputs}} and {{outputs}} in the format string. But I may miss something here.

@bpoldrack bpoldrack force-pushed the bpoldrack:enh-run-procedure branch 2 times, most recently from 83b97e4 to bbc820a Oct 11, 2018

@kyleam

This comment has been minimized.

Member

kyleam commented Oct 11, 2018

Just the same way. I thought about providing those parameters for run-procedure as well and just pass them on to run.

OK.

This would lack a possibility to define them via that call format string,

Yeah, that's one reason why I'm wondering what specific mechanism you have in mind. Also, there's the interaction with proc_{pre,post}.

but at least allow to use them with the actual call to run-procedure and possibly use {{inputs}} and {{outputs}} in the format string. But I may miss something here.

It's quite possible I'm the one missing something :] Here's the main thing I'm confused about: I think of procedures as these sort of canned routines that the caller doesn't need to know much about aside from their name and general purpose. Ignoring proc_{pre,post}, it seems like run_procedure callers would need to have an understanding of a procedure's innards in order to specify the inputs and outputs.

@bpoldrack

This comment has been minimized.

Member

bpoldrack commented Oct 11, 2018

Also, there's the interaction with proc_{pre,post}.

it seems like run_procedure callers would need to have an understanding of a procedure's innards in order to specify the inputs and outputs.

Actually, those are two good arguments to not make things more confusing without an actual necessity. Initially I thought: Well, if it doesn't really cost something, why hiding it away? But then this might be confusing and not perceived as an additional option but the standard way to go (which it shouldn't be). And it might indeed lead to a mess with proc_{pre, post}, where we can't easily decide about a paradigm on what to record in what commit under what circumstances.
Long story short: Thinking about it again, I guess we should not do that as long as we don't have a case where it's actually needed.

@yarikoptic

This comment has been minimized.

Member

yarikoptic commented Oct 11, 2018

re travis failure

reading sources... [100%] usecases/simple_provenance_tracking                   
Warning, treated as error:
/home/travis/build/datalad/datalad/datalad/interface/run_procedure.py:docstring of datalad.api.run_procedure:66:Unexpected indentation.
make: *** [html] Error 2
make: Leaving directory `/home/travis/build/datalad/datalad/docs'
The command "if [ "${DATALAD_LOG_LEVEL:-}" != 2 -a -z "${BUILD_DATALAD_EXTENSION:-}" ]; then PYTHONPATH=$PWD $NOSE_WRAPPER make -C docs html doctest; fi" exited with 2.

@bpoldrack bpoldrack force-pushed the bpoldrack:enh-run-procedure branch from bbc820a to 5dbeb3c Oct 11, 2018

@bpoldrack

This comment has been minimized.

Member

bpoldrack commented Oct 11, 2018

Yes, I noticed. I'm amending like hell to figure what fx$!% whitespace crashes the beast. ;-)

bpoldrack added some commits Oct 11, 2018

@bpoldrack bpoldrack changed the title from [WIP] ENH+BF: run-procedure to ENH+BF: run-procedure Oct 12, 2018

@kyleam kyleam self-requested a review Oct 12, 2018

@mih mih requested review from mih and kyleam and removed request for kyleam Oct 12, 2018

@mih

Looks and feels good.

--discover is not robust to dangling symlinks. Here is what happens with an annexed procedure executable. Those should be reported as missing or not reported at all,IMHO.

% datalad run-procedure --discover
[ERROR  ] [Errno 2] No such file or directory: '/tmp/sup/toolbox/code/dummy.sh' [run_procedure.py:_guess_exec:133] (FileNotFoundError) 
(datalad3-dev) mih@meiner /tmp/sup (git)-[master] % ll toolbox/code
total 4.0K
lrwxrwxrwx 1 mih mih 117 Oct 12 16:09 dummy.sh -> ../.git/annex/objects/Kz/JP/MD5E-s0--d41d8cd98f00b204e9800998ecf8427e.sh/MD5E-s0--d41d8cd98f00b204e9800998ecf8427e.sh

It is not just --discover that crashes, but also --help-proc.

Other than that is works as advertised. I focused on the "toolbox subdataset" use case in my manual exporations.

@kyleam

Works nicely from my interactive testing. Thanks.

Left minor comments/requests.

if isinstance(v, tuple):
# If multiple values are defined (at the same level), ConfigManager
# returns a tuple. Go with the last one.
# TODO: Should we issue a warning or fail entirely?

This comment has been minimized.

@kyleam

kyleam Oct 12, 2018

Member

I think ConfigManager should be updated to return a single value. I suspect few if any of the existing callers handle a tuple. For this PR, I think the current handling is fine.

This comment has been minimized.

@bpoldrack

bpoldrack Oct 12, 2018

Member

I don't think always returning a single value will be the solution. ConfigManager is explicitly made to be able to deal with multiple definitions of the same config item at the same level. But we probably should somehow be able to tell ConfigManager that for a particular item we only accept one value. However, I agree: This exceeds this PR.

This comment has been minimized.

@kyleam

kyleam Oct 12, 2018

Member

I don't think always returning a single value will be the solution

Yes, you're right, things aren't so simple and some callers do expect multiple values.

But we probably should somehow be able to tell ConfigManager that for a particular item we only accept one value.

Sounds good. Based on frequency of use by our current callers, I wonder if "single value" should be the default and we should have a separate method for multiple items.

Anyway, I do have a question related to this PR: Your comment says "If multiple values are defined (at the same level), ConfigManager returns a tuple". I'm confused about what the "level" is here. I don't think it is related to config location (global, .git/config, .datalad/config,...) because settings don't need to be at the same level to be represented as a tuple:

$ git config --global datalad.testconfig false                                                  
$ git config datalad.testconfig false                                                          
$ python -c "import datalad.api as dl; print(dl.Dataset('.').config.get('datalad.testconfig'))"
('false', 'false')                 

So the only way I can think of to interpret "at the same level" as meaning "this level in the dataset hierarchy".

P.S. I like this example because a naive caller would get say if config.get(...) and unexpectedly get something that evaluates to true even though both the global and local setting is false. getbool also fails to handle it:

$ python -c "import datalad.api as dl; print(dl.Dataset('.').config.getbool('datalad', 'testconfig'))"
Traceback (most recent call last):                                                                                                     
  File "<string>", line 1, in <module>
  File "/home/kyle/src/python/datalad/datalad/config.py", line 473, in getbool
    return anything2bool(val)
  File "/home/kyle/src/python/datalad/datalad/config.py", line 113, in anything2bool
    % repr(val))
TypeError: Got value ('false', 'false') which could not be interpreted as a boolean

This comment has been minimized.

@bpoldrack

bpoldrack Oct 12, 2018

Member

I don't think it is related to config location (global, .git/config, .datalad/config,...)

It is related. What I mean by "level" is the hierarchy of things we look at to find the (or a) procedure. In your case it's both "user level": First goes into ~/.gitconfig, second into .git/config. Other levels would be "system", "dataset" (.datalad/config, which comes with the dataset in opposition to .git/config), datalad itself and last any datalad extension.

Re your PS:
That's a good catch and definitely sth to fix!

This comment has been minimized.

@bpoldrack

bpoldrack Oct 12, 2018

Member

Just for clarification: Those are the levels to be searched for procedures by run-procedure and also the levels of consideration by ConfigManager when figuring what to return. And this again is related to the change of order you noticed. They are consistent now. (But I still should write down the rational for that particular order)

This comment has been minimized.

@kyleam

kyleam Oct 12, 2018

Member

Thanks for clarifying. I think your mapping between "procedure levels" and "config levels" breaks down because AFAICS ConfigManager (as a getter) only distinguishes two levels: dataset (".datalad/config") versus "everything else git config will report".

If I have a setting in /etc/gitconfig, ~/.gitconfig, and $REPO/.git/config, ConfigManager reports

$ python -c "import datalad.api as dl; print(dl.Dataset('.').config.get('datalad.test'))"
('etc-config', 'home-config', 'repo-config')

By procedure level standards, that includes both "system" and "user" levels.

This comment has been minimized.

@bpoldrack

bpoldrack Oct 15, 2018

Member

True, but this reporting by git-config still happens in a particular order (more general to more specific). So, this is the reason I went with the last entry in case of multiple values returned.

I'll try to articulate this idea more clearly in comment/docstring

Show resolved Hide resolved datalad/interface/run_procedure.py
Show resolved Hide resolved datalad/interface/tests/test_run_procedure.py
ds.add('.')
# V6FACT: but this finally commits it
ds.save()
# TODO remove above two lines

This comment has been minimized.

@kyleam

kyleam Oct 12, 2018

Member

Again, I know this is just copied, and I don't think it needs to be addressed in this PR, but based on quick testing, v6 mode now appears to behave as expected without these lines (presumably either due to an update in annex or a fix on our end).

This comment has been minimized.

@bpoldrack

bpoldrack Oct 12, 2018

Member

Well, I should probably try to remove it within this PR. It will hang around there for quite some while otherwise, I guess.

This comment has been minimized.

@kyleam

kyleam Oct 12, 2018

Member

It will hang around there for quite some while otherwise, I guess.

I've noted it, so I would look into it once this PR lands, but of course I don't mind you looking into it now.

@@ -130,6 +170,14 @@ class RunProcedure(Interface):
and subsequently in the 'resources/procedures/' directories of any
installed extension, and, lastly, of the DataLad installation itself.
Please note, that a dataset that defines

This comment has been minimized.

@kyleam

kyleam Oct 12, 2018

Member

s/,//

Show resolved Hide resolved datalad/interface/run_procedure.py
yield m
for m, n in _get_file_match(op.join(ds.path, dir), name):
yield (m,) + _get_proc_config(n, ds=ds)
# 1.1. check subdatasets recursively

This comment has been minimized.

@kyleam

kyleam Oct 12, 2018

Member

s/1.1/2.1/ ?

This comment has been minimized.

@bpoldrack

bpoldrack Oct 12, 2018

Member

Right. Thanks for spotting.

Show resolved Hide resolved datalad/interface/run_procedure.py
Show resolved Hide resolved datalad/interface/run_procedure.py
args=('--help-proc',),
action='store_true',
doc="""if given, get a help message for procedure NAME from config
setting datalad.procedures.NAME.help. [CMD: Note, that this

This comment has been minimized.

@kyleam

kyleam Oct 12, 2018

Member

The CMD note is true for other options too and isn't worth mentioning IMO.

This comment has been minimized.

@bpoldrack

bpoldrack Oct 12, 2018

Member

Yes and no. While it's true for -d and --discover as well, -d probably is hardly ever used and it's kind of intuitive to use --discover only without a spec argument (and therefore not run into the issue). run-procedure XXX --help-proc however looks perfectly reasonable but might lead to a completely unexpected and non-telling failure (or even not a failure at all depending on what XXX does with that argument). So, if anything we should mention it for the options, too. It might not be that obvious at a first glance that the procedures name will eat up everything following, although it should be clear once you actually think about it.

This comment has been minimized.

@kyleam

kyleam Oct 12, 2018

Member

None of that convinces me it is worth mentioning, but of course not thinking it's worth mentioning also means I don't particularly care if it is mentioned :]

BF: Don't crash on broken symlinks
When called with --discover or --help, don't fail because a potential
procedure is a broken symlink. Report it absent instead and report what
we still can figure.
@bpoldrack

This comment has been minimized.

Member

bpoldrack commented Oct 15, 2018

Re broken symlinks:
@mih: Please have a look at 1ff10fe

bpoldrack added some commits Oct 15, 2018

@bpoldrack bpoldrack force-pushed the bpoldrack:enh-run-procedure branch from 7c106c9 to 655d804 Oct 15, 2018

@bpoldrack

This comment has been minimized.

Member

bpoldrack commented Oct 15, 2018

@kyleam : I hope changes of today clarify things sufficiently. Otherwise don't hesitate to further complain. ;-)

@kyleam

kyleam approved these changes Oct 15, 2018

Thanks for your updates, @bpoldrack

@bpoldrack

This comment has been minimized.

Member

bpoldrack commented Oct 15, 2018

Thanks for having a closer look, @kyleam!

@mih

mih approved these changes Oct 15, 2018

@bpoldrack bpoldrack merged commit 56fca93 into datalad:master Oct 16, 2018

8 checks passed

WIP ready for review
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
datalad-pr-dl-osx-64 DEV build done.
Details
datalad-pr-docker-dl-nd14_04 DEV build done.
Details
datalad-pr-docker-dl-nd16_04 DEV build done.
Details
datalad-pr-docker-dl-nd80 DEV build done.
Details
datalad-pr-docker-dl-nd90 DEV build done.
Details

@yarikoptic yarikoptic modified the milestone: Release 0.10.4 Oct 22, 2018

yarikoptic added a commit that referenced this pull request Oct 24, 2018

Merge tag '0.11.0' into debian
 Upgrade of [git-annex] to the most recent available to your release is
 advisable since a number of issues were resolved at that level.

 ### Major refactoring and deprecations

 - `datalad.consts.LOCAL_CENTRAL_PATH` constant was deprecated in favor
   of `datalad.locations.default-dataset` [configuration] variable
   ([#2835])

 ### Minor refactoring

 - `"notneeded"` messages are no longer reported by default results
   renderer
 - [run] no longer shows commit instructions upon command failure when
   `explicit` is true and no outputs are specified ([#2922])
 - `get_git_dir` moved into GitRepo ([#2886])
 - `_gitpy_custom_call` removed from GitRepo ([#2894])
 - `GitRepo.get_merge_base` argument is now called `commitishes` instead
   of `treeishes` ([#2903])

 ### Fixes

 - [update] should not leave the dataset in non-clean state ([#2858])
   and some other enhancements ([#2859])
 - Fixed chunking of the long command lines to account for decorators
   and other arguments ([#2864])
 - Progress bar should not crash the process on some missing progress
   information ([#2891])
 - Default value for `jobs` set to be `"auto"` (not `None`) to take
   advantage of possible parallel get if in `-g` mode ([#2861])
 - [wtf] must not crash if `git-annex` is not installed etc ([#2865]),
   ([#2865]), ([#2918]), ([#2917])
 - Fixed paths (with spaces etc) handling while reporting annex error
   output ([#2892]), ([#2893])
 - `__del__` should not access `.repo` but `._repo` to avoid attempts
   for reinstantiation etc ([#2901])
 - Fix up submodule `.git` right in `GitRepo.add_submodule` to avoid
   added submodules being non git-annex friendly ([#2909]), ([#2904])
 - [run-procedure] ([#2905])
   - now will provide dataset into the procedure if called within dataset
   - will not crash if procedure is an executable without `.py` or `.sh`
     suffixes
 - Use centralized `.gitattributes` handling while setting annex backend
   ([#2912])
 - `GlobbedPaths.expand(..., full=True)` incorrectly returned relative
    paths when called more than once ([#2921])

 ### Enhancements and new features

 - Report progress on [clone] when installing from "smart" git servers
   ([#2876])
 - Stale/unused `sth_like_file_has_content` was removed ([#2860])
 - Enhancements to [search] to operate on "improved" metadata layouts
   ([#2878])
 - Output of `git annex init` operation is now logged ([#2881])
 - New
   - `GitRepo.cherry_pick` ([#2900])
   - `GitRepo.format_commit` ([#2902])
 - [run-procedure] ([#2905])
   - procedures can now recursively be discovered in subdatasets as well.
     The uppermost has highest priority
   - Procedures in user and system locations now take precedence over
     those in datasets.

* tag '0.11.0':
  Make it a 0.11.0 release since there were some API RFings
  REL: slight tune up to Changelog following the advices
  DOC: v0.10.4: Mention change in procedure precedence (a0cbcba)
  DOC: v0.10.4: Fix description of db715b7
  DOC: v0.10.4: Improve description of 6f615a4
  DOC: v0.10.4: Remove duplicate word

yarikoptic added a commit that referenced this pull request Oct 24, 2018

Merge tag '0.11.0' into debian
 ## 0.11.0 (Oct 23, 2018) -- Soon-to-be-perfect

 [git-annex] 6.20180913 (or later) is now required - provides a number of
 fixes for v6 mode operations etc.

 ### Major refactoring and deprecations

 - `datalad.consts.LOCAL_CENTRAL_PATH` constant was deprecated in favor
   of `datalad.locations.default-dataset` [configuration] variable
   ([#2835])

 ### Minor refactoring

 - `"notneeded"` messages are no longer reported by default results
   renderer
 - [run] no longer shows commit instructions upon command failure when
   `explicit` is true and no outputs are specified ([#2922])
 - `get_git_dir` moved into GitRepo ([#2886])
 - `_gitpy_custom_call` removed from GitRepo ([#2894])
 - `GitRepo.get_merge_base` argument is now called `commitishes` instead
   of `treeishes` ([#2903])

 ### Fixes

 - [update] should not leave the dataset in non-clean state ([#2858])
   and some other enhancements ([#2859])
 - Fixed chunking of the long command lines to account for decorators
   and other arguments ([#2864])
 - Progress bar should not crash the process on some missing progress
   information ([#2891])
 - Default value for `jobs` set to be `"auto"` (not `None`) to take
   advantage of possible parallel get if in `-g` mode ([#2861])
 - [wtf] must not crash if `git-annex` is not installed etc ([#2865]),
   ([#2865]), ([#2918]), ([#2917])
 - Fixed paths (with spaces etc) handling while reporting annex error
   output ([#2892]), ([#2893])
 - `__del__` should not access `.repo` but `._repo` to avoid attempts
   for reinstantiation etc ([#2901])
 - Fix up submodule `.git` right in `GitRepo.add_submodule` to avoid
   added submodules being non git-annex friendly ([#2909]), ([#2904])
 - [run-procedure] ([#2905])
   - now will provide dataset into the procedure if called within dataset
   - will not crash if procedure is an executable without `.py` or `.sh`
     suffixes
 - Use centralized `.gitattributes` handling while setting annex backend
   ([#2912])
 - `GlobbedPaths.expand(..., full=True)` incorrectly returned relative
    paths when called more than once ([#2921])

 ### Enhancements and new features

 - Report progress on [clone] when installing from "smart" git servers
   ([#2876])
 - Stale/unused `sth_like_file_has_content` was removed ([#2860])
 - Enhancements to [search] to operate on "improved" metadata layouts
   ([#2878])
 - Output of `git annex init` operation is now logged ([#2881])
 - New
   - `GitRepo.cherry_pick` ([#2900])
   - `GitRepo.format_commit` ([#2902])
 - [run-procedure] ([#2905])
   - procedures can now recursively be discovered in subdatasets as well.
     The uppermost has highest priority
   - Procedures in user and system locations now take precedence over
     those in datasets.

* tag '0.11.0':
  CHANGELOG adjusted to reflect new minimal version of git-annex
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment