Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RF: get() without annotate_paths() #3746

Merged
merged 9 commits into from
Oct 9, 2019
Merged

Conversation

mih
Copy link
Member

@mih mih commented Oct 2, 2019

This tries to be a gentle RF that doesn't change (much) of the original
behavior (see test diff). However, a change in-line with #3742 is
unavoidable (see removed test).

This is bringing path semantics in line with status/subdatasets, and is
a precondition to fixing #3469

Ping #3368

@yarikoptic
Copy link
Member

Removed test is test_autoresolve_multiple_datasets. So it will not be possible to get sub*/anat if every subject is a separate sub dataset?

@mih
Copy link
Member Author

mih commented Oct 2, 2019

Removed test is test_autoresolve_multiple_datasets. So it will not be possible to get sub*/anat if every subject is a separate sub dataset?

It is possible and tested, as long as it is running in a dataset that is parent to all.

Comment on lines -251 to +263
def subds_result_filter(res):
return res.get('status') == 'ok' and res.get('type') == 'dataset'

# figuring out what dataset to start with, --contains limits --recursive
# to visit only subdataset on the trajectory to the target path
subds_trail = ds.subdatasets(contains=path, recursive=True,
on_failure="ignore",
result_filter=subds_result_filter)
result_filter=is_ok_dataset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When adding f634bea (TST+BF: Update subdatasets(contains=...) calls for previous commit, 2019-10-01), I thought a function like is_ok_dataset might exist, but failed to find it. Thanks for cleaning this up.

@codecov
Copy link

codecov bot commented Oct 3, 2019

Codecov Report

Merging #3746 into master will decrease coverage by 7.49%.
The diff coverage is 74.22%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #3746     +/-   ##
=========================================
- Coverage   82.94%   75.45%   -7.5%     
=========================================
  Files         273      273             
  Lines       35943    35895     -48     
=========================================
- Hits        29813    27083   -2730     
- Misses       6130     8812   +2682
Impacted Files Coverage Δ
datalad/metadata/aggregate.py 58.16% <ø> (ø) ⬆️
datalad/core/local/run.py 60.57% <0%> (-34.14%) ⬇️
datalad/interface/results.py 85.24% <0%> (-9.02%) ⬇️
datalad/interface/tests/test_rerun.py 80.36% <100%> (-19.64%) ⬇️
datalad/core/local/tests/test_run.py 85.99% <100%> (-13.53%) ⬇️
datalad/metadata/metadata.py 87.81% <100%> (-0.28%) ⬇️
datalad/distribution/tests/test_get.py 100% <100%> (ø) ⬆️
datalad/distribution/tests/test_uninstall.py 99.68% <100%> (-0.01%) ⬇️
datalad/interface/tests/test_annotate_paths.py 100% <100%> (ø) ⬆️
datalad/distribution/tests/test_install.py 99.6% <100%> (-0.2%) ⬇️
... and 81 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c598c1...a43da55. Read the comment docs.

@yarikoptic
Copy link
Member

yarikoptic commented Oct 3, 2019

Removed test is test_autoresolve_multiple_datasets. So it will not be possible to get sub*/anat if every subject is a separate sub dataset?

It is possible and tested, as long as it is running in a dataset that is parent to all.

Gotcha 1: considers current path (repository) even though abs path(s) point to another dataset

  • does not happen with current master (0.12.0rc5-175-g13ee49884)
  • affects install as well
example 1 - running within some other unrelated repo
/home/yoh/proj/datalad/datalad-master > datalad get ~/datalad/openfmri/ds*/sub-01/anat
[ERROR  ] /home/yoh/proj/datalad/datalad-master/.git [fun.py:is_git_dir:47] (WorkTreeRepositoryUnsupported) 

exit:1 /home/yoh/proj/datalad/datalad-master > git co master
Switched to branch 'master'
Your branch is behind 'origin/master' by 73 commits, and can be fast-forwarded.
  (use "git pull" to update your local branch)
changes on filesystem:                                                                                                                      
 .asv | 0

/home/yoh/proj/datalad/datalad-master > datalad get ~/datalad/openfmri/ds*/sub-01/anat
Total:   0%|                                                                                                    | 0.00/6.38M [00:00<?, ?B/s]ERROR:                                                                                                                                      
Interrupted by user while doing magic: KeyboardInterrupt() [cmd.py:_process_one_line:354]

/home/yoh/proj/datalad/datalad-master > datalad install ~/datalad/openfmri/ds*/sub-01/anat
[ERROR  ] /home/yoh/proj/datalad/datalad-master/.git [fun.py:is_git_dir:47] (WorkTreeRepositoryUnsupported) 
example 2 -- running from a pure `/tmp` -- NoDatasetArgumentFound
/tmp > datalad get ~/datalad/openfmri/ds*/sub-01/anat
[ERROR  ] No dataset found [dataset.py:require_dataset:594] (NoDatasetArgumentFound) 
usage: datalad get [-h] [-s LABEL] [-d PATH] [-r] [-R LEVELS] [-n]
                   [-D DESCRIPTION] [--reckless] [-J NJOBS]
                   [PATH [PATH ...]]

I think it should not consider/be affected by CWD (could be not "available" for many other reasons) whenever full paths are provided.

Gotcha 2: it does not get if superdataset is specified as the dataset

even if I do specify top (superdataset) dataset (`~/datalad`)
exit:1 /tmp > datalad get -d ~/datalad ~/datalad/openfmri/ds*/sub-01/anat 
action summary:
  get (notneeded: 7)

/tmp > du -scmL ~/datalad/openfmri/ds*/sub-01/anat                
1	/home/yoh/datalad/openfmri/ds000001/sub-01/anat
4	/home/yoh/datalad/openfmri/ds000002/sub-01/anat
du: cannot access '/home/yoh/datalad/openfmri/ds000003/sub-01/anat/sub-01_T1w.nii.gz'
du: cannot access '/home/yoh/datalad/openfmri/ds000003/sub-01/anat/sub-01_inplaneT2.nii.gz'
0	/home/yoh/datalad/openfmri/ds000003/sub-01/anat
du: cannot access '/home/yoh/datalad/openfmri/ds000011/sub-01/anat/sub-01_T1w.nii.gz'
du: cannot access '/home/yoh/datalad/openfmri/ds000011/sub-01/anat/sub-01_inplaneT2.nii.gz'
0	/home/yoh/datalad/openfmri/ds000011/sub-01/anat
du: cannot access '/home/yoh/datalad/openfmri/ds000109/sub-01/anat/sub-01_T1w.nii.gz'
0	/home/yoh/datalad/openfmri/ds000109/sub-01/anat
du: cannot access '/home/yoh/datalad/openfmri/ds000216/sub-01/anat/sub-01_T1w.nii.gz'
0	/home/yoh/datalad/openfmri/ds000216/sub-01/anat
du: cannot access '/home/yoh/datalad/openfmri/ds000241/sub-01/anat/sub-01_T1w.nii.gz'
0	/home/yoh/datalad/openfmri/ds000241/sub-01/anat
4	total
seems to work ok in master
/tmp > datalad get -d ~/datalad ~/datalad/openfmri/ds*/sub-01/anat
[INFO   ] To obtain some keys we need to fetch an archive of size 412.8 MB                                                                  
                                                                                                                                           ^C[WARNING] Still have 1 active progress bars when stopping                                                      | 0.00/5.71M [00:00<?, ?B/s]
ERROR:                                                                                                                                      
Interrupted by user while doing magic: KeyboardInterrupt() [cmd.py:_process_one_line:354]
works if I point directly to the subdataset containing those from which to get (`~/datalad/openfmri`)
/tmp > datalad get -d ~/datalad/openfmri ~/datalad/openfmri/ds*/sub-01/anat
get(ok): ds000011/sub-01/anat/sub-01_inplaneT2.nii.gz (file) [from web...]                                                                  
get(ok): ds000011/sub-01/anat/sub-01_T1w.nii.gz (file) [from web...]                                                                        
get(ok): ds000011/sub-01/anat (directory)                                                                       | 0.00/3.54M [00:00<?, ?B/s]
action summary:
  get (notneeded: 6, ok: 3)
works with `-r`
/tmp > datalad get -d ~/datalad -r ~/datalad/openfmri/ds*/sub-01/anat
[INFO   ] Installing <Dataset path=/home/yoh/datalad/openfmri> underneath /home/yoh/datalad/openfmri/ds000001/sub-01/anat recursively 
[INFO   ] Installing <Dataset path=/home/yoh/datalad/openfmri> underneath /home/yoh/datalad/openfmri/ds000002/sub-01/anat recursively 
[INFO   ] Installing <Dataset path=/home/yoh/datalad/openfmri> underneath /home/yoh/datalad/openfmri/ds000003/sub-01/anat recursively 
[INFO   ] Installing <Dataset path=/home/yoh/datalad/openfmri> underneath /home/yoh/datalad/openfmri/ds000011/sub-01/anat recursively 
[INFO   ] Installing <Dataset path=/home/yoh/datalad/openfmri> underneath /home/yoh/datalad/openfmri/ds000109/sub-01/anat recursively 
[INFO   ] Installing <Dataset path=/home/yoh/datalad/openfmri> underneath /home/yoh/datalad/openfmri/ds000216/sub-01/anat recursively 
[INFO   ] Installing <Dataset path=/home/yoh/datalad/openfmri> underneath /home/yoh/datalad/openfmri/ds000241/sub-01/anat recursively 
[INFO   ] Installing <Dataset path=/home/yoh/datalad/openfmri/ds000001> underneath /home/yoh/datalad/openfmri/ds000001/sub-01/anat recursively 
[INFO   ] Installing <Dataset path=/home/yoh/datalad/openfmri/ds000002> underneath /home/yoh/datalad/openfmri/ds000002/sub-01/anat recursively 
[INFO   ] Installing <Dataset path=/home/yoh/datalad/openfmri/ds000003> underneath /home/yoh/datalad/openfmri/ds000003/sub-01/anat recursively 
[INFO   ] Installing <Dataset path=/home/yoh/datalad/openfmri/ds000011> underneath /home/yoh/datalad/openfmri/ds000011/sub-01/anat recursively 
[INFO   ] Installing <Dataset path=/home/yoh/datalad/openfmri/ds000109> underneath /home/yoh/datalad/openfmri/ds000109/sub-01/anat recursively 
[INFO   ] Installing <Dataset path=/home/yoh/datalad/openfmri/ds000216> underneath /home/yoh/datalad/openfmri/ds000216/sub-01/anat recursively 
[INFO   ] Installing <Dataset path=/home/yoh/datalad/openfmri/ds000241> underneath /home/yoh/datalad/openfmri/ds000241/sub-01/anat recursively 
get(ok): openfmri/ds000216/sub-01/anat/sub-01_T1w.nii.gz (file) [from web...]
get(ok): openfmri/ds000216/sub-01/anat (directory)
sub-01/anat/sub-01_T1w.nii.gz:  69%|██████████████████████████████████████████████▏                    | 4.56M/6.62M [00:01<00:00, 3.15MB/s]^C[WARNING] Still have 1 active progress bars when stopping 
ERROR:                                                                                                                                      
Interrupted by user while doing magic: KeyboardInterrupt() [cmd.py:_process_one_line:354]

Those INFO msgs happen also with master version so not specific to the PR

that was version 0.12.0rc5-179-g84ecda578

@mih
Copy link
Member Author

mih commented Oct 3, 2019

I think it should not consider/be affected by CWD (could be not "available" for many other reasons) whenever full paths are provided.

This code is intentionally incapable of doing that. The dataset-unbound behavior is THE reason why nobody can or will fix annotate_paths().

@mih
Copy link
Member Author

mih commented Oct 4, 2019

@kyleam This run test failure is due to get() now following the present path interpretation conventions. Here is the test:

    # ... producing output file in specified dataset and passing output file as
    # relative to current directory
    with chpwd(ds1_subdir):
        out = op.join(ds0.path, "three")
        run("cd .> {}".format(out), dataset=ds0.path, explicit=True,
            outputs=[op.relpath(out, ds1_subdir)])

Leading to

> /home/mih/hacking/datalad/git/datalad/core/local/run.py(256)_install_and_reglob()
-> on_failure="ignore"):
(Pdb) dirs_new
['../../ds0']
(Pdb) l
251  
252         dirs, dirs_new = [], glob_dirs()
253         while dirs != dirs_new:
254             for res in dset.install(dirs_new,
255                                     result_xfm=None, return_type='generator',
256  ->                                 on_failure="ignore"):
257                 if res.get("state") == "absent":
258                     lgr.debug("Skipping install of non-existent path: %s",
259                               res["path"])
260                 else:
261                     yield res

So the incoming arg is given unmodified to install (and underneath get), but it was originally handed to an unbound run, while internally calling a bound `install, hence invalidating the semantics.

In general I believe that there are two major paradigms to path handling.

  1. Resolve to absolute paths at the start of a command
  2. Make at all internal dataset method calls unbound, using the original dataset argument verbatim

In run's case (1) can be rather difficult or impossible (globs etc.), so I guess (2) is the way to go. This would require some RF'ing in order to get the dataset arg to the places it needs to be passed. What do you think?

This other failure seems to be related to glob expansion not working on windows. Do you consider this as "should work", given that this is windows?

@mih
Copy link
Member Author

mih commented Oct 4, 2019

@yarikoptic re gotcha2: I cannot replicate this:

(datalad3-dev) mih@meiner /tmp % dl install -s /// ts
[INFO   ] Cloning http://datasets.datalad.org/ [1 other candidates] into '/tmp/ts' 
install(ok): /tmp/ts (dataset)
(datalad3-dev) mih@meiner /tmp % dl get -d /tmp/ts /tmp/ts/openfmri/ds000001/sub-01/anat
[INFO   ] Cloning http://datasets.datalad.org/openfmri/.git into '/tmp/ts/openfmri' 
install(ok): openfmri (dataset) [Installed subdataset in order to get /tmp/ts/openfmri/ds000001/sub-01/anat]
[INFO   ] Cloning http://datasets.datalad.org/openfmri/ds000001/.git into '/tmp/ts/openfmri/ds000001' 
install(ok): openfmri/ds000001 (dataset) [Installed subdataset in order to get /tmp/ts/openfmri/ds000001/sub-01/anat]
get(ok): openfmri/ds000001/sub-01/anat/sub-01_inplaneT2.nii.gz (file) [from web...]          
get(ok): openfmri/ds000001/sub-01/anat/sub-01_T1w.nii.gz (file) [from web...]                
get(ok): openfmri/ds000001/sub-01/anat (directory)        | 223k/5.66M [00:00<00:12, 447kB/s]
action summary:
  get (ok: 3)
  install (ok: 2)

also multi-paths are fine

(datalad3-dev) 1 mih@meiner /tmp % dl get -d /tmp/ts /tmp/ts/openfmri/ds00000{2,3}/sub-01/anat
[INFO   ] Cloning http://datasets.datalad.org/openfmri/ds000002/.git into '/tmp/ts/openfmri/ds000002' 
install(ok): openfmri/ds000002 (dataset) [Installed subdataset in order to get /tmp/ts/openfmri/ds000002/sub-01/anat]
[INFO   ] Cloning http://datasets.datalad.org/openfmri/ds000003/.git into '/tmp/ts/openfmri/ds000003' 
install(ok): openfmri/ds000003 (dataset) [Installed subdataset in order to get /tmp/ts/openfmri/ds000003/sub-01/anat]
get(ok): openfmri/ds000002/sub-01/anat/sub-01_T1w.nii.gz (file) [from web...]                
get(ok): openfmri/ds000002/sub-01/anat (directory)
get(ok): openfmri/ds000003/sub-01/anat/sub-01_inplaneT2.nii.gz (file) [from web...]          
get(ok): openfmri/ds000003/sub-01/anat/sub-01_T1w.nii.gz (file) [from web...]                
get(ok): openfmri/ds000003/sub-01/anat (directory)      | 1.41M/5.71M [00:00<00:03, 1.13MB/s]
action summary:
  get (ok: 5)
  install (ok: 2)
% git describe
0.12.0rc5-179-g84ecda578

@yarikoptic
Copy link
Member

dl install -s /// ts

Either please use full "standard" name (datalad) or we should centrally deploy datalad as dl. Trying to reproduce your snippets requires additional tuning, and I don't want to breed custom dl in bashrc (e.g. I never post use of my shortcuts dg d+ etc)

@yarikoptic
Copy link
Member

getting bitten by gotcha1 (if intended - not sure if I like that new intended behavior - breaks our original promise of being able to use datalad from outside of the target dataset, might be introducing side effects from taking CWD dataset configuration and applying it to operation on the target path/dataset, etc)
$> sudo rm -rf /tmp/ts && datalad install -s /// /tmp/ts && datalad install /tmp/ts/openfmri/ds00000{2,3} && datalad get -d /tmp/ts /tmp/ts/openfmri/ds0000*/sub-01/anat
[INFO   ] Cloning http://datasets.datalad.org/ [1 other candidates] into '/tmp/ts' 
install(ok): /tmp/ts (dataset)
[ERROR  ] No dataset found [dataset.py:require_dataset:594] (NoDatasetArgumentFound) 
usage: datalad install [-h] [-s SOURCE] [-d DATASET] [-g] [-D DESCRIPTION]
                       [-r] [-R LEVELS] [--nosave] [--reckless] [-J NJOBS]
                       [PATH [PATH ...]]

in your replication attempt of gotcha2 you didn't replicate the situation -- I had already those subdatasets installed. Here is a complete minimal example where I first install those subdatasets:

$> sudo rm -rf /tmp/ts && datalad install -s /// /tmp/ts && builtin cd /tmp/ts && datalad install openfmri/ds00000{2,3} && builtin cd /tmp && datalad get -d /tmp/ts /tmp/ts/openfmri/ds0000*/sub-01/anat       
[INFO   ] Cloning http://datasets.datalad.org/ [1 other candidates] into '/tmp/ts' 
install(ok): /tmp/ts (dataset)
[INFO   ] Cloning http://datasets.datalad.org/openfmri/.git into '/tmp/ts/openfmri' 
[INFO   ] Cloning http://datasets.datalad.org/openfmri/ds000002/.git into '/tmp/ts/openfmri/ds000002'                                       
install(ok): /tmp/ts/openfmri/ds000002 (dataset) [Installed subdataset in order to get /tmp/ts/openfmri/ds000002]                           
[INFO   ] Cloning http://datasets.datalad.org/openfmri/ds000003/.git into '/tmp/ts/openfmri/ds000003' 
install(ok): /tmp/ts/openfmri/ds000003 (dataset) [Installed subdataset in order to get /tmp/ts/openfmri/ds000003]                           
action summary:
  install (ok: 3)
datalad install openfmri/ds00000{2,3}  5.31s user 1.98s system 74% cpu 9.740 total
action summary:
  get (notneeded: 2)

and works with openfmri sub being specified:

$> datalad get -d /tmp/ts/openfmri /tmp/ts/openfmri/ds0000*/sub-01/anat 
get(ok): ds000002/sub-01/anat/sub-01_T1w.nii.gz (file) [from web...]                                                                        
get(ok): ds000002/sub-01/anat (directory)
get(ok): ds000003/sub-01/anat/sub-01_inplaneT2.nii.gz (file) [from web...]                                                                  
get(ok): ds000003/sub-01/anat/sub-01_T1w.nii.gz (file) [from web...]                                                                        
get(ok): ds000003/sub-01/anat (directory)  

@mih
Copy link
Member Author

mih commented Oct 4, 2019

getting bitten by gotcha1

I don't mind install() doing all kinds of stunts. From my POV it is already an outlier in terms of behavior, so I don't mind to make it fit this (or other) use cases. Install is merely a loop around clone/get() so we can easily treat every single input argument as independent. Performance will be an issue, but I don't think performance is an issue for install() in general.

re gotcha2: thanks for the pointer. Found it, and pushed a fix.

@kyleam
Copy link
Contributor

kyleam commented Oct 4, 2019

@mih re run test issues: These failures are related to gh-3551. I'd suggest marking them as known failures so that this PR isn't blocked (this PR brings the underlying issues to the forefront but doesn't introduce them), and I'll work on fixing them.


More details: I started looking at this a while ago. One approach to the install issue is expand those return values.

Here's the patch I had:

patch
Subject: [PATCH] BF: run: Pass absolute paths to install

We do a series of globs and installs to make sure the required
datasets are installed.  During this procedure, we expand the globs
with relative paths (full=False).  This is incorrect in two ways.
First, we expect calling dirname() on these paths to give something
useful, but that can easily result in an empty string when the paths
are relative.  Second, we pass the relative to a dataset-bound install
call, which considers the paths as relative to the dataset, leading to
incorrectly specified paths if the call is occurring from a
subdirectory [*].

Expand the globs to absolute paths instead.

Re: #3551

[*] ... though these incorrect calls may still work.  See gh-3650.
---
 datalad/core/local/run.py             | 2 +-
 datalad/interface/tests/test_rerun.py | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/datalad/core/local/run.py b/datalad/core/local/run.py
index e95473248..3d05b21ee 100644
--- a/datalad/core/local/run.py
+++ b/datalad/core/local/run.py
@@ -247,7 +247,7 @@ def _install_and_reglob(dset, gpaths):
     Generator with the results of the `install` calls.
     """
     def glob_dirs():
-        return list(map(op.dirname, gpaths.expand(refresh=True)))
+        return list(map(op.dirname, gpaths.expand(full=True, refresh=True)))
 
     dirs, dirs_new = [], glob_dirs()
     while dirs != dirs_new:
diff --git a/datalad/interface/tests/test_rerun.py b/datalad/interface/tests/test_rerun.py
index 7305f119c..0b089411c 100644
--- a/datalad/interface/tests/test_rerun.py
+++ b/datalad/interface/tests/test_rerun.py
@@ -593,7 +593,8 @@ def test_rerun_script(path):
                                         "b.txt": "b"}},
                         "s1_1": {"s2": {"c.dat": "c",
                                         "d.txt": "d"}},
-                        "ss": {"e.dat": "e"}}})
+                        "ss": {"e.dat": "e"}},
+                 "subdir": {"f.dat": "f"}})
 @with_tempfile(mkdir=True)
 def test_run_inputs_outputs(src, path):
     for subds in [("s0", "s1_0", "s2"),
-- 
2.23.0

The problem is that this exposes a core incompatibility between how run, from the very beginning, has handled calls from subdirectories and the kinds of absolute paths that rev_resolve_path will take:

def rev_resolve_path(path, ds=None):
"""Resolve a path specification (against a Dataset location)
Any path is returned as an absolute path. If, and only if, a dataset
object instance is given as `ds`, relative paths are interpreted as
relative to the given dataset. In all other cases, relative paths are
treated as relative to the current working directory.
Note however, that this function is not able to resolve arbitrarily
obfuscated path specifications. All operations are purely lexical, and no
actual path resolution against the filesystem content is performed.
Consequently, common relative path arguments like '../something' (relative
to PWD) can be handled properly, but things like 'down/../under' cannot, as
resolving this path properly depends on the actual target of any
(potential) symlink leading up to '..'.

I'll revisit that issue today.

There's also a tangentially related issue that run's (or really the GlobbedPaths helper's) iterative globbing of subdatasets only works in a restricted set of cases. That needs to be fixed (perhaps _install_necessary_subdatasets can help), but I'll punt on that for now unless fixing path handling issues above require it to be addressed.

kyleam added a commit to kyleam/datalad that referenced this pull request Oct 4, 2019
As described here [1], there are two approaches to dealing with the
current path rules: (1) resolve all paths to absolute paths and use
bound dataset methods or (2) use unbound dataset methods and relative
paths.

run() has mostly taken the first approach.  There is one spot that
incorrectly passes relative paths to install() [2].  If this is
changed to use absolute paths, it reveals a deeper incompatibility
with the new path handling on master.  rev_resolve_path() isn't happy
with absolute paths that look like "/go/../upstairs" (see its
docstring and code comments for why).  run() constructs paths like
these when the run occurring in a subdirectory and there are inputs or
outputs upstairs (see the examples in the added test).

So instead let's switch to the second approach.

Note that the compatibility kludge around the dataset handling is
mainly for the sake of datalad-htcondor.

Also note that needing to use chpwd() in several spots is ugly.  We do
this because we need to make sure we're in the correct directory when
rerunning.  We might want to make rerun() responsible for calling
chpwd() around its call of run_command(), but that would take a bit
more refactoring.

Fixes datalad#3551.

[1] datalad#3746 (comment)
[2] ... though these incorrect still work in at least some cases. See
    dataladgh-3650.
if res.get("action") == "get" and \
res.get("status") == "impossible" and \
res.get("message") == "path does not exist":
# MIH why just a warning if given inputs are not valid?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't spot any obvious conceptual reasons why we couldn't be stricter here, and I'd be fine with switching to an error. However, quickly testing that out, we would need to rework things due to a non-obvious interaction with GlobbedPaths (shown by the resulting failure in test_run_inputs_no_annex_repo). I'm going to open an issue about run's installation of missing subdatasets only working in a restricted set of cases (same issue mentioned elsewhere in this thread). It looks like fixing that will require core change to the same part of GlobbedPaths, so I'll make sure to note that we should keep this "warning -> error" switch in mind.

@kyleam
Copy link
Contributor

kyleam commented Oct 4, 2019

This other failure seems to be related to glob expansion not working on windows. Do you consider this as "should work", given that this is windows?

I don't feel confident about saying anything should work on windows, but for this particular test the glob expansion isn't a hit on other systems either. Because GlobbedPaths hangs onto the globs when there are no matches^, get() receives a literal "*" as the path, and it gives back a result saying that the path doesn't exist.

With the gh-3747 changes, the call is equivalent to get(dataset=ds.path, path=['*']), and that has the same failure:

https://github.com/datalad/datalad/pull/3747/checks?check_run_id=248506600#step:8:374

^ I'd need to review that code to recall why, but that's probably another aspect that needs to be reworked along with the changes mentioned here.

mih and others added 9 commits October 6, 2019 11:23
This tries to be a gentle RF that doesn't change (much) of the original
behavior (see test diff). However, a change in-line with datalad#3742 is
unavoidable (see removed test).

This is bringing path semantics in line with status/subdatasets, and is
a precondition to fixing datalad#3469
Also update that filter to match the new conventions
This test leads to run() calling `get(dataset=ds.path, path=['*'])`.
On non-Windows systems, the '*' is considered a literal path and
reported to be absent, as expected.  On Windows, the underlying
os.stat() call fails [0] with

    OSError: [WinError 123] The filename, directory name, or volume
    label syntax is incorrect:
    'C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\datalad_temp_5m0lh8wt\\*'

As discussed here [1], we might want to rework run() and/or
GlobbedPaths to avoid hanging on to the unmatched glob and feeding
them as a literal paths.  That might get rid of the failure for this
particular test, though it would just paper over the core Windows path
handling problem shown by this failure..

[0]: https://github.com/datalad/datalad/pull/3746/checks?check_run_id=248850583#step:8:417
[1]: datalad#3746 (comment)
@kyleam
Copy link
Contributor

kyleam commented Oct 6, 2019

I've rebased this to resolve conflicts with master, and I've marked test_run_inputs_no_annex_repo as a known windows failure.

range-diff
 1:  6c1ac2c6f =  1:  6060043e9 RF: get() without annotate_paths()
 2:  6158dd3b8 =  2:  35bb3fcaf BF: Adjust results to get install() result filter working
 3:  5b166d443 =  3:  096ea9aa3 BF: Not useful to send Path objects in messages
 4:  6c3ce76ba !  4:  e7242c343 TST: Adjust tests to match new calling conventions for get()
    @@ datalad/distribution/tests/test_uninstall.py: def test_remove_recursive_2(tdir):
     -        install('labs/tarr/face_place')
     +        with chpwd('labs'):
     +            install('tarr/face_place')
    -         remove('labs', recursive=True)
    +         remove(dataset='labs', recursive=True)
      
      
     
 5:  ab984551e =  5:  01029df1a RF: Adjust run for new get() error reporting
 6:  84ecda578 =  6:  2fe9d5d3a TST: No longer a pointless toplevel 'notneeded' result
 7:  89467b9c4 =  7:  276655687 BF: Fix up subdataset recursion logic in get()
 8:  c14e2acb1 =  8:  acfc6f145 TST: Mark known failures as suggested by @kyleam
 -:  --------- >  9:  a43da55f2 TST: Mark test_run_inputs_no_annex_repo as known Windows failure

kyleam added a commit to kyleam/datalad that referenced this pull request Oct 6, 2019
As described here [1], there are two approaches to dealing with the
current path rules: (1) resolve all paths to absolute paths and use
bound dataset methods or (2) use unbound dataset methods and relative
paths.

run() has mostly taken the first approach.  There is one spot that
incorrectly passes relative paths to install() [2].  If this is
changed to use absolute paths, it reveals a deeper incompatibility
with the new path handling on master.  rev_resolve_path() isn't happy
with absolute paths that look like "/go/../upstairs" (see its
docstring and code comments for why).  run() constructs paths like
these when the run is happening in a subdirectory and there are inputs
or outputs upstairs (see the examples in the added test).

So instead let's switch to the second approach.

Note that the compatibility kludge around the dataset handling is
mainly for the sake of datalad-htcondor.

Also note that needing to use chpwd() in several spots is ugly.  We do
this because we need to make sure we're in the correct directory when
rerunning.  We might want to make rerun() responsible for calling
chpwd() around its call of run_command(), but that would take a bit
more refactoring.

Fixes datalad#3551.
Re: datalad#3746

[1] datalad#3746 (comment)
[2] ... though these incorrect still work in at least some cases. See
    dataladgh-3650.
@kyleam
Copy link
Contributor

kyleam commented Oct 6, 2019

The remaining failure, datalad_container.tests.test_run.test_custom_call_fmt, is resolved by gh-3747.

@mih
Copy link
Member Author

mih commented Oct 6, 2019

Thx much @kyleam ! I'd be happy to close this one in favor of gh-3747

@yarikoptic
Copy link
Member

I have created a dedicated issue to discuss and possibly mitigate the removal of a feature with these changes: #3759
I will proceed with merging this one and Kyle's #3747 (so we have clear separate merges), will do it locally

@yarikoptic yarikoptic merged commit a43da55 into datalad:master Oct 9, 2019
@mih mih deleted the rf-get-annotate branch October 10, 2019 06:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants