-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RF: get() without annotate_paths() #3746
Conversation
Removed test is |
It is possible and tested, as long as it is running in a dataset that is parent to all. |
def subds_result_filter(res): | ||
return res.get('status') == 'ok' and res.get('type') == 'dataset' | ||
|
||
# figuring out what dataset to start with, --contains limits --recursive | ||
# to visit only subdataset on the trajectory to the target path | ||
subds_trail = ds.subdatasets(contains=path, recursive=True, | ||
on_failure="ignore", | ||
result_filter=subds_result_filter) | ||
result_filter=is_ok_dataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When adding f634bea (TST+BF: Update subdatasets(contains=...) calls for previous commit, 2019-10-01), I thought a function like is_ok_dataset
might exist, but failed to find it. Thanks for cleaning this up.
Codecov Report
@@ Coverage Diff @@
## master #3746 +/- ##
=========================================
- Coverage 82.94% 75.45% -7.5%
=========================================
Files 273 273
Lines 35943 35895 -48
=========================================
- Hits 29813 27083 -2730
- Misses 6130 8812 +2682
Continue to review full report at Codecov.
|
Gotcha 1: considers current path (repository) even though abs path(s) point to another dataset
example 1 - running within some other unrelated repo/home/yoh/proj/datalad/datalad-master > datalad get ~/datalad/openfmri/ds*/sub-01/anat
[ERROR ] /home/yoh/proj/datalad/datalad-master/.git [fun.py:is_git_dir:47] (WorkTreeRepositoryUnsupported)
exit:1 /home/yoh/proj/datalad/datalad-master > git co master
Switched to branch 'master'
Your branch is behind 'origin/master' by 73 commits, and can be fast-forwarded.
(use "git pull" to update your local branch)
changes on filesystem:
.asv | 0
/home/yoh/proj/datalad/datalad-master > datalad get ~/datalad/openfmri/ds*/sub-01/anat
Total: 0%| | 0.00/6.38M [00:00<?, ?B/s]ERROR:
Interrupted by user while doing magic: KeyboardInterrupt() [cmd.py:_process_one_line:354]
/home/yoh/proj/datalad/datalad-master > datalad install ~/datalad/openfmri/ds*/sub-01/anat
[ERROR ] /home/yoh/proj/datalad/datalad-master/.git [fun.py:is_git_dir:47] (WorkTreeRepositoryUnsupported) example 2 -- running from a pure `/tmp` -- NoDatasetArgumentFound/tmp > datalad get ~/datalad/openfmri/ds*/sub-01/anat
[ERROR ] No dataset found [dataset.py:require_dataset:594] (NoDatasetArgumentFound)
usage: datalad get [-h] [-s LABEL] [-d PATH] [-r] [-R LEVELS] [-n]
[-D DESCRIPTION] [--reckless] [-J NJOBS]
[PATH [PATH ...]]
I think it should not consider/be affected by CWD (could be not "available" for many other reasons) whenever full paths are provided. Gotcha 2: it does not
|
This code is intentionally incapable of doing that. The dataset-unbound behavior is THE reason why nobody can or will fix annotate_paths(). |
@kyleam This run test failure is due to
Leading to
So the incoming arg is given unmodified to In general I believe that there are two major paradigms to path handling.
In This other failure seems to be related to glob expansion not working on windows. Do you consider this as "should work", given that this is windows? |
@yarikoptic re gotcha2: I cannot replicate this:
also multi-paths are fine
|
Either please use full "standard" name ( |
getting bitten by gotcha1 (if intended - not sure if I like that new intended behavior - breaks our original promise of being able to use datalad from outside of the target dataset, might be introducing side effects from taking CWD dataset configuration and applying it to operation on the target path/dataset, etc)$> sudo rm -rf /tmp/ts && datalad install -s /// /tmp/ts && datalad install /tmp/ts/openfmri/ds00000{2,3} && datalad get -d /tmp/ts /tmp/ts/openfmri/ds0000*/sub-01/anat
[INFO ] Cloning http://datasets.datalad.org/ [1 other candidates] into '/tmp/ts'
install(ok): /tmp/ts (dataset)
[ERROR ] No dataset found [dataset.py:require_dataset:594] (NoDatasetArgumentFound)
usage: datalad install [-h] [-s SOURCE] [-d DATASET] [-g] [-D DESCRIPTION]
[-r] [-R LEVELS] [--nosave] [--reckless] [-J NJOBS]
[PATH [PATH ...]] in your replication attempt of gotcha2 you didn't replicate the situation -- I had already those subdatasets installed. Here is a complete minimal example where I first install those subdatasets: $> sudo rm -rf /tmp/ts && datalad install -s /// /tmp/ts && builtin cd /tmp/ts && datalad install openfmri/ds00000{2,3} && builtin cd /tmp && datalad get -d /tmp/ts /tmp/ts/openfmri/ds0000*/sub-01/anat
[INFO ] Cloning http://datasets.datalad.org/ [1 other candidates] into '/tmp/ts'
install(ok): /tmp/ts (dataset)
[INFO ] Cloning http://datasets.datalad.org/openfmri/.git into '/tmp/ts/openfmri'
[INFO ] Cloning http://datasets.datalad.org/openfmri/ds000002/.git into '/tmp/ts/openfmri/ds000002'
install(ok): /tmp/ts/openfmri/ds000002 (dataset) [Installed subdataset in order to get /tmp/ts/openfmri/ds000002]
[INFO ] Cloning http://datasets.datalad.org/openfmri/ds000003/.git into '/tmp/ts/openfmri/ds000003'
install(ok): /tmp/ts/openfmri/ds000003 (dataset) [Installed subdataset in order to get /tmp/ts/openfmri/ds000003]
action summary:
install (ok: 3)
datalad install openfmri/ds00000{2,3} 5.31s user 1.98s system 74% cpu 9.740 total
action summary:
get (notneeded: 2) and works with $> datalad get -d /tmp/ts/openfmri /tmp/ts/openfmri/ds0000*/sub-01/anat
get(ok): ds000002/sub-01/anat/sub-01_T1w.nii.gz (file) [from web...]
get(ok): ds000002/sub-01/anat (directory)
get(ok): ds000003/sub-01/anat/sub-01_inplaneT2.nii.gz (file) [from web...]
get(ok): ds000003/sub-01/anat/sub-01_T1w.nii.gz (file) [from web...]
get(ok): ds000003/sub-01/anat (directory) |
I don't mind re gotcha2: thanks for the pointer. Found it, and pushed a fix. |
@mih re run test issues: These failures are related to gh-3551. I'd suggest marking them as known failures so that this PR isn't blocked (this PR brings the underlying issues to the forefront but doesn't introduce them), and I'll work on fixing them. More details: I started looking at this a while ago. One approach to the install issue is expand those return values. Here's the patch I had: patch
The problem is that this exposes a core incompatibility between how datalad/datalad/distribution/dataset.py Lines 612 to 626 in 13ee498
I'll revisit that issue today. There's also a tangentially related issue that run's (or really the GlobbedPaths helper's) iterative globbing of subdatasets only works in a restricted set of cases. That needs to be fixed (perhaps |
As described here [1], there are two approaches to dealing with the current path rules: (1) resolve all paths to absolute paths and use bound dataset methods or (2) use unbound dataset methods and relative paths. run() has mostly taken the first approach. There is one spot that incorrectly passes relative paths to install() [2]. If this is changed to use absolute paths, it reveals a deeper incompatibility with the new path handling on master. rev_resolve_path() isn't happy with absolute paths that look like "/go/../upstairs" (see its docstring and code comments for why). run() constructs paths like these when the run occurring in a subdirectory and there are inputs or outputs upstairs (see the examples in the added test). So instead let's switch to the second approach. Note that the compatibility kludge around the dataset handling is mainly for the sake of datalad-htcondor. Also note that needing to use chpwd() in several spots is ugly. We do this because we need to make sure we're in the correct directory when rerunning. We might want to make rerun() responsible for calling chpwd() around its call of run_command(), but that would take a bit more refactoring. Fixes datalad#3551. [1] datalad#3746 (comment) [2] ... though these incorrect still work in at least some cases. See dataladgh-3650.
if res.get("action") == "get" and \ | ||
res.get("status") == "impossible" and \ | ||
res.get("message") == "path does not exist": | ||
# MIH why just a warning if given inputs are not valid? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't spot any obvious conceptual reasons why we couldn't be stricter here, and I'd be fine with switching to an error. However, quickly testing that out, we would need to rework things due to a non-obvious interaction with GlobbedPaths (shown by the resulting failure in test_run_inputs_no_annex_repo
). I'm going to open an issue about run's installation of missing subdatasets only working in a restricted set of cases (same issue mentioned elsewhere in this thread). It looks like fixing that will require core change to the same part of GlobbedPaths, so I'll make sure to note that we should keep this "warning -> error" switch in mind.
I don't feel confident about saying anything should work on windows, but for this particular test the glob expansion isn't a hit on other systems either. Because GlobbedPaths hangs onto the globs when there are no matches^, get() receives a literal "*" as the path, and it gives back a result saying that the path doesn't exist. With the gh-3747 changes, the call is equivalent to https://github.com/datalad/datalad/pull/3747/checks?check_run_id=248506600#step:8:374 ^ I'd need to review that code to recall why, but that's probably another aspect that needs to be reworked along with the changes mentioned here. |
This tries to be a gentle RF that doesn't change (much) of the original behavior (see test diff). However, a change in-line with datalad#3742 is unavoidable (see removed test). This is bringing path semantics in line with status/subdatasets, and is a precondition to fixing datalad#3469
Also update that filter to match the new conventions
Associated PR in progress: datalad#3747
This test leads to run() calling `get(dataset=ds.path, path=['*'])`. On non-Windows systems, the '*' is considered a literal path and reported to be absent, as expected. On Windows, the underlying os.stat() call fails [0] with OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect: 'C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\datalad_temp_5m0lh8wt\\*' As discussed here [1], we might want to rework run() and/or GlobbedPaths to avoid hanging on to the unmatched glob and feeding them as a literal paths. That might get rid of the failure for this particular test, though it would just paper over the core Windows path handling problem shown by this failure.. [0]: https://github.com/datalad/datalad/pull/3746/checks?check_run_id=248850583#step:8:417 [1]: datalad#3746 (comment)
c14e2ac
to
a43da55
Compare
I've rebased this to resolve conflicts with master, and I've marked test_run_inputs_no_annex_repo as a known windows failure. range-diff
|
As described here [1], there are two approaches to dealing with the current path rules: (1) resolve all paths to absolute paths and use bound dataset methods or (2) use unbound dataset methods and relative paths. run() has mostly taken the first approach. There is one spot that incorrectly passes relative paths to install() [2]. If this is changed to use absolute paths, it reveals a deeper incompatibility with the new path handling on master. rev_resolve_path() isn't happy with absolute paths that look like "/go/../upstairs" (see its docstring and code comments for why). run() constructs paths like these when the run is happening in a subdirectory and there are inputs or outputs upstairs (see the examples in the added test). So instead let's switch to the second approach. Note that the compatibility kludge around the dataset handling is mainly for the sake of datalad-htcondor. Also note that needing to use chpwd() in several spots is ugly. We do this because we need to make sure we're in the correct directory when rerunning. We might want to make rerun() responsible for calling chpwd() around its call of run_command(), but that would take a bit more refactoring. Fixes datalad#3551. Re: datalad#3746 [1] datalad#3746 (comment) [2] ... though these incorrect still work in at least some cases. See dataladgh-3650.
The remaining failure, datalad_container.tests.test_run.test_custom_call_fmt, is resolved by gh-3747. |
This tries to be a gentle RF that doesn't change (much) of the original
behavior (see test diff). However, a change in-line with #3742 is
unavoidable (see removed test).
This is bringing path semantics in line with status/subdatasets, and is
a precondition to fixing #3469
Ping #3368