Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run: Two --input/--output glob fixes #5594

merged 13 commits into from Apr 26, 2021


Copy link

@kyleam kyleam commented Apr 21, 2021

This series fixes two glob-related issues with run:

The commits in between are mostly focused on preparing the GlobbedPaths helper and tests for the last commit.

To unlock output files, run() expands globs prior to running a
command.  For the default invocation, there's no reason to do more
because all changes are saved and the outputs are stored in the commit
message unexpanded.  However, when --explicit or
--expand={both,outputs} is specified, re-globbing is needed to ensure
that an up-to-date list of outputs is captured.

Note that for --explicit, only a subset of cases are affected.  If
there are no matches, GlobbedPaths returns the original pattern.  That
pattern gets passed to GitRepo.get_content_info(), where ls-files
expands the glob.  That is, the unexpected behavior surfaces only
when, before the command is executed, an --output matches some files
and the execution produces additional matching files.

Closes datalad#3448.
The "input/output doesn't exist" case is already covered by other
spots (e.g., test_run_explicit and `rerun --onto` tests), so there's
no point in checking it again, especially given that the rerun tests
are already slow.
test_run_inputs_no_annex_repo() was added as a regression test to
ensure that run() doesn't crash when inputs are specified for a plain
git repo.  (At the time, globbing went through git-annex and worked
only for annexed files.)

With an upcoming change, this test will fail for an unrelated reason:
inputs that don't match will lead to an error result.  Add a file to
the working tree to avoid that.
Most of test_globbedpaths() uses os.path.join or os.sep, but a couple
of spots hard code "/".  Update those, and drop the
known_failure_windows from test_globbedpaths() in hopes that the test
now works on Windows.
test_globbedpaths() points GlobbedPaths to "$test_directory/subdir/",
and then tests globbing _within_ "subdir/".  One of the test cases is
"subdir/", which doesn't have a hit and shows up in the results
because GlobbedPaths (intentionally, though confusingly) returns the
original pattern when there are no hits.

Pick a name that makes it clear that the result is due to there being
no file system matches.
GlobbedPaths._get_sub_patterns() caches its results per-instance, but
the function performs a pure analysis of a string, so it's safe to
cache globally.  Do so to potentially save some cycles and simplify
the code.
GlobbedPaths stores some values in its _paths dictionary that are only
updated when expand() is called with refresh=True.  In addition it
stores the original patterns there, but this is determined when the
class is created and never updated.

Extract "patterns" out into its own variable, and rename "_paths" to
"_cache" to make it clearer what its purpose is.
GlobbedPaths keeps a per-instance cache of some values.  This cache is
ignored when expand() is called with refresh=True.  Several spots in
expand() condition on `refresh`, and upcoming changes would add more

Instead clear _cache upfront if refresh is True so that multiple sites
(and the reader) don't have to worry about `refresh`.
GlobbedPaths.__bool__() returns true if the class was instantiated
with `patterns`.  Aside from checking for the special-cased ".", it
does that by calling expand().  That's doing unnecessary work, though,
and the answer can be found more directly by looking at the stored
"patterns" value.
If the globbing for an --input or --output value doesn't have any
hits, the pattern is left as is in the results and is accessible
through the {inputs} and {outputs} placeholders.  That's odd and
confusing, but it's not clear to me that the alternative of filtering
them out is better.

For inputs, an upcoming commit will error when a pattern has no hits,
so the specific behavior here doesn't matter too much.  For outputs,
unmatched non-globs need to be available in the placeholders because
they name files that should exist after the command is executed.
Perhaps there should be special treatment of patterns that match

For now, add a test to document this confusing area.
When calling glob.glob() with a pattern that doesn't match,
GlobbedPaths chops off the tail and looks for the first sub-pattern
that matches.  It does this so that run() can repeatedly glob and
install the needed subdatasets [1].  expand() returns results that
include these partial patterns.  And if the above doesn't work, the
results include the original pattern, since run() needs the value for
the common case of an --output file that doesn't yet exist.

This behavior doesn't allow run() to figure out what paths are
missing.  Extend GlobbedPaths to expose this information.  Ideally the
new strict mode would be the default mode of operation, but retain the
old behavior as the default to avoid breakage in the unlikely case
that third-party code uses GlobbedPaths.

[1] 2cf27db (BF: run: Support globbing in uninstalled datasets,
When a pattern given as --input doesn't match an existing file, run()
gives a warning and proceeds.  This is problematic because a
command-line caller can't control it with --on-failure, and a Python
caller can't easily react to the failure.  Using the GlobbedPaths
functionality added in the previous commit, teach run() to yield an
error result if an --input pattern doesn't have any matches.

The treatment of outputs stays that same, but this opens up the
possibility of later giving a warning/error if an --output pattern
doesn't match an existing file after the command is executed.

Note that using a catch-all "*" as the input in a working tree where
glob.glob("*") returns no matches will now error (as shown in added
test).  That seems acceptable given that this case is likely to be
very rare and that the special-cased "." can be used instead.

Also, note that in order to abort the rest of the run() call, the
caller will need to use --on-failure=stop, just as they would if `get`
fails to download a file [1].  We may want to move to making run()
abort in both cases.

Closes datalad#5583.
[1] datalad/datalad-container#147
Copy link

codecov bot commented Apr 21, 2021

Codecov Report

Merging #5594 (91d63ef) into maint (1878d0d) will increase coverage by 0.06%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##            maint    #5594      +/-   ##
+ Coverage   90.18%   90.24%   +0.06%     
  Files         299      299              
  Lines       42117    42224     +107     
+ Hits        37984    38106     +122     
+ Misses       4133     4118      -15     
Impacted Files Coverage Δ
datalad/core/local/ 99.11% <100.00%> (+0.01%) ⬆️
datalad/core/local/tests/ 99.63% <100.00%> (+0.04%) ⬆️
datalad/interface/tests/ 99.60% <100.00%> (-0.01%) ⬇️
datalad/support/ 100.00% <100.00%> (+1.25%) ⬆️
datalad/support/tests/ 100.00% <100.00%> (ø)
datalad/downloaders/ 78.24% <0.00%> (+0.35%) ⬆️
datalad/interface/ 95.22% <0.00%> (+0.36%) ⬆️
datalad/downloaders/tests/ 91.92% <0.00%> (+2.85%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1878d0d...91d63ef. Read the comment docs.

# won't bother the "install, reglob" routine.
return expanded
Copy link

@bpoldrack bpoldrack Apr 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be me, but I'm confused by the indentation of that else. Where is the if it belongs to?
I suppose there should be no else at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the if it belongs to?

It belongs to the for. I can switch this to using a sentinel value to register the partial match, if you prefer.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof. Didn't know that syntax. WTF? :-D

I was legitimately confused since only the inner if seemed to make sense, but then this shouldn't work. So - execute always after loop? But how does that work out?
Anyways - now I know and with that it's clear. Thx for enlightening me!

Copy link

Overall: Didn't check the tests yet, but otherwise looks nice!

Copy link
Contributor Author

kyleam commented Apr 22, 2021

Didn't check the tests yet, but otherwise looks nice!

Great, thanks for taking a look @bpoldrack

Copy link

Thank you @kyleam ! I haven't spotted anything worth commenting on my cursory review. Let's proceed.

@yarikoptic yarikoptic merged commit d55866d into datalad:maint Apr 26, 2021
@kyleam kyleam deleted the run-glob-fixes branch April 26, 2021 13:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
None yet

Successfully merging this pull request may close these issues.

None yet

3 participants