Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run: Don't remove leading empty directory when removing outputs #5492

Merged
merged 3 commits into from
Mar 16, 2021

Conversation

kyleam
Copy link
Contributor

@kyleam kyleam commented Mar 12, 2021

If outputs aren't present (and thus can't be unlocked), run() removes
them so that the command doesn't need to provide special handling of
annex links. This is done with datalad remove, which calls git-rm
underneath. git rm path/to/file deletes the leading directories when
they are otherwise empty.

That behavior can be surprising for commands that don't ensure
outputs' directories exist. Instead unlink the file directly so
leading empty directories stick around. Unlike git-rm, this doesn't
remove the file from the index, but the end result is the same after
the save (even if the command ends up not recreating the output).

Note, however, that ideally commands should be written to not assume
that the output directory exists in HEAD's tree. This makes them
better suited to be used outside the context of the current tree
(e.g., rerun --onto), where an output's directory of course may be
pruned because the checked out tree doesn't include any tracked files
under an output's directory.

Closes #5486.


Opened against master due to the result record changes.

When looking over this code, I was sure what the behavior was.  Note
it to save future readers from wondering the same thing.

(A dedicated issue will be opened later.)
If outputs aren't present (and thus can't be unlocked), run() removes
them so that the command doesn't need to provide special handling of
annex links.  This is done with `datalad remove`, which calls git-rm
underneath. `git rm path/to/file` deletes the leading directories when
they are otherwise empty.

That behavior can be surprising for commands that don't ensure
outputs' directories exist.  Instead unlink the file directly so
leading empty directories stick around.  Unlike git-rm, this doesn't
remove the file from the index, but the end result is the same after
the save (even if the command ends up not recreating the output).

Note, however, that ideally commands should be written to not assume
that the output directory exists in HEAD's tree.  This makes them
better suited to be used outside the context of the current tree
(e.g., `rerun --onto`), where an output's directory of course may be
pruned because the checked out tree doesn't include any tracked files
under an output's directory.

Closes datalad#5486.
@codecov
Copy link

codecov bot commented Mar 12, 2021

Codecov Report

Merging #5492 (5941b7d) into master (d50e613) will decrease coverage by 18.96%.
The diff coverage is 92.59%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #5492       +/-   ##
===========================================
- Coverage   90.10%   71.14%   -18.97%     
===========================================
  Files         296      293        -3     
  Lines       42103    42089       -14     
===========================================
- Hits        37937    29944     -7993     
- Misses       4166    12145     +7979     
Impacted Files Coverage Δ
datalad/core/local/run.py 98.26% <83.33%> (-0.85%) ⬇️
datalad/core/local/tests/test_run.py 99.30% <100.00%> (-0.33%) ⬇️
datalad/support/tests/utils.py 0.00% <0.00%> (-100.00%) ⬇️
datalad/cmdline/tests/__init__.py 0.00% <0.00%> (-100.00%) ⬇️
datalad/cmdline/tests/test_helpers.py 0.00% <0.00%> (-100.00%) ⬇️
datalad/distribution/tests/__init__.py 0.00% <0.00%> (-100.00%) ⬇️
datalad/distribution/tests/test_get.py 0.00% <0.00%> (-100.00%) ⬇️
datalad/distribution/tests/test_drop.py 0.00% <0.00%> (-100.00%) ⬇️
datalad/distribution/tests/test_utils.py 0.00% <0.00%> (-100.00%) ⬇️
datalad/distribution/tests/test_update.py 0.00% <0.00%> (-100.00%) ⬇️
... and 159 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d50e613...822cab2. Read the comment docs.

yield dict(res, action="run-remove", status="error",
message=("Removing file failed: %s", exc_str(exc)))
else:
yield dict(res, action="run-remove", status="ok",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just an idea/observation, feel free to ignore:
our "action" values are indeed nowhere "standardized" although could be relied upon in the logic "upstairs". Typically they correspond to the actual "full" action (like run here)

here is what I see values ATM
$> git grep "\<action=[\"']" | grep -v 'form action' | sed -e "s,.*\<action=['\"]\([^\"']*\)['\"].*,\1,g" | sort -n | uniq -c | sort -n
      1 add-sibling
      1 addurl
      1 bar
      1 clean
      1 create_sibling
      1 delete
      1 discover_procedure
      1 dummy
      1 enable
      1 enable-sibling
      1 export_archive
      1 export-archive-ora
      1 export_to_figshare
      1 fetch
      1 funky
      1 no_annex
      1 push
      1 query-sibling
      1 remove-sibling
      1 run_procedure
      1 store
      1 subdataset
      1 test
      1 version
      2 add_readme
      2 add_submodule
      2 copy_file
      2 foo
      2 procedure_help
      2 uninstall
      2 wtf
      3 create
      3 drop
      3 search
      4 query
      4 remove
      4 run
      5 aggregate_metadata
      5 configure-sibling
      6 create_sibling_gitlab
      7 status
      7 unlock
      8 create-sibling-ria
      8 merge
      9 addurls
      9 metadata
     11 store_false
     12 install
     13 add
     13 save
     18 get
     21 append
     22 publish
     29 update
     32 copy
     34 diff
     69 store_true
so typically correspond to the actual user facing "action" and typically in Python syntax (with `_` as separator), although have `-` in some cases (`export-archive-ora`) and also freeform new minted action names like `query-sibling` similar to here `run-remove` which do not correspond to any command. So adding a new one here does not really break any consistency any longer, but may be it could be a good time to introduce some notion of "subaction". I.e. here it could be `run.remove` or `run:remove` to signal that it is `remove` step of `run` action, so that later code upstairs could start using such syntax to separate out action (`run`) from subaction (`remove`) in a consistent manner?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically they correspond to the actual "full" action (like run here)

Right. I went with "run" first, but switched to something more specific because I was worried it'd cause confusion.

So adding a new one here does not really break any consistency any longer, but may be it could be a good time to introduce some notion of "subaction". I.e. here it could be run.remove or run:remove to signal that it is remove step of run action, so that later code upstairs could start using such syntax to separate out action (run) from subaction (remove) in a consistent manner?

I think both your suggestions ("run.remove" and "run:remove") are better than what I chose. I'll switch it over to "run.remove".

if "cannot unlock" in res["message"]:
for rem_res in remove(dataset=dset_path,
path=res["path"],
check=False, save=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note (to myself) -- check=False implies that no files are dropped (see e.g. #1823) and thus replacement with plain rm should not introduce change in behavior of absent dropping

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with only cursory analysis, seems to be ok. Left some possibly inconsequential comments

Make it clearer that "remove" is a subaction of the `run` command
rather than a command itself.

Thanks to @yarikoptic for the suggestion.

datalad#5492 (comment)
@kyleam kyleam merged commit bbbf197 into datalad:master Mar 16, 2021
@kyleam kyleam deleted the rerun-empty-dir-workaround branch March 16, 2021 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

rerun: do not delete parent directories of outputs if those are otherwise empty
2 participants