Reimplementation of drop/(uninstall)/remove #6111

mih · 2021-10-25T08:58:36Z

Sits on top of #6110 and will grow further. For now just a demo of the state for the curious.

Please help with naming! We need labels for

--what
- keys detetermined by paths in the worktree (default operation): filecontent?
- to engage git annex drop --auto: unwanted?
- to engage git annex drop --all: allkeys?
- to uninstall entire datasets: all?
--reckless
- to disable minimum key redundancy checks: availability
- to disable test if all branches are pushed: availability (yes, same as above -- same recklessness different subject only. one matters for file drop, the other for dataset uninstall), @adswa gave example, when the two should be selectable individually, though
- to disable checks for unsaved modifications: modification
- to disable announcement/push of git annex dead here outcome: undead
Fixes uninstall must check whether a dataset has a non-local remote and all changes are pushed #1142
Fixes bring --kill back? #1282
Fixes remove --nocheck is not just "nocheck" but also "nodrop" #1823
Fixes -J support for datalad drop #1953
Fixes remove confuses #2229
Fixes --all (or just --git-annex-options?) for datalad drop #2328
Fixes remove --if-dirty default handling is misleading #2655
Fixes --no-drop for datalad remove #2673
Fixes Make uninstall uninstall top-level datasets #2967
Fixes remove() should announce annex dead #3887
Fixes Even remove cannot remove without using -d way to specify #4097
Fixes Just a suggestion: datalad remove (make default --if-dirty fail) #4115
Fixes remove: fails to remove in --recursive, demands --recursive option #4784
Fixes (Re)design of remove as purge #5842
Fixes Dataset.get_superdataset() calls subdatasets() with result renderer #6123

TODO:

codecov · 2021-10-25T09:51:11Z

Codecov Report

Merging #6111 (ccadd0c) into master (f7e89af) will decrease coverage by 35.74%.
The diff coverage is 87.56%.

❗ Current head ccadd0c differs from pull request most recent head d3e67f8. Consider uploading reports for the commit d3e67f8 to get more accurate results

@@             Coverage Diff             @@
##           master    #6111       +/-   ##
===========================================
- Coverage   89.82%   54.07%   -35.75%     
===========================================
  Files         317      318        +1     
  Lines       42417    41835      -582     
===========================================
- Hits        38100    22624    -15476     
- Misses       4317    19211    +14894

Impacted Files	Coverage Δ
datalad/core/distributed/tests/test_clone.py	`0.00% <0.00%> (-97.44%)`	⬇️
datalad/core/distributed/tests/test_push.py	`0.00% <0.00%> (-98.94%)`	⬇️
datalad/core/local/create.py	`87.50% <ø> (-10.30%)`	⬇️
datalad/core/local/tests/test_create.py	`0.00% <0.00%> (-100.00%)`	⬇️
datalad/core/local/tests/test_run.py	`0.00% <0.00%> (-100.00%)`	⬇️
datalad/core/local/tests/test_save.py	`0.00% <0.00%> (-98.03%)`	⬇️
datalad/core/local/tests/test_status.py	`0.00% <ø> (-98.49%)`	⬇️
datalad/distribution/create_sibling.py	`15.88% <0.00%> (-49.99%)`	⬇️
datalad/distribution/dataset.py	`85.29% <ø> (-11.35%)`	⬇️
datalad/distribution/drop.py	`0.00% <0.00%> (-96.52%)`	⬇️
... and 241 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f7e89af...d3e67f8. Read the comment docs.

mih · 2021-11-05T07:43:59Z

Good FS:

% python -m nose -s -v datalad.local.tests.test_remove datalad.distributed.tests.test_drop datalad.distribution.tests.test_uninstall  --with-coverage --cover-erase --cover-package datalad.local.remove --cover-package datalad.distributed.drop
Name                          Stmts   Miss  Cover   Missing
-----------------------------------------------------------
datalad/distributed/drop.py     209      8    96%   189-190, 324-331, 403, 562, 607, 716
datalad/local/remove.py          56      0   100%
-----------------------------------------------------------
TOTAL                           265      8    97%
----------------------------------------------------------------------
Ran 30 tests in 109.508s

OK

Crippled FS (hence by-passing the often inappropriate if-on-windows conditionals):

======================================================================
FAIL: datalad.distribution.tests.test_uninstall.test_remove_file_handle_only
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/mih/hacking/datalad/git/datalad/distribution/tests/test_uninstall.py", line 292, in test_remove_file_handle_only
    eq_(rpath_one, (ds.pathobj / 'two').resolve())
AssertionError: PosixPath('/media/mih/Samsung_T5/tmp/datalad_temp_tree_test_remove_file_handle_onlyonew6nnh/one') != PosixPath('/media/mih/Samsung_T5/tmp/datalad_temp_tree_test_remove_file_handle_onlyonew6nnh/two')

======================================================================
FAIL: datalad.distribution.tests.test_uninstall.test_uninstall_multiple_paths
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/mih/hacking/datalad/git/datalad/distribution/tests/test_uninstall.py", line 248, in test_uninstall_multiple_paths
    ok_(all([f.endswith('keep') for f in files_left if exists(f) and not isdir(f)]))
AssertionError: None

======================================================================
FAIL: datalad.distribution.tests.test_uninstall.test_uninstall_recursive
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/mih/hacking/datalad/git/datalad/distribution/tests/test_uninstall.py", line 339, in test_uninstall_recursive
    ok_(not exists(lname))
AssertionError: None

Name                          Stmts   Miss  Cover   Missing
-----------------------------------------------------------
datalad/distributed/drop.py     209      7    97%   189-190, 324-331, 403, 562, 716
datalad/local/remove.py          56      0   100%
-----------------------------------------------------------
TOTAL                           265      7    97%
----------------------------------------------------------------------
Ran 30 tests in 317.609s

FAILED (failures=3)

3x slower, some test assumptions broken.

Merge individual still-unique aspects into the new tests.

This type of subdataset removal is tested in `test_remove_subdataset_nomethod`, where is does not require network.

RF all usage of deprecated commands and arguments.

mih · 2021-11-05T09:21:25Z

Consolidated the tests and ported for cross-platform compatibility. Now approximately same speed and identical near-complete coverage:

Good FS

Name                          Stmts   Miss  Cover   Missing
-----------------------------------------------------------
datalad/distributed/drop.py     209      5    98%   197, 324-331, 403, 607
datalad/local/remove.py          56      1    98%   142
-----------------------------------------------------------
TOTAL                           265      6    98%
----------------------------------------------------------------------
Ran 22 tests in 56.195s

OK

Crippled FS:

Name                          Stmts   Miss  Cover   Missing
-----------------------------------------------------------
datalad/distributed/drop.py     209      5    98%   197, 324-331, 403, 607
datalad/local/remove.py          56      1    98%   142
-----------------------------------------------------------
TOTAL                           265      6    98%
----------------------------------------------------------------------
Ran 22 tests in 77.139s

OK

Co-authored-by: Adina Wagner <adina.wagner@t-online.de>

bpoldrack · 2021-11-05T11:38:46Z

Not a huge issue, since it's not destroying data, but I wonder whether the availability check for revisions is ideal.
Example: I have a dataset w/ a subdataset. I clone that thing and in the clone I remove origin from the subdataset. If I now want to drop that subds from the clone, drop is complaining about no known remote. On the one hand, that's technically right, since that dataset doesn't know about a remote. On the other hand: If I drop it with --reckless availability, I can get it again w/o issues. At some level we know that it's fine. Looks to me as if the check is running on the knowledge of the to-be-dropped dataset only, while the reverse operation (get) would use the info as available from the superdataset.

Don't immediately see a better solution, but someone else might.

(datalad-dev) ben@tree in /tmp/test_drop_clone/subds on git:master
❱ git remote remove origin
(datalad-dev) ben@tree in /tmp/test_drop_clone/subds on git:master
❱ cd ..
(datalad-dev) ben@tree in /tmp/test_drop_clone on git:master
❱ datalad -f json_pp drop -d. subds --what datasets                                                                                                                                                            1 !
{
  "action": "uninstall",
  "message": [
    "to-be-dropped dataset has revisions that are not available at any known sibling. Use `datalad push --to ...` to push these before dropping the local dataset, or ignore via `--reckless availability`. Unique revisions: %s",
    [
      "master"
    ]
  ],
  "path": "/tmp/test_drop_clone/subds",
  "refds": "/tmp/test_drop_clone",
  "status": "error",
  "type": "dataset"
}
(datalad-dev) ben@tree in /tmp/test_drop_clone on git:master
❱ datalad -f json_pp drop -d. subds --what datasets --reckless availability                                                                                                                                    1 !
{
  "action": "uninstall",
  "path": "/tmp/test_drop_clone/subds",
  "refds": "/tmp/test_drop_clone",
  "status": "ok",
  "type": "dataset"
}
(datalad-dev) ben@tree in /tmp/test_drop_clone on git:master
❱ datalad get subds
[INFO   ] scanning for unlocked files (this may take some time)                                                                                                                                                    
install(ok): /tmp/test_drop_clone/subds (dataset) [Installed subdataset in order to get /tmp/test_drop_clone/subds]
get(ok): subds/another (file) [from origin...]                                                                                                                                                                     
action summary:                                                                                                                                                                                                    
  get (ok: 1)
  install (ok: 1)

bpoldrack · 2021-11-05T11:55:49Z

Dropping a regular subdirectory w/ --what datasets returns absolutely nothing. No notneeded, no error - just silence.

❱ datalad -f json_pp drop somedir --what datasets

bpoldrack · 2021-11-05T11:59:56Z

Similar: Dropping an already dropped subds, doesn't return anything either. Shoudn't that be a notneeded?

Technically likely the same thing:

❱ datalad -f json_pp drop file.txt --what datasets 
❱

datalad/distributed/drop.py

mih · 2021-11-05T12:23:49Z

Dropping a regular subdirectory w/ --what datasets returns absolutely nothing. No notneeded, no error - just silence.

Doing this for this simple case is easy, for the general case not. How would you determine for any given input path whether it had absolutely no consequences -- keeping in mind:

recursive operation
multiple path arguments
multiple path arguments pointing to different datasets

The desire for this type of reporting originally led to AnnotatePaths. We should keep in mind that notneeded results are inconsequential in general, and typically not even displayed. I doubt that it is worth complicating the code for it, or even invest runtime to determine whether this is happinging.

Same here ...

Similar: Dropping an already dropped subds, doesn't return anything either. Shoudn't that be a notneeded?

How much time should we invest? An empty dir could be an uninstalled subdataset, or an empty dir. A non-existing path could be an uninstalled subdataset, or not exist at all. Do we report notneeded for first-level uninstalled datasets? But why not for next-level? -- they also do not exist. So generating these inconsequential results always require inspection, slow normal operations, and complicate the code, because we need to track numerous combinations when a report qualifies as "having said something already". The current implementation reports notneeded whenever it is cheap, or when required for backward compatibility. But it tries hard to avoid runtime costs for beautification.

Example: I have a dataset w/ a subdataset. I clone that thing and in the clone I remove origin from the subdataset. If I now want to drop that subds from the clone, drop is complaining about no known remote. On the one hand, that's technically right, since that dataset doesn't know about a remote. On the other hand: If I drop it with --reckless availability, I can get it again w/o issues. At some level we know that it's fine. Looks to me as if the check is running on the knowledge of the to-be-dropped dataset only, while the reverse operation (get) would use the info as available from the superdataset.

What are you proposing? That on-drop the superdataset is inspected whether it can generate any candidate URL from which a repository can be cloned that is then inspected whether it has all the commits needed by the to-be-dropped subdataset?

loj · 2021-11-05T12:28:52Z

This feels good! I tried it out for a couple of different workflows/situations that I commonly used drop/uninstall/remove, and I have no complaints. :-)

I just found the small typo in the docs for --reckless undead.

bpoldrack · 2021-11-05T13:09:06Z

How would you determine for any given input path whether it had absolutely no consequences

I might be wrong, but shouldn't such a case be discoverable right after determining paths_by_ds? A given path pointing to a regular subdir should be associated with dataset, rather than generate its own entry, right? So, _drop_dataset should be able to tell. In fact, I think right here:

https://github.com/datalad/datalad/pull/6111/files#diff-f0ac845bc6391c474eb58b0c1448b7f6bc00cb5585dab58777d8d9512f99c06bR371

(link doesn't work properly. Line 371 in that diff's drop.py)

Comment right above that return seems to realize that basically. We have a path of which we know that ony possibly keys can be dropped, but what was datasets, so we don't do anything. Fine. Why not make that a notneeded result with respective message?

Edit: Arguably, it's even impossible by the assessment in that code block.

bpoldrack · 2021-11-05T13:37:26Z

datalad/distributed/drop.py

+        # so we have paths constraints that prevent dropping the full dataset
+        # there is nothing to do here, but to drop keys, which we must not
+        # done
+        return


@mih

Something along these lines is what I mean. When do we end up here? AFAICS, when we have path(s) that are part of THIS dataset (as opposed to a subdataset) and we have what='datasets'. So, phrasing it like the comment ("constraining the drop of THIS dataset") is one way, but "trying to drop subdataset(s) where there are none" is another.

May be I just don't see how we can end up here in a way that this message would be misleading. I don't know.

Suggested change

return

for p in paths:

yield dict(action="drop",

path=p,

status="impossible",

message=f"what=datasets was given, but no dataset can be dropped from {p}",

type="file")

return

I am afraid this is not correct. It is perfectly possible that the paths you are reporting on have been originally used to drop datasets underneath it. impossible would be wrong, because something might have been done already, and file is a guess.

It is perfectly possible that the paths you are reporting on have been originally used to drop datasets underneath it.

Hm. I fail to produce that, but thanks for the hint. Trying to figure that out.

bpoldrack · 2021-11-05T13:51:24Z

FTR: I'm testing my suggestion.

As for this:

What are you proposing? That on-drop the superdataset is inspected whether it can generate any candidate URL from which a repository can be cloned that is then inspected whether it has all the commits needed by the to-be-dropped subdataset?

No - too expensive, obviously. As I said: No immediate idea. Just stumbled upon it. If you don't see a solution either - it's fine with me as is. I agree it's not worth making things more expensive.

We should keep in mind that notneeded results are inconsequential in general, and typically not even displayed. I doubt that it is worth complicating the code for it, or even invest runtime to determine whether this is happinging.

Agree with the latter. Wouldn't phrase the former exactly like that. Via python interface I'd consider notneeded results much nicer to deal with than having to account for results sometimes, exceptions in other cases and absolutely nothing as yet another way of how datalad commands can behave. So - not worth paying a significant cost, but worth giving a second thought from my POV.

mih · 2021-11-05T14:06:40Z

Conclusion from chat: Let's put the exploration and improvements possibly coming from @bpoldrack's comments in a separate PR. This one is already huge, and the suggested behavior was also not exhibited by the implementation that is being replaced here
.

mih changed the title ~~ Reimplementation of drop according to design proposal~~ Reimplementation of drop according to design proposal Oct 25, 2021

mih added the semver-minor Increment the minor version when merged label Oct 25, 2021

mih force-pushed the nf-dropng branch from 8972581 to 567ba1f Compare October 25, 2021 09:18

mih linked an issue Oct 25, 2021 that may be closed by this pull request

-J support for datalad drop #1953

Closed

mih force-pushed the nf-dropng branch from 567ba1f to c9e97bd Compare October 25, 2021 09:22

mih added the cmd-drop label Oct 25, 2021

mih force-pushed the nf-dropng branch 2 times, most recently from 21be7bb to 1ef5cae Compare October 25, 2021 14:43

This was linked to issues Oct 25, 2021

--all (or just --git-annex-options?) for datalad drop #2328

Closed

remove() should announce annex dead #3887

Closed

Make uninstall uninstall top-level datasets #2967

Closed

mih force-pushed the nf-dropng branch from dbde062 to cf55c13 Compare October 26, 2021 16:29

mih linked an issue Oct 27, 2021 that may be closed by this pull request

bring --kill back? #1282

Closed

mih force-pushed the nf-dropng branch from ca69168 to 1fbc527 Compare October 27, 2021 12:04

mih added the cmd-uninstall label Oct 27, 2021

mih linked an issue Oct 27, 2021 that may be closed by this pull request

uninstall must check whether a dataset has a non-local remote and all changes are pushed #1142

Closed

This comment has been minimized.

Sign in to view

mih mentioned this pull request Oct 27, 2021

remove --nocheck is not just "nocheck" but also "nodrop" #1823

Closed

This was linked to issues Oct 28, 2021

remove --nocheck is not just "nocheck" but also "nodrop" #1823

Closed

--no-drop for datalad remove #2673

Closed

mih changed the title ~~Reimplementation of drop according to design proposal~~ Reimplementation of drop/(uninstall)/remove Oct 29, 2021

mih linked an issue Oct 29, 2021 that may be closed by this pull request

(Re)design of remove as purge #5842

Closed

mih added 7 commits November 5, 2021 08:47

Port CWD removal test

eb4e016

Localize command test in respective modules

f9b1650

Dissolved now duplicate tests

8760902

Merge individual still-unique aspects into the new tests.

Remove slow test (20s on my laptop) that adds nothing

d59bbb2

This type of subdataset removal is tested in `test_remove_subdataset_nomethod`, where is does not require network.

Consolidate tests on "short" helpers

8e375cc

Migrate tests to their respective commands

f1c4e4b

RF all usage of deprecated commands and arguments.

Cover a few more corner cases in the tests

920edab

mih force-pushed the nf-dropng branch from 36fa855 to 920edab Compare November 5, 2021 09:21

Extend examples

8da7305

Co-authored-by: Adina Wagner <adina.wagner@t-online.de>

mih force-pushed the nf-dropng branch from 62534da to 8da7305 Compare November 5, 2021 09:30

mih mentioned this pull request Nov 5, 2021

remove --if-dirty default handling is misleading #2655

Closed

Add deprecation warnings

b7d542a

loj reviewed Nov 5, 2021

View reviewed changes

datalad/distributed/drop.py Outdated Show resolved Hide resolved

Fix typos

d3e67f8

bpoldrack reviewed Nov 5, 2021

View reviewed changes

mih merged commit 3cbc5cd into datalad:master Nov 5, 2021

mih deleted the nf-dropng branch November 5, 2021 14:12

mih mentioned this pull request Mar 17, 2022

Clarifying datalad remove #4830

Closed

mlell mentioned this pull request Nov 13, 2023

Using tags to mark results for retention psychoinformatics-de/knowledge-base#115

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reimplementation of drop/(uninstall)/remove #6111

Reimplementation of drop/(uninstall)/remove #6111

mih commented Oct 25, 2021 •

edited

Loading

codecov bot commented Oct 25, 2021 •

edited

Loading

This comment has been minimized.

mih commented Nov 5, 2021 •

edited

Loading

mih commented Nov 5, 2021

bpoldrack commented Nov 5, 2021

bpoldrack commented Nov 5, 2021

bpoldrack commented Nov 5, 2021 •

edited

Loading

mih commented Nov 5, 2021 •

edited

Loading

loj commented Nov 5, 2021

bpoldrack commented Nov 5, 2021 •

edited

Loading

bpoldrack Nov 5, 2021

mih Nov 5, 2021

bpoldrack Nov 5, 2021

bpoldrack commented Nov 5, 2021

mih commented Nov 5, 2021 •

edited

Loading

-        return
+        for p in paths:
+            yield dict(action="drop",
+                       path=p,
+                       status="impossible",
+                       message=f"what=datasets was given, but no dataset can be dropped from {p}",
+                       type="file")
+        return

Reimplementation of drop/(uninstall)/remove #6111

Reimplementation of drop/(uninstall)/remove #6111

Conversation

mih commented Oct 25, 2021 • edited Loading

codecov bot commented Oct 25, 2021 • edited Loading

Codecov Report

This comment has been minimized.

mih commented Nov 5, 2021 • edited Loading

mih commented Nov 5, 2021

bpoldrack commented Nov 5, 2021

bpoldrack commented Nov 5, 2021

bpoldrack commented Nov 5, 2021 • edited Loading

mih commented Nov 5, 2021 • edited Loading

loj commented Nov 5, 2021

bpoldrack commented Nov 5, 2021 • edited Loading

bpoldrack Nov 5, 2021

Choose a reason for hiding this comment

mih Nov 5, 2021

Choose a reason for hiding this comment

bpoldrack Nov 5, 2021

Choose a reason for hiding this comment

bpoldrack commented Nov 5, 2021

mih commented Nov 5, 2021 • edited Loading

mih commented Oct 25, 2021 •

edited

Loading

codecov bot commented Oct 25, 2021 •

edited

Loading

mih commented Nov 5, 2021 •

edited

Loading

bpoldrack commented Nov 5, 2021 •

edited

Loading

mih commented Nov 5, 2021 •

edited

Loading

bpoldrack commented Nov 5, 2021 •

edited

Loading

mih commented Nov 5, 2021 •

edited

Loading