Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplementation of drop/(uninstall)/remove #6111

Merged
merged 51 commits into from
Nov 5, 2021
Merged

Conversation

mih
Copy link
Member

@mih mih commented Oct 25, 2021

Sits on top of #6110 and will grow further. For now just a demo of the state for the curious.

Please help with naming! We need labels for

TODO:

  • drop in replacement for the previous drop implementation
  • remove old drop implementation
  • 95+% percent test coverage, using a new set of tests that are confirmed to work cross-platform
  • support passing --jobs to git annex drop -J support for datalad drop #1953
  • support dropping all keys from all branches --all (or just --git-annex-options?) for datalad drop #2328 (criticial precondition for safely uninstalling datasets)
  • drop in replacement for uninstall
  • decide if with drop --what all --reckless availability we should even attempt to drop all keys. We could simply wipe out the repo, and be done (I feel like we discussed this before). already possible to bypass with kill
  • should the outcome of git annex dead here be pushed to the default remote (if any), or all remotes? What if one of them fails? What if it needs a merge first?
    No! If, and only if, the existence of a local annex was ever communicated elsewhere (most would not), an error message must inform users what to do to communicate its pending death. But not automatically inside drop.
  • check for potential undead annex records remove() should announce annex dead #3887
  • same for all branches: should they be pushed to the default remote (if any), or all remotes? What if one of them fails? What if it needs a merge first?
    Same no! An error message should inform about this, and instruct what to do (i.e. datalad push), but not do anything automatically inside drop
  • verify branch state availability before dropping an entire dataset
  • can drop toplevel datasets Make uninstall uninstall top-level datasets #2967
  • go through use case list at http://docs.datalad.org/en/stable/design/drop.html and confirm/deny support for each
  • document commands
  • decide on "unpushed" detection with adjusted branches. I think we should localsync() before testing this, and then only report the corresponding branches. The adjusted ones are not meant to be pushed anyways.
  • bring back kill
    image
  • uninstall is a thin shim of a command around drop, old tests pass with minimal adjustments for increased safety
  • rewrite remove as a thin wrapper around drop and save, and remove all old helper code.
  • made the last usage of path annotation obsolete and dropped the entire code collection that was required to do this previously
  • rename remove(what=...) to remove(drop=...)
  • PR against -deprecated to move AnnotatePaths there (publish is built on it) Import AnnotatePaths datalad-deprecated#46
  • reckless=kill should maybe be kill-dataset, it won't actually kill anything, when given a file path to drop, or actually affect file dropping too

@mih mih changed the title  Reimplementation of drop according to design proposal Reimplementation of drop according to design proposal Oct 25, 2021
@mih mih added the semver-minor Increment the minor version when merged label Oct 25, 2021
@mih mih linked an issue Oct 25, 2021 that may be closed by this pull request
@mih mih added the cmd-drop label Oct 25, 2021
@codecov
Copy link

codecov bot commented Oct 25, 2021

Codecov Report

Merging #6111 (ccadd0c) into master (f7e89af) will decrease coverage by 35.74%.
The diff coverage is 87.56%.

❗ Current head ccadd0c differs from pull request most recent head d3e67f8. Consider uploading reports for the commit d3e67f8 to get more accurate results
Impacted file tree graph

@@             Coverage Diff             @@
##           master    #6111       +/-   ##
===========================================
- Coverage   89.82%   54.07%   -35.75%     
===========================================
  Files         317      318        +1     
  Lines       42417    41835      -582     
===========================================
- Hits        38100    22624    -15476     
- Misses       4317    19211    +14894     
Impacted Files Coverage Δ
datalad/core/distributed/tests/test_clone.py 0.00% <0.00%> (-97.44%) ⬇️
datalad/core/distributed/tests/test_push.py 0.00% <0.00%> (-98.94%) ⬇️
datalad/core/local/create.py 87.50% <ø> (-10.30%) ⬇️
datalad/core/local/tests/test_create.py 0.00% <0.00%> (-100.00%) ⬇️
datalad/core/local/tests/test_run.py 0.00% <0.00%> (-100.00%) ⬇️
datalad/core/local/tests/test_save.py 0.00% <0.00%> (-98.03%) ⬇️
datalad/core/local/tests/test_status.py 0.00% <ø> (-98.49%) ⬇️
datalad/distribution/create_sibling.py 15.88% <0.00%> (-49.99%) ⬇️
datalad/distribution/dataset.py 85.29% <ø> (-11.35%) ⬇️
datalad/distribution/drop.py 0.00% <0.00%> (-96.52%) ⬇️
... and 241 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f7e89af...d3e67f8. Read the comment docs.

@mih

This comment has been minimized.

@mih mih changed the title Reimplementation of drop according to design proposal Reimplementation of drop/(uninstall)/remove Oct 29, 2021
@mih mih linked an issue Oct 29, 2021 that may be closed by this pull request
@mih
Copy link
Member Author

mih commented Nov 5, 2021

Good FS:

% python -m nose -s -v datalad.local.tests.test_remove datalad.distributed.tests.test_drop datalad.distribution.tests.test_uninstall  --with-coverage --cover-erase --cover-package datalad.local.remove --cover-package datalad.distributed.drop
Name                          Stmts   Miss  Cover   Missing
-----------------------------------------------------------
datalad/distributed/drop.py     209      8    96%   189-190, 324-331, 403, 562, 607, 716
datalad/local/remove.py          56      0   100%
-----------------------------------------------------------
TOTAL                           265      8    97%
----------------------------------------------------------------------
Ran 30 tests in 109.508s

OK

Crippled FS (hence by-passing the often inappropriate if-on-windows conditionals):

======================================================================
FAIL: datalad.distribution.tests.test_uninstall.test_remove_file_handle_only
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/mih/hacking/datalad/git/datalad/distribution/tests/test_uninstall.py", line 292, in test_remove_file_handle_only
    eq_(rpath_one, (ds.pathobj / 'two').resolve())
AssertionError: PosixPath('/media/mih/Samsung_T5/tmp/datalad_temp_tree_test_remove_file_handle_onlyonew6nnh/one') != PosixPath('/media/mih/Samsung_T5/tmp/datalad_temp_tree_test_remove_file_handle_onlyonew6nnh/two')

======================================================================
FAIL: datalad.distribution.tests.test_uninstall.test_uninstall_multiple_paths
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/mih/hacking/datalad/git/datalad/distribution/tests/test_uninstall.py", line 248, in test_uninstall_multiple_paths
    ok_(all([f.endswith('keep') for f in files_left if exists(f) and not isdir(f)]))
AssertionError: None

======================================================================
FAIL: datalad.distribution.tests.test_uninstall.test_uninstall_recursive
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/mih/hacking/datalad/git/datalad/distribution/tests/test_uninstall.py", line 339, in test_uninstall_recursive
    ok_(not exists(lname))
AssertionError: None

Name                          Stmts   Miss  Cover   Missing
-----------------------------------------------------------
datalad/distributed/drop.py     209      7    97%   189-190, 324-331, 403, 562, 716
datalad/local/remove.py          56      0   100%
-----------------------------------------------------------
TOTAL                           265      7    97%
----------------------------------------------------------------------
Ran 30 tests in 317.609s

FAILED (failures=3)

3x slower, some test assumptions broken.

Merge individual still-unique aspects into the new tests.
This type of subdataset removal is tested in
`test_remove_subdataset_nomethod`, where is does not require network.
RF all usage of deprecated commands and arguments.
@mih
Copy link
Member Author

mih commented Nov 5, 2021

Consolidated the tests and ported for cross-platform compatibility. Now approximately same speed and identical near-complete coverage:

Good FS

Name                          Stmts   Miss  Cover   Missing
-----------------------------------------------------------
datalad/distributed/drop.py     209      5    98%   197, 324-331, 403, 607
datalad/local/remove.py          56      1    98%   142
-----------------------------------------------------------
TOTAL                           265      6    98%
----------------------------------------------------------------------
Ran 22 tests in 56.195s

OK

Crippled FS:

Name                          Stmts   Miss  Cover   Missing
-----------------------------------------------------------
datalad/distributed/drop.py     209      5    98%   197, 324-331, 403, 607
datalad/local/remove.py          56      1    98%   142
-----------------------------------------------------------
TOTAL                           265      6    98%
----------------------------------------------------------------------
Ran 22 tests in 77.139s

OK

Co-authored-by: Adina Wagner <adina.wagner@t-online.de>
@bpoldrack
Copy link
Member

Not a huge issue, since it's not destroying data, but I wonder whether the availability check for revisions is ideal.
Example: I have a dataset w/ a subdataset. I clone that thing and in the clone I remove origin from the subdataset. If I now want to drop that subds from the clone, drop is complaining about no known remote. On the one hand, that's technically right, since that dataset doesn't know about a remote. On the other hand: If I drop it with --reckless availability, I can get it again w/o issues. At some level we know that it's fine. Looks to me as if the check is running on the knowledge of the to-be-dropped dataset only, while the reverse operation (get) would use the info as available from the superdataset.

Don't immediately see a better solution, but someone else might.

(datalad-dev) ben@tree in /tmp/test_drop_clone/subds on git:master
❱ git remote remove origin
(datalad-dev) ben@tree in /tmp/test_drop_clone/subds on git:master
❱ cd ..
(datalad-dev) ben@tree in /tmp/test_drop_clone on git:master
❱ datalad -f json_pp drop -d. subds --what datasets                                                                                                                                                            1 !
{
  "action": "uninstall",
  "message": [
    "to-be-dropped dataset has revisions that are not available at any known sibling. Use `datalad push --to ...` to push these before dropping the local dataset, or ignore via `--reckless availability`. Unique revisions: %s",
    [
      "master"
    ]
  ],
  "path": "/tmp/test_drop_clone/subds",
  "refds": "/tmp/test_drop_clone",
  "status": "error",
  "type": "dataset"
}
(datalad-dev) ben@tree in /tmp/test_drop_clone on git:master
❱ datalad -f json_pp drop -d. subds --what datasets --reckless availability                                                                                                                                    1 !
{
  "action": "uninstall",
  "path": "/tmp/test_drop_clone/subds",
  "refds": "/tmp/test_drop_clone",
  "status": "ok",
  "type": "dataset"
}
(datalad-dev) ben@tree in /tmp/test_drop_clone on git:master
❱ datalad get subds
[INFO   ] scanning for unlocked files (this may take some time)                                                                                                                                                    
install(ok): /tmp/test_drop_clone/subds (dataset) [Installed subdataset in order to get /tmp/test_drop_clone/subds]
get(ok): subds/another (file) [from origin...]                                                                                                                                                                     
action summary:                                                                                                                                                                                                    
  get (ok: 1)
  install (ok: 1)

@bpoldrack
Copy link
Member

Dropping a regular subdirectory w/ --what datasets returns absolutely nothing. No notneeded, no error - just silence.

❱ datalad -f json_pp drop somedir --what datasets

@bpoldrack
Copy link
Member

bpoldrack commented Nov 5, 2021

Similar: Dropping an already dropped subds, doesn't return anything either. Shoudn't that be a notneeded?

Technically likely the same thing:

❱ datalad -f json_pp drop file.txt --what datasets 
❱

datalad/distributed/drop.py Outdated Show resolved Hide resolved
@mih
Copy link
Member Author

mih commented Nov 5, 2021

Dropping a regular subdirectory w/ --what datasets returns absolutely nothing. No notneeded, no error - just silence.

Doing this for this simple case is easy, for the general case not. How would you determine for any given input path whether it had absolutely no consequences -- keeping in mind:

  • recursive operation
  • multiple path arguments
  • multiple path arguments pointing to different datasets

The desire for this type of reporting originally led to AnnotatePaths. We should keep in mind that notneeded results are inconsequential in general, and typically not even displayed. I doubt that it is worth complicating the code for it, or even invest runtime to determine whether this is happinging.

Same here ...

Similar: Dropping an already dropped subds, doesn't return anything either. Shoudn't that be a notneeded?

How much time should we invest? An empty dir could be an uninstalled subdataset, or an empty dir. A non-existing path could be an uninstalled subdataset, or not exist at all. Do we report notneeded for first-level uninstalled datasets? But why not for next-level? -- they also do not exist. So generating these inconsequential results always require inspection, slow normal operations, and complicate the code, because we need to track numerous combinations when a report qualifies as "having said something already". The current implementation reports notneeded whenever it is cheap, or when required for backward compatibility. But it tries hard to avoid runtime costs for beautification.

Example: I have a dataset w/ a subdataset. I clone that thing and in the clone I remove origin from the subdataset. If I now want to drop that subds from the clone, drop is complaining about no known remote. On the one hand, that's technically right, since that dataset doesn't know about a remote. On the other hand: If I drop it with --reckless availability, I can get it again w/o issues. At some level we know that it's fine. Looks to me as if the check is running on the knowledge of the to-be-dropped dataset only, while the reverse operation (get) would use the info as available from the superdataset.

What are you proposing? That on-drop the superdataset is inspected whether it can generate any candidate URL from which a repository can be cloned that is then inspected whether it has all the commits needed by the to-be-dropped subdataset?

@loj
Copy link
Collaborator

loj commented Nov 5, 2021

This feels good! I tried it out for a couple of different workflows/situations that I commonly used drop/uninstall/remove, and I have no complaints. :-)

I just found the small typo in the docs for --reckless undead.

@bpoldrack
Copy link
Member

bpoldrack commented Nov 5, 2021

How would you determine for any given input path whether it had absolutely no consequences

I might be wrong, but shouldn't such a case be discoverable right after determining paths_by_ds? A given path pointing to a regular subdir should be associated with dataset, rather than generate its own entry, right? So, _drop_dataset should be able to tell. In fact, I think right here:

https://github.com/datalad/datalad/pull/6111/files#diff-f0ac845bc6391c474eb58b0c1448b7f6bc00cb5585dab58777d8d9512f99c06bR371

(link doesn't work properly. Line 371 in that diff's drop.py)

Comment right above that return seems to realize that basically. We have a path of which we know that ony possibly keys can be dropped, but what was datasets, so we don't do anything. Fine. Why not make that a notneeded result with respective message?

Edit: Arguably, it's even impossible by the assessment in that code block.

# so we have paths constraints that prevent dropping the full dataset
# there is nothing to do here, but to drop keys, which we must not
# done
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mih

Something along these lines is what I mean. When do we end up here? AFAICS, when we have path(s) that are part of THIS dataset (as opposed to a subdataset) and we have what='datasets'. So, phrasing it like the comment ("constraining the drop of THIS dataset") is one way, but "trying to drop subdataset(s) where there are none" is another.

May be I just don't see how we can end up here in a way that this message would be misleading. I don't know.

Suggested change
return
for p in paths:
yield dict(action="drop",
path=p,
status="impossible",
message=f"what=datasets was given, but no dataset can be dropped from {p}",
type="file")
return

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid this is not correct. It is perfectly possible that the paths you are reporting on have been originally used to drop datasets underneath it. impossible would be wrong, because something might have been done already, and file is a guess.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is perfectly possible that the paths you are reporting on have been originally used to drop datasets underneath it.

Hm. I fail to produce that, but thanks for the hint. Trying to figure that out.

@bpoldrack
Copy link
Member

FTR: I'm testing my suggestion.

As for this:

What are you proposing? That on-drop the superdataset is inspected whether it can generate any candidate URL from which a repository can be cloned that is then inspected whether it has all the commits needed by the to-be-dropped subdataset?

No - too expensive, obviously. As I said: No immediate idea. Just stumbled upon it. If you don't see a solution either - it's fine with me as is. I agree it's not worth making things more expensive.

We should keep in mind that notneeded results are inconsequential in general, and typically not even displayed. I doubt that it is worth complicating the code for it, or even invest runtime to determine whether this is happinging.

Agree with the latter. Wouldn't phrase the former exactly like that. Via python interface I'd consider notneeded results much nicer to deal with than having to account for results sometimes, exceptions in other cases and absolutely nothing as yet another way of how datalad commands can behave. So - not worth paying a significant cost, but worth giving a second thought from my POV.

@mih
Copy link
Member Author

mih commented Nov 5, 2021

Conclusion from chat: Let's put the exploration and improvements possibly coming from @bpoldrack's comments in a separate PR. This one is already huge, and the suggested behavior was also not exhibited by the implementation that is being replaced here
.

@mih mih merged commit 3cbc5cd into datalad:master Nov 5, 2021
@mih mih deleted the nf-dropng branch November 5, 2021 14:12
@mih mih mentioned this pull request Mar 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment