Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dry run option for clean command #5738

Merged
merged 4 commits into from
Aug 5, 2021
Merged

Conversation

bpoldrack
Copy link
Member

@bpoldrack bpoldrack commented Jun 11, 2021

There are several places where temporary files could be left by datalad operations. datalad clean now comes with a --dry-run option to just report on what could be cleaned.
Furthermore, the datalad-archives special remote now issues an INFO message to annex when it's actually extracting an archive in order to let a user know they may want to call datalad clean afterwards. This should make the presence of an extraction cache less surprising to users and make them aware of the clean command.

The issue leading to this PR and the discussion therein is #5519, although this is not actually solving it. However, at least this is an improvement of the UX aspect of the issue.

Closes #5834

@bpoldrack
Copy link
Member Author

bpoldrack commented Jun 11, 2021

Note, that this is for now replacing PR #5730, as an actual solution turns out to be more of an effort than I can currently afford. So, not closing #5519 really, but at least let the user know what's going on and how to remove that cache.

@mih : Guess, with that the issue can be regarded as addressed within context of https://github.com/datalad/datalad/projects/13, since the "real solution" is actually significantly more than a "simple UX issue".

Edit:
Or may be in addition I should look into your idea of letting status report on the existence of the cache.

@codecov
Copy link

codecov bot commented Jun 11, 2021

Codecov Report

Merging #5738 (efe24aa) into master (88a0546) will decrease coverage by 2.09%.
The diff coverage is 86.40%.

❗ Current head efe24aa differs from pull request most recent head 2a46f3d. Consider uploading reports for the commit 2a46f3d to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5738      +/-   ##
==========================================
- Coverage   90.64%   88.54%   -2.10%     
==========================================
  Files         309      307       -2     
  Lines       42393    42513     +120     
==========================================
- Hits        38426    37643     -783     
- Misses       3967     4870     +903     
Impacted Files Coverage Δ
datalad/core/local/tests/test_diff.py 99.53% <ø> (ø)
datalad/core/local/tests/test_run.py 100.00% <ø> (ø)
datalad/core/local/tests/test_save.py 98.02% <ø> (ø)
datalad/customremotes/datalad.py 39.34% <ø> (-0.98%) ⬇️
datalad/dataset/repo.py 95.38% <ø> (-0.14%) ⬇️
datalad/distributed/create_sibling_ria.py 82.65% <ø> (-0.10%) ⬇️
datalad/distributed/export_to_figshare.py 26.41% <ø> (+0.32%) ⬆️
...talad/distributed/tests/test_export_to_figshare.py 100.00% <ø> (ø)
datalad/distribution/dataset.py 94.42% <ø> (-0.03%) ⬇️
datalad/distribution/get.py 97.89% <ø> (-0.02%) ⬇️
... and 158 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 88a0546...2a46f3d. Read the comment docs.

@bpoldrack
Copy link
Member Author

bpoldrack commented Jun 11, 2021

Looking into more reporting on this. Thus proposing --info for clean, which in turn could be called by status.
Proposal in f2624dc WDYT, @datalad/developers ?

Yes - needs tests. I'm aware. :)

@bpoldrack bpoldrack changed the title UX: Let datalad-archives hint to datalad clean UX: Give hints to datalad's clean command Jun 11, 2021
Copy link
Member

@mih mih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, I think linking this into status is a reasonable way to give this aspect the visibility it needs in order to avoid accumulation of cruft.

That being said: I personally see the reporting on cleanable locations not as a core function of status, hence I do not think it should actually yield such results (unconditionally). A programmatic user that is concerned with cleaning things up, can, will, and has to make a dedicated call to clean. With the present change, clean will be called twice for them (once with --info and once to actually clean). -- and once for everybody else, unconditionally.

I think it would make sense to move the call to clean into custom_result_summary_renderer() and only report on cleanable locations, if there are any. This setup ensures that the cycles necessary are only spent when there is a chance that a (human) consumer catches the notification, and, e.g. not in pipes with datalad -f json.

@bpoldrack
Copy link
Member Author

I think it would make sense to move the call to clean into custom_result_summary_renderer() and only report on cleanable locations, if there are any.

Makes sense to me. Thx.

@bpoldrack
Copy link
Member Author

only report on cleanable locations, if there are any.

FTR: That was the case already.

However, changed to what you suggested, I think, @mih. This now includes a commit to pass original kwargs to custom summary renderers (as we already do for custom result renderers).

Copy link
Member

@mih mih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As always, the more one thinks, the more thoughts come out ;-)

Below are a bunch of additional comments. Also this PR has already shed some light for me personally. I am afraid, this kind of output will be with us frequently:

% datalad clone https://github.com/psychoinformatics-de/studyforrest-data-structural
install(ok): /tmp/studyforrest-data-structural (dataset)
% cd studyforrest-data-structural
studyforrest-data-structural (git)-[master] % datalad
studyforrest-data-structural (git)-[master] % datalad get sub-01/anat/sub-01_T1w.nii.gz
get(ok): sub-01/anat/sub-01_T1w.nii.gz (file) [from mddatasrc...]                                                          
studyforrest-data-structural (git)-[master] % datalad status
.git/annex/transfer: Discovered 1 annex temporary transfer directory: download
Note: You can remove temporary files anytime by using datalad's 'clean' command.

datalad/core/local/status.py Outdated Show resolved Hide resolved
datalad/interface/clean.py Outdated Show resolved Hide resolved
datalad/interface/clean.py Outdated Show resolved Hide resolved
datalad/interface/utils.py Outdated Show resolved Hide resolved
datalad/core/local/status.py Outdated Show resolved Hide resolved
datalad/core/local/status.py Outdated Show resolved Hide resolved
Comment on lines 401 to 403
self.info("%s special remote is using an extraction cache "
"under %s. You may want to call 'datalad clean' "
"to remove it afterwards." %
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.info("%s special remote is using an extraction cache "
"under %s. You may want to call 'datalad clean' "
"to remove it afterwards." %
self.info("%s special remote created an extraction cache "
"under %s. Remove it with DataLad's 'clean' command "
"to save disk space." %

There can be performance loss when it is removed. I would prefer concrete advice when to do what, rather than a more nebulous statement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I see that. However, this is the special remote itself. Technically it's kinda wrong to point to datalad commands to begin with + long message would be there all the time. Wondering: We usually call annex with --json by now, giving us an info object that we decide to spit out (rather than annex itself). So, we could use that for a short message that we catch in datalad and can potentially decide to enhance or swallow from within a context where we know much more about the situation than within the special remote. Not too clear to me at this point, just thinking about principles how to deal with messaging better in general.

Special remotes issues an INFO message to annex when it's actually
extracting an archive in order to let a user know they may want to call
`datalad clean` afterwards. This doesn't resolve the underlying issue of
keeping an extraction cache around (see dataladgh-5519), but at least improves
on UX for now.
With this switch `clean` reports on existing temporary locations, that
it is able to clean up. This modes yields result records with
'action="clean [dry-run]"' and thereby is inline with `run --dry-run`.
@bpoldrack
Copy link
Member Author

bpoldrack commented Jul 19, 2021

@mih @yarikoptic :

Since I didn't see a good solution to performance issues as discussed above, changed to the following: Simply let status display a hint to clean in the summary by default - no actual probing is performed. Hint can be disabled by a config. This should address the UX aspect, as users become aware that way.

If that's fine with you guys, will edit PR+first post accordingly.
I would not close the original issue, since autocleaning would still be a desirable option, but just take out the UX label and consider this done for the UX project. Agree, @mih?

Otherwise one question remains for me: datalad.status.hints.clean may not be ideal. If we want to generalize that approach, we should decide on the config "hierarchy": Are such flags to be collected per command (as I did here), or should we go for a broader collection like datalad.hints.status-clean? Or no referenz to a particular command at all, since there may be hints, that could appear in different spots if enabled, so just datalad.hints.use-clean?

@mih
Copy link
Member

mih commented Jul 20, 2021

I cannot get behind some of the rational that led to the current state of the PR, but in its present form I would prefer to not merge it. From my POV it adds to the problem of silently growing repositories the problem of output noise:

% datalad create sofresh
[INFO   ] Creating a new annex repo at /tmp/sofresh 
[INFO   ] scanning for unlocked files (this may take some time) 
create(ok): /tmp/sofresh (dataset)

% datalad -C sofresh status
nothing to save, working tree clean
Hint: Run 'datalad clean [--dry-run]' to find and remove temporary files used by datalad and git-annex.
Disable 'datalad.status.hints.clean' to not display this message anymore.

If we ignore that the hint is visually masking the important information, users will see this unconditionally every time they execute status (which is arguably one, if not the, most frequently executed command). They will have learned to ignore it (or disable it), before they actually get into the situation that datalad clean --dry-run would report anything cleanable.

Given that the PR does not longer address the original issue, I'd say that the modification of status should be reverted, and only the updates to clean be kept. I wanted to look into how the messaging of the new archive remote behaves, but I tripped over #5800

@bpoldrack
Copy link
Member Author

So, just clean --dry-run and a message from datalad-archives special remote when it creates an extraction cache.
The latter looks like this ATM:

❱ datalad get .
[INFO   ] To obtain some keys we need to fetch an archive of size 2.6 kB                                                                         
[INFO   ] datalad-archives special remote is using an extraction cache under /tmp/clone/.git/datalad/tmp/archives/34d49c45cd. Remove it with DataLad's 'clean' command to save disk space. 
get(ok): .ssh/id_appveyor_rsa (file) [from datalad-archives...]
get(ok): .ssh/id_appveyor_rsa.pub (file) [from datalad-archives...]
get(ok): appveyor_key.tar.gz (file) [from origin...]
action summary:
  get (ok: 3)

@bpoldrack bpoldrack changed the title UX: Give hints to datalad's clean command Dry run option for clean command Jul 28, 2021
Copy link
Member

@adswa adswa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked into this in detail, but came here to say that I like the UX message where it is put now much more than unconditionally within status.
I also like the addition of a dry-run flag.

There were a few things I noticed:
When I run it in a random directory without a cache, it just returns nothing. What do you think: would it make sense to report back a "cacheless" state, e.g., "No temporary clutter found"? This behavior would somehow match that of status (which reports nothing to save, working tree clean) and rerun (which warn No command given when nothing would be run). But it would be different from, e.g. git clean -dfn, which also returns nothing if nothing would be cleaned away.

I tried to trigger the hint by downloading and extracting an archive, but didn't see the hint even though there is a temporary download directory left behind:

(handbook2) adina@muninn in /tmp/abcd on git:master
❱ datalad download-url --archive -m test https://www.7-zip.org/a/7za920.zip
[INFO   ] Downloading 'https://www.7-zip.org/a/7za920.zip' into '/tmp/abcd/' 
download_url(ok): /tmp/abcd/7za920.zip (file)                                   
add(ok): 7za920.zip (file)                                                      
save(ok): . (dataset)                                                           
[INFO   ] Adding content of the archive /tmp/abcd/7za920.zip into annex AnnexRepo(/tmp/abcd) 
[INFO   ] Initiating special remote datalad-archives 
[INFO   ] Finished adding /tmp/abcd/7za920.zip: Files processed: 4, +annex: 4 
action summary:
  add (ok: 1)
  download_url (ok: 1)
  save (ok: 1)
(handbook2) adina@muninn in /tmp/abcd on git:master
❱ datalad clean --dry-run
.git/annex/transfer: Discovered 1 annex temporary transfer directory: download

It may make sense to add it here, too?

I think it would be nice to add --dry-run to the examples of the command, even if it is kinda straightforward.

@bpoldrack
Copy link
Member Author

bpoldrack commented Jul 29, 2021

I tried to trigger the hint by downloading and extracting an archive, but didn't see the hint even though there is a temporary download directory left behind

That's a different thing, not the extraction cache of the special remote, though.
.git/annex/transfer/... is created by git-annex, not by datalad-archives. So, we can't really issue such a hint when it's happening. We'd need to investigate whether annex left something behind, thus running into similar issues as with the original approach here - probing can be costly.

Re no result:

  1. FTR - There should be notneeded results (which aren't rendered by default)
  2. I guess I'm with you - should behave like status. Will do.

@mih
Copy link
Member

mih commented Jul 29, 2021

That's a different thing, not the extraction cache of the special remote, though.

I see

% tree -d .git/datalad
.git/datalad
└── tmp
    └── archives

after download-url, but no report on it, and clean not acting on it without --dry-run.

@bpoldrack
Copy link
Member Author

I see

% tree -d .git/datalad
.git/datalad
└── tmp
    └── archives

@mih : I see that, too. However, it's empty. So, there's nothing to clean.

@bpoldrack
Copy link
Member Author

@mih , @adswa :

Added a "nothing-to-do" message, added examples, clean now deals with empty directories as well and switched to make use of Path objects instead.

/tmp/some on git:master❱ datalad clean --dry-run
.git/datalad/tmp/archives: Discovered empty temporary archive directory
.git/annex/tmp: Discovered empty temporary annex directory
.git/annex/transfer: Discovered 1 annex temporary transfer directory: download
/tmp/some on git:master❱ datalad clean
clean(ok): .git/datalad/tmp/archives (directory) [Removed empty temporary archive directory]
clean(ok): .git/annex/tmp (directory) [Removed empty temporary annex directory]
clean(ok): .git/annex/transfer (directory) [Removed 1 annex temporary transfer directory: download]
/tmp/some on git:master❱ datalad clean --dry-run
nothing to clean, no temporary locations present.

@adswa
Copy link
Member

adswa commented Aug 3, 2021

I like it!
But one of your tests fails on Windows:

======================================================================
ERROR: datalad.interface.tests.test_clean.test_clean
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python39-x64\lib\site-packages\nose\case.py", line 198, in runTest
    self.test(*self.arg)
  File "C:\Python39-x64\lib\site-packages\datalad\tests\utils.py", line 739, in _wrap_with_tempfile
    return t(*(arg + (filename,)), **kw)
  File "C:\Python39-x64\lib\site-packages\datalad\interface\tests\test_clean.py", line 55, in test_clean
    res = clean(dataset=ds, return_type='item-or-list',
  File "C:\Python39-x64\lib\site-packages\datalad\interface\utils.py", line 483, in eval_func
    return return_func(generator_func)(*args, **kwargs)
  File "C:\Python39-x64\lib\site-packages\datalad\interface\utils.py", line 476, in return_func
    results = list(results)
  File "C:\Python39-x64\lib\site-packages\datalad\interface\utils.py", line 396, in generator_func
    for r in _process_results(
  File "C:\Python39-x64\lib\site-packages\datalad\interface\utils.py", line 574, in _process_results
    for res in results:
  File "C:\Python39-x64\lib\site-packages\datalad\interface\clean.py", line 173, in __call__
    rmtree(topdir)
  File "C:\Python39-x64\lib\site-packages\datalad\utils.py", line 476, in rmtree
    path = r'\\?\ '.strip() + path
TypeError: can only concatenate str (not "WindowsPath") to str

Edit: I tried it on Windows, and clean fails with this message whenenver it finds something to remove

- output a message when there's nothing to do / to report on, since
  'notneeded' results aren't rendered
- remove and report on empty temporary directories as well
- add examples to the doc
- switch implementation to `Path` objects
- enhance test (didn't even test whether something was removed but only
  the result records)
@bpoldrack
Copy link
Member Author

Ah - stupid me. Force-pushed.

@adswa
Copy link
Member

adswa commented Aug 3, 2021

Cool, works for me locally on Windows now!

Copy link
Member

@mih mih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates.

I resolved half a dozen conversation on topics that are no longer associated with the changes proposed in here. It would probably have been better to close this PR and start a new one, instead. IMHO it is needlessly hard to review these winding stories.

Given that this PR closes #5834 I think a key change is the special remote info message. However, there is no test for it actually showing up in user output, and I cannot make it show manually either.

% datalad create archive
[INFO   ] Creating a new annex repo at /tmp/archive 
[INFO   ] scanning for unlocked files (this may take some time) 
create(ok): /tmp/archive (dataset)
% cd archive
% datalad download-url --archive -m test https://www.7-zip.org/a/7za920.zip
[INFO   ] Downloading 'https://www.7-zip.org/a/7za920.zip' into '/tmp/archive/' 
download_url(ok): /tmp/archive/7za920.zip (file)                                           
add(ok): 7za920.zip (file)                                                                 
save(ok): . (dataset)                                                                      
[INFO   ] Adding content of the archive /tmp/archive/7za920.zip into annex AnnexRepo(/tmp/archive) 
[INFO   ] Initiating special remote datalad-archives 
[INFO   ] Finished adding /tmp/archive/7za920.zip: Files processed: 4, +annex: 4 
action summary:
  add (ok: 1)
  download_url (ok: 1)
  save (ok: 1)
(datalad-dev) mih@meiner /tmp/archive (git)-[master] % git annex info
...
        c04eb54b-4b4e-5755-8436-866b043170fa -- [datalad-archives]

Is it not working? Or is it not supposed to be working in this case?

@bpoldrack
Copy link
Member Author

bpoldrack commented Aug 3, 2021

Is it not working? Or is it not supposed to be working in this case?

Not supposed. download_url doesn't leave an extraction cache. However, if you drop the extracted content, and then get the content again, the cache should be established and the message be issued.

That's because the first time the content is extracted into the worktree, since it's supposed to be added.

❱ datalad download-url --archive -m test https://www.7-zip.org/a/7za920.zip
[INFO   ] Downloading 'https://www.7-zip.org/a/7za920.zip' into '/tmp/some/' 
download_url(ok): /tmp/some/7za920.zip (file)                                                                                                                                                      
add(ok): 7za920.zip (file)                                                                                                                                                                         
save(ok): . (dataset)                                                                                                                                                                              
[INFO   ] Adding content of the archive /tmp/some/7za920.zip into annex AnnexRepo(/tmp/some)                                                                                                       
[INFO   ] Initiating special remote datalad-archives 
[INFO   ] Finished adding /tmp/some/7za920.zip: Files processed: 4, +annex: 4 
action summary:
  add (ok: 1)
  download_url (ok: 1)
  save (ok: 1)
❱ datalad drop .
drop(ok): /tmp/some/7-zip.chm (file)                                                                                                                                                               
drop(ok): /tmp/some/7za.exe (file)
drop(ok): /tmp/some/license.txt (file)
drop(ok): /tmp/some/readme.txt (file)
action summary:
  drop (notneeded: 3, ok: 4)
❱ datalad get .
[INFO   ] datalad-archives special remote is using an extraction cache under /tmp/some/.git/datalad/tmp/archives/118110afdf. Remove it with DataLad's 'clean' command to save disk space.          
get(ok): 7-zip.chm (file) [from datalad-archives...]
get(ok): 7za.exe (file) [from datalad-archives...]
get(ok): license.txt (file) [from datalad-archives...]
get(ok): readme.txt (file) [from datalad-archives...]
action summary:
  get (ok: 4)

@mih
Copy link
Member

mih commented Aug 3, 2021

Not supposed. download_url doesn't leave an extraction cache. However, if you drop the extracted content, clone that ds and get the content from origin, the cache should be established and the message be issued.

Thanks for the clarification. I can confirm that it does not need a clone to see it working:

(continued from above)

(datalad-dev) mih@meiner /tmp/archive (git)-[master] % datalad drop license.txt
drop(ok): /tmp/archive/license.txt (file)                                                  
(datalad-dev) mih@meiner /tmp/archive (git)-[master] % datalad get license.txt
[INFO   ] datalad-archives special remote is using an extraction cache under /tmp/archive/.git/datalad/tmp/archives/5ca17613fa. Remove it with DataLad's 'clean' command to save disk space. 
get(ok): license.txt (file) [from datalad-archives...]

Add a test assertion to make sure the datalad-archives special remote is
giving a hint to its extraction cache and to `datalad clean`.
@bpoldrack
Copy link
Member Author

I can confirm that it does not need a clone to see it working:

Yeah, just reread my own comment and noticed the cloning was nonsense.

Pushed a test enhancement.

@bpoldrack bpoldrack added the semver-minor Increment the minor version when merged label Aug 4, 2021
Copy link
Member

@mih mih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx! From my POV this is ready.

@bpoldrack bpoldrack merged commit cd62e06 into datalad:master Aug 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver-minor Increment the minor version when merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make users aware of (the need for) datalad clean
4 participants