New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: download_url: Use trailing separator to signal directory target #3854
Conversation
f38d72c (BF: download_url: Update for new path resolution logic, 2019-06-03) didn't properly adjust path handling for downstream code that feeds the paths into AnnexRepo methods. We give these methods paths that are relative to the current directory when a dataset is not an instance, but these methods still expect paths to be either relative to the dataset or full paths. Pass AnnexRepo methods paths that are relative to the dataset. Fixes datalad#3847.
Use the resolve_path() helper rather than custom logic to resolve paths against the dataset. Using centralized logic helps avoid inconsistent behavior and allows us to take advantage of the non-trivial logic in resolve_path(). In particular, we avoid the use of normpath(), which is problematic for the reason mentioned in resolve_path's docstring and comments. Here's the pathlib documentation that resolve_path() references: Spurious slashes and single dots are collapsed, but double dots ('..') are not, since this would change the meaning of a path in the face of symbolic links: [...] (a naïve approach would make PurePosixPath('foo/../bar') equivalent to PurePosixPath('bar'), which is wrong if foo is a symbolic link to another directory) , which is problematic for the reasons mentioned in Re: datalad#3643 (comment)
On both master and 0.11x, there isn't an attempt to identify the dataset from --path argument. For example, if outside of the </path/to/ds/> dataset, running $ datalad download-url --path /path/to/ds/fname https://www.datalad.org/img/logo/studyforrest.png downloads the file to </path/to/ds/fname>, but it does not perform any of the dataset-dependent functionality (e.g., saving). Looking at 98153ec (ENH: download_url: Optionally add file to dataset, 2018-05-17), it appears that functionality was never supported and that this description was thoughtlessly copied from an existing --dataset description.
We've now dropped Python 2 support, so follow the suggestion of the deleted commented.
As of a570fcb (ENH: downloaders: Ensure directories for target exist, 2019-09-02), download() creates leading directories if it is given a path that does not exist for _non-directory_ targets. A directory target is supported, but it must exist. Move the "make directories if needed" logic early so that we can handle directory targets as well.
I've pushed an update with more tests and tweaked handling of the "path without slash points to existing directory" case. I'll take this out of draft mode, but label it with "do not merge" because it sits on top of gh-3850. range-diff
|
If the --path argument points to an existing directory, download_url() will dump content to files within that directory. The only way we know that the user wants a directory is that one exists. As a consequence, if there's a typo (as described in dataladgh-3484), download_url() can't be aware that a directory was intended and goes with the non-directory treatment. When combined with --archive, this can lead to a large number of files in a location the user didn't intend, typically the top-level directory of the repository. To improve this situation, require the user to tack on a trailing separator to indicate that they want the directory treatment. If the user has a typo in the directory name, at least the content goes into a misnamed subdirectory. And because download_url() knows what the user wanted regardless of whether the directory exists, download_url() can now support creating directories when they don't exist, which underneath is already supported by download(). Closes datalad#3848.
Codecov Report
@@ Coverage Diff @@
## master #3854 +/- ##
===========================================
+ Coverage 46.56% 80.73% +34.16%
===========================================
Files 270 273 +3
Lines 36006 36058 +52
===========================================
+ Hits 16767 29112 +12345
+ Misses 19239 6946 -12293
Continue to review full report at Codecov.
|
Jeez... Apologies for not having responded in a month... |
This implements the trailing slash solution to gh-3848. It sits on top of the unmerged gh-3850. I'm marking it as a draft because more extensive testing should be added to test_download_url.py.