Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: download_url: Use trailing separator to signal directory target #3854

Merged
merged 6 commits into from Dec 9, 2019

Conversation

kyleam
Copy link
Contributor

@kyleam kyleam commented Nov 1, 2019

This implements the trailing slash solution to gh-3848. It sits on top of the unmerged gh-3850. I'm marking it as a draft because more extensive testing should be added to test_download_url.py.

kyleam added 5 commits Nov 1, 2019
f38d72c (BF: download_url: Update for new path resolution logic,
2019-06-03) didn't properly adjust path handling for downstream code
that feeds the paths into AnnexRepo methods.  We give these methods
paths that are relative to the current directory when a dataset is not
an instance, but these methods still expect paths to be either
relative to the dataset or full paths.

Pass AnnexRepo methods paths that are relative to the dataset.

Fixes datalad#3847.
Use the resolve_path() helper rather than custom logic to resolve
paths against the dataset.  Using centralized logic helps avoid
inconsistent behavior and allows us to take advantage of the
non-trivial logic in resolve_path().  In particular, we avoid the use
of normpath(), which is problematic for the reason mentioned in
resolve_path's docstring and comments.  Here's the pathlib
documentation that resolve_path() references:

    Spurious slashes and single dots are collapsed, but double dots ('..')
    are not, since this would change the meaning of a path in the face of
    symbolic links:
    [...]
    (a naïve approach would make PurePosixPath('foo/../bar')
    equivalent to PurePosixPath('bar'), which is wrong if foo is a
    symbolic link to another directory) , which is problematic for the
    reasons mentioned in

Re: datalad#3643 (comment)
On both master and 0.11x, there isn't an attempt to identify the
dataset from --path argument.  For example, if outside of the
</path/to/ds/> dataset, running

  $ datalad download-url --path /path/to/ds/fname
    https://www.datalad.org/img/logo/studyforrest.png

downloads the file to </path/to/ds/fname>, but it does not perform any
of the dataset-dependent functionality (e.g., saving).

Looking at 98153ec (ENH: download_url: Optionally add file to
dataset, 2018-05-17), it appears that functionality was never
supported and that this description was thoughtlessly copied from an
existing --dataset description.
We've now dropped Python 2 support, so follow the suggestion of the
deleted commented.
As of a570fcb (ENH: downloaders: Ensure directories for target
exist, 2019-09-02), download() creates leading directories if it is
given a path that does not exist for _non-directory_ targets.  A
directory target is supported, but it must exist.

Move the "make directories if needed" logic early so that we can
handle directory targets as well.
@kyleam kyleam force-pushed the download-url-o-slash branch from 3af188e to 87f5fc8 Compare Nov 4, 2019
@kyleam
Copy link
Contributor Author

kyleam commented Nov 4, 2019

I've pushed an update with more tests and tweaked handling of the "path without slash points to existing directory" case. I'll take this out of draft mode, but label it with "do not merge" because it sits on top of gh-3850.

range-diff
1:  813fc456f = 1:  813fc456f DOC: download_url: Correct --dataset's description
2:  535c8730b = 2:  535c8730b ENH: downloaders: Use makedirs's exist_ok to avoid race
3:  18973ea02 = 3:  18973ea02 RF: downloaders: Support non-existing directory target
4:  3af188e4e ! 4:  87f5fc8ac ENH: download_url: Use trailing separator to signal directory target
    @@ datalad/interface/download_url.py: def __call__(urls, dataset=None, path=None, o
              urls = assure_list_from_str(urls)
      
     -        if len(urls) > 1 and path and not op.isdir(path):
    -+        if len(urls) > 1 and not dir_is_target:
    -             yield get_status_dict(
    -                 status="error",
    -                 message=(
    -                     "When specifying multiple urls, --path should point to "
    +-            yield get_status_dict(
    +-                status="error",
    +-                message=(
    +-                    "When specifying multiple urls, --path should point to "
     -                    "an existing directory. Got %r", path),
    -+                    "a directory target (with a trailing separator). Got %r",
    -+                    path),
    -                 type="file",
    -                 path=path,
    -                 **common_report)
    +-                type="file",
    +-                path=path,
    +-                **common_report)
    +-            return
    ++        if not dir_is_target:
    ++            if len(urls) > 1:
    ++                yield get_status_dict(
    ++                    status="error",
    ++                    message=(
    ++                        "When specifying multiple urls, --path should point to "
    ++                        "a directory target (with a trailing separator). Got %r",
    ++                        path),
    ++                    type="file",
    ++                    path=path,
    ++                    **common_report)
    ++                return
    ++            # download() would be fine getting an existing directory and
    ++            # downloading the URL underneath it, but let's enforce a trailing
    ++            # slash here for consistency.
    ++            if op.isdir(path):
    ++                yield get_status_dict(
    ++                    status="error",
    ++                    message=(
    ++                        "Non-directory path given (no trailing separator) "
    ++                        "but a directory with that name exists"),
    ++                    type="file",
    ++                    path=path,
    ++                    **common_report)
    ++                return
    + 
    +         # TODO setup fancy ui.progressbars doing this in parallel and reporting overall progress
    +         # in % of urls which were already downloaded
     
      ## datalad/interface/tests/test_download_url.py ##
     @@ datalad/interface/tests/test_download_url.py: def test_download_url_exceptions():
    @@ datalad/interface/tests/test_download_url.py: def test_download_url_exceptions()
                         res0)
      
          res1 = download_url('http://example.com/bogus',
    +@@ datalad/interface/tests/test_download_url.py: def test_download_url_exceptions():
    +         assert_in('http://example.com/bogus', msg)
    + 
    + 
    ++@with_tree(tree={"dir": {}})
    ++def test_download_url_existing_dir_no_slash_exception(path):
    ++    with chpwd(path):
    ++        res = download_url('url', path="dir", save=False, on_failure='ignore')
    ++        assert_result_count(res, 1, status='error')
    ++        assert_message("Non-directory path given (no trailing separator) "
    ++                       "but a directory with that name exists",
    ++                       res)
    ++
    ++
    + @known_failure_githubci_win
    + @assert_cwd_unchanged
    + @with_tree(tree=[
     @@ datalad/interface/tests/test_download_url.py: def test_download_url_exceptions():
      @serve_path_via_http
      @with_tempfile(mkdir=True)
    @@ datalad/interface/tests/test_download_url.py: def test_download_url_archive(topp
          os.mkdir(subdir_path)
          with chpwd(subdir_path):
              download_url([topurl + "archive.tar.gz"], archive=True)
    +     ok_(ds.repo.file_has_content(opj("subdir", "archive", "file1.txt")))
    ++
    ++
    ++@known_failure_githubci_win
    ++@with_tree(tree={"a0.tar.gz": {'f0.txt': 'abc'},
    ++                 "a1.tar.gz": {'f1.txt': 'def'}})
    ++@serve_path_via_http
    ++@with_tempfile(mkdir=True)
    ++def test_download_url_archive_trailing_separator(toppath, topurl, path):
    ++    ds = Dataset(path).create()
    ++    # Archives will be extracted in the specified subdirectory, which doesn't
    ++    # need to exist.
    ++    ds.download_url([topurl + "a0.tar.gz"], path=opj("with-slash", ""),
    ++                    archive=True)
    ++    ok_(ds.repo.file_has_content(opj("with-slash", "a0", "f0.txt")))
    ++    # But if the path doesn't have a trailing separator, it will not be
    ++    # considered a directory. The archive will be downloaded to that path and
    ++    # then extracted in the top-level of the dataset.
    ++    ds.download_url([topurl + "a1.tar.gz"], path="no-slash",
    ++                    archive=True)
    ++    ok_(ds.repo.file_has_content(opj("a1", "f1.txt")))

@kyleam kyleam marked this pull request as ready for review Nov 4, 2019
@kyleam kyleam added the do not merge label Nov 4, 2019
If the --path argument points to an existing directory, download_url()
will dump content to files within that directory.  The only way we
know that the user wants a directory is that one exists.  As a
consequence, if there's a typo (as described in dataladgh-3484),
download_url() can't be aware that a directory was intended and goes
with the non-directory treatment.  When combined with --archive, this
can lead to a large number of files in a location the user didn't
intend, typically the top-level directory of the repository.

To improve this situation, require the user to tack on a trailing
separator to indicate that they want the directory treatment.  If the
user has a typo in the directory name, at least the content goes into
a misnamed subdirectory.  And because download_url() knows what the
user wanted regardless of whether the directory exists, download_url()
can now support creating directories when they don't exist, which
underneath is already supported by download().

Closes datalad#3848.
@kyleam kyleam force-pushed the download-url-o-slash branch from 87f5fc8 to de9d692 Compare Nov 5, 2019
@codecov
Copy link

codecov bot commented Nov 5, 2019

Codecov Report

Merging #3854 into master will increase coverage by 34.16%.
The diff coverage is 61.4%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #3854       +/-   ##
===========================================
+ Coverage   46.56%   80.73%   +34.16%     
===========================================
  Files         270      273        +3     
  Lines       36006    36058       +52     
===========================================
+ Hits        16767    29112    +12345     
+ Misses      19239     6946    -12293
Impacted Files Coverage Δ
datalad/interface/tests/test_download_url.py 100% <100%> (+100%) ⬆️
datalad/downloaders/base.py 56.29% <100%> (+22.22%) ⬆️
datalad/downloaders/tests/test_http.py 58.39% <100%> (+26.28%) ⬆️
datalad/interface/download_url.py 35.22% <8.33%> (-6.21%) ⬇️
tools/coverage-bin/git-annex-remote-datalad 100% <0%> (ø)
tools/coverage-bin/datalad 100% <0%> (ø)
...ols/coverage-bin/git-annex-remote-datalad-archives 100% <0%> (ø)
datalad/config.py 98.85% <0%> (+0.38%) ⬆️
datalad/core/local/tests/test_diff.py 99.4% <0%> (+0.59%) ⬆️
datalad/support/gitrepo.py 83.46% <0%> (+0.63%) ⬆️
... and 143 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 728f997...de9d692. Read the comment docs.

@kyleam kyleam removed the do not merge label Dec 2, 2019
@mih
Copy link
Member

mih commented Dec 9, 2019

Jeez... Apologies for not having responded in a month... 🤦‍♂️

@mih mih merged commit a55d62c into datalad:master Dec 9, 2019
17 checks passed
@kyleam kyleam deleted the download-url-o-slash branch Dec 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants