Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

addurls: Provide better handling of invalid and empty input #3579

Merged
merged 4 commits into from Aug 2, 2019

Conversation

@kyleam
Copy link
Member

commented Aug 1, 2019

Fixes gh-3577 and another gh-3577-inspired issue.

DOC: addurls: Update stale return value
This should have been updated in 5415ce9 (BF: addurls: Process
datasets in a stable, breadth-first order, 2019-07-24).

@kyleam kyleam changed the title addurls: Provide better Handling of invalid and empty input addurls: Provide better handling of invalid and empty input Aug 1, 2019

@yarikoptic
Copy link
Member

left a comment

Just a minor recommendation


in_file = op.join(path, "in")
with open(in_file, "w") as fh:
fh.write("")

This comment has been minimized.

Copy link
@yarikoptic

yarikoptic Aug 1, 2019

Member

Could have just use with_tree to get temporary directory with a file with content. IMHO would have been more descriptive and shorter

This comment has been minimized.

Copy link
@kyleam

kyleam Aug 1, 2019

Author Member

Yeah, I like {with,create}_tree too. I'm not sure why I didn't use it here. Updated.

range-diff
1:  d853daad3 = 1:  d853daad3 DOC: addurls: Update stale return value
2:  cc54a8f1a ! 2:  8c8f2a9a2 BF: addurls: Provide better error for invalid input stream
    @@ -48,19 +48,23 @@
      
      from datalad.api import addurls, Dataset, subdatasets
      import datalad.plugin.addurls as au
    +@@
    + from datalad.tests.utils import assert_dict_equal
    + from datalad.tests.utils import eq_, ok_exists
    + from datalad.tests.utils import create_tree, with_tempfile, HTTPPath
    ++from datalad.tests.utils import with_tree
    + from datalad.utils import get_tempfile_kwargs, rmtemp
    + 
    + 
     @@
                   op.join("bar", "adir", "bar-again", "other-ds")})
              ok_exists(os.path.join(
                  ds.path, "foo", "adir", "foo-again", "other-ds", "bdir", "a"))
     +
    -+    @with_tempfile(mkdir=True)
    ++    @with_tree({"in": ""})
     +    def test_addurls_invalid_input(self, path):
    -+        ds = Dataset(path).create()
    -+
    ++        ds = Dataset(path).create(force=True)
     +        in_file = op.join(path, "in")
    -+        with open(in_file, "w") as fh:
    -+            fh.write("")
    -+
     +        for in_type in ["csv", "json"]:
     +            with assert_raises(IncompleteResultsError) as exc:
     +                ds.addurls(in_file, "{url}", "{name}", input_type=in_type)
3:  ae358d7b5 ! 3:  de086917b BF: addurls: Don't assume that the stream has rows
    @@ -32,19 +32,13 @@
                      ds.addurls(in_file, "{url}", "{name}", input_type=in_type)
                  assert_in("Failed to read", text_type(exc.exception))
     +
    -+    @with_tempfile(mkdir=True)
    ++    @with_tree({"in.csv": "url,name,subdir",
    ++                "in.json": "[]"})
     +    def test_addurls_no_rows(self, path):
    -+        ds = Dataset(path).create()
    -+
    -+        in_csv = op.join(path, "in.csv")
    -+        with open(in_csv, "w") as fh:
    -+            fh.write("url,name,subdir")
    -+
    -+        in_json = op.join(path, "in.json")
    -+        with open(in_json, "w") as fh:
    -+            fh.write("[]")
    -+
    -+        for fname in [in_csv, in_json]:
    ++        ds = Dataset(path).create(force=True)
    ++        for fname in ["in.csv", "in.json"]:
    ++            # TODO: This op.join() can be dropped once gh-3580 is fixed.
    ++            fname = op.join(path, fname)
     +            with swallow_logs(new_level=logging.WARNING) as cml:
     +                ds.addurls(fname, "{url}", "{name}")
     +                cml.assert_logged("No rows", regex=False)
4:  355dc8a72 ! 4:  281512ca1 RF: addurls: Return early if there are no rows to process
    @@ -27,8 +27,8 @@
      --- a/datalad/plugin/tests/test_addurls.py
      +++ b/datalad/plugin/tests/test_addurls.py
     @@
    - 
    -         for fname in [in_csv, in_json]:
    +             # TODO: This op.join() can be dropped once gh-3580 is fixed.
    +             fname = op.join(path, fname)
                  with swallow_logs(new_level=logging.WARNING) as cml:
     -                ds.addurls(fname, "{url}", "{name}")
     +                assert_in_results(

in_csv = op.join(path, "in.csv")
with open(in_csv, "w") as fh:
fh.write("url,name,subdir")

This comment has been minimized.

Copy link
@yarikoptic

yarikoptic Aug 1, 2019

Member

Same here

kyleam added 3 commits Aug 1, 2019
BF: addurls: Provide better error for invalid input stream
If the content we get from the stream is invalid (for JSON we fail to
decode or for CSV we don't get a header), tell the user that we
couldn't read data from the stream rather than letting the underlying
exception bubble up.
RF: addurls: Return early if there are no rows to process
The downstream code won't fail if `rows` is empty, but it's also
pointless to execute.
BF: addurls: Don't assume that the stream has rows
extract() fails with an index error if the stream is valid JSON or CSV
but lacks any rows (i.e. for JSON an empty list or for CSV a
header-only file).  Update extract() to issue a warning and return an
empty list of rows.

Re: templateflow/datalad-osf#1
Closes #3577.

@kyleam kyleam force-pushed the kyleam:addurls-norows branch from 355dc8a to 281512c Aug 1, 2019

@kyleam kyleam merged commit 281512c into datalad:0.11.x Aug 2, 2019

2 of 3 checks passed

continuous-integration/appveyor/pr AppVeyor build failed
Details
WIP Ready for review
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
kyleam added a commit that referenced this pull request Aug 2, 2019

@kyleam kyleam deleted the kyleam:addurls-norows branch Aug 2, 2019

yarikoptic added a commit that referenced this pull request Sep 6, 2019
Merge tag '0.11.7' into debian
0.11.7 (Sep 02, 2019) -- python2-we-still-love-you-but-...

Primarily bugfixes with some optimizations and refactorings.

 Fixes

- [addurls][]
  - now provides better handling when the URL file isn't in the
    expected format.  ([#3579][])
  - always considered a relative file for the URL file argument as
    relative to the current working directory, which goes against the
    convention used by other commands of taking relative paths as
    relative to the dataset argument.  ([#3582][])

- [run-procedure][]
  - hard coded "python" when formatting the command for non-executable
    procedures ending with ".py".  `sys.executable` is now used.
    ([#3624][])
  - failed if arguments needed more complicated quoting than simply
    surrounding the value with double quotes.  This has been resolved
    for systems that support `shlex.quote`, but note that on Windows
    values are left unquoted. ([#3626][])

- [siblings][] now displays an informative error message if a local
  path is given to `--url` but `--name` isn't specified.  ([#3555][])

- [sshrun][], the command DataLad uses for `GIT_SSH_COMMAND`, didn't
  support all the parameters that Git expects it to.  ([#3616][])

- Fixed a number of Unicode py2-compatibility issues. ([#3597][])

 Enhancements and new features

- The [annotate-paths][] helper now caches subdatasets it has seen to
  avoid unnecessary calls.  ([#3570][])

- A repeated configuration query has been dropped from the handling of
  `--proc-pre` and `--proc-post`.  ([#3576][])

- Calls to `git annex find` now use `--in=.` instead of the alias
  `--in=here` to take advantage of an optimization that git-annex (as
  of the current release, 7.20190730) applies only to the
  former. ([#3574][])

- [addurls][] now suggests close matches when the URL or file format
  contains an unknown field.  ([#3594][])

- Shared logic used in the setup.py files of Datalad and its
  extensions has been moved to modules in the _datalad_build_support/
  directory.  ([#3600][])

- Get ready for upcoming git-annex dropping support for direct mode
  ([#3631][])

* tag '0.11.7': (87 commits)
  DOC: Added an entry to changelogn on merged 3631
  ENH: finalizing changelog for 0.11.7
  TST: Update tests for a git-annex without direct mode
  TST: utils: Add decorator that skips when direct mode is unsupported
  ENH: annexrepo: Refuse to initialize in direct mode if unsupported
  ENH: annexrepo: Add check_direct_mode_support method
  BF+TST: Avoid leaking patched git-annex version
  TST+RF: test_annexrepo: Split up a test
  CHANGELOG.md: Second batch for 0.11.7
  TST: run_procedure: Mark test_spaces() as known Windows failure
  TST: run_procedure: Mark test_quoting as known windows failure
  TST: run_procedure: Test more arguments that need quoting
  BF(py2): run_procedure: Avoid encoding error in log message
  TST: add run_procedure test with spaces in file name
  TST/RF: non-hardcoded Python executable
  RF: newline at end of file
  RF: helper instead of conditional
  RF: remove superfluous imports
  BF/TST: remove quoting
  ENH: replace conditionals with helper function
  ...
yarikoptic added a commit that referenced this pull request Sep 6, 2019
Merge tag '0.11.7' into debian
0.11.7 (Sep 02, 2019) -- python2-we-still-love-you-but-...

Primarily bugfixes with some optimizations and refactorings.

 Fixes

- [addurls][]
  - now provides better handling when the URL file isn't in the
    expected format.  ([#3579][])
  - always considered a relative file for the URL file argument as
    relative to the current working directory, which goes against the
    convention used by other commands of taking relative paths as
    relative to the dataset argument.  ([#3582][])

- [run-procedure][]
  - hard coded "python" when formatting the command for non-executable
    procedures ending with ".py".  `sys.executable` is now used.
    ([#3624][])
  - failed if arguments needed more complicated quoting than simply
    surrounding the value with double quotes.  This has been resolved
    for systems that support `shlex.quote`, but note that on Windows
    values are left unquoted. ([#3626][])

- [siblings][] now displays an informative error message if a local
  path is given to `--url` but `--name` isn't specified.  ([#3555][])

- [sshrun][], the command DataLad uses for `GIT_SSH_COMMAND`, didn't
  support all the parameters that Git expects it to.  ([#3616][])

- Fixed a number of Unicode py2-compatibility issues. ([#3597][])

- [download-url][] now will create leading directories of the output path
  if they do not exist ([#3646][])

 Enhancements and new features

- The [annotate-paths][] helper now caches subdatasets it has seen to
  avoid unnecessary calls.  ([#3570][])

- A repeated configuration query has been dropped from the handling of
  `--proc-pre` and `--proc-post`.  ([#3576][])

- Calls to `git annex find` now use `--in=.` instead of the alias
  `--in=here` to take advantage of an optimization that git-annex (as
  of the current release, 7.20190730) applies only to the
  former. ([#3574][])

- [addurls][] now suggests close matches when the URL or file format
  contains an unknown field.  ([#3594][])

- Shared logic used in the setup.py files of Datalad and its
  extensions has been moved to modules in the _datalad_build_support/
  directory.  ([#3600][])

- Get ready for upcoming git-annex dropping support for direct mode
  ([#3631][])

* tag '0.11.7':
  Changelog entry for download-url paths handling
  ENH: downloaders: Ensure directories for target exist
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.