Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: pass values through shlex_quote, fixes #3625 #3626

Merged
merged 15 commits into from Aug 29, 2019

Conversation

@adswa
Copy link
Contributor

commented Aug 27, 2019

In this PR I have

  • removed quoted placeholders from the template in _guess_exec in run_procedure.py and
  • instead shlex_quoted() the values supplied in __call__, (if we are not on Windows)
However, locally, this crashes two tests:
======================================================================
FAIL: datalad.core.local.tests.test_create.test_create_withprocedure
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/adina/env/handbook/lib/python3.7/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/adina/repos/datalad/datalad/tests/utils.py", line 607, in newfunc
    return t(*(arg + (filename,)), **kw)
  File "/home/adina/repos/datalad/datalad/core/local/tests/test_create.py", line 406, in test_create_withprocedure
    eq_(ds.config['datalad.metadata.nativetype'], ('xmp', 'datacite'))
AssertionError: '"xmp" "datacite"' != ('xmp', 'datacite')

======================================================================
FAIL: datalad.interface.tests.test_run_procedure.test_configs
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/adina/env/handbook/lib/python3.7/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/adina/repos/datalad/datalad/tests/utils.py", line 895, in newfunc
    return func(*args, **kwargs)
  File "/home/adina/repos/datalad/datalad/tests/utils.py", line 434, in newfunc
    return t(*(arg + (d,)), **kw)
  File "/home/adina/repos/datalad/datalad/interface/tests/test_run_procedure.py", line 228, in test_configs
    ok_file_has_content(op.join(ds.path, 'fromproc.txt'), 'some_arg\n')
  File "/home/adina/repos/datalad/datalad/tests/utils.py", line 417, in ok_file_has_content
    assert_equal(content, file_content, **kwargs)
AssertionError: 'some_arg\n' != '"some_arg"\n'
- some_arg
+ "some_arg"
? +        +

because multiple arguments are now returned as a single string. I lack the judgment to decide whether this is indeed problematic, and at which point I can do adjustments.

Can someone give me a pointer?

(I put the commits on top of those in #3624 to prevent merge conflicts, let me know if that's okay)

script=procedure_file if on_windows else shlex_quote(procedure_file),
ds='' if not ds else (ds.path if on_windows else shlex_quote(ds.path)),
args=shlex_quote(u' '.join(u'"{}"'.format(a) for a in args) if args else '') if not on_windows
else u' '.join(u'"{}"'.format(a) for a in args) if args else '')

This comment has been minimized.

Copy link
@mih

mih Aug 27, 2019

Member

shelx_quote() would need to happen on each a inside the format call AFAICS:

u' '.join(u'"{}"'.format(a if on_windows else shlex_quote(a)) for a in args) if args else '')

This comment has been minimized.

Copy link
@adswa

adswa Aug 27, 2019

Author Contributor

oh, thx!

This comment has been minimized.

Copy link
@mih

mih Aug 27, 2019

Member

i think the entire u'"{}".fornat can go and a or quoted a passed directly.

@kyleam

This comment has been minimized.

Copy link
Member

commented Aug 27, 2019

because multiple arguments are now returned as a single string. I lack the judgment to decide whether this is indeed problematic, and at which point I can do adjustments.

Can someone give me a pointer?

You explained it pretty well yourself :] As @mih suggests, you should map shlex_quote over each argument.

(I put the commits on top of those in #3624 to prevent merge conflicts, let me know if that's okay)

That's good.

  • instead shlex_quoted() the values supplied in __call__, (if we are not on Windows)

I suppose every shlex_quote() call should get the "not on windows" condition, so it seems like we should add a helper (probably to datalad.utils) that handles this.

Also, while I personally don't like our "BF, ENH" subject convention, it is pretty consistently used in this code base, so please prepend those :]

@codecov

This comment has been minimized.

Copy link

commented Aug 27, 2019

Codecov Report

Merging #3626 into 0.11.x will decrease coverage by 0.12%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           0.11.x    #3626      +/-   ##
==========================================
- Coverage   81.44%   81.31%   -0.13%     
==========================================
  Files         256      256              
  Lines       33823    33851      +28     
==========================================
- Hits        27547    27527      -20     
- Misses       6276     6324      +48
Impacted Files Coverage Δ
datalad/utils.py 64.87% <100%> (-0.1%) ⬇️
datalad/interface/run_procedure.py 68.32% <100%> (-19.95%) ⬇️
datalad/interface/tests/test_run_procedure.py 100% <100%> (ø) ⬆️
datalad/interface/unlock.py 64.19% <0%> (-13.59%) ⬇️
datalad/downloaders/tests/test_http.py 88.86% <0%> (-2.23%) ⬇️
datalad/support/annexrepo.py 58.52% <0%> (-0.52%) ⬇️
datalad/interface/run.py 70.64% <0%> (+1.49%) ⬆️
datalad/cmdline/main.py 66.37% <0%> (+4.36%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update de6dbb0...d805fbe. Read the comment docs.

@adswa adswa changed the title WIP: pass values through shlex_quote, fixes #3625 WIP ENH: pass values through shlex_quote, fixes #3625 Aug 27, 2019

@kyleam
Copy link
Member

left a comment

Thanks for working on this.

I left a few comments. The only things that I think really need to be addressed are

This should also get a regression test, but I can add that on top if you'd like.

return {'type': u'python_script',
'template': u'python "{script}" "{ds}" {args}',
'template': u'%s {script} {ds} {args}' % ex,

This comment has been minimized.

Copy link
@kyleam

kyleam Aug 27, 2019

Member

These changes are OK as is, but just a note that essentially the same change is happening three places. This is a good indication that the common bits could have been pulled out into a variable and then the change could have been made to just one spot.

script=procedure_file if on_windows else shlex_quote(procedure_file),
ds='' if not ds else (ds.path if on_windows else shlex_quote(ds.path)),
args=(u' '.join(shlex_quote(a) for a in args) if args else '') if not on_windows
else u' '.join(str(a) for a in args) if args else '')

This comment has been minimized.

Copy link
@kyleam

kyleam Aug 27, 2019

Member

The changes to all these arguments could be simplified by adding a shlex_quote helper that returns the value as is on windows. Even if you prefer not to add a general helper right now, you could add a local one. Something like (untested)

diff --git a/datalad/interface/run_procedure.py b/datalad/interface/run_procedure.py
index 9d2f8e4a3..048f8690f 100644
--- a/datalad/interface/run_procedure.py
+++ b/datalad/interface/run_procedure.py
@@ -448,11 +448,11 @@ def __call__(
             raise ValueError("No idea how to execute procedure %s. "
                              "Missing 'execute' permissions?" % procedure_file)
 
+        maybe_shlex_quote = lambda x: x if on_windows else shlex_quote
         cmd = ex['template'].format(
-            script=procedure_file if on_windows else shlex_quote(procedure_file),
-            ds='' if not ds else (ds.path if on_windows else shlex_quote(ds.path)),
-            args=(u' '.join(shlex_quote(a) for a in args) if args else '') if not on_windows
-                    else u' '.join(str(a) for a in args) if args else '')
+            script=maybe_shlex_quote(procedure_file),
+            ds=maybe_shlex_quote(ds.path) if ds else '',
+            args=u' '.join(maybe_shlex_quote(a) for a in args))
         lgr.debug('Attempt to run procedure {} as: {}'.format(
             name,
             cmd))

In addition to moving the condition to one place, this takes advantage of ' '.join([]) => '' to drop an unnecessary check for the empty args case.

And one comment that isn't stylistic/subjective: the str() call in u' '.join(str(a) for a in args) can cause unicode issues on python 2. six.text_type is one way you can use unicode/str only py2/py3, but you actually don't need it here because you can just do away with the list comprehension and give args directly to join(). (Taking a quick glance, there would still be upstream issues with unicode in py2, but the code here wouldn't be contributing to the issue.)

@kyleam

This comment has been minimized.

Copy link
Member

commented Aug 27, 2019

Also, while I personally don't like our "BF, ENH" subject convention, it is pretty consistently used in this code base, so please prepend those :]

adswa changed the title [...]

Oops, sorry, I should've made it clear that I was talking about the commit message subjects.

@mih

This comment has been minimized.

Copy link
Member

commented Aug 28, 2019

FTR: The test might be fixed like this:

diff --git a/datalad/interface/tests/test_run_procedure.py b/datalad/interface/tests/test_run_procedure.py
index d9a76919e..4a17b7765 100644
--- a/datalad/interface/tests/test_run_procedure.py
+++ b/datalad/interface/tests/test_run_procedure.py
@@ -238,7 +238,7 @@ def test_configs(path):
     # for run:
     ds.config.add(
         'datalad.procedures.datalad_test_proc.call-format',
-        'python "{script}" "{ds}" {{mysub}} {args}',
+        'python {script} {ds} {{mysub}} {args}',
         where='dataset'
     )
     ds.config.add(
@@ -258,7 +258,7 @@ def test_configs(path):
     # config on dataset level:
     ds.config.add(
         'datalad.procedures.datalad_test_proc.call-format',
-        'python "{script}" "{ds}" local {args}',
+        'python {script} {ds} local {args}',
         where='local'
     )
     ds.unlock("fromproc.txt")

I could not test it though, because the current PR is against 0.11, rebase wasn't straightforward and with all the extensions my dev environment cannot easily handle this. Sorry.

@adswa

This comment has been minimized.

Copy link
Contributor Author

commented Aug 28, 2019

@mih this makes it work for me, I've pushed those changes. Thanks!

@mih mih changed the title WIP ENH: pass values through shlex_quote, fixes #3625 ENH: pass values through shlex_quote, fixes #3625 Aug 28, 2019


lgr.log(5, "Done importing datalad.utils")
lgr.log(5, "Done importing datalad.utils")

This comment has been minimized.

Copy link
@mih

mih Aug 28, 2019

Member

I personally prefer line endings all the way to the last line, but we are not agreement within the project already.

This comment has been minimized.

Copy link
@adswa

adswa Aug 28, 2019

Author Contributor

done.

@@ -15,7 +15,9 @@

from glob import iglob
from argparse import REMAINDER
from six.moves import shlex_quote

This comment has been minimized.

Copy link
@mih

mih Aug 28, 2019

Member

This is no longer needed with maybe_shlex_quote

@@ -33,6 +35,7 @@
from datalad.distribution.dataset import datasetmethod
from datalad.support.exceptions import InsufficientArgumentsError
from datalad.support.exceptions import NoDatasetArgumentFound
from datalad.utils import on_windows, maybe_shlex_quote

This comment has been minimized.

Copy link
@mih

mih Aug 28, 2019

Member

It is not used consistently in the code base (yet), but we now do (for any import with more than a single item)

from datalad.utils import (
    on_windows,
    maybe_shlex_quot,
)

i.e. one per line, each line (incl last one has a trailing comma) -- this minimizes the diffs and make them quicker to grasp.

That being said on_windows is no longer needed, see below.

This comment has been minimized.

Copy link
@adswa

adswa Aug 28, 2019

Author Contributor

Ah yes, I missed this. I removed all superfluous imports now.

'state': state}
elif script_file.endswith('.py'):
ex = sys.executable if on_windows else shlex_quote(sys.executable)

This comment has been minimized.

Copy link
@mih

mih Aug 28, 2019

Member

Can also be maybe_shlex_quote(sys.executable) now

This comment has been minimized.

Copy link
@adswa

adswa Aug 28, 2019

Author Contributor

good point, thx.

@adswa adswa force-pushed the adswa:shlex-quote branch from a152555 to 62b14ce Aug 28, 2019

@mih

This comment has been minimized.

Copy link
Member

commented Aug 28, 2019

Tests did not run (apt issue). Restarted.

@mih
Copy link
Member

left a comment

Ah, sorry, didn't spot this one until now.

@@ -258,7 +258,7 @@ def test_configs(path):
# config on dataset level:
ds.config.add(
'datalad.procedures.datalad_test_proc.call-format',
'python "{script}" "{ds}" local {args}',
'python {script} {ds} local {args}',

This comment has been minimized.

Copy link
@mih

mih Aug 28, 2019

Member

Here and above we should also use sys.executable not python.

This comment has been minimized.

Copy link
@adswa

adswa Aug 28, 2019

Author Contributor

oh right. Thanks!

@adswa

This comment has been minimized.

Copy link
Contributor Author

commented Aug 28, 2019

Hi @kyleam, I have tried to write a regression test, but that ended up being a test with args that contain spaces -- which is a test that I think cannot fail currently (although there currently is no test that explicitly gives args with spaces, so I pushed it anyway). I wasn't able to come up with any test that can trigger a failure in the case of quoted placeholders... If you have an idea how to write a test for this, you can let me know or add one yourself, if you like.

kyleam added 2 commits Aug 28, 2019
BF(py2): run_procedure: Avoid encoding error in log message
This can fail if the command has non-ascii characters, as it will in
the next commit.

Note that this will conflict with changes on master, which is a good
thing because a similar fix needs to be applied there.
TST: run_procedure: Test more arguments that need quoting
Unlike the spaces check in the last commit, this would fail before the
recent the shlex_quote() changes because the arguments contain quotes.

This very likely needs to be skipped on windows, where our quoting is
a no-op, but I'm leaving it as is in this commit to see how the
AppVeyor run fails.
@kyleam

This comment has been minimized.

Copy link
Member

commented Aug 28, 2019

Hi @kyleam, I have tried to write a regression test, but that ended up being a test with args that contain spaces

Thanks!

-- which is a test that I think cannot fail currently

Right, it passes without the changes from this PR (i.e. on top of 0.11.x).

(although there currently is no test that explicitly gives args with spaces, so I pushed it anyway).

Sounds good.

I wasn't able to come up with any test that can trigger a failure in the case of quoted placeholders... If you have an idea how to write a test for this, you can let me know or add one yourself, if you like.

I've pushed a test. It should fail on AppVeyor, but I'd like to see that before marking it as a known windows failure. Feel free to mark it once the test run is over. (edit: already failed and already marked)

TST: run_procedure: Mark test_quoting as known windows failure
This test fails on AppVeyor [0].  run_procedure() uses a no-op
shlex_quote() on Windows [1], so this is expected.

[0]: https://ci.appveyor.com/project/mih/datalad/builds/27022010/job/jq3lr977ht5vrmpj#L1109
[1]: #3624 (comment)
@kyleam
kyleam approved these changes Aug 28, 2019
Copy link
Member

left a comment

Assuming the currently running tests pass, I consider this is good to go. Thanks @adswa.

@kyleam

This comment has been minimized.

Copy link
Member

commented Aug 28, 2019

test_spaces() fails on AppVeyor. While run_procedure's previous approach of simply surrounding the argument with quotes is incorrect, it would have worked on windows for the simple case of a file name with spaces (I think, though of course we didn't have this test). Perhaps we should make the windows branch of maybe_shlex_quote() smarter so that it can handle simple cases like this, but how we handle shlex_quote on windows needs more work and thought in general. For the purposes of this PR I think we should switch that test's "skip if indirect" check to a full "skip on windows".

TST: run_procedure: Mark test_spaces() as known Windows failure
run_procedure() can't use shlex_quote() on Windows [0], so an argument
with spaces will not be handled correctly [1].  We should probably
teach the windows branch of maybe_shlex_quote() to handle simple cases
like a name with spaces.

[0]: #3624 (comment)
[1]: https://ci.appveyor.com/project/mih/datalad/builds/27022362/job/xoum5f43c135bfgd#L1125
@mih
mih approved these changes Aug 29, 2019
@mih

This comment has been minimized.

Copy link
Member

commented Aug 29, 2019

Thx @adswa and @kyleam !

@mih mih merged commit d1261bb into datalad:0.11.x Aug 29, 2019

3 checks passed

WIP Ready for review
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
yarikoptic added a commit that referenced this pull request Sep 6, 2019
Merge tag '0.11.7' into debian
0.11.7 (Sep 02, 2019) -- python2-we-still-love-you-but-...

Primarily bugfixes with some optimizations and refactorings.

 Fixes

- [addurls][]
  - now provides better handling when the URL file isn't in the
    expected format.  ([#3579][])
  - always considered a relative file for the URL file argument as
    relative to the current working directory, which goes against the
    convention used by other commands of taking relative paths as
    relative to the dataset argument.  ([#3582][])

- [run-procedure][]
  - hard coded "python" when formatting the command for non-executable
    procedures ending with ".py".  `sys.executable` is now used.
    ([#3624][])
  - failed if arguments needed more complicated quoting than simply
    surrounding the value with double quotes.  This has been resolved
    for systems that support `shlex.quote`, but note that on Windows
    values are left unquoted. ([#3626][])

- [siblings][] now displays an informative error message if a local
  path is given to `--url` but `--name` isn't specified.  ([#3555][])

- [sshrun][], the command DataLad uses for `GIT_SSH_COMMAND`, didn't
  support all the parameters that Git expects it to.  ([#3616][])

- Fixed a number of Unicode py2-compatibility issues. ([#3597][])

 Enhancements and new features

- The [annotate-paths][] helper now caches subdatasets it has seen to
  avoid unnecessary calls.  ([#3570][])

- A repeated configuration query has been dropped from the handling of
  `--proc-pre` and `--proc-post`.  ([#3576][])

- Calls to `git annex find` now use `--in=.` instead of the alias
  `--in=here` to take advantage of an optimization that git-annex (as
  of the current release, 7.20190730) applies only to the
  former. ([#3574][])

- [addurls][] now suggests close matches when the URL or file format
  contains an unknown field.  ([#3594][])

- Shared logic used in the setup.py files of Datalad and its
  extensions has been moved to modules in the _datalad_build_support/
  directory.  ([#3600][])

- Get ready for upcoming git-annex dropping support for direct mode
  ([#3631][])

* tag '0.11.7': (87 commits)
  DOC: Added an entry to changelogn on merged 3631
  ENH: finalizing changelog for 0.11.7
  TST: Update tests for a git-annex without direct mode
  TST: utils: Add decorator that skips when direct mode is unsupported
  ENH: annexrepo: Refuse to initialize in direct mode if unsupported
  ENH: annexrepo: Add check_direct_mode_support method
  BF+TST: Avoid leaking patched git-annex version
  TST+RF: test_annexrepo: Split up a test
  CHANGELOG.md: Second batch for 0.11.7
  TST: run_procedure: Mark test_spaces() as known Windows failure
  TST: run_procedure: Mark test_quoting as known windows failure
  TST: run_procedure: Test more arguments that need quoting
  BF(py2): run_procedure: Avoid encoding error in log message
  TST: add run_procedure test with spaces in file name
  TST/RF: non-hardcoded Python executable
  RF: newline at end of file
  RF: helper instead of conditional
  RF: remove superfluous imports
  BF/TST: remove quoting
  ENH: replace conditionals with helper function
  ...
yarikoptic added a commit that referenced this pull request Sep 6, 2019
Merge tag '0.11.7' into debian
0.11.7 (Sep 02, 2019) -- python2-we-still-love-you-but-...

Primarily bugfixes with some optimizations and refactorings.

 Fixes

- [addurls][]
  - now provides better handling when the URL file isn't in the
    expected format.  ([#3579][])
  - always considered a relative file for the URL file argument as
    relative to the current working directory, which goes against the
    convention used by other commands of taking relative paths as
    relative to the dataset argument.  ([#3582][])

- [run-procedure][]
  - hard coded "python" when formatting the command for non-executable
    procedures ending with ".py".  `sys.executable` is now used.
    ([#3624][])
  - failed if arguments needed more complicated quoting than simply
    surrounding the value with double quotes.  This has been resolved
    for systems that support `shlex.quote`, but note that on Windows
    values are left unquoted. ([#3626][])

- [siblings][] now displays an informative error message if a local
  path is given to `--url` but `--name` isn't specified.  ([#3555][])

- [sshrun][], the command DataLad uses for `GIT_SSH_COMMAND`, didn't
  support all the parameters that Git expects it to.  ([#3616][])

- Fixed a number of Unicode py2-compatibility issues. ([#3597][])

- [download-url][] now will create leading directories of the output path
  if they do not exist ([#3646][])

 Enhancements and new features

- The [annotate-paths][] helper now caches subdatasets it has seen to
  avoid unnecessary calls.  ([#3570][])

- A repeated configuration query has been dropped from the handling of
  `--proc-pre` and `--proc-post`.  ([#3576][])

- Calls to `git annex find` now use `--in=.` instead of the alias
  `--in=here` to take advantage of an optimization that git-annex (as
  of the current release, 7.20190730) applies only to the
  former. ([#3574][])

- [addurls][] now suggests close matches when the URL or file format
  contains an unknown field.  ([#3594][])

- Shared logic used in the setup.py files of Datalad and its
  extensions has been moved to modules in the _datalad_build_support/
  directory.  ([#3600][])

- Get ready for upcoming git-annex dropping support for direct mode
  ([#3631][])

* tag '0.11.7':
  Changelog entry for download-url paths handling
  ENH: downloaders: Ensure directories for target exist
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.