Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BF: fix execution of patool commands #3176

Merged
merged 6 commits into from Feb 25, 2019

Conversation

yarikoptic
Copy link
Member

It is an incomplete fix, which would still cause test failure in the crawler if no p7zip is installed in crawler due to #3175 , but wanted to see if anything breaks here before proceeding further

BK since would stall if gzip or pigz is available, and not p7zip
patool invokes with shell=True and behavior when cmd is provided as list
and shell is True is not what you would expect - only the first list member
is provided to the shell and the rest as additional arguments to the shell,
not as a part of the command. That leads to incorrect execution of the line
@codecov
Copy link

codecov bot commented Feb 23, 2019

Codecov Report

Merging #3176 into 0.11.x will decrease coverage by 32.82%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           0.11.x    #3176       +/-   ##
===========================================
- Coverage   90.83%   58.01%   -32.83%     
===========================================
  Files         249       92      -157     
  Lines       32783    14652    -18131     
===========================================
- Hits        29780     8500    -21280     
- Misses       3003     6152     +3149
Impacted Files Coverage Δ
datalad/plugin/addurls.py 18.32% <0%> (-81.37%) ⬇️
datalad/support/globbedpaths.py 21.25% <0%> (-78.75%) ⬇️
datalad/interface/rerun.py 18.48% <0%> (-77.73%) ⬇️
datalad/interface/run.py 24% <0%> (-76%) ⬇️
datalad/plugin/export_archive.py 24.65% <0%> (-75.35%) ⬇️
datalad/interface/add_archive_content.py 19.23% <0%> (-71.16%) ⬇️
datalad/plugin/add_readme.py 29.31% <0%> (-63.8%) ⬇️
datalad/plugin/wtf.py 19.17% <0%> (-62.56%) ⬇️
datalad/metadata/metadata.py 23.39% <0%> (-62.12%) ⬇️
datalad/metadata/aggregate.py 14.24% <0%> (-60.24%) ⬇️
... and 212 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9059059...06ab803. Read the comment docs.

@yarikoptic
Copy link
Member Author

grr... Not sure which way to proceed here. So

  • so far support for .gz relied pretty much on the uniform implementation across all archives, where we just extract an archive under a directory and look at the files extracted
  • so it was inline with how p7zip handles all of them including .gz and p7zip does use stored in the .gz header original filename
  • not sure yet what is the behavior if no filename was stored in the header

I thought that may be we could provide our custom gunzip functionality which would respect that filename and thus would mitigate discrepancy, but

  • gzip/gunzip itself does not care about possibly stored original filename in the gzip header (unless given explicitly --name option)
  • Python's gzip library explicitly reads and discards (does not even store anywhere, bleh) possibly present in the header filename: https://github.com/python/cpython/blob/master/Lib/gzip.py#L427

so I do not see an easy way to figure out that filename in the header. But we can't easily switch to "just use original filename stripped of .gz" since

  • we might be operating already on a annex key (so no filename pointing to it)
  • would require implementing .gz-special handling of datalad-archive urls. It would actually be the only proper solution probably since that stored filename is not really something to rely on (later on could be extended to other .bz2, .xz etc; I don't think xz even stores the original filename). But once again -- currently datalad-archive relies on the handling for those as for any other archive. So we had urls like dl+archive:MD5E-s24--5590a3cbf10c77d61fb4058b741760d9.gz#path=123&size=0 not dl+archive:MD5E-s24--5590a3cbf10c77d61fb4058b741760d9.gz#path=MD5E-s24--5590a3cbf10c77d61fb4058b741760d9&size=0 or even more adequate here just dl+archive:MD5E-s24--5590a3cbf10c77d61fb4058b741760d9.gz#size=0 (without any path within)
  • so if we were to provide "proper" treatment/url we might need to provide compatibility layer just in case there is some datalad dataset with pointers to previously .gz'ed files. I.e. if there is a path for any of those .gz, we pretty much ignore it and verify that only a single file is extracted and take it.

This test passes whenever p7zip-full is installed
@yarikoptic
Copy link
Member Author

ok -- for now just decided to blow in no 7z is installed.

yarikoptic added a commit to yarikoptic/datalad-crawler that referenced this pull request Feb 24, 2019
… files

DataLad core has a number of improvements in handling of .gz files in
datalad/datalad#3176
but overall proper solution requires RFing of how all the .gz should be handled
including datalad-archives special remote, so they do not rely on any filename
in the header.  That is for now postponed.
Installing p7zip-full should mitigate the issue for now
Proper solution is a bit too involved, so postponed, and decided just to blow
with an informative message for now
@yarikoptic yarikoptic changed the base branch from master to 0.11.x February 25, 2019 14:22
@yarikoptic
Copy link
Member Author

oops -- forgot about our bright future, changed base to be 0.11.x

@kyleam
Copy link
Contributor

kyleam commented Feb 25, 2019

Thanks for summarizing the results of all your digging.

ok -- for now just decided to blow in no 7z is installed.

Sounds sensible.

@yarikoptic
Copy link
Member Author

with @kyleam 's blessing will merge for now

@yarikoptic yarikoptic merged commit 78b1011 into datalad:0.11.x Feb 25, 2019
@yarikoptic yarikoptic added this to the Release 0.11.4 milestone Feb 26, 2019
#
if isinstance(cmd, (list, tuple)) and kwargs.get('shell'):
# patool (as far as I see it) takes care about quoting args
cmd = ' '.join(cmd)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we might want to consider moving this into the runner itself. I didn't spot any place where we use shell and pass the list, so it shouldn't change any other behavior but might prevent us from hitting it again

yarikoptic added a commit that referenced this pull request Apr 6, 2019
    ## 0.11.4 (Mar 18, 2019) -- get-ready

    Largely a bug fix release with a few enhancements

    ### Important

    - 0.11.x series will be the last one with support for direct mode of [git-annex][]
      which is used on crippled (no symlinks and no locking) filesystems.
      v7 repositories should be used instead.

    ### Fixes

    - Extraction of .gz files is broken without p7zip installed.  We now
      abort with an informative error in this situation.  ([#3176][])

    - Committing failed in some cases because we didn't ensure that the
      path passed to `git read-tree --index-output=...` resided on the
      same filesystem as the repository.  ([#3181][])

    - Some pointless warnings during metadata aggregation have been
      eliminated.  ([#3186][])

    - With Python 3 the LORIS token authenticator did not properly decode
      a response ([#3205][]).

    - With Python 3 downloaders unnecessarily decoded the response when
      getting the status, leading to an encoding error.  ([#3210][])

    - In some cases, our internal command Runner did not adjust the
      environment's `PWD` to match the current working directory specified
      with the `cwd` parameter.  ([#3215][])

    - The specification of the pyliblzma dependency was broken.  ([#3220][])

    - [search] displayed an uninformative blank log message in some
      cases.  ([#3222][])

    - The logic for finding the location of the aggregate metadata DB
      anchored the search path incorrectly, leading to a spurious warning.
      ([#3241][])

    - Some progress bars were still displayed when stdout and stderr were
      not attached to a tty.  ([#3281][])

    - Check for stdin/out/err to not be closed before checking for `.isatty`.
      ([#3268][])

    ### Enhancements and new features

    - Creating a new repository now aborts if any of the files in the
      directory are tracked by a repository in a parent directory.
      ([#3211][])

    - [run] learned to replace the `{tmpdir}` placeholder in commands with
      a temporary directory.  ([#3223][])

    - [duecredit][] support has been added for citing DataLad itself as
      well as datasets that an analysis uses.  ([#3184][])

    - The `eval_results` interface helper unintentionally modified one of
      its arguments.  ([#3249][])

    - A few DataLad constants have been added, changed, or renamed ([#3250][]):
      - `HANDLE_META_DIR` is now `DATALAD_DOTDIR`.  The old name should be
         considered deprecated.
      - `METADATA_DIR` now refers to `DATALAD_DOTDIR/metadata` rather than
        `DATALAD_DOTDIR/meta` (which is still available as
        `OLDMETADATA_DIR`).
      - The new `DATASET_METADATA_FILE` refers to `METADATA_DIR/dataset.json`.
      - The new `DATASET_CONFIG_FILE` refers to `DATALAD_DOTDIR/config`.
      - `METADATA_FILENAME` has been renamed to `OLDMETADATA_FILENAME`.

* tag '0.11.4': (82 commits)
  Updated CHANGELOG.md for having merged check for not being closed
  [DATALAD RUNCMD] CHANGELOG: Re-linkify 0.11.4 entries
  CHANGELOG.md: Update 0.11.4 entries
  CHANGELOG.md: Adjust the format of a link
  BF: split lines on spaces and commas befoe doing sed capture of #issue
  RF: adjust tools/link_issues_CHANGELOG to have MD links as [#issue][]
  BF+DOC: Fix all links to mark them valid markdown
  BF: Fix markup bug that prevents sphinx-build from succeeding
  DOC: extend coumentation of is_interactive
  BF: guard .istty with try/except
  ENH: ui: Drop progress bars if not attached to tty
  ENH: ui: Add another name for SilentConsoleLog
  BF: check for stdin/out/err to not be closed before checking for .isatty
  RF: HANDLE_META_DIR -> DATALAD_DOTDIR (but keeping an alias for compatibility)
  RF: define/use consts DATASET_CONFIG_FILE, DATASET_METADATA_FILE, METADATA_DIR
  RF: METADATA_DIR/FILE -> OLDMETADATA_DIR/FILE
  BF: do not cause a side-effect on kwargs in @eval_results
  BF: join with ds.path when trying to see if any other metadata file is available
  RF: Move duecredit into its own section so it is not a part of install_requires
  [DATALAD RUNCMD] CHANGELOG: Re-linkify 0.11.4 entries
  ...
yarikoptic added a commit to yarikoptic/datalad that referenced this pull request Apr 6, 2019
    ## 0.11.4 (Mar 18, 2019) -- get-ready

    Largely a bug fix release with a few enhancements

    ### Important

    - 0.11.x series will be the last one with support for direct mode of [git-annex][]
      which is used on crippled (no symlinks and no locking) filesystems.
      v7 repositories should be used instead.

    ### Fixes

    - Extraction of .gz files is broken without p7zip installed.  We now
      abort with an informative error in this situation.  ([datalad#3176][])

    - Committing failed in some cases because we didn't ensure that the
      path passed to `git read-tree --index-output=...` resided on the
      same filesystem as the repository.  ([datalad#3181][])

    - Some pointless warnings during metadata aggregation have been
      eliminated.  ([datalad#3186][])

    - With Python 3 the LORIS token authenticator did not properly decode
      a response ([datalad#3205][]).

    - With Python 3 downloaders unnecessarily decoded the response when
      getting the status, leading to an encoding error.  ([datalad#3210][])

    - In some cases, our internal command Runner did not adjust the
      environment's `PWD` to match the current working directory specified
      with the `cwd` parameter.  ([datalad#3215][])

    - The specification of the pyliblzma dependency was broken.  ([datalad#3220][])

    - [search] displayed an uninformative blank log message in some
      cases.  ([datalad#3222][])

    - The logic for finding the location of the aggregate metadata DB
      anchored the search path incorrectly, leading to a spurious warning.
      ([datalad#3241][])

    - Some progress bars were still displayed when stdout and stderr were
      not attached to a tty.  ([datalad#3281][])

    - Check for stdin/out/err to not be closed before checking for `.isatty`.
      ([datalad#3268][])

    ### Enhancements and new features

    - Creating a new repository now aborts if any of the files in the
      directory are tracked by a repository in a parent directory.
      ([datalad#3211][])

    - [run] learned to replace the `{tmpdir}` placeholder in commands with
      a temporary directory.  ([datalad#3223][])

    - [duecredit][] support has been added for citing DataLad itself as
      well as datasets that an analysis uses.  ([datalad#3184][])

    - The `eval_results` interface helper unintentionally modified one of
      its arguments.  ([datalad#3249][])

    - A few DataLad constants have been added, changed, or renamed ([datalad#3250][]):
      - `HANDLE_META_DIR` is now `DATALAD_DOTDIR`.  The old name should be
         considered deprecated.
      - `METADATA_DIR` now refers to `DATALAD_DOTDIR/metadata` rather than
        `DATALAD_DOTDIR/meta` (which is still available as
        `OLDMETADATA_DIR`).
      - The new `DATASET_METADATA_FILE` refers to `METADATA_DIR/dataset.json`.
      - The new `DATASET_CONFIG_FILE` refers to `DATALAD_DOTDIR/config`.
      - `METADATA_FILENAME` has been renamed to `OLDMETADATA_FILENAME`.

* tag '0.11.4':
  Updated CHANGELOG.md for having merged check for not being closed
  [DATALAD RUNCMD] CHANGELOG: Re-linkify 0.11.4 entries
  CHANGELOG.md: Update 0.11.4 entries
  CHANGELOG.md: Adjust the format of a link
  BF: split lines on spaces and commas befoe doing sed capture of #issue
  RF: adjust tools/link_issues_CHANGELOG to have MD links as [#issue][]
  BF+DOC: Fix all links to mark them valid markdown
  BF: Fix markup bug that prevents sphinx-build from succeeding
  DOC: extend coumentation of is_interactive
  BF: guard .istty with try/except
  BF: check for stdin/out/err to not be closed before checking for .isatty
  [DATALAD RUNCMD] CHANGELOG: Re-linkify 0.11.4 entries
  CHANGELOG: Add entries for 0.11.4
  ENH: version boost and initial changes to CHANGELOG
@yarikoptic yarikoptic deleted the bf-gz-extract branch April 23, 2019 20:36
kyleam added a commit to kyleam/datalad that referenced this pull request May 11, 2020
All of the checks of the test generator fail on a system without
p7zip.  Before dataladgh-4041 (in particular, before beb5d04), the subset
of checks that existed at that time was skipped with (see dataladgh-3176):

    SKIP: cmd:7z is missing. (Not) Funny enough but ATM we need p7zip
    installation to handle .gz files extraction 'correctly'

However, with the change to the archive file name in 05e9353 (TST:
Expand test coverage to additional relevant compression formats,
2020-01-17), the MissingExternalDependency exception that led to the
skip for gzip formats is no longer triggered.  And even if it were,
the other file formats tested would fail.

Mark these tests as known failures when p7zip isn't available.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants