Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

7z-based (optional) replacement for patool #4041

Merged
merged 17 commits into from Jan 21, 2020
Merged

Conversation

mih
Copy link
Member

@mih mih commented Jan 17, 2020

Sitting on top of #4039

This is a complete draft implementation of a 7z-based file compression/decompression that should be as capable as the patool-based one.

TODO

  • confirm runs on windows (main objective here)
  • expand tests for additional archive formats and compressors (prev. tests were fairly minimal)
    • support 7z archives
    • support xz and tar.xz (required to pull in xz-utils on travis
    • run entirety of test_archives.py on all platforms
  • implement fall-back to patool on non-windows platforms
    • requestable via boolean datalad.runtime.use-patoolconfig switch
    • transparent fall-back when 7z is not found/supported (as requested by @yarikoptic) although now we need close inspection of what was actually executed whenever something archive-related fails.
  • wtf reports 7z version
  • RF if Replace shlex.quote() with quote_cmdlinearg() #4050 should land before this or vice versa

Further related TODOs (for subsequent PRs) surfaced by this PR:

@mih
Copy link
Member Author

mih commented Jan 17, 2020

Grrrr, this was supposed to be in draft mode. I see no way to achieve that post-factum.

@mih mih force-pushed the patoolreplacement branch 3 times, most recently from cd028b7 to d5c1939 Compare Jan 17, 2020
Copy link
Member

@yarikoptic yarikoptic left a comment

Automagically fallback to patool if 7z command is not available. I would have even may be wrapped *comparess_file so if exception would happen, announce in the log (before reraising) to try an alternative method by changing that config setting. Otherwise it is burried and nobody would discover that there is a config setting they could tune (until they ask)

datalad/support/archives.py Outdated Show resolved Hide resolved
datalad/support/archives.py Outdated Show resolved Hide resolved
@yarikoptic
Copy link
Member

yarikoptic commented Jan 17, 2020

  • expand tests for additional archive formats and compressors

would also be nice to

  • make one of the travis runs use alternative (non default) archives handler (i.e. patool instead of 7z) by providing config var in the env.

@mih
Copy link
Member Author

mih commented Jan 17, 2020

Holy cow, this is a rabbit hole... I am so deep in add_archive_content.py and I don't even know why...

@codecov
Copy link

codecov bot commented Jan 17, 2020

Codecov Report

Merging #4041 into master will decrease coverage by 0.08%.
The diff coverage is 89.9%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #4041      +/-   ##
=========================================
- Coverage   89.79%   89.7%   -0.09%     
=========================================
  Files         273     274       +1     
  Lines       36577   36620      +43     
=========================================
+ Hits        32843   32851       +8     
- Misses       3734    3769      +35
Impacted Files Coverage Δ
datalad/interface/tests/test_download_url.py 100% <ø> (ø) ⬆️
datalad/tests/test_archives.py 94.92% <100%> (+4.21%) ⬆️
datalad/interface/download_url.py 97.89% <100%> (+0.16%) ⬆️
datalad/support/external_versions.py 91.03% <58.33%> (-2.96%) ⬇️
datalad/support/archives.py 89.47% <84.21%> (+0.05%) ⬆️
datalad/support/archive_utils_7z.py 88% <88%> (ø)
datalad/support/archive_utils_patool.py 40.25% <0%> (-42.86%) ⬇️
datalad/utils.py 87.26% <0%> (ø) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fca2866...25d7c34. Read the comment docs.

@mih
Copy link
Member Author

mih commented Jan 17, 2020

Automagically fallback to patool if 7z command is not available.

I am reluctant to do this. We are pretty much in the same situation with patool. We can pull patool in via dependency, but we cannot pull in its dependencies (tar, zip, ...). Moreover, the reason why I am doing this to begin with, is that this code is broken, and it should be removed, or fixed. Hence, I went for a dedicated switch.

@yarikoptic
Copy link
Member

yarikoptic commented Jan 17, 2020

We are pretty much in the same situation with patool. We can pull patool in via dependency, but we cannot pull in its dependencies (tar, zip, ...).

patool has fallbacks from 7z to e.g. zip, jar, etc, so it could work on systems without 7z. You might fix some bugs (on windows) but would introduce datalad crashes on others (without 7z, which is generally not installed by default, whenever zip etc are more known and more likely to be installed). so I would strongly appreciate (could buy a bottle) if we could avoid such crashes with a cost of a single if condition.

@mih mih added the merge-if-ok label Jan 19, 2020
@mih mih added this to Major issues in Windows Jan 19, 2020
@mih
Copy link
Member Author

mih commented Jan 19, 2020

I am planning to merge this relatively quickly: On Monday, or whenever the test pass (whichever happens last). It is the foundation for more work on windows, which will go into different directions, hence should not be added here.

make one of the travis runs use alternative (non default) archives handler (i.e. patool instead of 7z) by providing config var in the env.

I would leave this out here, for now, and as a topic for further debate. I think we should not test this code for every PR. Maybe a cron job would suffice.

mih added 12 commits Jan 20, 2020
…taladgh-4019)

Keep the old implementation around (by request of @yarikoptic), but
require a switch (datalad.runtime.use-patool) to turn it on.
The reason is the same as what made us use MD5E instead MD5 as the
default backend. If we purposefully remove an extension, we have
mimetype inspection left is the only option to decided on some
further processing (e.g. archive type). Possible, but needlessly
complicated.

An alternative fix would have been to not purposefully use an
extension-less archive name in the associated tests.
Via @yarikoptic:

> would remain a preferred way to simply reraise exception, since
  otherwise it would use this line as the source of exception while also
  mentioning original one (in python3 only)
Now required on import.
@mih mih merged commit 790ab49 into datalad:master Jan 21, 2020
17 checks passed
@mih mih deleted the patoolreplacement branch Jan 21, 2020
@mih mih moved this from Major issues to Done in Windows Mar 12, 2020
kyleam added a commit to kyleam/datalad that referenced this issue May 11, 2020
All of the checks of the test generator fail on a system without
p7zip.  Before dataladgh-4041 (in particular, before beb5d04), the subset
of checks that existed at that time was skipped with (see dataladgh-3176):

    SKIP: cmd:7z is missing. (Not) Funny enough but ATM we need p7zip
    installation to handle .gz files extraction 'correctly'

However, with the change to the archive file name in 05e9353 (TST:
Expand test coverage to additional relevant compression formats,
2020-01-17), the MissingExternalDependency exception that led to the
skip for gzip formats is no longer triggered.  And even if it were,
the other file formats tested would fail.

Mark these tests as known failures when p7zip isn't available.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-if-ok
Projects
No open projects
Windows
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

2 participants