Compress-555 allow reading of stored entries in zips by default #137

tbentleypfpt · 2020-09-10T17:10:59Z

We are currently using tika for text extraction which uses commons-compress for handling zips. Currently some sites are returning zips that have entries with stored data descriptors which fail to extract due to the ZipArchiveInputStream defaulting to false for 'allowStoredEntriesWithDataDescriptor'.

This PR adjusts the ZipArchiveInputStream to allow reading of stored entries with data descriptor by default.

https://issues.apache.org/jira/browse/COMPRESS-555

…efault

garydgregory · 2020-09-10T17:19:21Z

Sounds like tika needs to call the other constructor.

coveralls · 2020-09-10T17:23:46Z

Coverage increased (+0.01%) to 87.231% when pulling 3c0ec25 on tbentleypfpt:COMPRESS-555-allow-stored-entries-in-zips-by-default into 550e525 on apache:master.

tbentleypfpt · 2020-09-10T17:29:35Z

Sounds like tika needs to call the other constructor.

Well it goes through the ArchiveStreamFactory in commons-compress and there is no way of passing in from tika whether we want to enable that feature or not since it's specific to ZIP.

@Override public ArchiveInputStream createArchiveInputStream(final String archiverName, final InputStream in, final String actualEncoding) throws ArchiveException { ... if (ZIP.equalsIgnoreCase(archiverName)) { if (actualEncoding != null) { return new ZipArchiveInputStream(in, actualEncoding); } return new ZipArchiveInputStream(in); } ...

garydgregory · 2020-09-10T18:47:31Z

Good find. Let's get some feedback from the community to learn if there are side-effects to worry about here.

PeterAlfredLee · 2020-09-11T02:29:00Z

Some explanizations for the allowStoredEntriesWithDataDescriptor:

An entry using STORED method means it would not use any compression - it's just a memcopy. For most of time, the size and compressed size(they are equal in the case of STORED) are stored in the Local File Header - which locates before the file raw data. So we can directly read the accurate amout of bytes, because we already know the amout of data before we start to extract the file.

This changes if the entry is using STORED and data descriptor at the same time. The size and compressed size is stored in the data descriptor, and the data descriptor locates after the raw file data. With data descriptor, we do not know the size of the file before we start to extract the file. We do not know the accurate size of data before we start extracting file, so we need to check if a signature(signature of Local File Header, Central Directory Record or Data Descriptor) is met byte by byte.

Obviously, it is slower if the data descriptor is used for STORED. What's worse, there are some cases lead to a error extracting : some bytes in the raw file data may equal to signature, and compress will stop reading at a wrong location. This is mostly happened in a zip archive that contains another zip archive. And wo do not have any workaround here.

In short words, setting allowStoredEntriesWithDataDescriptor to be true for STROED entries would lead to a slower extraction, and for some times it would lead to a failed extraction(e.g. zip archive in zip).

That's why we use allowStoredEntriesWithDataDescriptor and the default value is false. This may be a little complicated, but I hope it helps.

tbentleypfpt · 2020-09-11T16:17:16Z

Ah that explanation helps a lot @PeterAlfredLee. Sounds like we need to look at other options for handling this. Either in this project or in tika.

tbentleypfpt · 2020-09-11T17:07:21Z

After receiving the additional information about STORED entries, this is not an appropriate change. An update will instead be made to Tika.
https://issues.apache.org/jira/browse/TIKA-3196

tbentleypfpt · 2020-09-11T17:48:36Z

Tika Update: apache/tika#356

tbentleypfpt added 2 commits September 8, 2020 15:58

ZipArchiveInputStream allows stored entries with data descriptor by d…

b66ecaf

…efault

Revert tests

3c0ec25

tbentleypfpt closed this Sep 11, 2020

tballison mentioned this pull request Sep 18, 2020

Attempt to read zips with STORED data descriptors apache/tika#356

Closed

PeterAlfredLee mentioned this pull request Sep 25, 2020

Fix TIKA-3196 apache/tika#364

Merged

looly mentioned this pull request Nov 16, 2022

UnsupportedZipFeatureException: Unsupported feature data descriptor used in entry dromara/hutool#2736

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compress-555 allow reading of stored entries in zips by default #137

Compress-555 allow reading of stored entries in zips by default #137

tbentleypfpt commented Sep 10, 2020

garydgregory commented Sep 10, 2020

coveralls commented Sep 10, 2020

tbentleypfpt commented Sep 10, 2020

garydgregory commented Sep 10, 2020

PeterAlfredLee commented Sep 11, 2020

tbentleypfpt commented Sep 11, 2020

tbentleypfpt commented Sep 11, 2020

tbentleypfpt commented Sep 11, 2020

Compress-555 allow reading of stored entries in zips by default #137

Compress-555 allow reading of stored entries in zips by default #137

Conversation

tbentleypfpt commented Sep 10, 2020

garydgregory commented Sep 10, 2020

coveralls commented Sep 10, 2020

tbentleypfpt commented Sep 10, 2020

garydgregory commented Sep 10, 2020

PeterAlfredLee commented Sep 11, 2020

tbentleypfpt commented Sep 11, 2020

tbentleypfpt commented Sep 11, 2020

tbentleypfpt commented Sep 11, 2020