Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compress-555 allow reading of stored entries in zips by default #137

Closed
wants to merge 2 commits into from
Closed

Compress-555 allow reading of stored entries in zips by default #137

wants to merge 2 commits into from

Conversation

tbentleypfpt
Copy link

We are currently using tika for text extraction which uses commons-compress for handling zips. Currently some sites are returning zips that have entries with stored data descriptors which fail to extract due to the ZipArchiveInputStream defaulting to false for 'allowStoredEntriesWithDataDescriptor'.

This PR adjusts the ZipArchiveInputStream to allow reading of stored entries with data descriptor by default.

https://issues.apache.org/jira/browse/COMPRESS-555

@garydgregory
Copy link
Member

Sounds like tika needs to call the other constructor.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 87.231% when pulling 3c0ec25 on tbentleypfpt:COMPRESS-555-allow-stored-entries-in-zips-by-default into 550e525 on apache:master.

@tbentleypfpt
Copy link
Author

Sounds like tika needs to call the other constructor.

Well it goes through the ArchiveStreamFactory in commons-compress and there is no way of passing in from tika whether we want to enable that feature or not since it's specific to ZIP.

@Override public ArchiveInputStream createArchiveInputStream(final String archiverName, final InputStream in, final String actualEncoding) throws ArchiveException { ... if (ZIP.equalsIgnoreCase(archiverName)) { if (actualEncoding != null) { return new ZipArchiveInputStream(in, actualEncoding); } return new ZipArchiveInputStream(in); } ...

@garydgregory
Copy link
Member

Good find. Let's get some feedback from the community to learn if there are side-effects to worry about here.

@PeterAlfredLee
Copy link
Member

Some explanizations for the allowStoredEntriesWithDataDescriptor:

An entry using STORED method means it would not use any compression - it's just a memcopy. For most of time, the size and compressed size(they are equal in the case of STORED) are stored in the Local File Header - which locates before the file raw data. So we can directly read the accurate amout of bytes, because we already know the amout of data before we start to extract the file.

This changes if the entry is using STORED and data descriptor at the same time. The size and compressed size is stored in the data descriptor, and the data descriptor locates after the raw file data. With data descriptor, we do not know the size of the file before we start to extract the file. We do not know the accurate size of data before we start extracting file, so we need to check if a signature(signature of Local File Header, Central Directory Record or Data Descriptor) is met byte by byte.

Obviously, it is slower if the data descriptor is used for STORED. What's worse, there are some cases lead to a error extracting : some bytes in the raw file data may equal to signature, and compress will stop reading at a wrong location. This is mostly happened in a zip archive that contains another zip archive. And wo do not have any workaround here.

In short words, setting allowStoredEntriesWithDataDescriptor to be true for STROED entries would lead to a slower extraction, and for some times it would lead to a failed extraction(e.g. zip archive in zip).

That's why we use allowStoredEntriesWithDataDescriptor and the default value is false. This may be a little complicated, but I hope it helps.

@tbentleypfpt
Copy link
Author

Ah that explanation helps a lot @PeterAlfredLee. Sounds like we need to look at other options for handling this. Either in this project or in tika.

@tbentleypfpt
Copy link
Author

After receiving the additional information about STORED entries, this is not an appropriate change. An update will instead be made to Tika.
https://issues.apache.org/jira/browse/TIKA-3196

@tbentleypfpt
Copy link
Author

Tika Update: apache/tika#356

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants