New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compress-555 allow reading of stored entries in zips by default #137
Compress-555 allow reading of stored entries in zips by default #137
Conversation
Sounds like tika needs to call the other constructor. |
Well it goes through the ArchiveStreamFactory in commons-compress and there is no way of passing in from tika whether we want to enable that feature or not since it's specific to ZIP.
|
Good find. Let's get some feedback from the community to learn if there are side-effects to worry about here. |
Some explanizations for the An entry using STORED method means it would not use any compression - it's just a memcopy. For most of time, the size and compressed size(they are equal in the case of STORED) are stored in the Local File Header - which locates before the file raw data. So we can directly read the accurate amout of bytes, because we already know the amout of data before we start to extract the file. This changes if the entry is using STORED and data descriptor at the same time. The size and compressed size is stored in the data descriptor, and the data descriptor locates after the raw file data. With data descriptor, we do not know the size of the file before we start to extract the file. We do not know the accurate size of data before we start extracting file, so we need to check if a signature(signature of Local File Header, Central Directory Record or Data Descriptor) is met byte by byte. Obviously, it is slower if the data descriptor is used for STORED. What's worse, there are some cases lead to a error extracting : some bytes in the raw file data may equal to signature, and compress will stop reading at a wrong location. This is mostly happened in a zip archive that contains another zip archive. And wo do not have any workaround here. In short words, setting That's why we use |
Ah that explanation helps a lot @PeterAlfredLee. Sounds like we need to look at other options for handling this. Either in this project or in tika. |
After receiving the additional information about STORED entries, this is not an appropriate change. An update will instead be made to Tika. |
Tika Update: apache/tika#356 |
We are currently using tika for text extraction which uses commons-compress for handling zips. Currently some sites are returning zips that have entries with stored data descriptors which fail to extract due to the ZipArchiveInputStream defaulting to false for 'allowStoredEntriesWithDataDescriptor'.
This PR adjusts the ZipArchiveInputStream to allow reading of stored entries with data descriptor by default.
https://issues.apache.org/jira/browse/COMPRESS-555