New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add zip64 #211

Merged
merged 6 commits into from Mar 13, 2017

Conversation

Projects
None yet
2 participants
@kenkendk
Contributor

kenkendk commented Mar 9, 2017

This adds zip64 writing support.

The way zip64 is implemented is by appending a set of "extra" values to the header.
In the zip file there are two headers for each file: local (before the stream) and central (end of file).

The central header is simple enough, and most implementations simply use this and mostly ignore the other header. Once we are writing the central header, we have all the information required, so we can just write it.

For the local header, there is a tradeoff. The "extra" bytes take up 2+2+8+8=20 bytes pr. entry. This header is only required if either stream size (compressed and uncompressed) exceeds uint.MaxValue, but we do not know the length of the stream before writing.

The dilemma is: do we write it for all files, in case one is too long? Or do we not write it and risk overflowing the values?

Since the header is "mostly ignored" we could live with this being broken. On the other hand, if we can use it we should.

I have added a hint value to the ZipWriteEntryOptions that can force this header on and off. I have also added a flag in ZipWriterOptions that enable the extra fields for all entries by default. If the caller does not set any flags, as I would assume most would, I use a threshold of 2GiB to toggle zip64 headers. If the underlying stream is already 2GiB or more, zip64 is automatically enabled in the local header.
This is not a perfect solution, but the I figure most users would write smaller zip files and never notices. Larger zip files are not really impacted by 20 bytes here and there. You can of course defeat the scheme by writing a 1.9GiB file, and then a +4GiB file, thus hitting the limitations before the automatic upgrade kicks in.

If the stream is non-seekable, we have another issue, namely that the file would normally set a flag and then write the Crc, uncompressed size, and compressed size in a special trailing header. This trailing header has not been updated to use zip64, so we cannot write the correct values in it. We can also not use both trailing headers and "extra" data. This was clarified from PKWare: https://blogs.oracle.com/xuemingshen/entry/is_zipinput_outputstream_handling_of

In the case of streaming, the local headers are written with the trailing data, which may overflow. But the headers do contain the crc32 value, and may contain correct data if the sizes are less than uint.MaxValue. Again, the central directory header has the correct values.

Not sure how to deal with testing, as it requires files +4GiB to hit the limitations.

@adamhathcock

This comment has been minimized.

Show comment
Hide comment
@adamhathcock

adamhathcock Mar 10, 2017

Owner

So if I understand, there's no post data descriptor for zip64 but the central directory has the values?

Then you can't use streaming reads on zip64?

This pretty shitty. I guess they updated APPNOTE.txt:

4.3.9.5 An extensible data descriptor will be released in a 
      future version of this APPNOTE.  This new record is intended to
      resolve conflicts with the use of this record going forward,
      and to provide better support for streamed file processing.

      4.3.9.6 When the Central Directory Encryption method is used, 
      the data descriptor record is not required, but MAY be used.  
      If present, and bit 3 of the general purpose bit field is set to 
      indicate its presence, the values in fields of the data descriptor
      record MUST be set to binary zeros.  See the section on the Strong 
      Encryption Specification for information. Refer to the section in 
      this document entitled "Incorporating PKWARE Proprietary Technology 
      into Your Product" for more information.

So there's currently no way to forward-only read entries out of a zip64.

Would a good test to just be forcing zip64 on a small file? The values getting overloaded shouldn't really matter.

Owner

adamhathcock commented Mar 10, 2017

So if I understand, there's no post data descriptor for zip64 but the central directory has the values?

Then you can't use streaming reads on zip64?

This pretty shitty. I guess they updated APPNOTE.txt:

4.3.9.5 An extensible data descriptor will be released in a 
      future version of this APPNOTE.  This new record is intended to
      resolve conflicts with the use of this record going forward,
      and to provide better support for streamed file processing.

      4.3.9.6 When the Central Directory Encryption method is used, 
      the data descriptor record is not required, but MAY be used.  
      If present, and bit 3 of the general purpose bit field is set to 
      indicate its presence, the values in fields of the data descriptor
      record MUST be set to binary zeros.  See the section on the Strong 
      Encryption Specification for information. Refer to the section in 
      this document entitled "Incorporating PKWARE Proprietary Technology 
      into Your Product" for more information.

So there's currently no way to forward-only read entries out of a zip64.

Would a good test to just be forcing zip64 on a small file? The values getting overloaded shouldn't really matter.

@kenkendk

This comment has been minimized.

Show comment
Hide comment
@kenkendk

kenkendk Mar 10, 2017

Contributor

Streaming

Yes, that is my understanding: streaming cannot write 64bit values in the post-data.

I know they previously said that the post-data header should have 8-byte values for zip64, but they forgot to mention how to figure out if the file is zip64 when reading forward-only. The update you mention is probably a thought that there should be some kind of identifier in the post-data region.

You can still use forward-only reading, but you cannot verify if the lengths are correct, if the stream is larger than 4GiB.

Edit: This problem is worst if you write the archive forward-only AND read the archive in forward-only. If the stream is seekable, it is possible to forgo the post-data and use the zip64 extras, such that forward-only reading is possible.

Testing

You can force zip64 to have the extra headers, but I would expect errors to show up when values are hitting the unit.MaxValue limit.

Misc

One detail I forgot to mention in the initial text is that the list of files report the Count as int, so there is an overflow there if you have more than int.MaxValue files in the archive. SharpCompress does not use this value for anything, but other libraries may.

Contributor

kenkendk commented Mar 10, 2017

Streaming

Yes, that is my understanding: streaming cannot write 64bit values in the post-data.

I know they previously said that the post-data header should have 8-byte values for zip64, but they forgot to mention how to figure out if the file is zip64 when reading forward-only. The update you mention is probably a thought that there should be some kind of identifier in the post-data region.

You can still use forward-only reading, but you cannot verify if the lengths are correct, if the stream is larger than 4GiB.

Edit: This problem is worst if you write the archive forward-only AND read the archive in forward-only. If the stream is seekable, it is possible to forgo the post-data and use the zip64 extras, such that forward-only reading is possible.

Testing

You can force zip64 to have the extra headers, but I would expect errors to show up when values are hitting the unit.MaxValue limit.

Misc

One detail I forgot to mention in the initial text is that the list of files report the Count as int, so there is an overflow there if you have more than int.MaxValue files in the archive. SharpCompress does not use this value for anything, but other libraries may.

@adamhathcock

This comment has been minimized.

Show comment
Hide comment
@adamhathcock

adamhathcock Mar 10, 2017

Owner

You could always just say upfront to write everything as zip64 and deal with streaming that way. Then write post descriptors with 8 bytes. I guess that would be incompatible with other implementations?

I guess if the code allows forward-only reading but you'd have to read the central directory for the valid length that's something. However, that's now how the reader works now. The central directory isn't used with readers, only archives. Ugh.

I guess the count should be 'ulong' like everything else.

This code really needs some refactor lovin too I think.

Owner

adamhathcock commented Mar 10, 2017

You could always just say upfront to write everything as zip64 and deal with streaming that way. Then write post descriptors with 8 bytes. I guess that would be incompatible with other implementations?

I guess if the code allows forward-only reading but you'd have to read the central directory for the valid length that's something. However, that's now how the reader works now. The central directory isn't used with readers, only archives. Ugh.

I guess the count should be 'ulong' like everything else.

This code really needs some refactor lovin too I think.

@kenkendk

This comment has been minimized.

Show comment
Hide comment
@kenkendk

kenkendk Mar 10, 2017

Contributor

Streaming

I guess we can look at "what others do", but that is likely to cause errors. If the readers expect 4-byte values, there should at least be some indicator somewhere that the values are not 4-byte.

Count overflow

Yes, but that would mean replacing the system List<T>, which is a rather intrusive change.

Contributor

kenkendk commented Mar 10, 2017

Streaming

I guess we can look at "what others do", but that is likely to cause errors. If the readers expect 4-byte values, there should at least be some indicator somewhere that the values are not 4-byte.

Count overflow

Yes, but that would mean replacing the system List<T>, which is a rather intrusive change.

@adamhathcock

This comment has been minimized.

Show comment
Hide comment
@adamhathcock

adamhathcock Mar 10, 2017

Owner

I usually test with WinRAR and 7Zip myself.

I guess I see the point about the collections. Let's save that for a rainy day.

Owner

adamhathcock commented Mar 10, 2017

I usually test with WinRAR and 7Zip myself.

I guess I see the point about the collections. Let's save that for a rainy day.

@kenkendk

This comment has been minimized.

Show comment
Hide comment
@kenkendk

kenkendk Mar 10, 2017

Contributor

Maybe a better approach is to simply throw an exception if that case is hit?

If the stream is non-seekable, and the input data is larger than uint.MaxValue, we could throw an exception.

Contributor

kenkendk commented Mar 10, 2017

Maybe a better approach is to simply throw an exception if that case is hit?

If the stream is non-seekable, and the input data is larger than uint.MaxValue, we could throw an exception.

@adamhathcock

This comment has been minimized.

Show comment
Hide comment
@adamhathcock

adamhathcock Mar 10, 2017

Owner

Yes. Explicit exceptions are good if there's no fix. At least to let people know's happening.

Thanks again!

Owner

adamhathcock commented Mar 10, 2017

Yes. Explicit exceptions are good if there's no fix. At least to let people know's happening.

Thanks again!

@kenkendk

This comment has been minimized.

Show comment
Hide comment
@kenkendk

kenkendk Mar 10, 2017

Contributor

Should we do the same for the case where we use a seek-able stream, and do not have space for the zip64 extra fields?

Maybe we could then choose to not make space for it by default (remove my 2GiB threshold) and make the caller set the zip64 flag when creating the archive or stream exceeds 4GiB ?

Contributor

kenkendk commented Mar 10, 2017

Should we do the same for the case where we use a seek-able stream, and do not have space for the zip64 extra fields?

Maybe we could then choose to not make space for it by default (remove my 2GiB threshold) and make the caller set the zip64 flag when creating the archive or stream exceeds 4GiB ?

@adamhathcock

This comment has been minimized.

Show comment
Hide comment
@adamhathcock

adamhathcock Mar 10, 2017

Owner

I would rather be explicit in these cases and force the user to know what they're doing. Especially when getting into the zip64 file size area.

I guess just measure then throw exceptions if it's not zip64 mode.

Owner

adamhathcock commented Mar 10, 2017

I would rather be explicit in these cases and force the user to know what they're doing. Especially when getting into the zip64 file size area.

I guess just measure then throw exceptions if it's not zip64 mode.

@kenkendk

This comment has been minimized.

Show comment
Hide comment
@kenkendk

kenkendk Mar 10, 2017

Contributor

Ok, I will update the code to require explicit zip64 activation when writing streams larger than 4GiB. And disallow writing streams larger than 4GiB in forward-only mode.

Archives larger than 4GiB can be transparently supported, as long as all streams inside are less than 4GiB individually.

Contributor

kenkendk commented Mar 10, 2017

Ok, I will update the code to require explicit zip64 activation when writing streams larger than 4GiB. And disallow writing streams larger than 4GiB in forward-only mode.

Archives larger than 4GiB can be transparently supported, as long as all streams inside are less than 4GiB individually.

kenkendk added some commits Mar 10, 2017

Fixed an error in the zip64 central end of header: the signature + le…
…ngth (12 bytes) are not included in the reported length.
Changed the logic to throw exceptions when sizes exceed the zip archi…
…ve limits, and zip64 is not enabled.

This changes the logic, such that archives larger than 4GiB are still automatically written correct (only the central header is special).
Archives with individual streams larger than 4 GiB must set the zip64 flag, either on the archive or the individual streams.
@kenkendk

This comment has been minimized.

Show comment
Hide comment
@kenkendk

kenkendk Mar 11, 2017

Contributor

I have updated the logic to require setting the zip64 flag before writing a stream larger than 4GiB.
I have also disabled zip64 support for writing non-seekable streams.

I have added the check to the Write method of the ZipWritingStream to avoid throwing exceptions in the Dispose method.

Not sure how you prefer this to work, but we cannot know in advance if the write expands the file beyond 4GiB, so I have made a pre-check and a post-check. If we catch it in the pre-check, the file is correct (but with a shorter stream), if we catch it in the post check, invalid data is now in the archive.
Maybe the post-check is sufficient? Is it acceptable to close the base stream if the write has produced an invalid file?

I have also added a unittest for trying out various parts of the zip64 features.

Contributor

kenkendk commented Mar 11, 2017

I have updated the logic to require setting the zip64 flag before writing a stream larger than 4GiB.
I have also disabled zip64 support for writing non-seekable streams.

I have added the check to the Write method of the ZipWritingStream to avoid throwing exceptions in the Dispose method.

Not sure how you prefer this to work, but we cannot know in advance if the write expands the file beyond 4GiB, so I have made a pre-check and a post-check. If we catch it in the pre-check, the file is correct (but with a shorter stream), if we catch it in the post check, invalid data is now in the archive.
Maybe the post-check is sufficient? Is it acceptable to close the base stream if the write has produced an invalid file?

I have also added a unittest for trying out various parts of the zip64 features.

@adamhathcock

This comment has been minimized.

Show comment
Hide comment
@adamhathcock

adamhathcock Mar 13, 2017

Owner

It looks like you throw an exception when the file produced will be invalid so that seems sufficient.

Owner

adamhathcock commented Mar 13, 2017

It looks like you throw an exception when the file produced will be invalid so that seems sufficient.

@adamhathcock adamhathcock merged commit f853840 into adamhathcock:master Mar 13, 2017

1 check passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment