Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Flush to ZipArchive #24149

Open
Tracked by #62658
twsouthwick opened this issue Nov 15, 2017 · 19 comments
Open
Tracked by #62658

Add Flush to ZipArchive #24149

twsouthwick opened this issue Nov 15, 2017 · 19 comments
Labels
api-needs-work API needs work before it is approved, it is NOT ready for implementation area-System.IO.Compression needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration
Milestone

Comments

@twsouthwick
Copy link
Member

When creating a package on .NET Core, you have to call Dispose or Close to get the contents written out to the underlying stream. This works on .NET Framework, but fails on .NET Core:

[Fact]
public void FlushSavesPackageContents()
{
   var uri = new Uri("/something", UriKind.Relative);
   var contentType = "something/other";

   byte[] CreatePackage()
   {
       using (var ms = new MemoryStream())
       {
           using (var package = Package.Open(ms, FileMode.Create))
           {
               package.CreatePart(uri, contentType);

               Assert.Empty(ms.ToArray());

               package.Flush();

               // All data should be written after calling flush
               return ms.ToArray();
           }
       }
   }

   var bytes = CreatePackage();

   Assert.NotEmpty(bytes);

   using (var ms = new MemoryStream(bytes))
   using (var package = Package.Open(ms))
   {
       Assert.True(package.PartExists(uri));
       Assert.Equal(contentType, package.GetPart(uri).ContentType);
   }
}
@twsouthwick
Copy link
Member Author

It appears that this is due to not flushing the underlying zip. In .NET Framework it is flushed, but not in .NET Core. I'll try adding that in and see if it works.

@twsouthwick
Copy link
Member Author

@ianhays ZipArchive does not have a Flush method. Is there a way to flush the current archive state to a stream, or is this part of the issue to get streaming support for compression?

@ianhays
Copy link
Contributor

ianhays commented Nov 15, 2017

I'm not sure why ZipArchive in netcore doesn't have a flush method. It was ported to CoreFX without one. It could be that it was removed since it was internal-only in netfx and not used in netcore.

We could probably add it as public, at least for ZipArchiveMode.Create. I'm not sure how that would affect being able to open some entries though. We would need to add lots of tests to first verify the netfx Flushing behavior - which may be tricky since it's not exposed through ZipArchive - then try to match that in core. At that point a streaming API may be the better solution.

@twsouthwick
Copy link
Member Author

Yeah - I took a shot at just calling the WriteFile method (which is what the Dispose method calls) in lieu of a flush and it caused a lot of problems. It seems to close streams, so it expects to be called in the dispose. Exposing it would be nice, but I agree that there will be a need to add a bunch of tests for it.

We seem to be finding many issues with System.IO.Packaging in the DocumentFormat.OpenXml library that would be fixed with the streaming API...

@daniel-white
Copy link

Any progress on this? we use DocumentFormat.OpenXml and it has intentional bugs that keep parity between .net fx and .net core. see dotnet/Open-XML-SDK#380

@ianhays ianhays changed the title Calling Flush on System.IO.Packaging.Package does not flush contents Add Flush to ZipArchive Jan 18, 2018
@ianhays
Copy link
Contributor

ianhays commented Jan 18, 2018

There hasn't been any progress that I'm aware of, no. I've updated the title and the tags with the current state of the issue and marked it as up-for-grabs.

The work item here is to plan an API addition of Flush to ZipArchive. To do that, a formal proposal will have to be made and the design questions in the comments above should be answered.

@ghost
Copy link

ghost commented Jul 21, 2018

So I have just hit this issue too. Porting some code from FX to .Net. Need ZipPackage to be able to flush the stream. Is there any workaround ? Else, I can take on adding the API to ZipArchive.

@odhanson
Copy link
Member

Sorry, replaced my github account. So the message above is from me.

I have made the changes to add a Flush API and added tests to verify. I will follow through the procedures for getting the API and change approved.

@SinxHe
Copy link

SinxHe commented Aug 10, 2018

when will the flush function will be added?

@odhanson
Copy link
Member

@SinxHe , unfortunately fixing this issue is a bit more tricky than I initially anticipated. As I have managed to work around this issue for now, I am not sure when I will be able to get back to work on a fix for this.

@twsouthwick
Copy link
Member Author

@odhanson Thanks for taking a look at this. Can you share a fork with your current progress so someone else can build off of what you've tried?

@odhanson
Copy link
Member

odhanson commented Aug 20, 2018

@twsouthwick

Sure. I have made a fork under my account:
https://github.com/odhanson/corefx/tree/dev/Flush

There are two commits, the first adds the API and some tests. The second implements the API, and adds some tests that failed my implementation :(. So the good news is, that we do have a bunch of tests to base off of.

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 5.0 milestone Jan 31, 2020
@carlossanlop carlossanlop modified the milestones: 5.0.0, Future Jun 18, 2020
@MathewRimmington
Copy link

Hi, I'm wondering if there has been any progress on this issue? I'm seeing an out of memory exception when trying to create a huge spreadsheet. As the OpenXmlWriter is writing contents the memory usage continues to creep up, but there is currently no way to flush the data so the entire contents of the sheet remain in memory.

I'm also curious to know what the work around is.

@carlossanlop
Copy link
Member

carlossanlop commented Nov 18, 2021

Thank you all for the patience. We are planning work for .NET 7 and fixing the bugs affecting System.IO.Packaging is most likely going to be among our top priorities for the Compression area specifically.

@odhanson Thank you for the code you provided: https://github.com/odhanson/corefx/tree/dev/Flush . It was extremely helpful!

I debugged it to understand the problem better (I adapted them to dotnet/runtime of course) and I would like to share the following feedback:

  • The Flush method is writing the entries into the target stream via WriteFile(), just like Dispose, but without disposing the internal streams afterwards. This means we can call Flush inside Dispose to avoid the code duplication. It's a simple enough change.
  • I noticed that you added a line to clear the entries at the end of WriteFile. I assume you added this after running the unit tests and noticing that, after calling Flush(), the file was getting the entries written twice.
  • To fix this problem, we need to add logic to detect flushing without disposal separately for Update and for Create. This wouldn't be a trivial change.

The logic to handle creation and update is too convoluted and intermixed. So before making any Flush-related changes, I'm thinking we could potentially benefit from refactoring the code using the strategy pattern, similarly to what we did for FileStream in .NET 6.

After we refactor the code, we could implement Flush more easily, and have clear and separate logic for creation and for updating.

I also need to keep in mind Ian's concern shared above:

We could probably add it as public, at least for ZipArchiveMode.Create. I'm not sure how that would affect being able to open some entries though. We would need to add lots of tests to first verify the netfx Flushing behavior - which may be tricky since it's not exposed through ZipArchive - then try to match that in core. At that point a streaming API may be the better solution.

@twsouthwick I'll continue investigating how feasible it is to add Flush and will let you know if/when I have an API proposal.

@carlossanlop
Copy link
Member

carlossanlop commented Nov 21, 2021

@twsouthwick (and/or anyone else monitoring this issue),

I have some questions that arised after I read this:
https://en.wikipedia.org/wiki/ZIP_(file_format)#Structure

Let's ignore for a moment what .NET Framework is doing. What behavior do you expect from Flush when...

  1. The stream is non-seekable?
  2. The stream is seekable?

Non-seekable stream

Let's say you're working with a stream that is being sent to you from the network.

Deleting
  • If we delete an entry, then call Flush, we can't really delete the file, so will simply append an updated Central Directory at the end of the stream that would exclude the "deleted" file from the list.
  • Alternative: we save the location where the old CD starts, then we overwrite the CD with the new one, which would exclude the "deleted" entry from its existing list.

Note: We currently cannot open a ZipArchive with ZipArchiveMode.Update if the stream is non-seekable, but I'm exploring the option of removing the restriction if it's feasible.

Adding
  • If we add a new file, then call Flush, we append the new entry at the end of the file, then immediately afterwards write an updated Central Directory that contains the new file.
  • Alternative: we save the location where the old CD starts, then we write the new entry right where the old CD starts (overwriting it, obviously) then we would have to write a new CD after the file.

Seekable stream

Let's say you're working with a zip file loaded from disk using a FileStream:

Deleting
  • If we delete a file, then call Flush, we could in theory rewrite the whole archive from start to end, ensuring we skip the deleted file and updating the CD list at the end to exclude the deleted file.
  • Alternative (more complicated): We iterate through the list of entries, find the first one marked as deleted, located the first byte of its header in the archive stream, then start overwriting it with the subsequent entries that are not marked as deleted, until all are rewritten, and finally write the updated CD.
Adding

I think the behavior would be the same as described for non-seekable.

Updating a file entry's stream
  • Similar to deleting, we mark it as a file whose contents changed, then we either rewrite the whole archive from start to end, or
  • We find the first modified entry in the CD list and start overwriting everything after that file, including it, then write the updated CD with the new sizes.

@ghost ghost added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Nov 21, 2021
@jeffhandley jeffhandley modified the milestones: Future, 7.0.0 Jan 9, 2022
@jeffhandley jeffhandley modified the milestones: 7.0.0, 8.0.0 Jul 9, 2022
@ashahabov
Copy link

If I understood correctly, the fix is planned for .NET 8 time?😥

@exelsior
Copy link

exelsior commented Dec 5, 2023

Any progress on this?

@exelsior
Copy link

exelsior commented Dec 7, 2023

Hello, could you please recommend third party package with zip flush support?

@edwardneal
Copy link
Contributor

I recently submitted #102704, which implements the logic @carlossanlop describes in Flush for seekable streams (although I've admittedly only just seen the comment - wish I'd seen it earlier!) Hopefully this disposes of the more complex case.

The only ZipArchiveMode which can modify a non-seekable stream is Create. The first Flush here is simple, and I'm tempted say we should throw an exception for any subsequent Flush calls:

  1. At that point, we're no longer Creating an archive, we're modifying it
  2. Any modifications we perform are going to be technically valid but strange. Adding a new archive entry is broadly fine, but modifying an archive entry will mean that we abandon the original copy and re-append it again to the file with the updated CD. This'll result in archives which are much larger than they need to be, and which will return a different set of results when streamed (because the abandoned entry will be detected until the CD is parsed.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-needs-work API needs work before it is approved, it is NOT ready for implementation area-System.IO.Compression needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration
Development

No branches or pull requests