-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[API Proposal]: In TarEntry objects emitted by TarReader, make the offset to the entry data in enclosing stream a public property #101314
Comments
Tagging subscribers to this area: @dotnet/area-system-formats-tar |
But why? If you need a stream for the file, |
The motivating scenario is that we need to handle potentially very large tar balls in Azure Blob Storage. The TarReader has each entry sharing the same super stream. We can't do concurrent downloads for large tar balls in storage with that structure. Multiple threads can't concurrently share the super stream. We would like to only use the TarReader to get the sizes, names, and offsets into the super stream for each entry and use our own stream handling to download the entry data. Note we currently use a self implemented tar reader that enables this. But it only supports the ustar format with its 8GB entry size limit. We need at least PAX support and are looking for a supported solution. Our needs predated .NET TAR support. |
I'm curious. Why is reading with multiple readers at different offsets better than reading a blob sequentially with a single reader? Does Azure limit bandwidth per reader? Or is it some other reason? |
For very large tar balls, reading sequentially isn't performant. Reading with multiple TarReaders isn't always performant either given the cost of traversing the entries. With stores that support it, such as Azure Blob Storage, concurrent reading using the Blob store API, for example, is substantially faster. We just need the super stream off set to support that. |
I was curious to learn why it is faster. If Azure limits bandwidth per reader, then it's clear why more readers can achieve a higher bandwidth. An other reason may be that a single reader has idle time while it asks for new data. The higher the round trip time, the higher the impact on bandwidth. An additional reader can make use of that bandwidth. I assume your tarball has large files in it, otherwise the base reader wouldn't be able to skip anything and cause the tarball to get downloaded twice. |
Is the solution here to just make TarEntry able to open multiple streams? Exposing the format of the file feels wrong. |
Making the TarEntry able to open multiple streams strikes me as difficult to do. If it can be done cleanly it would be great. But it means the TarReader, rather than being initialized with a stream for sequential access, would need to be given a callback or some other mechanism with which to create additional streams to the same data source. |
wrt the performance of multiple BlobClients reading from a common single blob. We found performance improvements of up to a factor of two for our scenarios. From a code perspective, the ability to customize our own downloading adds welcome flexibility. The need to lock or post to queues or otherwise serialize access to the TarEntries is an issue. It makes the Tar API awkward to use in non client scenarios. |
Thanks for creating this issue. This is an interesting scenario. Let me summarize to confirm I got it right:
Additionally, I also checked, and it is also impossible to get the start of the data stream by reading the position of the main stream right after calling I think it's acceptable to expose the start of the data stream as a value that is relative to the main stream. It should be a read-only property, and it will throw an exception if the main stream is not seekable. Keep in mind that if your intention is to use this in PAX entries, you'll have to be careful to skip extended attributes and global extended attributes entries, as those only contain additional metadata for the real actual entry. I think the name of the property should be different than proposed. It should indicate that it is the position of the data stream relative to the main stream. But this can be discussed and decided in the API proposal meeting. public abstract partial class TarEntry
{
+ public long PositionOfDataInArchiveStream { get; }
} If there are no additional comments, objections or concerns, I can mark this as ready for review. Let me know what you think. |
Yes your summary of the issue is correct. And your proposal sounds great. To change my earlier comment. I agree this doesn't make sense if the enclosing stream is not seekable. There isn't clear scenario for exposing the property if it isn't. It can't hurt to allow the property for non seekable streams anyway (perhaps for consistency). But there isn't a scenario for it that I can come up with. I think a case could be made that it would reduce the complexity of consuming code by not prompting a try catch around the property get in some scenarios. |
Marking this as ready for review. |
It would be preferable if the user should not care about the tar format. Because the intended use-case is to get the position to file data, the property could return |
Background and motivation
Make the position or offset of the data in the enclosing stream for a System.Formats.Tar.TarEntry object a public property.
The TarEntry objects returned by TarReader.GetNextEntryAsync() contain a DataStream member that encapsulates the section of the enclosing super stream containing the entry data. We need the offset into the enclosing super stream so we can process the entry data without having to use DataStream member.
This would enable more flexible use of tar balls in data stores that support features like concurrent access. Azure Blob Storage is an example. An example of a current limitation that would be overcome by this would be the concurrent uploading of entry data from tar balls stored in Azure Blob Storage.
The current implementation has this info in a private member of the internal stream type. We need this to be made public.
runtime/src/libraries/System.Formats.Tar/src/System/Formats/Tar/SubReadStream.cs
Line 18 in b067ce0
The motivating scenario is that we need to handle potentially very large tar balls in Azure Blob Storage. The TarReader has each entry sharing the same super stream. We can't do concurrent downloads for large tar balls in storage with that structure. Multiple threads can't concurrently share the super stream. We would like to only use the TarReader to get the sizes, names, and offsets into the super stream for each entry and use our own stream handling to download the entry data.
API Proposal
TarEntry.OffsetInSuperStream
API Usage
Alternative Designs
No response
Risks
No response
The text was updated successfully, but these errors were encountered: