Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: In TarEntry objects emitted by TarReader, make the offset to the entry data in enclosing stream a public property #101314

Open
martinpf opened this issue Apr 19, 2024 · 13 comments
Labels
api-ready-for-review API is ready for review, it is NOT ready for implementation area-System.Formats.Tar

Comments

@martinpf
Copy link

martinpf commented Apr 19, 2024

Background and motivation

Make the position or offset of the data in the enclosing stream for a System.Formats.Tar.TarEntry object a public property.

The TarEntry objects returned by TarReader.GetNextEntryAsync() contain a DataStream member that encapsulates the section of the enclosing super stream containing the entry data. We need the offset into the enclosing super stream so we can process the entry data without having to use DataStream member.

This would enable more flexible use of tar balls in data stores that support features like concurrent access. Azure Blob Storage is an example. An example of a current limitation that would be overcome by this would be the concurrent uploading of entry data from tar balls stored in Azure Blob Storage.

The current implementation has this info in a private member of the internal stream type. We need this to be made public.

protected readonly long _startInSuperStream;

The motivating scenario is that we need to handle potentially very large tar balls in Azure Blob Storage. The TarReader has each entry sharing the same super stream. We can't do concurrent downloads for large tar balls in storage with that structure. Multiple threads can't concurrently share the super stream. We would like to only use the TarReader to get the sizes, names, and offsets into the super stream for each entry and use our own stream handling to download the entry data.

API Proposal

TarEntry.OffsetInSuperStream

API Usage

        // Create stream for tar ball data in Azure Blob Storage
        var blobClient = Azure.Storage.Blobs.BlobClient(....);
        var blobClientStream = await blobClient.OpenReadAsync(...);

       // Create TarReader for the stream and get a TarEntry
        var tarReader = new System.Formats.Tar.TarReader(blobClientStream);
        var tarEntry = await tarReader.GetNextEntryAsync();

       // get position of TarEntry data in blob stream
        var entryOffsetInBlobStream = tarEntry.OffsetInSuperStream;
        var entryLength =  tarEntry.Length;

        // create a separate stream
        var newBlobClientStream = await TarBlob.OpenReadAsync(...);
        newBlobClientStream.Seek(entryOffsetInBlobStream, SeekOrigin.Begin);

        // read tar ball content from separate BlobClient stream
        var bytes = new byte[length];
        await tarBlobStream.ReadAsync(bytes, 0, (int)entryLength);
          

Alternative Designs

No response

Risks

No response

@martinpf martinpf added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Apr 19, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Apr 19, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-formats-tar
See info in area-owners.md if you want to be subscribed.

@colejohnson66
Copy link

But why? If you need a stream for the file, TarEntry always provides it. Why would using your own stream-like type provide any benefit?

@martinpf
Copy link
Author

martinpf commented Apr 20, 2024

The motivating scenario is that we need to handle potentially very large tar balls in Azure Blob Storage. The TarReader has each entry sharing the same super stream. We can't do concurrent downloads for large tar balls in storage with that structure. Multiple threads can't concurrently share the super stream. We would like to only use the TarReader to get the sizes, names, and offsets into the super stream for each entry and use our own stream handling to download the entry data.

Note we currently use a self implemented tar reader that enables this. But it only supports the ustar format with its 8GB entry size limit. We need at least PAX support and are looking for a supported solution. Our needs predated .NET TAR support.

@tmds
Copy link
Member

tmds commented Apr 23, 2024

I'm curious. Why is reading with multiple readers at different offsets better than reading a blob sequentially with a single reader? Does Azure limit bandwidth per reader? Or is it some other reason?

@martinpf
Copy link
Author

For very large tar balls, reading sequentially isn't performant. Reading with multiple TarReaders isn't always performant either given the cost of traversing the entries. With stores that support it, such as Azure Blob Storage, concurrent reading using the Blob store API, for example, is substantially faster. We just need the super stream off set to support that.

@tmds
Copy link
Member

tmds commented Apr 23, 2024

I was curious to learn why it is faster.

If Azure limits bandwidth per reader, then it's clear why more readers can achieve a higher bandwidth.

An other reason may be that a single reader has idle time while it asks for new data. The higher the round trip time, the higher the impact on bandwidth. An additional reader can make use of that bandwidth.

I assume your tarball has large files in it, otherwise the base reader wouldn't be able to skip anything and cause the tarball to get downloaded twice.

@colejohnson66
Copy link

Is the solution here to just make TarEntry able to open multiple streams? Exposing the format of the file feels wrong.

@martinpf
Copy link
Author

Making the TarEntry able to open multiple streams strikes me as difficult to do. If it can be done cleanly it would be great. But it means the TarReader, rather than being initialized with a stream for sequential access, would need to be given a callback or some other mechanism with which to create additional streams to the same data source.

@martinpf
Copy link
Author

wrt the performance of multiple BlobClients reading from a common single blob. We found performance improvements of up to a factor of two for our scenarios. From a code perspective, the ability to customize our own downloading adds welcome flexibility. The need to lock or post to queues or otherwise serialize access to the TarEntries is an issue. It makes the Tar API awkward to use in non client scenarios.

@carlossanlop
Copy link
Member

carlossanlop commented Apr 25, 2024

Thanks for creating this issue. This is an interesting scenario. Let me summarize to confirm I got it right:

  • You wish to retrieve the position and length of each tar entry data section for future usage.
  • You are using a seekable stream (mandatory prerequisite for this all to work).
  • It is currently impossible to get the position of the data stream start relative to the main stream. The DataStream will return a value of 0. The real value is tracked internally by that sub stream via _startInSuperStream.
  • You are proposing to make this start position public by exposing the value of _startInSuperStream.

Additionally, I also checked, and it is also impossible to get the start of the data stream by reading the position of the main stream right after calling GetNextEntry/GetNextEntryAsync, because internally, TryGetNextHeader will read the metadata fields, and immediately after will call ProcessDataBlock, which, when done, will cause the main stream position to point at the beginning of the next tar entry (or the end of the archive, if that was the last entry).

I think it's acceptable to expose the start of the data stream as a value that is relative to the main stream. It should be a read-only property, and it will throw an exception if the main stream is not seekable.

Keep in mind that if your intention is to use this in PAX entries, you'll have to be careful to skip extended attributes and global extended attributes entries, as those only contain additional metadata for the real actual entry.

I think the name of the property should be different than proposed. It should indicate that it is the position of the data stream relative to the main stream. But this can be discussed and decided in the API proposal meeting.

public abstract partial class TarEntry
{
+        public long PositionOfDataInArchiveStream { get; }
}

If there are no additional comments, objections or concerns, I can mark this as ready for review. Let me know what you think.

@carlossanlop carlossanlop removed the untriaged New issue has not been triaged by the area owner label Apr 25, 2024
@martinpf
Copy link
Author

martinpf commented Apr 26, 2024

Yes your summary of the issue is correct. And your proposal sounds great.

To change my earlier comment. I agree this doesn't make sense if the enclosing stream is not seekable. There isn't clear scenario for exposing the property if it isn't. It can't hurt to allow the property for non seekable streams anyway (perhaps for consistency). But there isn't a scenario for it that I can come up with. I think a case could be made that it would reduce the complexity of consuming code by not prompting a try catch around the property get in some scenarios.

@carlossanlop
Copy link
Member

Marking this as ready for review.
I made a small change in the proposed API name, as we need to make sure the user knows the pointer is referring to the archive stream, not to the data stream of the entry.

@carlossanlop carlossanlop added the api-ready-for-review API is ready for review, it is NOT ready for implementation label May 7, 2024
@tmds
Copy link
Member

tmds commented May 8, 2024

Keep in mind that if your intention is to use this in PAX entries, you'll have to be careful to skip extended attributes and global extended attributes entries, as those only contain additional metadata for the real actual entry.

It would be preferable if the user should not care about the tar format.

Because the intended use-case is to get the position to file data, the property could return -1/throw for non-regular file entry types.

@teo-tsirpanis teo-tsirpanis removed the api-suggestion Early API idea and discussion, it is NOT ready for implementation label May 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-ready-for-review API is ready for review, it is NOT ready for implementation area-System.Formats.Tar
Projects
None yet
Development

No branches or pull requests

5 participants