[API Proposal]: In TarEntry objects emitted by TarReader, make the offset to the entry data in enclosing stream a public property #101314

martinpf · 2024-04-19T21:03:29Z

Background and motivation

Make the position or offset of the data in the enclosing stream for a System.Formats.Tar.TarEntry object a public property.

The TarEntry objects returned by TarReader.GetNextEntryAsync() contain a DataStream member that encapsulates the section of the enclosing super stream containing the entry data. We need the offset into the enclosing super stream so we can process the entry data without having to use DataStream member.

This would enable more flexible use of tar balls in data stores that support features like concurrent access. Azure Blob Storage is an example. An example of a current limitation that would be overcome by this would be the concurrent uploading of entry data from tar balls stored in Azure Blob Storage.

The current implementation has this info in a private member of the internal stream type. We need this to be made public.

runtime/src/libraries/System.Formats.Tar/src/System/Formats/Tar/SubReadStream.cs

Line 18 in b067ce0

protected readonly long _startInSuperStream;

The motivating scenario is that we need to handle potentially very large tar balls in Azure Blob Storage. The TarReader has each entry sharing the same super stream. We can't do concurrent downloads for large tar balls in storage with that structure. Multiple threads can't concurrently share the super stream. We would like to only use the TarReader to get the sizes, names, and offsets into the super stream for each entry and use our own stream handling to download the entry data.

API Proposal

TarEntry.OffsetInSuperStream

API Usage

        // Create stream for tar ball data in Azure Blob Storage
        var blobClient = Azure.Storage.Blobs.BlobClient(....);
        var blobClientStream = await blobClient.OpenReadAsync(...);

       // Create TarReader for the stream and get a TarEntry
        var tarReader = new System.Formats.Tar.TarReader(blobClientStream);
        var tarEntry = await tarReader.GetNextEntryAsync();

       // get position of TarEntry data in blob stream
        var entryOffsetInBlobStream = tarEntry.OffsetInSuperStream;
        var entryLength =  tarEntry.Length;

        // create a separate stream
        var newBlobClientStream = await TarBlob.OpenReadAsync(...);
        newBlobClientStream.Seek(entryOffsetInBlobStream, SeekOrigin.Begin);

        // read tar ball content from separate BlobClient stream
        var bytes = new byte[length];
        await tarBlobStream.ReadAsync(bytes, 0, (int)entryLength);

Alternative Designs

No response

Risks

No response

dotnet-policy-service · 2024-04-19T21:03:53Z

Tagging subscribers to this area: @dotnet/area-system-formats-tar
See info in area-owners.md if you want to be subscribed.

colejohnson66 · 2024-04-19T21:20:10Z

But why? If you need a stream for the file, TarEntry always provides it. Why would using your own stream-like type provide any benefit?

martinpf · 2024-04-20T00:00:18Z

The motivating scenario is that we need to handle potentially very large tar balls in Azure Blob Storage. The TarReader has each entry sharing the same super stream. We can't do concurrent downloads for large tar balls in storage with that structure. Multiple threads can't concurrently share the super stream. We would like to only use the TarReader to get the sizes, names, and offsets into the super stream for each entry and use our own stream handling to download the entry data.

Note we currently use a self implemented tar reader that enables this. But it only supports the ustar format with its 8GB entry size limit. We need at least PAX support and are looking for a supported solution. Our needs predated .NET TAR support.

tmds · 2024-04-23T12:22:12Z

I'm curious. Why is reading with multiple readers at different offsets better than reading a blob sequentially with a single reader? Does Azure limit bandwidth per reader? Or is it some other reason?

martinpf · 2024-04-23T15:26:26Z

For very large tar balls, reading sequentially isn't performant. Reading with multiple TarReaders isn't always performant either given the cost of traversing the entries. With stores that support it, such as Azure Blob Storage, concurrent reading using the Blob store API, for example, is substantially faster. We just need the super stream off set to support that.

tmds · 2024-04-23T18:51:52Z

I was curious to learn why it is faster.

If Azure limits bandwidth per reader, then it's clear why more readers can achieve a higher bandwidth.

An other reason may be that a single reader has idle time while it asks for new data. The higher the round trip time, the higher the impact on bandwidth. An additional reader can make use of that bandwidth.

I assume your tarball has large files in it, otherwise the base reader wouldn't be able to skip anything and cause the tarball to get downloaded twice.

colejohnson66 · 2024-04-23T21:04:48Z

Is the solution here to just make TarEntry able to open multiple streams? Exposing the format of the file feels wrong.

martinpf · 2024-04-24T16:33:22Z

Making the TarEntry able to open multiple streams strikes me as difficult to do. If it can be done cleanly it would be great. But it means the TarReader, rather than being initialized with a stream for sequential access, would need to be given a callback or some other mechanism with which to create additional streams to the same data source.

martinpf · 2024-04-24T16:47:01Z

wrt the performance of multiple BlobClients reading from a common single blob. We found performance improvements of up to a factor of two for our scenarios. From a code perspective, the ability to customize our own downloading adds welcome flexibility. The need to lock or post to queues or otherwise serialize access to the TarEntries is an issue. It makes the Tar API awkward to use in non client scenarios.

carlossanlop · 2024-04-25T04:09:30Z

Thanks for creating this issue. This is an interesting scenario. Let me summarize to confirm I got it right:

You wish to retrieve the position and length of each tar entry data section for future usage.
You are using a seekable stream (mandatory prerequisite for this all to work).
It is currently impossible to get the position of the data stream start relative to the main stream. The DataStream will return a value of 0. The real value is tracked internally by that sub stream via _startInSuperStream.
You are proposing to make this start position public by exposing the value of _startInSuperStream.

Additionally, I also checked, and it is also impossible to get the start of the data stream by reading the position of the main stream right after calling GetNextEntry/GetNextEntryAsync, because internally, TryGetNextHeader will read the metadata fields, and immediately after will call ProcessDataBlock, which, when done, will cause the main stream position to point at the beginning of the next tar entry (or the end of the archive, if that was the last entry).

I think it's acceptable to expose the start of the data stream as a value that is relative to the main stream. It should be a read-only property, and it will throw an exception if the main stream is not seekable.

Keep in mind that if your intention is to use this in PAX entries, you'll have to be careful to skip extended attributes and global extended attributes entries, as those only contain additional metadata for the real actual entry.

I think the name of the property should be different than proposed. It should indicate that it is the position of the data stream relative to the main stream. But this can be discussed and decided in the API proposal meeting.

public abstract partial class TarEntry
{
+        public long PositionOfDataInArchiveStream { get; }
}

If there are no additional comments, objections or concerns, I can mark this as ready for review. Let me know what you think.

martinpf · 2024-04-26T22:47:29Z

Yes your summary of the issue is correct. And your proposal sounds great.

To change my earlier comment. I agree this doesn't make sense if the enclosing stream is not seekable. There isn't clear scenario for exposing the property if it isn't. It can't hurt to allow the property for non seekable streams anyway (perhaps for consistency). But there isn't a scenario for it that I can come up with. I think a case could be made that it would reduce the complexity of consuming code by not prompting a try catch around the property get in some scenarios.

carlossanlop · 2024-05-07T20:07:32Z

Marking this as ready for review.
I made a small change in the proposed API name, as we need to make sure the user knows the pointer is referring to the archive stream, not to the data stream of the entry.

tmds · 2024-05-08T08:06:01Z

Keep in mind that if your intention is to use this in PAX entries, you'll have to be careful to skip extended attributes and global extended attributes entries, as those only contain additional metadata for the real actual entry.

It would be preferable if the user should not care about the tar format.

Because the intended use-case is to get the position to file data, the property could return -1/throw for non-regular file entry types.

martinpf added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Apr 19, 2024

dotnet-issue-labeler bot added the area-System.Formats.Tar label Apr 19, 2024

dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Apr 19, 2024

carlossanlop removed the untriaged New issue has not been triaged by the area owner label Apr 25, 2024

carlossanlop added the api-ready-for-review API is ready for review, it is NOT ready for implementation label May 7, 2024

teo-tsirpanis removed the api-suggestion Early API idea and discussion, it is NOT ready for implementation label May 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[API Proposal]: In TarEntry objects emitted by TarReader, make the offset to the entry data in enclosing stream a public property #101314

[API Proposal]: In TarEntry objects emitted by TarReader, make the offset to the entry data in enclosing stream a public property #101314

martinpf commented Apr 19, 2024 •

edited

dotnet-policy-service bot commented Apr 19, 2024

colejohnson66 commented Apr 19, 2024

martinpf commented Apr 20, 2024 •

edited

tmds commented Apr 23, 2024 •

edited

martinpf commented Apr 23, 2024

tmds commented Apr 23, 2024

colejohnson66 commented Apr 23, 2024

martinpf commented Apr 24, 2024

martinpf commented Apr 24, 2024

carlossanlop commented Apr 25, 2024 •

edited

martinpf commented Apr 26, 2024 •

edited

carlossanlop commented May 7, 2024

tmds commented May 8, 2024

[API Proposal]: In TarEntry objects emitted by TarReader, make the offset to the entry data in enclosing stream a public property #101314

[API Proposal]: In TarEntry objects emitted by TarReader, make the offset to the entry data in enclosing stream a public property #101314

Comments

martinpf commented Apr 19, 2024 • edited

Background and motivation

API Proposal

API Usage

Alternative Designs

Risks

dotnet-policy-service bot commented Apr 19, 2024

colejohnson66 commented Apr 19, 2024

martinpf commented Apr 20, 2024 • edited

tmds commented Apr 23, 2024 • edited

martinpf commented Apr 23, 2024

tmds commented Apr 23, 2024

colejohnson66 commented Apr 23, 2024

martinpf commented Apr 24, 2024

martinpf commented Apr 24, 2024

carlossanlop commented Apr 25, 2024 • edited

martinpf commented Apr 26, 2024 • edited

carlossanlop commented May 7, 2024

tmds commented May 8, 2024

martinpf commented Apr 19, 2024 •

edited

martinpf commented Apr 20, 2024 •

edited

tmds commented Apr 23, 2024 •

edited

carlossanlop commented Apr 25, 2024 •

edited

martinpf commented Apr 26, 2024 •

edited