Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: Add Method FromStream(Async) to BinaryData that does not copy #89804

Open
dersia opened this issue Aug 1, 2023 · 7 comments
Open
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Memory needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration
Milestone

Comments

@dersia
Copy link

dersia commented Aug 1, 2023

Background and motivation

Our aspnet web api backend takes files from the frontend and uploads those to Azure Storage Blobs.

We use BlobClient.Upload and we also have to call update on the properties to set the content-type of the blob.
With the new Media-Type on BinaryData (#89605) we would like to take advantage of this feature so don't have to make multiple calls on the BlobClient, and in general I think there is benefit in using BinaryData overall.
But as of today BinaryData.FromStream() copies the stream as you can see here

private static async Task<BinaryData> FromStreamAsync(Stream stream, bool async, CancellationToken cancellationToken = default)
{
const int CopyToBufferSize = 81920; // the default used by Stream.CopyToAsync
int bufferSize = CopyToBufferSize;
MemoryStream memoryStream;
if (stream.CanSeek)
{
long longLength = stream.Length - stream.Position;
if (longLength > int.MaxValue || longLength < 0)
{
throw new ArgumentOutOfRangeException(nameof(stream), SR.ArgumentOutOfRange_StreamLengthMustBeNonNegativeInt32);
}
// choose a minimum valid (non-zero) buffer size.
bufferSize = longLength == 0 ? 1 : Math.Min((int)longLength, CopyToBufferSize);
memoryStream = new MemoryStream((int)longLength);
}
else
{
memoryStream = new MemoryStream();
}
using (memoryStream)
{
if (async)
{
await stream.CopyToAsync(memoryStream, bufferSize, cancellationToken).ConfigureAwait(false);
}
else
{
stream.CopyTo(memoryStream, bufferSize);
}
return new BinaryData(memoryStream.GetBuffer().AsMemory(0, (int)memoryStream.Position));
}
}

Because of this BinaryData is not useable for us. So I would like to suggest an API that would not copy the stream.

This makes BinaryData convenient to use and avoid extra allocation.

API Proposal

namespace System;

public class BinaryData
{
    // existing methods with https://github.com/dotnet/runtime/pull/89605
    public static BinaryData FromStream(Stream stream);
    public static BinaryData FromStream(Stream stream, string? mediaType);
    public static Task<BinaryData> FromStreamAsync(Stream stream, CancellationToken cancellationToken = default);
    public static Task<BinaryData> FromStreamAsync(Stream stream, string? mediaType, CancellationToken cancellationToken = default);

    // new api
    public static BinaryData FromStream(Stream stream, bool copyContent = true);
    public static BinaryData FromStream(Stream stream, string? mediaType, bool copyContent = true);
    public static Task<BinaryData> FromStreamAsync(Stream stream, bool copyContent = true, CancellationToken cancellationToken = default);
    public static Task<BinaryData> FromStreamAsync(Stream stream, string? mediaType, bool copyContent = true, CancellationToken cancellationToken = default);
}

API Usage

public async Task UploadBlob(Stream fileStream, string mediaType, CancellationToken cancellationToken = default)
{
    var data = BinrayData.FromString(fileStream, mediaType, false);
    await blobClient.UploadAsync(data, stream, true, cancellationToken).ConfigureAwait(false);
}

Alternative Designs

Alternatively FromStream could be overloaded but I think this would be a source breaking change.
Another alternative would be to make FromStream not to copy at all, but I think this will be a breaking behavior and might lead to bugs, where it "just" works for users even if the underlying stream was closed in the meantime.

Risks

If called with copyContent false, this could lead to an exception, if the underlying stream is closed, but I think this is something that users are familiar with and with the bool parameter explicitly set to false, users are opting into this behavior and therefor it should be fine, if an exception is thrown.

@dersia dersia added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Aug 1, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Aug 1, 2023
@ghost
Copy link

ghost commented Aug 1, 2023

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

Our aspnet web api backend takes files from the frontend and uploads those to Azure Storage Blobs.

We use BlobClient.Upload and we also have to call update on the properties to set the content-type of the blob.
With the new Media-Type on BinaryData (#89605) we would like to take advantage of this feature so don't have to make multiple calls on the BlobClient, and in general I think there is benefit in using BinaryData overall.
But as of today BinaryData.FromStream() copies the stream as you can see here

private static async Task<BinaryData> FromStreamAsync(Stream stream, bool async, CancellationToken cancellationToken = default)

Because of this BinaryData is not useable for us. So I would like to suggest an API that would not copy the stream.

This makes BinaryData convenient to use and avoid extra allocation.

API Proposal

namespace System;

public class BinaryData
{
    // existing methods with https://github.com/dotnet/runtime/pull/89605
    public static BinaryData FromStream(Stream stream);
    public static BinaryData FromStream(Stream stream, string? mediaType);
    public static Task<BinaryData> FromStreamAsync(Stream stream, CancellationToken cancellationToken = default);
    public static Task<BinaryData> FromStreamAsync(Stream stream, string? mediaType, CancellationToken cancellationToken = default);

    // new api
    public static BinaryData FromStream(Stream stream, bool copyContent = true);
    public static BinaryData FromStream(Stream stream, string? mediaType, bool copyContent = true);
    public static Task<BinaryData> FromStreamAsync(Stream stream, bool copyContent = true, CancellationToken cancellationToken = default);
    public static Task<BinaryData> FromStreamAsync(Stream stream, string? mediaType, bool copyContent = true, CancellationToken cancellationToken = default);
}

API Usage

public async Task UploadBlob(Stream fileStream, string mediaType, CancellationToken cancellationToken = default)
{
    var data = BinrayData.FromString(fileStream, mediaType, false);
    await blobClient.UploadAsync(data, stream, true, cancellationToken).ConfigureAwait(false);
}

Alternative Designs

Alternatively FromStream could be overloaded but I think this would be a source breaking change.
Another alternative would be to make FromStream not to copy at all, but I think this will be a breaking behavior and might lead to bugs, where it "just" works for users even if the underlying stream was closed in the meantime.

Risks

If called with copyContent false, this could lead to an exception, if the underlying stream is closed, but I think this is something that users are familiar with and with the bool parameter explicitly set to false, users are opting into this behavior and therefor it should be fine, if an exception is thrown.

Author: dersia
Assignees: -
Labels:

api-suggestion, area-System.IO, untriaged

Milestone: -

@ghost
Copy link

ghost commented Aug 2, 2023

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

Our aspnet web api backend takes files from the frontend and uploads those to Azure Storage Blobs.

We use BlobClient.Upload and we also have to call update on the properties to set the content-type of the blob.
With the new Media-Type on BinaryData (#89605) we would like to take advantage of this feature so don't have to make multiple calls on the BlobClient, and in general I think there is benefit in using BinaryData overall.
But as of today BinaryData.FromStream() copies the stream as you can see here

private static async Task<BinaryData> FromStreamAsync(Stream stream, bool async, CancellationToken cancellationToken = default)
{
const int CopyToBufferSize = 81920; // the default used by Stream.CopyToAsync
int bufferSize = CopyToBufferSize;
MemoryStream memoryStream;
if (stream.CanSeek)
{
long longLength = stream.Length - stream.Position;
if (longLength > int.MaxValue || longLength < 0)
{
throw new ArgumentOutOfRangeException(nameof(stream), SR.ArgumentOutOfRange_StreamLengthMustBeNonNegativeInt32);
}
// choose a minimum valid (non-zero) buffer size.
bufferSize = longLength == 0 ? 1 : Math.Min((int)longLength, CopyToBufferSize);
memoryStream = new MemoryStream((int)longLength);
}
else
{
memoryStream = new MemoryStream();
}
using (memoryStream)
{
if (async)
{
await stream.CopyToAsync(memoryStream, bufferSize, cancellationToken).ConfigureAwait(false);
}
else
{
stream.CopyTo(memoryStream, bufferSize);
}
return new BinaryData(memoryStream.GetBuffer().AsMemory(0, (int)memoryStream.Position));
}
}

Because of this BinaryData is not useable for us. So I would like to suggest an API that would not copy the stream.

This makes BinaryData convenient to use and avoid extra allocation.

API Proposal

namespace System;

public class BinaryData
{
    // existing methods with https://github.com/dotnet/runtime/pull/89605
    public static BinaryData FromStream(Stream stream);
    public static BinaryData FromStream(Stream stream, string? mediaType);
    public static Task<BinaryData> FromStreamAsync(Stream stream, CancellationToken cancellationToken = default);
    public static Task<BinaryData> FromStreamAsync(Stream stream, string? mediaType, CancellationToken cancellationToken = default);

    // new api
    public static BinaryData FromStream(Stream stream, bool copyContent = true);
    public static BinaryData FromStream(Stream stream, string? mediaType, bool copyContent = true);
    public static Task<BinaryData> FromStreamAsync(Stream stream, bool copyContent = true, CancellationToken cancellationToken = default);
    public static Task<BinaryData> FromStreamAsync(Stream stream, string? mediaType, bool copyContent = true, CancellationToken cancellationToken = default);
}

API Usage

public async Task UploadBlob(Stream fileStream, string mediaType, CancellationToken cancellationToken = default)
{
    var data = BinrayData.FromString(fileStream, mediaType, false);
    await blobClient.UploadAsync(data, stream, true, cancellationToken).ConfigureAwait(false);
}

Alternative Designs

Alternatively FromStream could be overloaded but I think this would be a source breaking change.
Another alternative would be to make FromStream not to copy at all, but I think this will be a breaking behavior and might lead to bugs, where it "just" works for users even if the underlying stream was closed in the meantime.

Risks

If called with copyContent false, this could lead to an exception, if the underlying stream is closed, but I think this is something that users are familiar with and with the bool parameter explicitly set to false, users are opting into this behavior and therefor it should be fine, if an exception is thrown.

Author: dersia
Assignees: -
Labels:

api-suggestion, area-System.Memory, untriaged

Milestone: -

@stephentoub
Copy link
Member

I don't understand the proposal. What is the behavior when not copying? Are you proposing storing the Stream object and then reading from it later when the BinaryData is consumed? That seems like a non-starter, as for example it would push a lot of possibly asynchronous and exceptional work into existing members, like the implicit cast to span.

@dersia
Copy link
Author

dersia commented Jan 30, 2024

@stephentoub Sorry for the late reply.

Basically yes, I am suggesting to store the Stream object and probably with an additional leaveOpen argument.
I would like to clarify why I am suggesting this.

The problem that we are having is that we have an http endpoint that takes files. these files need to be uploaded to azure blob-storage and we also need to set the ContentType of those blobs. The Azure SDKs that do upload to Azure Storage Account have multiple overloads, for example one that takes a stream and also one that takes BinaryData. The feature that was added in #89605 allows for BinaryData to also set the ContentType during upload and we want to make use of this feature. When using the Stream overload on the BlobClient, we would also need to make another call to update the Blob Properties so we can set the ContentType. Having said that, there is already an overload on BinaryData which takes a Stream as an argument, the problem with that overload is, that it copies the original stream and this does not work with lager files and we ran into memory exceptions.

As far as I understand the Runtime has already APIs that take Streams and handle correct behavior with streams, so I don't quite understand what would make this different or exceptional in that regard, but then again, you are the expert and I hope you can enlighten me here.
Also the overload that takes BinaryData with a stream argument as of today is internally creates a MemoryStream internally and copys the original stream and hase to then work with that stream, which is converted to a span so I don't quite understand why it couldn't do the same with the original stream?

Sorry again for the late reply and thank you,
Sia

@tannergooding tannergooding added needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration and removed untriaged New issue has not been triaged by the area owner labels Jun 24, 2024
@stephentoub
Copy link
Member

As far as I understand the Runtime has already APIs that take Streams and handle correct behavior with streams, so I don't quite understand what would make this different or exceptional in that regard

The problem is the rest of the BinaryData surface area assumes the type stores all the data, which then means that methods like ToArray, ToMemory, and the implicit cast to ReadOnlyMemory<T> would all start doing I/O, not only significantly changing their performance profile, but also leading to I/O that likely should have been done asynchronously now either being done synchronously, or worse, being done asynchronously but blocking (sync over async).

@stephentoub stephentoub added this to the Future milestone Jul 19, 2024
@julealgon
Copy link

Our aspnet web api backend takes files from the frontend and uploads those to Azure Storage Blobs.

Have you considered using pre-signed blob upload links, send those to the frontend, and have it upload directly to Azure through them?

I know this doesn't directly answer your proposal, but figured I'd mention anyways in case you were not aware of this option which tends to be substantially more efficient than roundtripping all the data twice via an indirection (your API).

@KrzysztofCwalina
Copy link
Member

cc: @seanmcc-msft

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Memory needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration
Projects
None yet
Development

No branches or pull requests

6 participants