Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide an in-box implementation of IBufferWriter<byte> backed by a resizable byte array #28538

Closed
ahsonkhan opened this issue Jan 28, 2019 · 18 comments · Fixed by dotnet/corefx#36961
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Buffers
Milestone

Comments

@ahsonkhan
Copy link
Member

Motivation

The Utf8JsonWriter is a ref struct which accepts IBufferWriter<byte> for synchronous writing. Today, we provide one implementation of this interface, PipeWriter. If the user wants to write to an array or stream, or wants to write asynchronously to an output sink other than PipeWriter, they end up implementing their own custom IBufferWriter<byte>. To enable writing JSON to an array or stream (sync/async) using the new Utf8JsonWriter, we need to provide a built-in implementation of IBufferWriter<byte> so that every caller doesn't have to write their own.

Proposed API

namespace System.Buffers
{
    // MemoryBufferWriter? BinaryBufferWriter?
    // Is it fine to leave it non-generic (as bytes), given we need to write to the stream?
    public class ArrayBufferWriter : IBufferWriter<byte>, IDisposable {

        // struct or class? support custom ArrayPools? For the JSON scenario, it can't be a struct
        public ArrayBufferWriter(int initialCapacity = 256);

        // Implements IBufferWriter<byte>
        public void Advance(int count);
        public Memory<byte> GetMemory(int sizeHint = 0);
        public Span<byte> GetSpan(int sizeHint = 0);

        // Implements IDisposable
        public void Dispose();
        
        public long BytesCommitted { get; } // Do we need this?
        public int BytesWritten { get; } // Should this be int64 too, like BytesCommitted?

        // What has been written to the output so far.
        public Memory<byte> OutputAsMemory { get; } // Written? Formatted? SoFar suffix?
        public Span<byte> OutputAsSpan { get; }
        
        // Resets state but doesn't return the underlying buffer back to the pool
        public void Clear();

        public void CopyTo(Stream stream);
        public Task CopyToAsync(Stream stream, CancellationToken cancellationToken = default(CancellationToken));
    }
}

Sample Usage

private static void WriteJsonToFile()
{
    using (var fileStream = File.Create("MyJsonFile.json"))
    using (var output = new ArrayBufferWriter())
    {
        var state = new JsonWriterState(options: new JsonWriterOptions { Indented = true });
        var json = new Utf8JsonWriter(output, state);
        json.WriteStartObject();
        json.WriteString("message", "Hello, World!", escape: true);
        json.WriteEndObject();
        json.Flush(isFinalBlock: true);
        output.CopyTo(fileStream);
    }
}
const int IterationCount = 10_000_000;
const int SyncWriteThreshold = 1_000_000;

private static async Task WriteJsonToFileAsync()
{
    using (var fileStream = File.Create("MyJsonFile.json"))
    using (var output = new ArrayBufferWriter())
    {
        var state = new JsonWriterState(options: new JsonWriterOptions { Indented = true });
        long written = 0;

        for (int j = 0; j < IterationCount; j++)
        {
            state = WriteJson(output, j, state);

            written += state.BytesWritten;
            if (written > SyncWriteThreshold)
            {
                await output.CopyToAsync(fileStream);
                written = 0;
            }
        }

        if (written != 0)
        {
            await output.CopyToAsync(fileStream);
            written = 0;
        }
    }
}

private static JsonWriterState WriteJson(ArrayBufferWriter output, int j, JsonWriterState state)
{
    var json = new Utf8JsonWriter(output, state);

    if (j == 0)
    {
        json.WriteStartObject();
    }

    json.WriteString("message", "Hello, World!", escape: true);

    if (j < IterationCount - 1)
    {
        json.Flush(isFinalBlock: false);
    }
    else
    {
        json.WriteEndObject();
        json.Flush(isFinalBlock: true);
    }

    return json.GetCurrentState();
}

Microsoft.Extensions.DependencyModel:
https://github.com/dotnet/core-setup/blob/c304bb38c1462f0f1633fc7ed2c317dcc5c72c9c/src/managed/Microsoft.Extensions.DependencyModel/DependencyContextWriter.Utf8JsonWriter.cs#L15-L48

public void Write(DependencyContext context, Stream stream)
{
    if (context == null)
    {
        throw new ArgumentNullException(nameof(context));
    }
    if (stream == null)
    {
        throw new ArgumentNullException(nameof(stream));
    }
    using (var bufferWriter = new ArrayBufferWriter())
    {
        Write(context, bufferWriter);
        bufferWriter.CopyTo(stream);
    }
}

private void Write(DependencyContext context, ArrayBufferWriter bufferWriter)
{
    var state = new JsonWriterState(options: new JsonWriterOptions { Indented = true });
    var jsonWriter = new Utf8JsonWriter(bufferWriter, state);

    jsonWriter.WriteStartObject();
    WriteRuntimeTargetInfo(context, ref jsonWriter);
    WriteCompilationOptions(context.CompilationOptions, ref jsonWriter);
    WriteTargets(context, ref jsonWriter);
    WriteLibraries(context, ref jsonWriter);
    if (context.RuntimeGraph.Any())
    {
        WriteRuntimeGraph(context, ref jsonWriter);
    }
    jsonWriter.WriteEndObject();
    jsonWriter.Flush();
}

Sample implementation of ArrayBufferWriter:
https://gist.github.com/ahsonkhan/1275dd34aaff4239cfdb7b213aa4ebe4

Open Questions

  • Are we fine with the type name ArrayBufferWriter?
    • Can it continue to implement IBufferWriter<byte> specifically rather than be generic? It implements IBufferWriter<byte> since that's what the scenario requires (writing the underlying bytes to the stream).
  • Can we keep it as a class or should it be a struct? The Utf8JsonWriter accepts IBufferWriter<byte> so, as a struct, it would get boxed and the state changes won't be visible to the caller.
  • Should it support custom array pools that the caller passes in?
  • Should we remove BytesCommitted or is it useful?
    • If we are keeping it, should we keep BytesWritten as Int64 or is keeping it as Int32 fine? BytesWritten will never be > int.MaxValue.
  • Anyone have suggestions for better names for OutputAsMemory/OutputAsSpan?

cc @terrajobst, @davidfowl, @stephentoub, @KrzysztofCwalina, @steveharter, @bartonjs, @joshfree, @benaadams, @JeremyKuhne, @pakrym, @eerhardt

@ahsonkhan ahsonkhan self-assigned this Jan 28, 2019
@omariom
Copy link
Contributor

omariom commented Jan 28, 2019

ValueStringBuilder has interesting ability to use whatever initial span it is provided and then switch to ArrayPool.
The span is usually stack allocated but it doesn't have to be.
https://github.com/dotnet/corefx/blob/90e11f77895a5c5363dd818b5c89c6f811eef2e6/src/Common/src/CoreLib/System/Text/ValueStringBuilder.cs#L12

@ahsonkhan
Copy link
Member Author

ValueStringBuilder has interesting ability to use whatever initial span it is provided and then switch to ArrayPool.

Yep, but that forces the type to be a ref struct which doesn't work for this scenario.
We could consider accepting an initial array though but not sure how useful that is.

@pakrym
Copy link
Contributor

pakrym commented Jan 28, 2019

Should this be a ReadOnlySequenceBuilder kind of thing? Would allow to reduced copies and use IMemoryPool?

@ahsonkhan
Copy link
Member Author

Should this be a ReadOnlySequenceBuilder kind of thing? Would allow to reduced copies and use IMemoryPool?

My goal here is to provide support for writing to an array and stream. How would this help for that case, and how would it reduce copies?

@pakrym
Copy link
Contributor

pakrym commented Jan 28, 2019

To use it with a stream you can write every segment of ROS via Stream.WriteAsync. An array can be created by calling .ToArray().

There would be way less array resizing and data copies if we can just append segments to a linked list for new data. Would also cause less memory fragmentation for large payloads because it avoids allocating arrays of larger and larger size.

@ahsonkhan
Copy link
Member Author

ahsonkhan commented Jan 29, 2019

Going through Memory.Span each time you write a JSON element isn't great for performance and the call to .ToArray() ends up allocating. I don't think it is a beneficial trade-off to append ReadOnlySequenceSegment from MemoryPool to a linked list in order to avoid the intermittent data copies every time we have to resize (which should be rare), which we have to eventually walk/copy to write to the stream anyway. Such an implementation ends up being slower.

Here is a sample implementation that uses BufferSegment:
https://gist.github.com/ahsonkhan/a212a2c690773538715ca147139ef0b3
If you have some suggestions or feedback to improve it, I would love to hear them. I tried enumerating the ReadOnlySequence and copying to an array from the pool rather than calling ToArray() and it helped a bit (reduced the allocs), but not enough.

@pakrym
Copy link
Contributor

pakrym commented Jan 29, 2019

the call to .ToArray() ends up allocating.

What scenarios do you need .ToArray in? In case of ArrayPool-backed implementation do you returned rented array? When do you return it to the pool?

we have to resize (which should be rare)

How is "rare" determined? If 256 used as default size and 2x as resizing algorithm you'll go through 256-512-1024-2048-4096 to write some non primitive JSON output.

@ahsonkhan
Copy link
Member Author

What scenarios do you need .ToArray in?

Copying the data that has been written so far to a stream.

public async Task CopyToAsync(Stream stream, CancellationToken cancellationToken = default)
{
    CheckIfDisposed();

    if (stream == null)
        throw new ArgumentNullException(nameof(stream));

    var sequence = new ReadOnlySequence<byte>(_writeStart, 0, _writeHead, _writeHead.End);

    int dataLength = (int)sequence.Length;
    byte[] pooledBuffer = ArrayPool<byte>.Shared.Rent(dataLength);
    sequence.CopyTo(pooledBuffer);

    await stream.WriteAsync(pooledBuffer.AsMemory(0, dataLength), cancellationToken).ConfigureAwait(false);
    _committed += _written;

    ArrayPool<byte>.Shared.Return(pooledBuffer);
    ClearHelper();
}

In case of ArrayPool-backed implementation do you returned rented array? When do you return it to the pool?

Yes, I return them. I return them on Dispose() or whenever we resize.

How is "rare" determined? If 256 used as default size and 2x as resizing algorithm you'll go through 256-512-1024-2048-4096 to write some non primitive JSON output.

We can modify the growth scheme based on perf measurements, but that is an implementation detail. My point was, we resize only after some amount of writes, which means the cost is amortized. We could jump from 256 directly to 4096 and then start doubling, if it helps. The MemoryPool-based impl that I shared, results in a regression for every write (since GetSpan() slows down). There is a regression even if we never call CopyTo (i.e. never create a ReadOnlySequence from the segments or call ToArray() on it).

Writing HelloWorld goes from ~220 ns to ~300ns. Larger payloads regress by 7-10%.

@pakrym
Copy link
Contributor

pakrym commented Jan 29, 2019

Copying the data that has been written so far to a stream.

It's not exactly required in that scenario. You can just foreach over ReadOnlySequence.

We can modify the growth scheme based on perf measurements, but that is an implementation detail. My point was, we resize only after some amount of writes, which means the cost is amortized. We could jump from 256 directly to 4096 and then start doubling, if it helps. The MemoryPool-based impl that I shared, results in a regression for every write (since GetSpan() slows down). There is a regression even if we never call CopyTo (i.e. never create a ReadOnlySequence from the segments or call ToArray() on it).
Writing HelloWorld goes from ~220 ns to ~300ns. Larger payloads regress by 7-10%.

What benchmarks are you using? I thought we are caching span inside Utf8JsonWriter so why does it matter so much how fast GetSpan is?

@pakrym
Copy link
Contributor

pakrym commented Jan 29, 2019

Both implementations are very similar in all but couple ways:

Sequence:

  1. No additional copies
  2. Extra allocation per block (segment)
  3. More expensive GetSpan
  4. Unability to get result as single array without allocation

Array:

  1. Extra copies per resize
  2. Possible memory fragmentation because of resizing and danger of getting into LOH for large payloads
  3. Faster GetSpan

No-one says we can't have both but we need to figure out what performs better in real world scenarios (rendering reasonably sized JSON in kestrel for example)

@layomia layomia closed this as completed Jan 31, 2019
@layomia layomia reopened this Jan 31, 2019
@ahsonkhan
Copy link
Member Author

Approved with feedback: dotnet/apireviews#88

@ahsonkhan
Copy link
Member Author

Resolving the feedback from dotnet/apireviews#88 & dotnet/corefx#35094, here is the final API shape:

namespace System.Buffers
{
    public sealed partial class ArrayBufferWriter<T> : IBufferWriter<T>, IDisposable
    {
        public ArrayBufferWriter() { }
        public ArrayBufferWriter(int initialCapacity) { }
        public int AvailableSpace { get { throw null; } } // == Capacity - CurrentIndex
        public int Capacity { get { throw null; } }
        public int CurrentIndex { get { throw null; } }
        public ReadOnlyMemory<T> OutputAsMemory { get { throw null; } }
        public ReadOnlySpan<T> OutputAsSpan { get { throw null; } }
        public void Clear() { }

        // Implements IDisposable
        public void Dispose() { }

        // Implements IBufferWriter<T>
        public void Advance(int count) { }
        public Memory<T> GetMemory(int sizeHint = 0) { throw null; }
        public Span<T> GetSpan(int sizeHint = 0) { throw null; }
    }
}

@benaadams
Copy link
Member

benaadams commented Feb 6, 2019

  1. Could you return array to pool; but then reuse object with current design
  2. Could you reset AvailableSpace == Capacity without returning pool with current design

For 2. I assume that's Clear()
For 1. I assume Dispose() returns array; could it then be reused, followed by second Dispose()?

@ahsonkhan
Copy link
Member Author

For 2. I assume that's Clear()

Yes.

For 1. I assume Dispose() returns array; could it then be reused, followed by second Dispose()?

No, you can't reuse the object after the first dispose.

Are you asking because you want to avoid allocating multiple instances of ArrayBufferWriter?

@benaadams
Copy link
Member

Are you asking because you want to avoid allocating multiple instances of ArrayBufferWriter?

Yes, so its an allocation to decorate an array with the IBufferWriter interface (plus auto resizing); but if you were pooling the instance you shouldn't hang on to the actual ArrayPool's array.

However would be ok, to go custom for this; as, looking at the implementation, also might not want to force clearing the data either before returning.

@wli3
Copy link

wli3 commented Apr 15, 2019

any ETA on this issue?

no reply from dotnet/corefx#35094

@ahsonkhan
Copy link
Member Author

ahsonkhan commented Apr 15, 2019

I'll submit a PR in the next couple of days (working on it atm). Thanks for the ping.

@Timovzl
Copy link

Timovzl commented Sep 30, 2019

It seems that the array option was chosen in the end, and that the class was not made disposable. Would it not be more favorable to make the class disposable, and use ArrayPool arrays instead, avoiding the allocations?

I could not extract this from the discussion above, but let me know if the answer was already implied somewhere.

@msftgits msftgits transferred this issue from dotnet/corefx Feb 1, 2020
@msftgits msftgits added this to the 3.0 milestone Feb 1, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Buffers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants