Avoid unnecessary allocations when using FileStream #15088

ayende · 2015-08-25T09:35:39Z

This was originally a PR (dotnet/coreclr#1429), turned into an issue as a result of the comments there.

The idea is to avoid 4KB allocation for buffer whenever we need to work with large number of files.
Consider the following code:

foreach (var file in System.IO.Directory.GetFiles(dirToCheck, "*.dat"))
{
    using (var fs = new FileStream(file, FileMode.Open))
    {
             // do something to read from the stream
    }
}

The problem is that each instance of FileStream will allocate an independent buffer. If we are reading 10,000 files, that will result in 40MB(!) being allocated, even if we are very careful about allocations in general.

See also: dotnet/corefx#2929

The major problem is that FileStream will allocate its own buffer(s) and provide no way to really manage that. Creating large number of FileStream, or doing a big writes using WriteAsync will allocate a lot of temporary buffers, and generate a lot of GC pressure.

As I see it, there are a few options here:

Add a constructor that will take an external buffer to use. This will be the sole buffer that will be used, and if a bigger buffer is required, it will throw, instead of allocating a new buffer.
Add a pool of buffers that will be used. Something like the following code:

  [ThreadStatic] private static Stack<byte[]>[] _buffersBySize;
  
  private static GetBuffer(int requestedSize)
  {
      if(_buffersBySize == null)
          _buffersBySize = new Stack<byte[]>[32];
  
      var actualSize = PowerOfTwo(requestedSize);
      var pos = MostSignificantBit(actualSize);
  
      if(_buffersBySize[pos] == null)
          _buffersBySize[pos] = new Stack<byte[]>();
  
      if(_buffersBySize[pos].Count == 0)
          return new byte[actualSize];
  
      return _buffersBySize[pos].Pop();
  }
  
  private static void ReturnBuffer(byte[] buffer)
  {
      var actualSize = PowerOfTwo(buffer.Length);
      if(actualSize != buffer.Length)
          return; // can't put a buffer of strange size here (prbably an error)
  
      if(_buffersBySize == null)
          _buffersBySize = new Stack<byte[]>[32];
  
      var pos = MostSignificantBit(actualSize);
  
      if(_buffersBySize[pos] == null)
          _buffersBySize[pos] = new Stack<byte[]>();
  
      _buffersBySize[pos].Push(buffer);
  }

The idea here is that each thread has its own set of buffers, and we'll take the buffers from there. The Dispose method will return them to the thread buffer. Note that there is no requirement to use the same thread for creation / disposal. (Although to be fair, we'll probably need to handle a case where a disposal thread is used and all streams are disposed on it).

The benefit here is that this isn't going to impact the external API, while adding the external buffer will result in external API being visible.

stephentoub · 2015-08-25T15:54:24Z

cc: @KrzysztofCwalina

KrzysztofCwalina · 2015-08-25T16:14:25Z

I do think this is a very good issue. If we do the buffer pool option, we first need to design a good general purpose buffer pool. We have made many attempts at it in the past, and none of them turned out to be truly general purpose (i.e. such that the pool works for many different scenarios). But I think the time is ripe for this; we need a good buffer pool built in the platform.

ayende · 2015-08-25T16:32:53Z

A general purpose buffer pool would be wonderful, there are quite a few locations which need it.
FileStream, WebSockets, etc.
But that is a really hard problem to solve properly. The code above, for example, can cause starvation if you have some threads doing disposal and some doing creation (common in message passing systems).
And more complex buffer pools required non trivial synchronization.

At the same time, I don't think that IBufferPool is a good idea, either.

rynowak · 2016-07-20T17:18:13Z

If we do the buffer pool option, we first need to design a good general purpose buffer pool

Hopefully we'd all agree that we did this 👍

We were looking at addressing this issue pretty recently in ASP.NET and our options to address it are really limited unless it comes from CoreFx.

Write our own filestream (ugh) - that's the worst option for all the obvious reasons
Ask for a new API that allows us pass a buffer - this doesn't work immediately for us, we'd have to take a dependency on a newer netstandard to leverage it
Fix it inside CoreFx by using the buffer pool

stephentoub · 2016-07-20T17:23:16Z

Fix it inside CoreFx by using the buffer pool

We can revisit it, but see the discussion here dotnet/corefx#5954 (comment), then dotnet/corefx#6473.

cc: @jkotas, @socket, @KrzysztofCwalina

rynowak · 2016-07-20T18:21:04Z

Thanks @stephentoub - perhaps it's my unfamiliarity with the details, but from looking at the API definitions here, it seems that the only API that could leak a reference to the buffer is the CopyTo[Async] family. Both Read(...) and Write(...) use a caller-provided buffer and potentially a buffer allocated by the stream. Is CopyTo[Async] the sticking point here?

Do you think we'd solve any of the objections by using a pooled buffer inside filestream for Read(...) and Write(...) and then implementing our own CopyToAsync or perhaps adding an implementation that accepts a buffer as a parameter? (I haven't really done any research into whether or not we'd want to do this yet.)

As I understand it, the objection to using pooling in CopyToAsync is that a misbehaving destination stream could corrupt any future consumers of the buffer by holding on it and manipulating it after it's been returned to the pool. Currently if you are faced with a misbehaving destination stream, it will only corrupt the internal state of the FileStream. Is this accurate?

stephentoub · 2016-07-20T18:50:05Z

it seems that the only API that could leak a reference to the buffer is the CopyTo[Async] family

Not really. It's also about the internal buffer used by FileStream (FileStream doesn't actually override CopyTo{Async}). Let's say FileStream grabs a buffer from the pool when it's constructed and returns it when it's Dispose'd. What happens if misuse of the stream causes it to be Dispose'd while a ReadAsync operation is in flight? With the current implementation, we'd end up putting a buffer back into the pool and then potentially still writing into it as part of the in-flight ReadAsync operation. We could add synchronization (at a run-time cost) to only return the buffer to the pool in Dispose if there aren't any async operations in flight, and that would address this particular case. But depending on to what degree we care about corruption, there's still the case that something else in the process could put a buffer erroneously back into the pool, FileStream could use that buffer for reads/writes, but the original holder of the buffer could still be using it. There's nothing we can do about that, and we'd end up in a situation where corrupted data was being read or written in the file. The concern here is that we'd be introducing the potential for non-local corruption where it never existed before; something elsewhere in the process completely unrelated to a particular FileStream instance could end up corrupting that instance. Is that a security issue? Maybe, maybe not. Is it difficult to debug? Almost certainly.

ayende · 2016-07-20T19:36:38Z

What about ref counting the buffers? So if there are outstanding operations, it is only returned when they are all completed, even if disposed midway through?

stephentoub · 2016-07-20T19:38:20Z

What about ref counting the buffers?

That's what I was referring to with "We could add synchronization (at a run-time cost) to only return the buffer to the pool in Dispose if there aren't any async operations in flight". I think you're suggesting on top of that we could delay the return of the buffer until the operation completed, whereas I was suggesting we simply wouldn't return the buffer in that case. I don't think it's worth optimizing for cases of misuse (it's considered misuse to Dispose of a FileStream while operations are still in flight).

rynowak · 2016-07-20T20:08:06Z

Thanks for the summary Stephen.

But depending on to what degree we care about corruption, there's still the case that something else in the process could put a buffer erroneously back into the pool, FileStream could use that buffer for reads/writes,

This really seems more like a discussion of principles the runtime wants to follow than whether or not we can solve the issues related to filestream. Should every framework component behave as a much of a 'clean room' as possible?

I think the logical conclusion of this is that there ends up being a 'framework only' instance of the pool, or no pooling at all in corefx. Every other mitigation will have an achilles heel, and there would still be cases in existing BCL apis (like CopyToAsync) where we can't use the 'framework only' pool because it could allow aliasing. Of course, we never wanted to build a 'framework only' pool because it leads to suboptimal reuse.

The escape hatch would be to provide a constructor or method overload that accepts a caller-provided buffer. This way FileStream is as 'pure' a system as it can be (while still touching the file system 😆 ). This of course means that it requires us to wait until we're ready to adopt the next version of netstandard as our minimum requirement, so if we're going to do that, we have the possibility to get even more creative.

I'm comfortable waiting awhile on to resolve exactly what we (ASP.NET) want to do, because we don't yet have much data about the scenario in question (serving static files).

I think in an ideal world, I'd have the ability to write more unsafe code to solve IO problems using stack-allocated or manually managed memory. This isn't compatible with a lot of existing APIs of course which is why we aren't just doing that 😆

ayende · 2016-07-20T20:31:02Z

In general, having some manner that give us Stream over byte* would be pretty great. Right now we have to copy data from unmanaged to managed memory just to be able to pass the right thing into the Stream call.

adamsitnik · 2021-05-17T15:44:03Z

Background and Motivation

We have recently got rid of all managed allocations for FileStream.ReadAsync and FileStream.WriteAsync and the remaining _buffer allocation:

runtime/src/libraries/System.Private.CoreLib/src/System/IO/Strategies/BufferedFileStreamStrategy.cs

Line 1088 in dd8b090

    
           Interlocked.CompareExchange(ref _buffer, GC.AllocateUninitializedArray<byte>(_bufferSize), null);

is the last allocation that could be avoided.

We can do that by either:

allowing the users to pass the buffer to FileStream ctor. It could be an array rented from ArrayPool or unmanaged memory.
allow the users to specify that they want us to pool the buffer.

Proposed API

namespace System.IO
{
    public sealed class FileStreamOptions
    {
        public FileStreamOptions();
        public FileMode Mode { get; set; }
        public FileAccess Access { get; set; } = FileAccess.Read;
        public FileShare Share { get; set; } = FileShare.Read;
        public FileOptions Options { get; set; }
        public long PreallocationSize { get; set; }
+       public Memory<byte>? Buffer { get; set; } // default value == null => use default buffer size and allocate the buffer (current behaviour)
    }

Usage Examples

byte[] array = ArrayPool<byte>.Shared.Rent(16_000);
var advanced = new FileStreamOptions
{
    Mode = FileMode.CreateNew,
    Access = FileAccess.Write,
    Options = FileOptions.Asynchronous | FileOptions.WriteThrough,
    Buffer = array
};
using FileStream fileStream = new FileStream(advanced);
// use FileStream
ArrayPool<byte>.Shared.Return(array);

To disable the buffering, users would have to pass a default or empty Memory<byte>:

var noBuffering = new FileStreamOptions
{
    Buffer = default(Memory<byte>) // Array.Empty<byte>() would also work
};

Alternative Designs

Don't let the user provide the buffer (to minimize risk of misuse), but instead provide bufferSize and extend FileOptions with PoolBuffer:

namespace System.IO
{
    public sealed class FileStreamOptions
    {
        public FileStreamOptions();
        public FileMode Mode { get; set; }
        public FileAccess Access { get; set; } = FileAccess.Read;
        public FileShare Share { get; set; } = FileShare.Read;
        public FileOptions Options { get; set; }
        public long PreallocationSize { get; set; }
+       public int BufferSize { get; set; }
    }
    
    public enum FileOptions
    {
        WriteThrough,
        None,
        Encrypted,
        DeleteOnClose,
        SequentialScan,
        RandomAccess,
        Asynchronous,
+       PoolBuffer // new option
    }

Risks

Allowing the users to pass the buffer creates the risk of misusing the buffer by the user:

not returning a rented array to the pool and exhausting the ArrayPool
freeing the native memory that Memory<byte> wraps when it's still being used by FileStream

stephentoub · 2021-05-17T17:50:42Z

We already support creating the FileStream unbuffered, at which point the consumer is fully in control of buffering via the buffers they pass to read/write. I'd rather we just stick with that rather than exposing this scheme, which is yet another way to shoot oneself in the foot with pooling and yet another scheme in FileStream for letting the user control buffering. This also ends up being FileStream-specific... if we really think this internal buffer needs to be configurable further, we should think it through for what pattern should be used across all streams/writers/readers. And on top of that, this prescribed pattern ends up forcing a buffer to be rented/allocated in case it might be needed even if access patterns are such that it would never otherwise be reified.

sakno · 2021-05-17T23:22:39Z

We already support creating the FileStream unbuffered, at which point the consumer is fully in control of buffering via the buffers they pass to read/write.

Buffer management is not restricted only to allocation. The current implementation controls Position and flushing stream when needed. Seek, Position and other members take into account the state of the buffer. As a user, I don't want to re-implement all these things. I just want to override buffer allocation and nothing more. I could use BufferedStream as a wrapper for FileStream but it also doesn't support custom Memory<byte> as buffer. Moreover, it's extra level of indirection: BufferedStream -> FileStream -> FileStreamStrategy. From my point of view, adding support of custom buffer in .NET class lib cheaper that writing another level of abstraction at the top of FileStream.

which is yet another way to shoot oneself in the foot

Because this is the responsibility of the user. The same applicable to Unsafe class, pointers, and many other things in .NET/C#. I accept the risk in exchange for ability to solve my tasks efficiently. Flexible file I/O with small development effort is what I expect to see out-of-the-box.

ayende · 2021-05-18T07:36:04Z

Just to add my 2 cents, having the buffer management (position, etc) in FileStream will help, yes. I want to control the buffer allocation, but I don't actually care about how it is used.

bartonjs · 2021-05-18T17:27:56Z

Video

@tannergooding mentioned that there may be a more generalized allocator/management feature in the works, so rather than accepting a buffer now as well as an allocator "soon", we feel that the right answer for now is just to take the buffer size, not a user provided buffer.

namespace System.IO
{
    partial class FileStreamOptions
    {
       public int BufferSize { get; set; }
    }
}

stephentoub · 2021-05-28T03:40:31Z

Implemented by #52928

stephentoub assigned KrzysztofCwalina Sep 16, 2015

karelz assigned JeremyKuhne and unassigned KrzysztofCwalina Oct 11, 2016

JeremyKuhne removed their assignment Jan 14, 2020

msftgits transferred this issue from dotnet/corefx Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020

JeremyKuhne removed the untriaged New issue has not been triaged by the area owner label Mar 3, 2020

carlossanlop added this to To do in System.IO - FileStream via automation Mar 6, 2020

benaadams mentioned this issue Aug 5, 2020

Developers using FileStream find it to be high performance and robust #40359

Closed

12 tasks

adamsitnik modified the milestones: Future, 6.0.0 Feb 1, 2021

adamsitnik mentioned this issue Mar 12, 2021

Reduce FileStream allocations #49539

Closed

5 tasks

adamsitnik mentioned this issue May 7, 2021

Add new option bag for FileStream ctor #52446

Closed

adamsitnik self-assigned this May 17, 2021

adamsitnik added api-ready-for-review API is ready for review, it is NOT ready for implementation blocking Marks issues that we want to fast track in order to unblock other important work labels May 17, 2021

bartonjs added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels May 18, 2021

adamsitnik mentioned this issue May 18, 2021

Extend FileStreamOptions with BufferSize, allow to specify 0 to disable the buffering #52928

Merged

stephentoub closed this as completed May 28, 2021

System.IO - FileStream automation moved this from To do to Done May 28, 2021

dotnet locked as resolved and limited conversation to collaborators Jun 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid unnecessary allocations when using FileStream #15088

Avoid unnecessary allocations when using FileStream #15088

ayende commented Aug 25, 2015

stephentoub commented Aug 25, 2015

KrzysztofCwalina commented Aug 25, 2015

ayende commented Aug 25, 2015

rynowak commented Jul 20, 2016

stephentoub commented Jul 20, 2016 •

edited

rynowak commented Jul 20, 2016 •

edited

stephentoub commented Jul 20, 2016

ayende commented Jul 20, 2016

stephentoub commented Jul 20, 2016 •

edited

rynowak commented Jul 20, 2016

ayende commented Jul 20, 2016

adamsitnik commented May 17, 2021

stephentoub commented May 17, 2021 •

edited

sakno commented May 17, 2021

ayende commented May 18, 2021

bartonjs commented May 18, 2021 •

edited by dotnet-api-review bot

stephentoub commented May 28, 2021

Avoid unnecessary allocations when using FileStream #15088

Avoid unnecessary allocations when using FileStream #15088

Comments

ayende commented Aug 25, 2015

stephentoub commented Aug 25, 2015

KrzysztofCwalina commented Aug 25, 2015

ayende commented Aug 25, 2015

rynowak commented Jul 20, 2016

stephentoub commented Jul 20, 2016 • edited

rynowak commented Jul 20, 2016 • edited

stephentoub commented Jul 20, 2016

ayende commented Jul 20, 2016

stephentoub commented Jul 20, 2016 • edited

rynowak commented Jul 20, 2016

ayende commented Jul 20, 2016

adamsitnik commented May 17, 2021

Background and Motivation

Proposed API

Usage Examples

Alternative Designs

Risks

stephentoub commented May 17, 2021 • edited

sakno commented May 17, 2021

ayende commented May 18, 2021

bartonjs commented May 18, 2021 • edited by dotnet-api-review bot

stephentoub commented May 28, 2021

stephentoub commented Jul 20, 2016 •

edited

rynowak commented Jul 20, 2016 •

edited

stephentoub commented Jul 20, 2016 •

edited

stephentoub commented May 17, 2021 •

edited

bartonjs commented May 18, 2021 •

edited by dotnet-api-review bot