Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid unnecessary allocations when using FileStream #15088

Closed
ayende opened this issue Aug 25, 2015 · 17 comments
Closed

Avoid unnecessary allocations when using FileStream #15088

ayende opened this issue Aug 25, 2015 · 17 comments
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.IO blocking Marks issues that we want to fast track in order to unblock other important work enhancement Product code improvement that does NOT require public API changes/additions tenet-performance Performance related issue
Milestone

Comments

@ayende
Copy link
Contributor

ayende commented Aug 25, 2015

This was originally a PR (dotnet/coreclr#1429), turned into an issue as a result of the comments there.

The idea is to avoid 4KB allocation for buffer whenever we need to work with large number of files.
Consider the following code:

foreach (var file in System.IO.Directory.GetFiles(dirToCheck, "*.dat"))
{
    using (var fs = new FileStream(file, FileMode.Open))
    {
             // do something to read from the stream
    }
}

The problem is that each instance of FileStream will allocate an independent buffer. If we are reading 10,000 files, that will result in 40MB(!) being allocated, even if we are very careful about allocations in general.

See also: dotnet/corefx#2929

The major problem is that FileStream will allocate its own buffer(s) and provide no way to really manage that. Creating large number of FileStream, or doing a big writes using WriteAsync will allocate a lot of temporary buffers, and generate a lot of GC pressure.

As I see it, there are a few options here:

  • Add a constructor that will take an external buffer to use. This will be the sole buffer that will be used, and if a bigger buffer is required, it will throw, instead of allocating a new buffer.
  • Add a pool of buffers that will be used. Something like the following code:
  [ThreadStatic] private static Stack<byte[]>[] _buffersBySize;
  
  private static GetBuffer(int requestedSize)
  {
      if(_buffersBySize == null)
          _buffersBySize = new Stack<byte[]>[32];
  
      var actualSize = PowerOfTwo(requestedSize);
      var pos = MostSignificantBit(actualSize);
  
      if(_buffersBySize[pos] == null)
          _buffersBySize[pos] = new Stack<byte[]>();
  
      if(_buffersBySize[pos].Count == 0)
          return new byte[actualSize];
  
      return _buffersBySize[pos].Pop();
  }
  
  private static void ReturnBuffer(byte[] buffer)
  {
      var actualSize = PowerOfTwo(buffer.Length);
      if(actualSize != buffer.Length)
          return; // can't put a buffer of strange size here (prbably an error)
  
      if(_buffersBySize == null)
          _buffersBySize = new Stack<byte[]>[32];
  
      var pos = MostSignificantBit(actualSize);
  
      if(_buffersBySize[pos] == null)
          _buffersBySize[pos] = new Stack<byte[]>();
  
      _buffersBySize[pos].Push(buffer);
  }

The idea here is that each thread has its own set of buffers, and we'll take the buffers from there. The Dispose method will return them to the thread buffer. Note that there is no requirement to use the same thread for creation / disposal. (Although to be fair, we'll probably need to handle a case where a disposal thread is used and all streams are disposed on it).

The benefit here is that this isn't going to impact the external API, while adding the external buffer will result in external API being visible.

@stephentoub
Copy link
Member

cc: @KrzysztofCwalina

@KrzysztofCwalina
Copy link
Member

I do think this is a very good issue. If we do the buffer pool option, we first need to design a good general purpose buffer pool. We have made many attempts at it in the past, and none of them turned out to be truly general purpose (i.e. such that the pool works for many different scenarios). But I think the time is ripe for this; we need a good buffer pool built in the platform.

@ayende
Copy link
Contributor Author

ayende commented Aug 25, 2015

A general purpose buffer pool would be wonderful, there are quite a few locations which need it.
FileStream, WebSockets, etc.
But that is a really hard problem to solve properly. The code above, for example, can cause starvation if you have some threads doing disposal and some doing creation (common in message passing systems).
And more complex buffer pools required non trivial synchronization.

At the same time, I don't think that IBufferPool is a good idea, either.

@rynowak
Copy link
Member

rynowak commented Jul 20, 2016

If we do the buffer pool option, we first need to design a good general purpose buffer pool

Hopefully we'd all agree that we did this 👍

We were looking at addressing this issue pretty recently in ASP.NET and our options to address it are really limited unless it comes from CoreFx.

  1. Write our own filestream (ugh) - that's the worst option for all the obvious reasons
  2. Ask for a new API that allows us pass a buffer - this doesn't work immediately for us, we'd have to take a dependency on a newer netstandard to leverage it
  3. Fix it inside CoreFx by using the buffer pool

@stephentoub
Copy link
Member

stephentoub commented Jul 20, 2016

Fix it inside CoreFx by using the buffer pool

We can revisit it, but see the discussion here dotnet/corefx#5954 (comment), then dotnet/corefx#6473.

cc: @jkotas, @socket, @KrzysztofCwalina

@rynowak
Copy link
Member

rynowak commented Jul 20, 2016

Thanks @stephentoub - perhaps it's my unfamiliarity with the details, but from looking at the API definitions here, it seems that the only API that could leak a reference to the buffer is the CopyTo[Async] family. Both Read(...) and Write(...) use a caller-provided buffer and potentially a buffer allocated by the stream. Is CopyTo[Async] the sticking point here?

Do you think we'd solve any of the objections by using a pooled buffer inside filestream for Read(...) and Write(...) and then implementing our own CopyToAsync or perhaps adding an implementation that accepts a buffer as a parameter? (I haven't really done any research into whether or not we'd want to do this yet.)

As I understand it, the objection to using pooling in CopyToAsync is that a misbehaving destination stream could corrupt any future consumers of the buffer by holding on it and manipulating it after it's been returned to the pool. Currently if you are faced with a misbehaving destination stream, it will only corrupt the internal state of the FileStream. Is this accurate?

@stephentoub
Copy link
Member

it seems that the only API that could leak a reference to the buffer is the CopyTo[Async] family

Not really. It's also about the internal buffer used by FileStream (FileStream doesn't actually override CopyTo{Async}). Let's say FileStream grabs a buffer from the pool when it's constructed and returns it when it's Dispose'd. What happens if misuse of the stream causes it to be Dispose'd while a ReadAsync operation is in flight? With the current implementation, we'd end up putting a buffer back into the pool and then potentially still writing into it as part of the in-flight ReadAsync operation. We could add synchronization (at a run-time cost) to only return the buffer to the pool in Dispose if there aren't any async operations in flight, and that would address this particular case. But depending on to what degree we care about corruption, there's still the case that something else in the process could put a buffer erroneously back into the pool, FileStream could use that buffer for reads/writes, but the original holder of the buffer could still be using it. There's nothing we can do about that, and we'd end up in a situation where corrupted data was being read or written in the file. The concern here is that we'd be introducing the potential for non-local corruption where it never existed before; something elsewhere in the process completely unrelated to a particular FileStream instance could end up corrupting that instance. Is that a security issue? Maybe, maybe not. Is it difficult to debug? Almost certainly.

@ayende
Copy link
Contributor Author

ayende commented Jul 20, 2016

What about ref counting the buffers? So if there are outstanding operations, it is only returned when they are all completed, even if disposed midway through?

@stephentoub
Copy link
Member

stephentoub commented Jul 20, 2016

What about ref counting the buffers?

That's what I was referring to with "We could add synchronization (at a run-time cost) to only return the buffer to the pool in Dispose if there aren't any async operations in flight". I think you're suggesting on top of that we could delay the return of the buffer until the operation completed, whereas I was suggesting we simply wouldn't return the buffer in that case. I don't think it's worth optimizing for cases of misuse (it's considered misuse to Dispose of a FileStream while operations are still in flight).

@rynowak
Copy link
Member

rynowak commented Jul 20, 2016

Thanks for the summary Stephen.

But depending on to what degree we care about corruption, there's still the case that something else in the process could put a buffer erroneously back into the pool, FileStream could use that buffer for reads/writes,

This really seems more like a discussion of principles the runtime wants to follow than whether or not we can solve the issues related to filestream. Should every framework component behave as a much of a 'clean room' as possible?

I think the logical conclusion of this is that there ends up being a 'framework only' instance of the pool, or no pooling at all in corefx. Every other mitigation will have an achilles heel, and there would still be cases in existing BCL apis (like CopyToAsync) where we can't use the 'framework only' pool because it could allow aliasing. Of course, we never wanted to build a 'framework only' pool because it leads to suboptimal reuse.

The escape hatch would be to provide a constructor or method overload that accepts a caller-provided buffer. This way FileStream is as 'pure' a system as it can be (while still touching the file system 😆 ). This of course means that it requires us to wait until we're ready to adopt the next version of netstandard as our minimum requirement, so if we're going to do that, we have the possibility to get even more creative.

I'm comfortable waiting awhile on to resolve exactly what we (ASP.NET) want to do, because we don't yet have much data about the scenario in question (serving static files).

I think in an ideal world, I'd have the ability to write more unsafe code to solve IO problems using stack-allocated or manually managed memory. This isn't compatible with a lot of existing APIs of course which is why we aren't just doing that 😆

@ayende
Copy link
Contributor Author

ayende commented Jul 20, 2016

In general, having some manner that give us Stream over byte* would be pretty great. Right now we have to copy data from unmanaged to managed memory just to be able to pass the right thing into the Stream call.

@JeremyKuhne JeremyKuhne removed their assignment Jan 14, 2020
@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020
@JeremyKuhne JeremyKuhne removed the untriaged New issue has not been triaged by the area owner label Mar 3, 2020
@carlossanlop carlossanlop added this to To do in System.IO - FileStream via automation Mar 6, 2020
@adamsitnik adamsitnik modified the milestones: Future, 6.0.0 Feb 1, 2021
@adamsitnik
Copy link
Member

Background and Motivation

We have recently got rid of all managed allocations for FileStream.ReadAsync and FileStream.WriteAsync and the remaining _buffer allocation:

Interlocked.CompareExchange(ref _buffer, GC.AllocateUninitializedArray<byte>(_bufferSize), null);

is the last allocation that could be avoided.

We can do that by either:

  • allowing the users to pass the buffer to FileStream ctor. It could be an array rented from ArrayPool or unmanaged memory.
  • allow the users to specify that they want us to pool the buffer.

Proposed API

namespace System.IO
{
    public sealed class FileStreamOptions
    {
        public FileStreamOptions();
        public FileMode Mode { get; set; }
        public FileAccess Access { get; set; } = FileAccess.Read;
        public FileShare Share { get; set; } = FileShare.Read;
        public FileOptions Options { get; set; }
        public long PreallocationSize { get; set; }
+       public Memory<byte>? Buffer { get; set; } // default value == null => use default buffer size and allocate the buffer (current behaviour)
    }

Usage Examples

byte[] array = ArrayPool<byte>.Shared.Rent(16_000);
var advanced = new FileStreamOptions
{
    Mode = FileMode.CreateNew,
    Access = FileAccess.Write,
    Options = FileOptions.Asynchronous | FileOptions.WriteThrough,
    Buffer = array
};
using FileStream fileStream = new FileStream(advanced);
// use FileStream
ArrayPool<byte>.Shared.Return(array);

To disable the buffering, users would have to pass a default or empty Memory<byte>:

var noBuffering = new FileStreamOptions
{
    Buffer = default(Memory<byte>) // Array.Empty<byte>() would also work
};

Alternative Designs

Don't let the user provide the buffer (to minimize risk of misuse), but instead provide bufferSize and extend FileOptions with PoolBuffer:

namespace System.IO
{
    public sealed class FileStreamOptions
    {
        public FileStreamOptions();
        public FileMode Mode { get; set; }
        public FileAccess Access { get; set; } = FileAccess.Read;
        public FileShare Share { get; set; } = FileShare.Read;
        public FileOptions Options { get; set; }
        public long PreallocationSize { get; set; }
+       public int BufferSize { get; set; }
    }
    
    public enum FileOptions
    {
        WriteThrough,
        None,
        Encrypted,
        DeleteOnClose,
        SequentialScan,
        RandomAccess,
        Asynchronous,
+       PoolBuffer // new option
    }

Risks

Allowing the users to pass the buffer creates the risk of misusing the buffer by the user:

  • not returning a rented array to the pool and exhausting the ArrayPool
  • freeing the native memory that Memory<byte> wraps when it's still being used by FileStream

@adamsitnik adamsitnik self-assigned this May 17, 2021
@adamsitnik adamsitnik added api-ready-for-review API is ready for review, it is NOT ready for implementation blocking Marks issues that we want to fast track in order to unblock other important work labels May 17, 2021
@stephentoub
Copy link
Member

stephentoub commented May 17, 2021

We already support creating the FileStream unbuffered, at which point the consumer is fully in control of buffering via the buffers they pass to read/write. I'd rather we just stick with that rather than exposing this scheme, which is yet another way to shoot oneself in the foot with pooling and yet another scheme in FileStream for letting the user control buffering. This also ends up being FileStream-specific... if we really think this internal buffer needs to be configurable further, we should think it through for what pattern should be used across all streams/writers/readers. And on top of that, this prescribed pattern ends up forcing a buffer to be rented/allocated in case it might be needed even if access patterns are such that it would never otherwise be reified.

@sakno
Copy link
Contributor

sakno commented May 17, 2021

We already support creating the FileStream unbuffered, at which point the consumer is fully in control of buffering via the buffers they pass to read/write.

Buffer management is not restricted only to allocation. The current implementation controls Position and flushing stream when needed. Seek, Position and other members take into account the state of the buffer. As a user, I don't want to re-implement all these things. I just want to override buffer allocation and nothing more. I could use BufferedStream as a wrapper for FileStream but it also doesn't support custom Memory<byte> as buffer. Moreover, it's extra level of indirection: BufferedStream -> FileStream -> FileStreamStrategy. From my point of view, adding support of custom buffer in .NET class lib cheaper that writing another level of abstraction at the top of FileStream.

which is yet another way to shoot oneself in the foot

Because this is the responsibility of the user. The same applicable to Unsafe class, pointers, and many other things in .NET/C#. I accept the risk in exchange for ability to solve my tasks efficiently. Flexible file I/O with small development effort is what I expect to see out-of-the-box.

@ayende
Copy link
Contributor Author

ayende commented May 18, 2021

Just to add my 2 cents, having the buffer management (position, etc) in FileStream will help, yes. I want to control the buffer allocation, but I don't actually care about how it is used.

@bartonjs
Copy link
Member

bartonjs commented May 18, 2021

Video

@tannergooding mentioned that there may be a more generalized allocator/management feature in the works, so rather than accepting a buffer now as well as an allocator "soon", we feel that the right answer for now is just to take the buffer size, not a user provided buffer.

namespace System.IO
{
    partial class FileStreamOptions
    {
       public int BufferSize { get; set; }
    }
}

@bartonjs bartonjs added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels May 18, 2021
@stephentoub
Copy link
Member

Implemented by #52928

System.IO - FileStream automation moved this from To do to Done May 28, 2021
@dotnet dotnet locked as resolved and limited conversation to collaborators Jun 27, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.IO blocking Marks issues that we want to fast track in order to unblock other important work enhancement Product code improvement that does NOT require public API changes/additions tenet-performance Performance related issue
Projects
Development

No branches or pull requests

10 participants