Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unnecessary Slow Aligned Memory Reallocation (Multiple of System Page Size) #110225

Open
PavelCibulka opened this issue Nov 27, 2024 · 7 comments
Labels
area-System.Runtime.InteropServices tenet-performance Performance related issue untriaged New issue has not been triaged by the area owner

Comments

@PavelCibulka
Copy link

I've been experimenting with resizing allocated aligned memory. I believe that increasing or decreasing memory by multiples of the system page size should be almost instantaneous.

The system seems capable of this when tested with NativeMemory.Realloc, which completes in around 1ms. However, NativeMemory.Realloc doesn't guarantee alignment preservation.

  public unsafe void Alloc() {
        long size = 4L * 1024 * 1024 * 1024;
        void* mem = NativeMemory.Alloc((nuint)size);
        void* mem2 = NativeMemory.Realloc(mem, (nuint)(size + Environment.SystemPageSize));
        void* mem3 = NativeMemory.Realloc(mem2, (nuint)(size));
    }

When I perform the same test with NativeMemory.AlignedRealloc, it takes several seconds to complete. It should be as fast as NativeMemory.Realloc when the requested alignment remains unchanged and the memory is resized by multiples of the system page size.

 public unsafe void AlignedAlloc() {
        long size = 4L * 1024 * 1024 * 1024;
        void* mem = NativeMemory.AlignedAlloc((nuint)size, 64);
        void* mem2 = NativeMemory.AlignedRealloc(mem, (nuint)(size + Environment.SystemPageSize), 64);
        void* mem3 = NativeMemory.AlignedRealloc(mem2, (nuint)(size), 64);
    }

I'm unsure whether this issue should be reported to the .NET team or the operating system developers. I'm using Ubuntu 24.04 with kernel 6.10.14.

@PavelCibulka PavelCibulka added the tenet-performance Performance related issue label Nov 27, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Nov 27, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Copy link
Contributor

Tagging subscribers to this area: @dotnet/interop-contrib
See info in area-owners.md if you want to be subscribed.

@tannergooding
Copy link
Member

For the most part NativeMemory APIs are just thin wrappers over the underlying C runtime

For example:

  • NativeMemory.Alloc - thin wrapper over malloc
  • NativeMemory.AllocZeroed - thin wrapper over calloc
  • NativeMemory.Free - thin wrapper over free
  • NativeMemory.Realloc - thin wrapper over realloc

Its a little bit more flexible with the aligned variants, as C didn't standardize them until more recently:

  • NativeMemory.AlignedAlloc - thin wrapper over aligned_alloc (if available) or _aligned_malloc (Win32) -or- equivalent API on non Win32 systems without the C API
  • NativeMemory.AlignedFree - thin wrapper over free (if using aligned_alloc) or _aligned_free (Win32) -or- equivalent API that pairs with the aligned alloc API on non Win32 systems without the C API

NativeMemory.AlignedRealloc then ends up deferring to _aligned_realloc on Win32. But there is no equivalent C API and while if the memory was allocated with the C API you can use realloc, it doesn't allow changing the alignment (it should preserve it however, as its meant to be aware of that scenario). It's also worth noting that while realloc can theoretically grow an existing allocation and avoid the copy, that's fairly uncommon in practice and is dependent on many other factors. In many cases it functionally is malloc+copy+free and that is correspondingly what the underlying fallback implementation does on systems where aligned_alloc is a thing.

Notably "changing" the alignment is technically undefined behavior for the C API and it is technically possible for us to ignore the input and use realloc for the underlying implementation in that scenario, which might improve the performance on non Windows systems. But such a change likely needs deeper discussion.

@PavelCibulka
Copy link
Author

Thank you for the very detailed information.

If I understand correctly:

  • You can use:
    - NativeMemory.Alloc and calculate alignment manually.
    - NativeMemory.AllocZeroed and calculate alignment manually.
    - NativeMemory.AlignedAlloc.
  • Map Span to the allocated memory.
  • Change the size using NativeMemory.Realloc (alignment will be preserved even if allocated with NativeMemory.Alloc).
  • Calculate the position of Span using the same offset as in the previous memory location.
  • Map Span to the new memory.
  • Use Vector512 without encountering any errors.

If so, what is the maximum alignment that NativeMemory.Realloc would maintain? Would it be 64 bytes, system page size, or another value?

Can we include this information in the NativeMemory.Realloc documentation?

Is the only purpose of NativeMemory.AlignedRealloc for situations when you want to change alignment?

@tannergooding
Copy link
Member

Change the size using NativeMemory.Realloc (alignment will be preserved even if allocated with NativeMemory.Alloc).

From a general public contract point for .NET, the NativeMemory APIs Alloc/Realloc/Free are guaranteed to work together and AlignedAlloc/AlignedRealloc/AlignedFree are guaranteed to work together. It is not guaranteed that other mixes work, such as Realloc/Free with AlignedAlloc will work. Mixing APIs can therefore lead to undefined behavior.

The reason this nuance exists is because in some scenarios, like if we used certain POSIX APIs or on Windows where we need to defer to _aligned_malloc, they are strictly incompatible with the C runtime APIs realloc/free and can only be used with the corresponding native APIs (such as _aligned_realloc/_aligned_free).

While we currently use the underlying C runtime API on systems that provide it (currently all officially supported Linux systems), we don't surface that detail publicly and so there's no way to query it. If such a detail was surfaced (or you were willing to rely on a point in time implementation detail) then mixing NativeMemory.Realloc + NativeMemory.AlignedAlloc is safe on those specific scenarios due to the underlying guarantees of the C runtime itself, which is that aligned_alloc is paired with free and realloc (there is no aligned_free prior to C23 or aligned_realloc in general). The C runtime in particular remembers the original user specified alignment passed into aligned_alloc and preserves that if it needs to allocate a new buffer as part of realloc. For alloc+realloc, it only preserves the system default alignment (typically 16-bytes on 64-bit systems).

If so, what is the maximum alignment that NativeMemory.Realloc would maintain?

It depends on the underlying system. The C runtime doesn't guarantee a range of values that aligned_alloc must support, only that it must be support all fundamental alignments (typically this will be all powers of 2 up to sizeof(void*)). In practice most support at least up to the size of a page and many support larger alignments as well.

Is the only purpose of NativeMemory.AlignedRealloc for situations when you want to change alignment?

Changing alignment isn't strictly guaranteed to work as some underlying realloc functions, such as _aligned_realloc on Windows, require it to match the original alignment passed into the aligned allocation function. It exists to pair with AlignedAlloc and provide a function that will definitively work.

One bit I was trying to say in my previous message was that the .NET team could, with a bit more discussion, simplify our own implementation and just call realloc on Linux, rather than manually doing an aligned_alloc+memcpy+free chain. This would fix the performance issue you're seeing without needing users to rely on implementation details.

Can we include this information in the NativeMemory.Realloc documentation?

I think there's a few clarifying remarks we can add to improve things here, yes. Particularly in terms of what may be undefined behavior across platforms.

@jkotas
Copy link
Member

jkotas commented Dec 1, 2024

The C runtime in particular remembers the original user specified alignment passed into aligned_alloc and preserves that if it needs to allocate a new buffer as part of realloc.

Is that documented somewhere? I do not see it mentioned in any documentation and it does not appear to be the case based on my ad-hoc testing. For example, this is going to reliable show that realloc does not preserve 64kB alignment on Ubuntu 24.04:

include <stdlib.h>
#include <stdio.h>
#include <stdint.h>

int main()
{
   int blockSize = 65536;

   void* p = aligned_alloc(blockSize, blockSize);
   printf("%p %s\n", p, (((uintptr_t)p % blockSize) == 0) ? "aligned" : "NOT ALIGNED!");

   void* p2 = realloc(p, 2*blockSize);
   printf("%p %s\n", p2, (((uintptr_t)p2 % blockSize) == 0) ? "aligned" : "NOT ALIGNED!");
}

The exact conditions where realloc happens to preserve alignment vary between C runtime flavors (e.g. glibc vs. musl). I do not think that it is something one can reasonably depend on.

@tannergooding
Copy link
Member

Is that documented somewhere?

Hmmm, I thought it had been in the C17 or C23 spec; but after having re-read the relevant portions it isn't explicitly called out.

It would indeed be dependent on the underlying implementation given that, which may not preserve it in all cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Runtime.InteropServices tenet-performance Performance related issue untriaged New issue has not been triaged by the area owner
Projects
Status: No status
Development

No branches or pull requests

3 participants