Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 64 bits support to Array underlying storage #12221

Open
GPSnoopy opened this issue Mar 8, 2019 · 102 comments
Open

Add 64 bits support to Array underlying storage #12221

GPSnoopy opened this issue Mar 8, 2019 · 102 comments
Labels
api-needs-work area-System.Runtime needs-further-triage
Milestone

Comments

@GPSnoopy
Copy link

@GPSnoopy GPSnoopy commented Mar 8, 2019

While System.Array API supports LongLength and operator this[long i], the CLR does not allow arrays to be allocated with more than 2^31-1 elements (int.MaxValue).

This limitation has become a daily annoyance when working with HPC or big data. We frequently hit this limit.

Why this matters

  • .NET arrays is the de facto storage unit of many APIs & collections.
  • Currently one has to allocate native memory instead, which does not work with managed code (Span could have helped, but it's been limited to int32 as well).
  • HPC frameworks expects data to be contiguous in memory (i.e. the data cannot be split into separate arrays). E.g. BLAS libraries.
  • Requires applications to be coded and designed differently if they handle <2G elements or >2G elements. I.e. It does not scale.
  • It's an arbitrary limitation with little value to the user and application:
    • With a desktop with 64GB, why can one only allocate 3.1% of its memory in one go?
    • With a data centre machine with 1,536GB, why can one only allocate 0.1% of its memory in one go?

In C++ this is solved with std::size_t (whose typedef changes depending on the target platform). Ideally, .NET would have taken the same route when designing System.Array. Why they haven't is a mystery, given that AMD64 and .NET Framework appeared around the same time.

Proposal
I suggest that when the CLR/JIT runs the .NET application in x64, it allows the array long constructor to allocate more than int.MaxValue items:

  • Indexing the array with operator this[long i] should work as expected and give access to the entire array.
  • Indexing the array with operator this[int i] should work as expected but implicitly limit the access to only the first int.MaxValue elements.
  • The LongLength property should return the total number of elements in the array.
  • The Length property should return the total number of elements in the array, or throw OverflowException of there are more than int.MaxValue elements (this matches the current behaviour on multi-dimensional arrays with more than int.MaxValue elements).

I naively believe that the above should not break any existing application.

Bonus points for extending 64-bit support to Span and ReadOnlySpan.

@tannergooding
Copy link
Member

@tannergooding tannergooding commented Mar 8, 2019

I naively believe that the above should not break any existing application.

One of the simplest breaks becomes any program that is doing the following:

for (int i = 0; i < array.Length; i++)
{
}

This works fine for any existing programs, since CoreCLR actually defines a limit that is just under int.MaxValue. Allowing values that are greater than int.MaxValue would cause an overflow to -2147483648 and either cause unexpected behavior or cause an IndexOutOfRangeException to be thrown.

@giuliojiang
Copy link

@giuliojiang giuliojiang commented Mar 8, 2019

I frequently run into this array when processing large chunks of data from network streams into byte arrays, and always need to implement chunking logic in order to be able to process the data.
It would be great to be able to use long indexes on arrays

@GPSnoopy
Copy link
Author

@GPSnoopy GPSnoopy commented Mar 8, 2019

@tannergooding That's why I proposed above to keep the existing behaviour of throwing OverflowException on Length when there are more than int.MaxValue elements. Nothing changes there.

I'm suggesting that changing the CLR implementation as proposed above would allow applications and people who want to to use large arrays without breaking existing applications. You are right that simply passing a large array into a library that does not support it will break, but at least this will give us choice. We need to start somewhere, and .NET cannot keep ignoring this problem.

@philjdf
Copy link

@philjdf philjdf commented Mar 8, 2019

Yesterday I happily created an array in Python which contained more than two billion elements. When can I do the same in .NET?

Currently we get an exception if we try to construct an array with more than 2B elements. What's wrong with deferring that exception until something calls the Length property which can no longer return a valid int? @tannergooding's example wouldn't cause problems. Are there other examples which break?

@GrabYourPitchforks
Copy link
Member

@GrabYourPitchforks GrabYourPitchforks commented Mar 21, 2019

This is an interesting idea, but I wonder if it would be better to have a LargeArray<T> or similar class that has these semantics rather than try to shoehorn it into the existing Array class. The reason I suggest this course of action is that the GC currently has a hard dependency on the element count of any variable-length managed object fitting into a 32-bit signed integer. Changing the normal Array class to have a 64-bit length property would at minimum also affect the String type, and it may have other unintended consequences throughout the GC that would hinder its efficiency, even when collecting normal fixed-size objects. Additionally, accessing the existing Array.Length property would no longer be a simple one-instruction dereference; it'd now be an overflow check with associated branching.

If we had a theoretical LargeArray<T> class, it could be created from the beginning with a "correct" API surface, including even using nuint instead of long for the indexer. If it allowed T to be a reference type, we could also eliminate the weird pseudo-covariance / contravariance behavior that existing Array instances have, which would make writing to a LargeArray<T> potentially cheaper than writing to a normal Array.

@GPSnoopy GPSnoopy changed the title Add 64-bit support to Array underlying storage Add 64 bits support to Array underlying storage Apr 4, 2019
@GPSnoopy
Copy link
Author

@GPSnoopy GPSnoopy commented Apr 4, 2019

@GrabYourPitchforks interesting facts about the GC. I wasn't aware of such limitations.

A LargeArray<T> class could be a temporary solution. My main concern is that it would stay a very niche class with no interoperability with the rest of the .NET ecosystem, and ultimately would be an evolutionary dead end. I do like the idea of nuint/nint though.

My gut feeling is that Array, String and the GC are ripe for a 64 bits overhaul. We should bite the bullet and do it. So far I've been quite impressed by the .NET Core team willingness to revisit old decisions and areas that Microsoft had kept shut in the past (e.g. SIMD/platform specific instructions, float.ToString() roundtrip fixes, bug fixes that change backward compatibility, etc).

@juliusfriedman
Copy link
Contributor

@juliusfriedman juliusfriedman commented Apr 5, 2019

I guess I could palette a LargeArray but if that's going to be implemented than I hope it doesn't create a new type which is not actually an Array, IMHO it would have been much easier to address this if we would have created another sub type of Array internally instead of inventing Span however Span also solves other problems....

@GSPP
Copy link

@GSPP GSPP commented Apr 22, 2019

These days 2GB arrays are barely enough for many applications to run reliably. RAM prices have stagnated for a few years now. Surely, the industry will resolve this problem sooner or later. As RAM amounts resume increasing at Moores law rate this 2GB array issue will become very commonplace sooner or later.

A LargeArray<T> type might be a good medium term solution. But will 2GB arrays not be very commonplace 10 years from now? Do we then want to litter application code and API surfaces with LargeArray<T>? It would often be a hard choice whether to go for LargeArray<T> or T[].

Thinking in the very long term it seems far better to find a way to fix T[].

@GSPP
Copy link

@GSPP GSPP commented Apr 22, 2019

If 64 bit support is implemented there could be a tool that analyzes your code for legacy patterns (e.g. usage of int Length or the typical for loop for (int i = 0; i < array.Length; i++)). The tool should then be able to mass upgrade the source code. This could be a Roslyn analyzer.

@GrabYourPitchforks
Copy link
Member

@GrabYourPitchforks GrabYourPitchforks commented Apr 24, 2019

Since this would be such a massive ecosystem breaking change, one other thing you'd probably have to do is analyze the entire graph of all dependencies your application consumes. If the change were made to T[] directly (rather than LargeArray<T> or similar), assemblies would need to mark themselves with something indicating "I support this concept!", and the loader would probably want to block / warn when such assemblies are loaded. Otherwise you could end up in a scenario where two different assemblies loaded into the same application have different views of what an array is, which would result a never-ending bug farm.

@juliusfriedman
Copy link
Contributor

@juliusfriedman juliusfriedman commented Apr 26, 2019

Not if large array was an array... i.e. derived from it if only internally perhaps; like I suggested back in the span threads.

@GrabYourPitchforks
Copy link
Member

@GrabYourPitchforks GrabYourPitchforks commented Apr 27, 2019

If large array is a glorified array, then you could pass a large array into an existing API that accepts a normal array, and you'd end up right back with the problems as originally described in this thread.

@GrabYourPitchforks
Copy link
Member

@GrabYourPitchforks GrabYourPitchforks commented Apr 27, 2019

Furthermore, I'm not sure I buy the argument that adding large array would bifurcate the ecosystem. The scenario for large array is that you're operating with enormous data sets (potentially over 2bn elements). By definition you wouldn't be passing this data to legacy APIs anyway since those APIs wouldn't know what to do with that amount of data. Since this scenario is so specialized it almost assumes that you've already accepted that you're limited to calling APIs which have been enlightened.

@juliusfriedman
Copy link
Contributor

@juliusfriedman juliusfriedman commented Apr 27, 2019

You have LongLength on the Array

The only fundamental diff is that one is on the LOH and one is not.

By virtue of the same fact span wouldn't be able to hold more than such either so large span must be needed also...

@GPSnoopy
Copy link
Author

@GPSnoopy GPSnoopy commented Oct 2, 2019

From what I can gather and summarise from the above, there are two pieces of work.

  1. Update the CLR so that Array can works with 64-bit length and indices. This includes changes to the Array implementation itself, but as comments have pointed above, also to System.String and the Garbage Collector. It is likely to be relatively easy to come up with a fork of coreclr that can achieve this, as a proof of concept with no regard for backward compatibility.

  2. Find a realistic way to achieve backward compatibility. This is the hard part. I think this is unlikely to succeed without compromising some aspect of the CLR. Whether it is Length throwing on overflow, or awkwardly introducing new specific classes like LargeArray.

But the more I think about it, the more I think this issue is missing the point and ultimately the real problem with .NET as it stands. Even if the initial proposal was to be implemented, it would only fix the immediate 64-bit issue with Array but still leave collections and Span with the same indexing and length limitations.

I've started reading The Rust Programming Language (kind of felt overdue) and it struck me that Rust also mimics C++ size_t and ssize_t with usize and isize. C# on the other hand somehow decided not to expose this CPU architectural detail and forces everyone to the lowest common denominator for most of it's API: a 32-bit CPU with 32-bit addressing.

I'd like to emphasis that the 32-bit limitation is purely arbitrary from a user point of view. There is no such thing as a small array and a big array; an image application should not have to be implemented differently whether it works with 2,147,483,647 pixels or 2,147,483,648 pixels. Especially when it's data driven and the application has little control on what the user is up to. Even more frustrating if the hardware has long been capable of it. If you do not believe me or think I'm talking nonsense, I invite you to learn how to program for MS-DOS 16-bit with NEAR and FAR pointers (hint: there is a reason why Doom required a 386 32-bit CPU).

Instead of tinkering around the edges, what is the general appetite for a more ambitious approach to fix this limitation?

Here is a controversial suggestion, bit of a kick in the nest:

  • Take on @GrabYourPitchforks idea, introduce nint and nuint (I also like size and ssize, but I can imagine a lot of clashes with existing code).
  • Allow implicit conversion from int to nint (and uint to nuint) but not the reverse,
  • Change all Length properties on Array, String, Span and collections to return nint (leaving aside the debate of signed vs unsigned and keeping the .NET convention of signed lengths).
  • Change all indexing operators on Array, String, Span and collection to only take nint.
  • Remove LongLength properties and old indexing operators.
  • Take a page from Nullable Reference book and allow this to be an optional compilation feature (this is where it hurts).
  • Only allow a nint assembly to depend on another assembly if it's also using the new length types.
  • But allow a global or per-reference "I know what I'm doing" override, in which case calling the old int32 Length property on an Array is undefined (or just wraps around the 64-bit value).
  • Spend a decade refactoring for loops and var i = 0.

I understand this is far from ideal and can create uncomfortable ecosystem situations (Python2 vs Python3 anyone?). Open to suggestions on how to introduce size types in .NET in a way that doesn't leave .NET and C# more irrelevant on modern hardware each year.

@MichalStrehovsky
Copy link
Member

@MichalStrehovsky MichalStrehovsky commented Oct 2, 2019

If we can solve the issue with the GC not being tolerant of variable-length objects bigger than 2 GB,
a couple things that might make LargeArray<T> with a native-word-sized-Length more palatable:

  • Array length is already represented as a pointer-sized integer in the memory layout of arrays in .NET. On 64-bit platforms, the extra bytes serve as padding and are always zero.
  • If we were to introduce a LargeArray<T>, we can give it the same memory layout as existing arrays. The only difference is that the bytes that serve as padding for normal arrays would have a meaning for LargeArray<T>.
  • If the memory layout is the same, code that operates on LargeArray<T> can also operate on normal arrays (and for a smaller LargeArray<T>, vice-versa)
  • We can enable casting between LargeArray<T> and normal arrays
  • Casting from an array to LargeArray<T> is easy - we just follow the normal array casting rules (we could get rid of the covariance though, as @GrabYourPitchforks calls out). If we know the element types, it's basically a no-op.
  • When casting from LargeArray<T> to normal arrays, we would additionally check the size and throw InvalidCastException if the LargeArray<T> instance is too long.
  • Similar rules would apply when casting to collection types (e.g. you can cast LargeArray<T> to ICollection<T> only if the element count is less than MaxInt).
  • Existing code operating on arrays doesn't have to worry about Length being longer than MaxInt. Casting rules guarantee this can't happen.
  • LargeArray<T> would not derive from System.Array, but one could explicitly cast to it (the length-check throwing InvalidCastException would happen in that case). The same for non-generic collection types (e.g. casting to ICollection that has the Int32 Count property would only be allowed for a small LargeArray<T>).

This scheme would allow LargeArray<T> and normal arrays to co-exist pretty seamlessly.

@kasthack
Copy link

@kasthack kasthack commented Oct 2, 2019

@MichalStrehovsky

Similar rules would apply when casting to collection types (e.g. you can cast LargeArray to ICollection only if the element count is less than MaxInt).

This would make LargeArray<T> incompatible with a lot of older code in these cases:

  • Methods that currently accept a more specific type than they actually need(like ICollection<T> instead of IEnumerable<T> while the code just enumerates the values).

  • Probably, some cases where ICollection<T> isn't used directly but inherited from. For instance, methods that accept ISet/IList/IDictionary<...>(I assume, that those interfaces and their implementations would be eventually updated for 64-bit lengths as well) which are inherited from ICollection<T>.

I would go with overflow checks when .Count is called to keep the compatibility and add .LongCount property with a default implementation to the old interfaces.

@MichalStrehovsky
Copy link
Member

@MichalStrehovsky MichalStrehovsky commented Oct 2, 2019

@kasthack I was trying to come up with a compatible solution where one never has to worry about getting OverflowException in the middle of a NuGet package that one doesn't have the source for. Allowing cast to ICollection<T> to succeed no matter the size is really no different from just allowing arrays to be longer than 2 GB (no need for a LargeArray<T> type). Some code will work, some won't.

With explicit casting, it's clear that we're crossing a boundary into "legacy land" and we need to make sure "legacy land" can do everything it would reasonably be able to do with a normal ICollection<T> before we do the cast.

methods that accept ISet/IList/IDictionary<...>

Arrays don't implement ISet/IDictionary so a cast to these would never succeed. For IList, the same rules would apply as for ICollection (ICollection was just an example above).

@philjdf
Copy link

@philjdf philjdf commented Oct 3, 2019

@GPSnoopy's post makes me wonder whether the following variation might make sense:

  1. Introduce new nint and nuint types, but don't change the signatures of anything to use them. Nothing breaks.
  2. Introduce new arrays types (with fixed covariance), new span types, etc, which use nint and nuint. Keep the old ones and don't touch them. Make it fast and easy to convert between old and new versions of these types (with an exception if your 64-bit value is too big to fit into the 32-bit counterpart), but conversion should be explicit. Nothing breaks, type safety and all that.
  3. Add a C# compiler switch /HeyEveryoneIts2019 which, when you write double[] you get the new type of array instead of the old one, everything's nint and nuint, and the compiler adds conservative/obvious stuff to convert to/from old-style arrays when you call outside assemblies which want old-style arrays. This way if it gets through the conversion without an exception, you won't break any old referenced code.

@GSPP
Copy link

@GSPP GSPP commented Oct 3, 2019

It has been proposed that we could make Array.Length and array indices native-sized (nint or IntPtr).

This would be a portability issue. Code would need to be tested on both bitnesses which currently is rarely required for most codebases. Code on the internet would be subtly broken all the time because developers would only test their own bitness.

Likely, there will be language level awkwardness when nint and int come into contact. This awkwardness is a main reason unsigned types are not generally used.

In C languages the zoo of integer types with their loose size guarantees is a pain point.

I don't think we want to routinely use variable-length types in normal code. If nint is introduced as a type it should be for special situations. Likely, it is most useful as a performance optimization or when interoperating with native code.


All arrays should transparently support large sizes and large indexing. There should be no LargeArray<T> and no LargeSpan<T> so that we don't bifurcate the type system. This would entail an enormous duplication of APIs that operate on arrays and spans.

If the object size increase on 32 bit is considered a problem (it might well be) this could be behind a config switch.


Code, that cares about large arrays needs to switch to long.

Likely, it will be fine even in the very long term to keep most code on int. In my experience, over all the code I ever worked on, it is quite rare to have large collections. Most collections are somehow related to a concept that is inherently fairly limited. For example, a list of customers will not have billions of items in it except if you work for one of 10 companies in the entire world. They can use long. Luckily for us, our reality is structured so that most types of objects do not exist in amounts of billions.

I see no realistic way to upgrade the existing collection system to 64 bit indexes. It would create unacceptable compatibility issues. For example, if ICollection<T>.Count becomes 64 bit, all calling code is broken (all arithmetic but also storing indexes somewhere). This must be opt-in.

It would be nicer to have a more fundamental and more elegant solution. But I think this is the best tradeoff that we can achieve.

@FraserWaters-GR
Copy link

@FraserWaters-GR FraserWaters-GR commented Oct 8, 2019

Just note there is already a case today where Length can throw. Multi-dimensional arrays can be large enough that the total number of elements is greater than Int.MaxValue. I think this is why LongCount was originally added.

@huoyaoyuan
Copy link
Contributor

@huoyaoyuan huoyaoyuan commented Oct 8, 2019

Note: Currently, when you write foreach over an array, the C# (and also VB) compiler actually generates a for loop, and store index in 32 bits. This means that existing code must break with array that > 2G, or at least a recompile is required.

I really hopes all size-like parameters among the ecosystem is using nuint (avoid checking for >= 0).
There can be a [SizeAttribute] on all the parameters, and JIT generates the positive guard and bit expansion to allow existing int-size-compiled assembly to run with native-sized-corelib.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@maryamariyan maryamariyan added the untriaged label Feb 26, 2020
@lostmsu
Copy link

@lostmsu lostmsu commented Jun 20, 2020

One option is to create a NuGet package for LargeArray<T>, and polish it in practical use. Once polished, make it part of the standard. This is what C++ did to parts of Boost.

But CLR should be ready by then.

@GPSnoopy
Copy link
Author

@GPSnoopy GPSnoopy commented Jul 2, 2020

But CLR should be ready by then.

@lostmsu Are you aware of ongoing work to fix this in CLR that I'm not aware of?

@lostmsu
Copy link

@lostmsu lostmsu commented Jul 2, 2020

@GPSnoopy nope, that was an argument to start the CLR work ahead of time before BCL, so that BCL could catch up.

@joperezr joperezr removed the untriaged label Jul 2, 2020
@Frassle
Copy link
Contributor

@Frassle Frassle commented Mar 21, 2021

There has been discussion on that in this thread.

Yes I saw some of it but ideas of a LargeArray type that was segmented, or used native memory seemed like no-gos to me because you can't interop them with existing Array based code. #12221 (comment) is a good description of what could work for array (+ lets fix variance while we're at it, we can always add ReadOnlyArray for safe variance like what's done in ImmutableArray)

Applications which deal with large data sets could use the native array / native span types.

This point I'm not so sure about. Its looking like the Array type might have to be duplicated but I'm not sure that means we should duplicate all the other types. Consider that we probably want to apply this to the interfaces that arrays inherit from like ICollection and IList, and once you have large sized array you could have large sized List and Stack. For these types where we can change the Length/Count property I think it would be better to do that, than to duplicate the types.

Take an example of an API that currently looks like void DoIt(Span<T> a, Span<T> b). If you want to use that API with a large span allocation you either have the choice of seeing a runtime error when calling it, raising a bug and getting the implementation fixed to access NativeLength instead of Length. Or you get a compiler error, notice that a cast from LargeSpan to Span would throw (because you know you large spans are actually large), raise a bug and get an overloaded added void DoIt(LargeSpan<T> a, LargeSpan<T> b). For external libraries that's probably not to bad, they can generally be a bit more aggressive trimming old functions and could delete the Span overload after a few releases, but applying this to the BCL leads to a lot of overloads hanging around.

Tradeoffs to think about. I think I'd prefer the duplication off methods and properties that are int based when they need to be nint based, rather than duplicating every type that has one of those methods or properties and then duplicating every method or property that uses the original type.

@GrabYourPitchforks
Copy link
Member

@GrabYourPitchforks GrabYourPitchforks commented Mar 21, 2021

TBH I think many of these concerns about "but now you have to enlighten everything about LargeArray / LargeSpan!" are overblown.

This is not much different from the situation a few years ago where all we had were array-based overloads of various APIs, and then we went and added Span-based overloads as priority and time permitted. The world didn't end, the ecosystem didn't become bifurcated, and applications moved on to the new APIs only if they derived significant value from doing so. We didn't bother updating many of the old collection types or interfaces because it was a paradigm shift.

There's no reason we couldn't follow the same playbook here. Not every single aspect of the runtime and library ecosystem needs to move forward simultaneously. It's possible to deliver a minimum viable product including an exchange type and basic (common) API support, then continue to improve the ecosystem over future releases as priority dictates and time permits.

@Joe4evr
Copy link

@Joe4evr Joe4evr commented Mar 21, 2021

Take an example of an API that currently looks like void DoIt(Span<T> a, Span<T> b). If you want to use that API with a large span allocation you either have the choice of seeing a runtime error when calling it, raising a bug and getting the implementation fixed to access NativeLength instead of Length.

I imagine LargeSpan<T> would have some mitigation API on it like Span<T> GetSubSpan(nint start, int length) (note length here is specifically not nint) precisely for consuming APIs that wouldn't (yet) know about the new stuff.

@Frassle
Copy link
Contributor

@Frassle Frassle commented Mar 22, 2021

So would one of the following proposals possibly get traction?

A) Add NativeLength

Background and Motivation

See #12221 for background discussion on this API.

Proposed API

 namespace System
 {
	public readonly ref struct Span<T>
	{
-		private readonly int _length;
+		private readonly nint _length;

+		public Span(void* pointer, nint length);
+		public nint NativeLength { get; }
+ 		public ref T this[nint index] {get; }
+		public Span<T> Slice(nint start, nint length);

 		public int Length
 		{
			[NonVersionable]
-			get => _length;
+			get {
+				if (_length > Int32.MaxValue) throw new InvalidOperationException();
+				return (int)_length;
+			}
 		}
	}
 }

Similar API changes would be made to ReadOnlySpan, Memory, and ReadOnlyMemory.

Usage Examples

This could be used with native memory allocations to get .net style views on large data sets.
Once libraries are updated to use NativeLength rather than Length this would allow large data sets to make use of the normal .net ecosystem.

Risks

This incurs a bounds check on every access to Span, which as a high performance type is a unwanted negative cost.
Its not clear from the type system if apis are safe to call with Spans that are longer than Int32.MaxValue, thus risking unclear runtime exceptions.

B) Add NativeSpan

Background and Motivation

See #12221 for background discussion on this API.

Proposed API

namespace System
{
	public readonly ref struct NativeSpan<T>
	{
		public NativeSpan(void* pointer, nint length);
		public ref T this[nint index] { get; }
		public nint Length { get; }
		public NativeSpan<T> Slice(nint start);
		public NativeSpan<T> Slice(nint start, nint length);
		
		public static implicit operator NativeSpan<T>(Span<T> span);
		public static explicit operator Span<T>(NativeSpan<T> span);

		// Plus other methods that are present on Span<T>
	}
}

Similar class would be made for NativeReadOnlySpan, NativeMemory, and NativeReadOnlyMemory.

Usage Examples

This could be used with native memory allocations to get .net style views on large data sets.
Once libraries are updated to use NativeSpan rather than Span this would allow large data sets to make use of the normal .net ecosystem.

Risks

This will cause noticeable api surface duplication as methods are slowly moved to use NativeSpan instead of Span.
The BCL will have to maintain both overloads for a significant time. This is even more noticeable for methods/properties that currently return Span (for example MemoryExtensions.AsSpan() will also need MemoryExtensions.AsNativeSpan()).

Both

Either way thinking about this did trigger me to think if nint is actually the right type to use here? C++/rust would use an unsigned native int here (size_t in C++, usize in rust). Historically the BCL would not use unsigned types on such a major API surface because it's not CLS compliant, is that still a going concern? If we do stick with nint it does mean that on 32bit systems a Span would still be limited to 2GB despite most OS being able to allocate a larger buffer than that in modern 32bit system. Also there's always some groups looking into C# for kernel work and a Span for the entire address space could be nice.

@GrabYourPitchforks
Copy link
Member

@GrabYourPitchforks GrabYourPitchforks commented Mar 22, 2021

@Frassle Thanks for posting that. I think NativeSpan is more likely to get traction than Span.NativeLength, especially since as you called out Span is supposed to be a low-level, high-performance API. It wouldn't be acceptable for the Length property accessor to incur a check on every call. It also introduces confusion for consumers: if I'm calling an API that takes Span<T>, how am I supposed to know whether the API has been long-span enlightened? Presumably the docs would mention this, but do I now need to read the docs for every single API I call? That puts significant burden on callers.

For NativeSpan, we can improve the ecosystem piecemeal. I honestly think the set of NativeSpan overloads we add within the runtime and libraries would be fairly small and that it won't lead to overload explosion like some on this thread fear. We're talking overloading really low-level APIs like MemoryExtensions.IndexOf(this ReadOnlyNativeSpan<T>, ...), plus maybe some things on MemoryMarshal. You're not going to see APIs like Stream.Write(RONS<byte>), MemoryExtensions.ToUpperInvariant(RONS<char>), NativeSpan<T>.ToArray(), etc.

To all: If we had a NativeSpan<T> or a ReadOnlyNativeSpan<T>, what overloads would you want to see?

(Also, I'd prefer NativeSpan<T>.Length to be typed as nuint instead of nint, but that's just me. It has implications for the signature of MemoryExtensions.IndexOf and friends, but we can cross that bridge when we come to it.)

Edit: I'm not sold on NativeMemory or ReadOnlyNativeMemory yet. Those both have significant implications for the runtime, GC, and friends. They also have implications for the normal Memory<T> and ReadOnlyMemory<T> types. NativeSpan and ReadOnlyNativeSpan have much reduced impact, since their changes are limited to the exchange types themselves, whatever overloads we add, and some JIT optimizations.

@jkotas
Copy link
Member

@jkotas jkotas commented Mar 22, 2021

if I'm calling an API that takes Span, how am I supposed to know whether the API has been long-span enlightened?

This can be solvable using analyzers, e.g. add [NativeSpanEnligtened] attribute and analyzers that disallow calling Length in any [NativeSpanEnligtened] context.

@GrabYourPitchforks
Copy link
Member

@GrabYourPitchforks GrabYourPitchforks commented Mar 22, 2021

@jkotas That runs into the same problems as nullability. It introduces a marker attribute which serves as an API contract. Anything that performs code generation or dynamic dispatch would need to be updated to account for it, just as if they were running a light form of the Roslyn analyzer at runtime. Though to be honest I don't think we have a significant number of components which perform codegen over span-consuming methods.

It also doesn't solve the problem of somebody passing you (perhaps as a return value) a span whose length exceeds Int32.MaxValue, which means that your own code could encounter an exception in a place where you're not expecting such exceptions to occur. Could solve it by making this attribute viral and apply both to input and output / return parameters, just like nullability. And by crafting special compiler support for it. :)

@jkotas
Copy link
Member

@jkotas jkotas commented Mar 23, 2021

It also doesn't solve the problem of somebody passing you (perhaps as a return value) a span whose length exceeds

This part can be solved by having a long Span support as optional runtime mode. Span.Length either throws unconditionally or be slower and throw on overflow in this mode (ie there can be two variants of this mode).

It is similar to how we are dealing with linkability. We are marking APIs and patterns that are linker friendly. If your app uses a linker unfriendly APIs or patterns, it has unspecified behavior. You have to fix the code if you want to be sure that it works well.

I understand that the analyzer approach is not without issues, but I think it is worth having it on the table. If we went with the NativeSpan approach, we would have to introduce NativeSpan overloads for all APIs that take Span in the limit. It feels like a too high price to pay.

@GrabYourPitchforks
Copy link
Member

@GrabYourPitchforks GrabYourPitchforks commented Mar 23, 2021

If we went with the NativeSpan approach, we would have to introduce NativeSpan overloads for all APIs that take Span in the limit. It feels like a too high price to pay.

That concern is understandable, but I don't think it's likely to come to fruition in practice. I spent a few minutes looking through System.IO, System.Runtime, and other ref assemblies for public APIs which take [ReadOnly]Span<T>, and I just don't see the value in adding NativeSpan overloads of most of them. I don't see us adding NativeSpan overloads for any of our TryParse or TryFormat methods, for instance, and those alone seemingly account for over half of all spanified APIs. To be honest, I'd be surprised if we end up enlightening more than ~5% of all span APIs. To me, that seems like an acceptable price to pay to avoid complicating normal Span<T>.

All of that said, I think network-style scenarios are going to be more common than big data scenarios. And network-style scenarios may involve lots of data being sharded across multiple buffers via ReadOnlySequence<T>. We still don't enjoy widespread ecosystem support for that type across our spanified API surface. And the fact that we haven't created many overloads for this type tells me that we in practice aren't going to create many overloads for native span.

If we were to identify which APIs we wanted to create ReadOnlySequence<T> overloads for, that would allow us to focus on both the network and the big data problem, as you can always wrap a ReadOnlySequence<T> around an arbitrarily large native memory buffer.

@Joe4evr
Copy link

@Joe4evr Joe4evr commented Mar 23, 2021

If we went with the NativeSpan approach, we would have to introduce NativeSpan overloads for all APIs that take Span in the limit.

As I mentioned previously, that could be covered by giving NativeSpan a mitigation API that creates (up-to)-Int32.MaxValue-sized slices. I believe that can keep most existing APIs none-the-wiser and only a scant few may need new overloads because they can be slightly more optimized.

@GPSnoopy
Copy link
Author

@GPSnoopy GPSnoopy commented Mar 23, 2021

A few colleagues and I had some thoughts about how we could come up with a reasonable proposal along the lines of nullables (which seemed to be the favoured approach in our small group) and that could still be implemented as a PR by one or two developers. Since there have been many different ideas flying around in the comments, I've put this into a separate draft document.

Latest version of the proposal can be found at https://github.com/GPSnoopy/csharplang/blob/64bit-array-proposal/proposals/csharp-10.0/64bit-array-and-span.md

Expand arrays, strings and spans to native-int lengths

Summary

Expand arrays (e.g. float[]), string, Span<T> and ReadOnlySpan<T> to have 64-bit lengths on 64-bit platforms (also known as native lengths).

Motivation

Currently in C#, arrays (and the other aforementioned types) are limited to storing and addressing up to 2^31 (roughly 2 billion) elements. This limits the amount of data that can be contiguously stored in memory and forces the user to spend time and effort in implementing custom non-standard alternatives like native memory allocation or jagged arrays. Other languages such as C++ or Python do not suffer from this limitation.

With every passing years, due to the ever-continuing increases in RAM capacity, this limitation is becoming evident to an ever-increasing number of users. For example, machine learning frameworks often deal with large amount of data and dispatch their implementation to native libraries that expect a contiguous memory layout.

At the type of writing this (2021-03-23), the latest high-end smartphones have 8GB-12GB of RAM while the latest desktop CPUs can have up to 128GB of RAM. Yet when using float[], C# can only allocate up to 8GB of contiguous memory.

Proposal

The proposed solution is to change the signature of arrays, string, Span<T> and ReadOnlySpan<T> to have an nint Length property and an nint indexing operator (similar to what C++ does with ssize_t), respectively superseding and replacing the int Length and int indexing operator.

As this would break existing C# application compilation, an opt-in mechanism similar to what is done with C# 8.0 nullables is proposed. By default, assemblies are compiled with legacy lengths but new assemblies (or projects that have been ported the new native lengths) can opt-in to be compiled with native lengths support.

Future Improvements

This proposal limits itself to arrays, string, Span<T> and ReadOnlySpan<T>. The next logical step is to tackle all the standard containers such as List<T>, Dictionary<T>, etc.

Language Impacts

When an assembly is compiled with native lengths, the change of type for lengths and indexing from int to nint means that existing code will need to be ported to properly support this new feature. Specifically the following constructs will need updating.

for (int i = 0; i < array.Length; ++i) { /* ... */ } // If length is greather than what's representable with int, this will loop forever.

for (var i = 0; i < array.Length; ++i) { /* ... */ } // Same as above. A lot of code is written like this rather than using an explicit int type.

The correct version should be the following.

for (nint i = 0; i < array.Length; ++i) { /* .... */ }

Unfortunately, due to implicit conversion the C# 9.0 compiler does not complain when comparing an int to an nint. It is therefore proposed that the compiler warns when an implicit conversion occurs from int to nint for the following operations: a < b, a <= b, a > b, a >= b and a != b.

Runtime Impacts

In order to accommodate these changes, various parts of the dotnet 64-bit runtime need to be modified. Using C# nint and C++ ssize_t, these modifications are in theory backward compatible with the existing 32-bit runtime.

The proposal assumes that these are always enabled, irrespective of whether any assembly is marked as legacy lengths or native lengths.

  • The C++ runtime implementation of .NET arrays and strings need to use ssize_t to store their length (TODO I believe this is already the case, verify it).
  • The .NET runtime implementation of Span<T> and ReadOnlySpan<T> need to internally store their length as nint.
  • The .NET runtime implementation of System.Array needs to be updated to reflect the same type changes.
  • The JIT needs to be aware of native lengths when generating code that accesses arrays and strings, for both the Length property and the indexing operator.
  • The C# compiler and JIT compiler need to generate foreach code that is nint aware for arrays, string, Span<T> and ReadOnlySpan<T>.
  • The GC implementation currently assumes that containers and arrays have up to 2^31 elements and outgoing references. This limitation needs to be lifted.

Boundaries Between Assemblies

Calling Legacy Lengths Assembly From Native Lengths Assembly

Without any further change, a native length assembly passing an array (or any other types covered by this proposal) with a large length (i.e. greater than 2^31) to a legacy lengths assembly would result into an undefined behaviour. The most likely outcome would be the JIT truncating the Length property to its lowest 32-bit, causing the callee assembly to only loop on a subset of the given array.

The proposed solution is to follow C# 8.0 nullables compiler guards and warn/error when such a boundary is crossed in an incompatible way. As with nullables, the user can override this warning/error (TODO exact syntax TBD, can we reuse nullables exclamation mark? Or do we stick to nullable-like preprocessor directives? IMHO the latter is probably enough).

In practice, applications rarely need to pass large amount of data to their entire source code and dependencies domain. We foresee that most applications will only need to make a small part of the source code and dependencies native lengths compatible. Thus enabling a smoother transition to a native lengths-only C# future version.

Calling Native Lengths Assembly From Legacy Lengths Assembly

The proposed aforementioned runtime changes mean this should work as expected.

@GrabYourPitchforks
Copy link
Member

@GrabYourPitchforks GrabYourPitchforks commented Mar 23, 2021

@GPSnoopy That proposal seems to capture things fairly well. A few additions / changes I'd recommend:

  • Currently strings store their backing length as uint32, not as size_t. Changing the backing length to be native word sized will increase the size of all string instances by 4 bytes on 64-bit platforms. It also may necessitate changes to debugger tooling like sos, which I believe currently hard-code the layout of string instances.
  • Do you see changing [ReadOnly]Memory<T> at the same time as array, string, span?

Additionally, do you have a proposal for what we do with publicly-exposed System.* APIs? Consider the following code sample.

byte[] bytes = GetSomeBytes();
byte[] moreBytes = GetMoreBytes();
byte[] concat = bytes.Concat(moreBytes).ToArray(); // System.Linq

Presumably runtime-provided methods like Enumerable.ToArray<T>(this IEnumerable<T> @this) should be callable both from a legacy-length assembly and a native-length assembly. But the immediate caller would likely expect the runtime behavior to depend on the compilation mode of the caller. Otherwise you could end up with a native-length caller invoking this API and getting an improper OutOfMemoryException or a legacy-length caller invoking this API and getting back a too-large array. Do we need to make a modification to your proposal to account for this scenario?

@GSPP
Copy link

@GSPP GSPP commented Mar 24, 2021

I honestly think the set of NativeSpan overloads we add within the runtime and libraries would be fairly small

I believe that is true. Currently, most collections in practical applications contain far less than 2 billion elements. Why is that? It's not because of technology limitations. It's because IT applications usually model real world concepts and the real world very rarely has more than 2 billion items for any type of object. Even Facebook has only 1 billion users.

I find it remarkable that the need for more than 2 billion elements is so exceedingly rare.

If .NET adds support for large arrays, then the resulting collections and algorithms should be considered to be rarely used, not mainstream. Large arrays are an enabling feature so that more advanced libraries can be built on top. Large arrays should not be seen as a centerpiece of the platform permeating everything.

@GPSnoopy
Copy link
Author

@GPSnoopy GPSnoopy commented Mar 24, 2021

@GrabYourPitchforks Thank you for the feedback, you raise valid points. I've tried to address them, or at least highlight the gaps in https://github.com/GPSnoopy/csharplang/commit/830a3d8d3898b9a26066eee09e3493f9691d2edf (and a small addendum in https://github.com/GPSnoopy/csharplang/commit/804df79a564e3515cda862a4afea5f513fde4d5a).

  • I've checked the CLR source code. You are correct. I've highlighted that the proposal is okay with extending the memory used by strings. But marked in such a way that this is not a clear cut decision, and therefore might need to be revisited. Also added the impact to debuggers and other external tools.
  • [ReadOnly]Memory<T> is on oversight on my part. I've added them to the proposal.
  • Good point about System.Linq. I realise that the proposal only considered passing arrays & co as function arguments, but not when returning them. For the moment I suggest a runtime check when crossing boundaries in that case.

@GrabYourPitchforks
Copy link
Member

@GrabYourPitchforks GrabYourPitchforks commented Mar 24, 2021

It is therefore proposed that the JIT compiler adds an automatic runtime length check whenever a method ... crosses the boundary from a native lengths assembly to a legacy lengths assembly.

This would not be sufficient. The problem is indirection, such as through interfaces and delegate dispatch. It's not always possible to know what behavioral qualities the caller / target of an indirection possesses. (It's one of the reasons .NET Framework's CAS had so many holes in it. Indirection was a favorite way of bypassing the system.)

In order to account for this, you'd need to insert the check at entry to every single legacy method, and on the return back to any legacy caller. It cannot be done only at legacy / native boundaries.

Edit: It also didn't address the problem I mentioned with LINQ and other array-returning methods. The caller might expect the runtime method implementation to have a specific behavior, such as succeeding or throwing, independent of any checks that occur when boundaries are crossed. For example, MemoryMarshal.AsBytes is guaranteed to throw OverflowException if the resulting span would have more than 2bn elements. Array.CreateInstance is guaranteed to throw ArgumentOutOfRangeException or OutOfMemoryException. And so on. Making all of these methods native-aware and having the check performed at the return would change the caller-observable behavior when these methods are invoked by legacy applications.

Given the above paragraph, it might be appropriate to say "this new runtime capability changes contracts for existing APIs, even if no user assembly in the application is native-aware." It's fine to call out stuff like that in the doc, since it contributes to the trade-off decisions we need to make.

Edit x2: Oh, and fields and ref arguments! Those have visible side-effects even in the middle of method execution. (Example: the ref is to a field of an object, and that field is visible to other threads running concurrently.) Detecting whether a native assembly is about to set a field in use by a legacy assembly isn't something the JIT can know at method entry or method exit. This limitation should also be captured, as it influences whether the Array.Length property could be a simple narrowing cast or whether we'd need to insert "if out of bounds" logic into it.

@tannergooding
Copy link
Member

@tannergooding tannergooding commented Mar 24, 2021

While I agree that most of the ecosystem likely won't receive large data as inputs, I think its important that the ecosystem be encouraged to move to simply handle it.

When you are talking about large data, there isn't really a difference between naively handling 500 million elements vs 2 billion elements vs 1 trillion or more elements. I'd expect most code paths are likely optimized for a few hundred to a few thousand inputs at best, with a few special cases being optimized for millions of inputs.

I'd think you'd then have a few cases:

  • Trivial for loops which are easy to analyze and recognize. These can simply be updated to use .NativeLength rather than .Length and to ensure x cmp y is nint cmp nint
  • foreach loops likely need little to no change unless you are explicitly tracking a count
  • LINQ methods where you use things like .Count or .ToList. These are already potentially problematic because nothing requires enumerable to be less than 2 billion or even finite in size. LINQ already handles this be having a checked context around the fallback counter.

There are also two distinct cases for moving to NativeLength. One of which is on managed types (like T[] or List<T>) where it is largely a potential perf optimization (the JIT actually needs changes here for this to work as expected) and future proofing and the other of which is Span<T> (and possibly Memory<T>) where there is an immediate use case with native data.

Without the GC revving, there is actually no change to user code allocating managed arrays. Just a recommendation to switch to NativeLength for forward compatibility. Therefore, there can be no perf hit "today" for users working with Array.Length until said changes happen.

While with Span<T> as soon as we expose it (assuming we also expose a constructor that takes nint), users can start creating incompatible versions. This means Span<T> paths need to update or may end up having a checked downcast from nint inserted into their code.

I would imagine any checking we do could be public int Length => checked((int)(NativeLength)) and/or inserted by the JIT as part of bounds checking and which just like bounds checks, could be elided in many many common scenarios.

@GPSnoopy
Copy link
Author

@GPSnoopy GPSnoopy commented Mar 26, 2021

@GrabYourPitchforks Added your concerns verbatim to the proposal document.

Personally I would like to see if we can avoid adding such a check absolutely everywhere (e.g. what about non-virtual method calls?). Or find a better alternative.

@GrabYourPitchforks
Copy link
Member

@GrabYourPitchforks GrabYourPitchforks commented Mar 26, 2021

Personally I would like to see if we can avoid adding such a check absolutely everywhere (e.g. what about non-virtual method calls?).

Reflection and delegate dispatch may involve indirection to non-virtual methods. And it's not always possible for the runtime to reason about the caller or the target of such indirection. (See also my earlier comments on CAS and why it was so broken.)

@tfenise
Copy link

@tfenise tfenise commented Mar 28, 2021

It's not just about compatibility, that is, no problem would arise if we were to redesign everything from scratch. To use large arrays, indices also need to be extended to native size (64 bits), and that introduces problems.

verelpode has already pointed out:

Always? That's controversial. For example, if 64-bit nint is used inside a Dictionary<int,int> for ordinals/indexes (hash buckets), then its RAM consumption would increase to 160% of its current consumption, also leading to increased runtime duration and processor L1/L2 cache line misses. Likewise for HashSet<T> (150-160%).

Combined with apps needing multiple instances of these collection objects, this is a significant performance degradation in exchange for zero benefit for most programs (normal programs that don't process big-data).

How do we deal with this possible performance degradation in such collection types if ordinary arrays were allowed to be very large?

  1. Just extend the indices to native size (64 bits). Ignore everyday programmers' complaints. This doesn't sound very good.
  2. Do not extend. Keep those collections as it is, and state that they do not work with too many elements. The problem is that, all codes that use those collections may not work if given large array inputs. For example, suppose I as a library developer write:
public long[] Distinct(long[] array)
{
    HashSet<long> set = new();
    foreach (long x in array)
    {
        set.Add(x);
    }
    return set.ToArray();
}

This does not work if array contains too many distinct elements. To make this code large-array-aware, I need to be aware that HashSet<> does not support too many element and correct it, probably implementing my own large-array-aware MyHashSet<>. To make sure the corrected code is really large-array-aware, I need to write tests to test it against large array inputs. Such large long[] is at least of 16GB and I need a high-end computer with 32+GB of RAM to do such tests. Probably I would feel that large arrays are so rare, and feel lazy and simply drop large array support.
3. Make the collections smart in a way that they use 32-bit indices for not so many elements, and automatically switch to 64-bit indices if there are so many elements. This may still degrade performance, and adds to complexity.

Even if the BCL took the third approach, third-party collection developers may still feel lazy and choose other approaches instead. They either use int or nint for indices. Arrays being indexed through nint and the inability to index arrays with int without warnings may be enough to make them choose nint and thus avoid the second approach, which looks the worst, but not always enough. There are non-array-based collections such as binary trees, and a binary tree implementation may happen to use int for indices and counts while getting no warning, which means the second and the worst approach.

All of these "support-large-arrays-or-not" troubles are not present currently, because int is the single most "natural" integral type in .NET/C#, so codes probably work given not-so-big-relative-to-int inputs. If nint became the recommended way to index arrays, then there would be two most "natural" integral types in .NET/C#: int and nint. There would not be such guarantee that codes probably work given not-so-big-relative-to-nint inputs.

Personally I'd vote for introducing new types like LargeArray<T> or NativeSpan<T>, rather than enabling the existing T[] to be large. As I said, enabling the existing T[] to be large introduces two most "natural" integral types in .NET/C# and there would not be a guarantee that codes probably work given not-so-big-relative-to-nint inputs. Keeping a single array type does not make a difference if there would be codes that support large arrays and codes that do not. Introducing new types makes it explicit on the api surface whether the library supports large arrays. What's more, large arrays are not common, and may need special (e.g. vectorized, memory-saving) measures to deal with efficiently anyway.

@hxmcn
Copy link

@hxmcn hxmcn commented Nov 21, 2021

My suggestion: the next version of .net (.net 7, or .net 8 for LTS) should be 64 bit only, and make everything to 64 bit as possible as it can.
And, there's no property named LongSize / LongLenght, just Size/Length but all Int64 in data type. Sure, that's breaking change, but from a long-term perspective,such change is acceptable sand intuitive.

@msftbot msftbot bot added the needs-further-triage label Nov 21, 2021
@SupinePandora43
Copy link

@SupinePandora43 SupinePandora43 commented Mar 14, 2022

I kinda support breaking backwards compatibility as it confuses writing modern code.
But i think it's possible to keep binary compatibility while using native-sized integers for Length properties.
Here's my idea: compiler support - automatically replace Length call with hidden UnsignedNativeLength .
so, compiler will compile:

public ref struct Span<T> {
    private readonly nuint length;

    public nuint Length { get => length; }
}

to

public ref struct Span<T> {
    private readonly nuint length;

    // Compiler visible
    public nuint UnsignedNativeLength { get => length; }
    // Binary visible only
    public int Length { get => (int) Math.Min(int.MaxValue, UnsignedNativeLength); } // use some tricks here
}

And

nuint length = span.Length;

to

nuint length = span.UnsignedNativeLength;

Pros

  • Binary compatible
  • Length returns nuint

Cons

Performance

  • Cast performance (nuint -> int, int -> nuint)

Compatibility

  • All previously-written code won't be able to use more than int.MaxValue items, unless use nuints and being recompiled.
  • All previously-written code will have warnings "nuint Length->int cast"

IL

  • Strange nuint Length to nuint UnsignedNativeLength conversion
  • Compiler and IDE support to report Length as nuint type

(You can replace nuint in my code with nint, but i think we should use nuints because Length can't be less than 0.)

This can be implemented for researching on some roslyn branch.

@hez2010
Copy link
Contributor

@hez2010 hez2010 commented Mar 18, 2022

I think this is becoming increasingly urgent.
We even hit this issue in the compiler: #66787

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-needs-work area-System.Runtime needs-further-triage
Projects
None yet
Development

No branches or pull requests