New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add 64 bits support to Array underlying storage #12221
Comments
One of the simplest breaks becomes any program that is doing the following: for (int i = 0; i < array.Length; i++)
{
} This works fine for any existing programs, since CoreCLR actually defines a limit that is just under |
I frequently run into this array when processing large chunks of data from network streams into byte arrays, and always need to implement chunking logic in order to be able to process the data. |
@tannergooding That's why I proposed above to keep the existing behaviour of throwing OverflowException on Length when there are more than int.MaxValue elements. Nothing changes there. I'm suggesting that changing the CLR implementation as proposed above would allow applications and people who want to to use large arrays without breaking existing applications. You are right that simply passing a large array into a library that does not support it will break, but at least this will give us choice. We need to start somewhere, and .NET cannot keep ignoring this problem. |
Yesterday I happily created an array in Python which contained more than two billion elements. When can I do the same in .NET? Currently we get an exception if we try to construct an array with more than 2B elements. What's wrong with deferring that exception until something calls the |
This is an interesting idea, but I wonder if it would be better to have a If we had a theoretical |
@GrabYourPitchforks interesting facts about the GC. I wasn't aware of such limitations. A My gut feeling is that |
I guess I could palette a LargeArray but if that's going to be implemented than I hope it doesn't create a new type which is not actually an Array, IMHO it would have been much easier to address this if we would have created another sub type of Array internally instead of inventing Span however Span also solves other problems.... |
These days 2GB arrays are barely enough for many applications to run reliably. RAM prices have stagnated for a few years now. Surely, the industry will resolve this problem sooner or later. As RAM amounts resume increasing at Moores law rate this 2GB array issue will become very commonplace sooner or later. A Thinking in the very long term it seems far better to find a way to fix |
If 64 bit support is implemented there could be a tool that analyzes your code for legacy patterns (e.g. usage of |
Since this would be such a massive ecosystem breaking change, one other thing you'd probably have to do is analyze the entire graph of all dependencies your application consumes. If the change were made to |
Not if large array was an array... i.e. derived from it if only internally perhaps; like I suggested back in the span threads. |
If large array is a glorified array, then you could pass a large array into an existing API that accepts a normal array, and you'd end up right back with the problems as originally described in this thread. |
Furthermore, I'm not sure I buy the argument that adding large array would bifurcate the ecosystem. The scenario for large array is that you're operating with enormous data sets (potentially over 2bn elements). By definition you wouldn't be passing this data to legacy APIs anyway since those APIs wouldn't know what to do with that amount of data. Since this scenario is so specialized it almost assumes that you've already accepted that you're limited to calling APIs which have been enlightened. |
You have LongLength on the Array The only fundamental diff is that one is on the LOH and one is not. By virtue of the same fact span wouldn't be able to hold more than such either so large span must be needed also... |
From what I can gather and summarise from the above, there are two pieces of work.
But the more I think about it, the more I think this issue is missing the point and ultimately the real problem with .NET as it stands. Even if the initial proposal was to be implemented, it would only fix the immediate 64-bit issue with Array but still leave collections and Span with the same indexing and length limitations. I've started reading The Rust Programming Language (kind of felt overdue) and it struck me that Rust also mimics C++ size_t and ssize_t with usize and isize. C# on the other hand somehow decided not to expose this CPU architectural detail and forces everyone to the lowest common denominator for most of it's API: a 32-bit CPU with 32-bit addressing. I'd like to emphasis that the 32-bit limitation is purely arbitrary from a user point of view. There is no such thing as a small array and a big array; an image application should not have to be implemented differently whether it works with 2,147,483,647 pixels or 2,147,483,648 pixels. Especially when it's data driven and the application has little control on what the user is up to. Even more frustrating if the hardware has long been capable of it. If you do not believe me or think I'm talking nonsense, I invite you to learn how to program for MS-DOS 16-bit with NEAR and FAR pointers (hint: there is a reason why Doom required a 386 32-bit CPU). Instead of tinkering around the edges, what is the general appetite for a more ambitious approach to fix this limitation? Here is a controversial suggestion, bit of a kick in the nest:
I understand this is far from ideal and can create uncomfortable ecosystem situations (Python2 vs Python3 anyone?). Open to suggestions on how to introduce size types in .NET in a way that doesn't leave .NET and C# more irrelevant on modern hardware each year. |
If we can solve the issue with the GC not being tolerant of variable-length objects bigger than 2 GB,
This scheme would allow |
This would make
I would go with overflow checks when |
@kasthack I was trying to come up with a compatible solution where one never has to worry about getting With explicit casting, it's clear that we're crossing a boundary into "legacy land" and we need to make sure "legacy land" can do everything it would reasonably be able to do with a normal
Arrays don't implement ISet/IDictionary so a cast to these would never succeed. For IList, the same rules would apply as for ICollection (ICollection was just an example above). |
@GPSnoopy's post makes me wonder whether the following variation might make sense:
|
It has been proposed that we could make This would be a portability issue. Code would need to be tested on both bitnesses which currently is rarely required for most codebases. Code on the internet would be subtly broken all the time because developers would only test their own bitness. Likely, there will be language level awkwardness when In C languages the zoo of integer types with their loose size guarantees is a pain point. I don't think we want to routinely use variable-length types in normal code. If All arrays should transparently support large sizes and large indexing. There should be no If the object size increase on 32 bit is considered a problem (it might well be) this could be behind a config switch. Code, that cares about large arrays needs to switch to Likely, it will be fine even in the very long term to keep most code on I see no realistic way to upgrade the existing collection system to 64 bit indexes. It would create unacceptable compatibility issues. For example, if It would be nicer to have a more fundamental and more elegant solution. But I think this is the best tradeoff that we can achieve. |
Just note there is already a case today where Length can throw. Multi-dimensional arrays can be large enough that the total number of elements is greater than Int.MaxValue. I think this is why LongCount was originally added. |
Note: Currently, when you write I really hopes all size-like parameters among the ecosystem is using |
One option is to create a NuGet package for But CLR should be ready by then. |
@lostmsu Are you aware of ongoing work to fix this in CLR that I'm not aware of? |
@GPSnoopy nope, that was an argument to start the CLR work ahead of time before BCL, so that BCL could catch up. |
Yes I saw some of it but ideas of a LargeArray type that was segmented, or used native memory seemed like no-gos to me because you can't interop them with existing Array based code. #12221 (comment) is a good description of what could work for array (+ lets fix variance while we're at it, we can always add ReadOnlyArray for safe variance like what's done in ImmutableArray)
This point I'm not so sure about. Its looking like the Array type might have to be duplicated but I'm not sure that means we should duplicate all the other types. Consider that we probably want to apply this to the interfaces that arrays inherit from like ICollection and IList, and once you have large sized array you could have large sized List and Stack. For these types where we can change the Length/Count property I think it would be better to do that, than to duplicate the types. Take an example of an API that currently looks like Tradeoffs to think about. I think I'd prefer the duplication off methods and properties that are int based when they need to be nint based, rather than duplicating every type that has one of those methods or properties and then duplicating every method or property that uses the original type. |
TBH I think many of these concerns about "but now you have to enlighten everything about LargeArray / LargeSpan!" are overblown. This is not much different from the situation a few years ago where all we had were array-based overloads of various APIs, and then we went and added Span-based overloads as priority and time permitted. The world didn't end, the ecosystem didn't become bifurcated, and applications moved on to the new APIs only if they derived significant value from doing so. We didn't bother updating many of the old collection types or interfaces because it was a paradigm shift. There's no reason we couldn't follow the same playbook here. Not every single aspect of the runtime and library ecosystem needs to move forward simultaneously. It's possible to deliver a minimum viable product including an exchange type and basic (common) API support, then continue to improve the ecosystem over future releases as priority dictates and time permits. |
I imagine |
So would one of the following proposals possibly get traction? A) Add NativeLengthBackground and MotivationSee #12221 for background discussion on this API. Proposed API namespace System
{
public readonly ref struct Span<T>
{
- private readonly int _length;
+ private readonly nint _length;
+ public Span(void* pointer, nint length);
+ public nint NativeLength { get; }
+ public ref T this[nint index] {get; }
+ public Span<T> Slice(nint start, nint length);
public int Length
{
[NonVersionable]
- get => _length;
+ get {
+ if (_length > Int32.MaxValue) throw new InvalidOperationException();
+ return (int)_length;
+ }
}
}
} Similar API changes would be made to ReadOnlySpan, Memory, and ReadOnlyMemory. Usage ExamplesThis could be used with native memory allocations to get .net style views on large data sets. RisksThis incurs a bounds check on every access to Span, which as a high performance type is a unwanted negative cost. B) Add NativeSpanBackground and MotivationSee #12221 for background discussion on this API. Proposed APInamespace System
{
public readonly ref struct NativeSpan<T>
{
public NativeSpan(void* pointer, nint length);
public ref T this[nint index] { get; }
public nint Length { get; }
public NativeSpan<T> Slice(nint start);
public NativeSpan<T> Slice(nint start, nint length);
public static implicit operator NativeSpan<T>(Span<T> span);
public static explicit operator Span<T>(NativeSpan<T> span);
// Plus other methods that are present on Span<T>
}
} Similar class would be made for NativeReadOnlySpan, NativeMemory, and NativeReadOnlyMemory. Usage ExamplesThis could be used with native memory allocations to get .net style views on large data sets. RisksThis will cause noticeable api surface duplication as methods are slowly moved to use NativeSpan instead of Span. BothEither way thinking about this did trigger me to think if nint is actually the right type to use here? C++/rust would use an unsigned native int here (size_t in C++, usize in rust). Historically the BCL would not use unsigned types on such a major API surface because it's not CLS compliant, is that still a going concern? If we do stick with nint it does mean that on 32bit systems a Span would still be limited to 2GB despite most OS being able to allocate a larger buffer than that in modern 32bit system. Also there's always some groups looking into C# for kernel work and a Span for the entire address space could be nice. |
@Frassle Thanks for posting that. I think For To all: If we had a (Also, I'd prefer Edit: I'm not sold on |
This can be solvable using analyzers, e.g. add |
@jkotas That runs into the same problems as nullability. It introduces a marker attribute which serves as an API contract. Anything that performs code generation or dynamic dispatch would need to be updated to account for it, just as if they were running a light form of the Roslyn analyzer at runtime. Though to be honest I don't think we have a significant number of components which perform codegen over span-consuming methods. It also doesn't solve the problem of somebody passing you (perhaps as a return value) a span whose length exceeds |
This part can be solved by having a long Span support as optional runtime mode. It is similar to how we are dealing with linkability. We are marking APIs and patterns that are linker friendly. If your app uses a linker unfriendly APIs or patterns, it has unspecified behavior. You have to fix the code if you want to be sure that it works well. I understand that the analyzer approach is not without issues, but I think it is worth having it on the table. If we went with the NativeSpan approach, we would have to introduce NativeSpan overloads for all APIs that take Span in the limit. It feels like a too high price to pay. |
That concern is understandable, but I don't think it's likely to come to fruition in practice. I spent a few minutes looking through System.IO, System.Runtime, and other ref assemblies for public APIs which take All of that said, I think network-style scenarios are going to be more common than big data scenarios. And network-style scenarios may involve lots of data being sharded across multiple buffers via If we were to identify which APIs we wanted to create |
As I mentioned previously, that could be covered by giving |
A few colleagues and I had some thoughts about how we could come up with a reasonable proposal along the lines of nullables (which seemed to be the favoured approach in our small group) and that could still be implemented as a PR by one or two developers. Since there have been many different ideas flying around in the comments, I've put this into a separate draft document. Latest version of the proposal can be found at https://github.com/GPSnoopy/csharplang/blob/64bit-array-proposal/proposals/csharp-10.0/64bit-array-and-span.md Expand arrays, strings and spans to native-int lengthsSummaryExpand arrays (e.g. MotivationCurrently in C#, arrays (and the other aforementioned types) are limited to storing and addressing up to 2^31 (roughly 2 billion) elements. This limits the amount of data that can be contiguously stored in memory and forces the user to spend time and effort in implementing custom non-standard alternatives like native memory allocation or jagged arrays. Other languages such as C++ or Python do not suffer from this limitation. With every passing years, due to the ever-continuing increases in RAM capacity, this limitation is becoming evident to an ever-increasing number of users. For example, machine learning frameworks often deal with large amount of data and dispatch their implementation to native libraries that expect a contiguous memory layout. At the type of writing this (2021-03-23), the latest high-end smartphones have 8GB-12GB of RAM while the latest desktop CPUs can have up to 128GB of RAM. Yet when using ProposalThe proposed solution is to change the signature of arrays, As this would break existing C# application compilation, an opt-in mechanism similar to what is done with C# 8.0 nullables is proposed. By default, assemblies are compiled with legacy lengths but new assemblies (or projects that have been ported the new native lengths) can opt-in to be compiled with native lengths support. Future ImprovementsThis proposal limits itself to arrays, Language ImpactsWhen an assembly is compiled with native lengths, the change of type for lengths and indexing from for (int i = 0; i < array.Length; ++i) { /* ... */ } // If length is greather than what's representable with int, this will loop forever.
for (var i = 0; i < array.Length; ++i) { /* ... */ } // Same as above. A lot of code is written like this rather than using an explicit int type. The correct version should be the following. for (nint i = 0; i < array.Length; ++i) { /* .... */ } Unfortunately, due to implicit conversion the C# 9.0 compiler does not complain when comparing an Runtime ImpactsIn order to accommodate these changes, various parts of the dotnet 64-bit runtime need to be modified. Using C# The proposal assumes that these are always enabled, irrespective of whether any assembly is marked as legacy lengths or native lengths.
Boundaries Between AssembliesCalling Legacy Lengths Assembly From Native Lengths AssemblyWithout any further change, a native length assembly passing an array (or any other types covered by this proposal) with a large length (i.e. greater than 2^31) to a legacy lengths assembly would result into an undefined behaviour. The most likely outcome would be the JIT truncating the The proposed solution is to follow C# 8.0 nullables compiler guards and warn/error when such a boundary is crossed in an incompatible way. As with nullables, the user can override this warning/error (TODO exact syntax TBD, can we reuse nullables exclamation mark? Or do we stick to nullable-like preprocessor directives? IMHO the latter is probably enough). In practice, applications rarely need to pass large amount of data to their entire source code and dependencies domain. We foresee that most applications will only need to make a small part of the source code and dependencies native lengths compatible. Thus enabling a smoother transition to a native lengths-only C# future version. Calling Native Lengths Assembly From Legacy Lengths AssemblyThe proposed aforementioned runtime changes mean this should work as expected. |
@GPSnoopy That proposal seems to capture things fairly well. A few additions / changes I'd recommend:
Additionally, do you have a proposal for what we do with publicly-exposed System.* APIs? Consider the following code sample. byte[] bytes = GetSomeBytes();
byte[] moreBytes = GetMoreBytes();
byte[] concat = bytes.Concat(moreBytes).ToArray(); // System.Linq Presumably runtime-provided methods like |
I believe that is true. Currently, most collections in practical applications contain far less than 2 billion elements. Why is that? It's not because of technology limitations. It's because IT applications usually model real world concepts and the real world very rarely has more than 2 billion items for any type of object. Even Facebook has only 1 billion users. I find it remarkable that the need for more than 2 billion elements is so exceedingly rare. If .NET adds support for large arrays, then the resulting collections and algorithms should be considered to be rarely used, not mainstream. Large arrays are an enabling feature so that more advanced libraries can be built on top. Large arrays should not be seen as a centerpiece of the platform permeating everything. |
@GrabYourPitchforks Thank you for the feedback, you raise valid points. I've tried to address them, or at least highlight the gaps in https://github.com/GPSnoopy/csharplang/commit/830a3d8d3898b9a26066eee09e3493f9691d2edf (and a small addendum in https://github.com/GPSnoopy/csharplang/commit/804df79a564e3515cda862a4afea5f513fde4d5a).
|
This would not be sufficient. The problem is indirection, such as through interfaces and delegate dispatch. It's not always possible to know what behavioral qualities the caller / target of an indirection possesses. (It's one of the reasons .NET Framework's CAS had so many holes in it. Indirection was a favorite way of bypassing the system.) In order to account for this, you'd need to insert the check at entry to every single legacy method, and on the return back to any legacy caller. It cannot be done only at legacy / native boundaries. Edit: It also didn't address the problem I mentioned with LINQ and other array-returning methods. The caller might expect the runtime method implementation to have a specific behavior, such as succeeding or throwing, independent of any checks that occur when boundaries are crossed. For example, Given the above paragraph, it might be appropriate to say "this new runtime capability changes contracts for existing APIs, even if no user assembly in the application is native-aware." It's fine to call out stuff like that in the doc, since it contributes to the trade-off decisions we need to make. Edit x2: Oh, and fields and ref arguments! Those have visible side-effects even in the middle of method execution. (Example: the ref is to a field of an object, and that field is visible to other threads running concurrently.) Detecting whether a native assembly is about to set a field in use by a legacy assembly isn't something the JIT can know at method entry or method exit. This limitation should also be captured, as it influences whether the |
While I agree that most of the ecosystem likely won't receive large data as inputs, I think its important that the ecosystem be encouraged to move to simply handle it. When you are talking about large data, there isn't really a difference between naively handling 500 million elements vs 2 billion elements vs 1 trillion or more elements. I'd expect most code paths are likely optimized for a few hundred to a few thousand inputs at best, with a few special cases being optimized for millions of inputs. I'd think you'd then have a few cases:
There are also two distinct cases for moving to Without the GC revving, there is actually no change to user code allocating managed arrays. Just a recommendation to switch to While with I would imagine any checking we do could be |
@GrabYourPitchforks Added your concerns verbatim to the proposal document. Personally I would like to see if we can avoid adding such a check absolutely everywhere (e.g. what about non-virtual method calls?). Or find a better alternative. |
Reflection and delegate dispatch may involve indirection to non-virtual methods. And it's not always possible for the runtime to reason about the caller or the target of such indirection. (See also my earlier comments on CAS and why it was so broken.) |
It's not just about compatibility, that is, no problem would arise if we were to redesign everything from scratch. To use large arrays, indices also need to be extended to native size (64 bits), and that introduces problems. verelpode has already pointed out:
How do we deal with this possible performance degradation in such collection types if ordinary arrays were allowed to be very large?
This does not work if Even if the BCL took the third approach, third-party collection developers may still feel lazy and choose other approaches instead. They either use All of these "support-large-arrays-or-not" troubles are not present currently, because Personally I'd vote for introducing new types like |
My suggestion: the next version of .net (.net 7, or .net 8 for LTS) should be 64 bit only, and make everything to 64 bit as possible as it can. |
I kinda support breaking backwards compatibility as it confuses writing modern code. public ref struct Span<T> {
private readonly nuint length;
public nuint Length { get => length; }
} to public ref struct Span<T> {
private readonly nuint length;
// Compiler visible
public nuint UnsignedNativeLength { get => length; }
// Binary visible only
public int Length { get => (int) Math.Min(int.MaxValue, UnsignedNativeLength); } // use some tricks here
} And nuint length = span.Length; to nuint length = span.UnsignedNativeLength; Pros
ConsPerformance
Compatibility
IL
(You can replace This can be implemented for researching on some roslyn branch. |
I think this is becoming increasingly urgent. |
While System.Array API supports LongLength and operator this[long i], the CLR does not allow arrays to be allocated with more than 2^31-1 elements (int.MaxValue).
This limitation has become a daily annoyance when working with HPC or big data. We frequently hit this limit.
Why this matters
In C++ this is solved with std::size_t (whose typedef changes depending on the target platform). Ideally, .NET would have taken the same route when designing System.Array. Why they haven't is a mystery, given that AMD64 and .NET Framework appeared around the same time.
Proposal
I suggest that when the CLR/JIT runs the .NET application in x64, it allows the array long constructor to allocate more than int.MaxValue items:
I naively believe that the above should not break any existing application.
Bonus points for extending 64-bit support to Span and ReadOnlySpan.
The text was updated successfully, but these errors were encountered: