Proposal: Slicing #120

Open
stephentoub opened this Issue Jan 28, 2015 · 97 comments

Projects

None yet
@stephentoub
Member

(Note: this proposal was briefly discussed in #98, the C# design notes for Jan 21, 2015. It has not been updated based on the discussion that's already occurred on that thread.)

Background

Arrays are extremely prevalent in C# code, as they are in most programming languages, and it’s very common to hand arrays around from one method to another.

Problem

However, it’s also very common to only want to share a portion of an array. This is typically achieved either by copying that portion out into its own array, or by passing around the array along with range indicators for which portion of the array is intended to be used. The former can lead to inefficiencies due to unnecessary copies of non-trivial amounts of data, and the latter can lead both to more complicated code as well as to lack of trust that the intended subset is the only subset that’s actually going to being used.

Solution: Slice<T>

To address this common need, .NET and C# should support "slices." A slice, represented by the Slice<T> value type, is a subset of an array or other contiguous region of memory, including both unmanaged memory and other slices. The act of creating such a slice is referred to as "slicing," and beyond the support on the Slice<T>, the C# language would include language syntax for declaring slices, slicing off pieces of arrays or other slices, and reading from and writing to them.

An array is represented using array brackets:

int[] array = …;

Similarly, a slice would be represented using square brackets that contain a colon between them:

int[:] slice = …; // same as "Slice<int> slice = ..."

The presence of the colon maps to the syntax for creating slices, which would use an inclusive 'from' index before the colon and an exclusive 'to' index after the colon to indicate the range that should be sliced (omission of either index would simply imply the start of the array or the end of the array, respectively, and omission of both would mean the entire array):

int[] primes = new int[] { 2, 3, 5, 7, 9, 11, 13 };
int item = primes[1];   // Regular array access, producing the value 3
int[:] a = primes[0:3]; // A slice with elements {2, 3, 5} 
int[:] b = primes[1:2]; // A slice with elements {3} 
int[:] c = primes[:5];  // A slice with elements {2, 3, 5, 7, 9} 
int[:] d = primes[2:];  // A slice with elements {5, 7, 9, 11, 13} 
int[:] e = primes[:];   // A slice with elements {2, 3, 5, 7, 9, 11, 13} 
int[:] f = a[1:2];      // A slice with elements {3}

Arrays could also be implicitly converted to slices (via an implicit conversion operator on the slice type), with the resulting slice representing the entire array, as if both 'from' and 'to' indices had been omitted from the slicing operation:

int[:] g = primes[:];   // A slice with elements {2, 3, 5, 7, 9, 11, 13} 
int[:] h = primes;      // A slice with elements {2, 3, 5, 7, 9, 11, 13}
int[:] i = h[:];        // A slice with elements {2, 3, 5, 7, 9, 11, 13}

A slice could also be used in a similar manner to arrays, reading from and writing to them via indexing:

int[:] somePrimes = primes[1:3];  // A slice with elements { 3, 5 }

Debug.Assert(primes is Array);// true
Debug.Assert(somePrimes is Slice<int>);   // true

Debug.Assert(somePrimes.Length == 2);     // true
Debug.Assert(somePrimes[0] == primes[1]); // true
Debug.Assert(somePrimes[1] == primes[2]); // true
somePrimes[0] = 17;
Debug.Assert(primes[1] == 17);            // true

As demonstrated in this code example, slicing wouldn’t make a copy of the original data; rather, it would simply create an alias for a particular region of the larger range. This allows for efficient referencing and handing around of a sub-portion of an array without necessitating inefficient copying of data. However, if a copy is required, the ToArray method of Slice<T> could be used to forcibly introduce such a copy, which could then be stored as either an array or as a slice (since arrays implicitly convert to slices):

int[:] aliased = primes[1:3];        // Alias of a portion of the original array
int[:] copied  = primes[1:3].Copy(); // Copy  of a portion of the original array

This gives developers the flexibility as to whether they want the recipient of the slice to be working with the original array or not, minimizing unnecessary copies and ensuring that only the appropriate areas of the larger region are used (by design, there would be no way through the public surface area of Slice<T> nor through the C# language syntax to get back from a slice to the larger entity from which it was sliced).

As creating slices would be very efficient, methods that would otherwise be defined to take an array, an offset, and a count can then be defined to just take a slice.

Solution: ReadOnlySlice<T>

In addition to Slice<T>, the .NET Framework could also includes a ReadOnlySlice<T> type, which would be almost identical to Slice<T> except that it would not provide any way for writing to the slice. A Slice<T> would be implicitly convertible to a ReadOnlySlice<T>, but not the other way around.

As with slicing an array, creation of a ReadOnlySlice<T> wouldn’t copy data, but rather would create a read-only alias to the original data; this means that while you couldn’t change the contents of a ReadOnlySlice<T> through it, if you had a writable reference to the underlying data, you could still manipulate it:

int[]  primes= new int[] { 2, 3, 5, 7, 9, 11, 13 };
int[:] a = primes[1:3];     // A slice with elements {3, 5}
ReadOnlySlice<int> b = a;   // A read-only slice with elements {3, 5}
Debug.Assert(a[0] == 3);    // true
Debug.Assert(b[0] == 3);    // true
b[0] = 42;                  // Error: no set accessor available
a[0] = 42;                  // Ok
Debug.Assert(b[0] == 42);   // true

While C# would not have special syntax to represent a ReadOnlySlice<T>, it could still have knowledge of the type. In particular, there is a very commonly-used type in C# that behaves like an array but that’s immutable: string. It’s very common for developers to want to slice off substrings from strings, and historically this has been a relatively expensive operation, as it involves allocating a new string object and copying the string data to it. With ReadOnlySlice<T>, the compiler could provide built-in support for slicing off substrings represented as ReadOnlySlice<char>. This could be done using the same slicing syntax as exists for arrays.

string helloWorld = "hello, world";
ReadOnlySlice<char> hello = helloWorld[0:5];

This would allow for substrings to be taken and handed around in a very efficient manner. In addition to new methods on String like Slice (a call to which is what the slicing syntax on strings would compile down to), String would also support an explicit conversion from a ReadOnlySlice<char> back to a string. This would enable developers to work with substrings efficiently, and then only create a copy as a string when actually needed.

Further, just as the C# compiler today has support for concatenating strings and switching on strings, it could also have support for concatenating ReadOnlySlice<char> and switching on ReadOnlySlice<char>:

string helloWorld = "hello, world";
ReadOnlySlice<char> hello = helloWorld[:5];
ReadOnlySlice<char> world = helloWorld[7:];
switch(hello) { // no allocation necessary to switch on a ReadOnlySlice<T>
    case "hello": Hello(); break;
    case "world": World(); break;
}
Debug.Assert(hello + world == "helloworld"); // only a single allocation needed for the concatenation
@mikedn
mikedn commented Jan 28, 2015

How do you make ReadOnlySlice work with both arrays and strings? Access the array/string via IList<T>?

@stephentoub
Member

@mikedn, in this proposal, slices would support operating over any region of memory, whether it was from an array or a native pointer or the char* to data in a string. Its implementation would require interacting with internals in the runtime, rather than operating over a publicly-exposed abstraction like IList<T>... you could of course do the latter, but at a non-trivial performance cost for certain scenarios.

@theoy theoy added the Language-C# label Jan 28, 2015
@omariom
omariom commented Jan 28, 2015

Wow! Roslyn starts yielding it fruits!
It is a very welcomed feature.
I see its usage in API for batched processing. And many other places of course.

@Porges
Porges commented Jan 28, 2015

I'd much prefer that existing BCL classes that 'only take a T[]' were extended to support IList<T> (or IReadOnlyList<T> as the case may be), then we don't need additional CLR magic. Under this model {ReadOnly}Slice<T> are just wrappers around I{ReadOnly}List<T> with constrained offset/length (much like ArraySegment<T>). Copy() etc can be supported too.

There's a sort-of tangential issue around being able to treat unmanaged memory as T[] which the proposal mentions. I'd like to be able to do this for (e.g.) passing byte* to Streams (without first copying it into a byte[]), but this could probably be handled as (another!) Stream method.

@omariom
omariom commented Jan 28, 2015

@Porges, I think it wouldn't provide efficiency of raw arrays.

@HaloFour

This is one of those things that I'd really prefer could be handled by the runtime itself (with C# support in conjunction, of course). By that I mean have the ability directly in the runtime to define an array that is a range within another array where the runtime would manage the appropriate offset and bound checking. I know that ArraySegment<T> exists and can be used as an IList<T> but if you have a method that accepts only arrays that doesn't help much.

To keep within the same syntax:

byte[] b1 = new byte[500];
byte[] b2 = b1[10:10];
b2[0] = 123;
Debug.Assert(b1[10] == 123);
b1[11] = 234;
Debug.Assert(b2[1] == 234);
b2[-1] = 123; // throws IndexOutOfRangeException();
b2[10] = 123; // throws IndexOutOfRangeException();

A similar mechanism would be useful for substrings, where instead of actually copying the portion of the original string into a new string the substring would retain a reference to the original string with an offset and length:

string s1 = "Hello World!";
string s2 = s1[6:5];
Debug.Assert(s2 == "World");

The one disadvantage to both being that it keeps a root reference to the original array or string around for the lifetime of the slice.

@redknightlois

Having done some work already on Array Slices (https://github.com/Codealike/arrayslice) I will share some of the gotchas I had to deal with...

If slices are implemented in C# as a native construct understood by the compiler, you will make math oriented programmer like myself pretty happy. Which will be probably the ones that are seriously interested in having such a construct for performance reasons. IEnumerable, IList, etc have such a big performance impact that they are provided for convenience and/or interop with application code only. (see at the arrayslice link the performance impact).

While the implementation details with structs or classes, readonly or not are very important at the language design level the biggest issue is behind the language surface.

Today as it stands I know of 3 ways to handle this:

  • Implement an Slice class which just "overrides" the index. (our implementation does that using IL rewriting for performance, Roslyn will just generate the proper IL).
  • Implement it as some kind of IEnumerable (skip + take)
  • Do it properly where slices are actually arrays at the runtime level.

The first has a very important drawback, if your code doesn't support Slice you are screwed. Implicit converting to an array would not work... you have to pass the whole array (defeating the purpose of the Slice) or copy the array, a no-no for the audience that would really use it...

The second is clear, performance... again a no-no for the intended audience.

The third, AFAIK no support at the runtime level to actually create an array with "shared" memory. If there is IL to be able to do that, I am more than interested to know how :) ... therefore unless we can allow that at the runtime level slices will be useless or clash with code already written.

Needless to say, that where this is going is that there is a serious need of a generic numeric constraints too (both fixed and floating) to really make C# shine for performance math code. Traits if implemented properly would work for that.

I look forward to have an experience akind to what you have in Matlab or Python in terms of flexibility (not in syntax :D)

Federico

@gafter gafter added the 1 - Planning label Feb 2, 2015
@MadsTorgersen MadsTorgersen was assigned by gafter Feb 2, 2015
@Miista
Miista commented Feb 12, 2015

In what way is a slice different from an array? Couldn't the type simply be int[] slice = …?
It's still just an array.

Also I believe I would be more clear if you used the range operator (..) to mark a slice

int[] array = …;
int[] slice = array[0..2];
int[] head = array[..2];
int[] tail = array[2..];
@redknightlois

@miista AFAIK arrays in the CLR are not just a bunch of memory, the GC have to track it down so there should be a descriptor somewhere, etc. Therefore, the slice part of that bunch of memory is not exactly an array. The CLR/Roslyn guys surely could give a more detailed answer as I am interested into knowing that too. :)

@prasannavl

Awesome feature. Been waiting for this since the inception of C# itself 🎱

@stephentoub - While its use is numerous, I'm curious of how this is going to be used practically in case of strings.

In your example, you used the simplest case of switch,

string helloWorld = "hello, world";
ReadOnlySlice<char> hello = helloWorld[:5];
ReadOnlySlice<char> world = helloWorld[7:];
switch(hello) { // no allocation necessary to switch on a ReadOnlySlice<T>
    case "hello": Hello(); break;
    case "world": World(); break;
}
Debug.Assert(hello + world == "helloworld");

I'm curious here as to how a string is compared to a ReadOnlySlice<char>, since switch case requires compile time constant values of the same type (without additional compiler support.)

That being aside, for any practical advantage in efficiency while dealing with strings, the compiler needs to support allocation free representation of strings, since you almost always have to recreate a string from the ReadOnlySlice<char> again (which triggers allocation) to do anything useful from it, other than the switch case (even which I still don't see how, without compiler tweaks).

Unless, a String.FromSlice or something of that nature is provided, which internally creates a string that represents the same area of memory, I see the string slicing to be quite-pointless.

Now, considering, a String.FormSlice, or a Slice.ToSourceFormat, or anything of that nature is provided, can this not be directly simplified to directly providing the type itself, with controlled mutability, than a new type called Slice?

Example,

string helloWorld = "hello, world";

 // Internally built from the memory representation 
 // i.e, only the 'string' type is allocated (which acts a wrapper itself to the chars),
 // but simply representing the same area of memory
 // Conceptual pseudo: (String.FromSlice(String.Slice(helloWorld, 0, 5))
 // But can be efficiently done without the middle conversions directly.
string hello = helloWorld[:5];

string world = helloWorld[7:]; // Internally built from the memory representation again
switch(hello) { 
    case "hello": Hello(); break;
    case "world": World(); break;
}

Now, my point being, instead of creating a new Type called Slice, or ReadOnlySlice, since this will anyway require a reverse conversion at some-point reducing the potential gain of efficiency, why not directly return the arrays, and simply provide a direct way to create an array from an existing representation of another underlying memory of the array of the same type?

int[] x = {1, 2, 3, 4, 5};

// It returns a new array that internally maps directly to the array of x.
// Again, the returned int array type implicitly represents the same area of memory.
int[] slice = x[1:4];

// Alternatively, 
int[] slice = Array.Slice(x, 1, 4);

// Readonly version: (Reuse existing types)
ImmutableArray[] slice = Array.ImmutableSlice(x, 1, 4);

This ensures compatibility with all existing APIs, and no requirement for the API to be dealing with slices differently. IMO, API shouldn't have to think about where its a slice or an array. As far as they are concerned, they are getting a unit of data to operate on. The sender can decide whether its a slice that operates directly (conceptually similar to refs), or a copy.

Does this not make sense, as I really see no practical benefit, and use case in separating it as a brand new type - Only more potential decisions to dealt with, polluting the APIs with another set of overloads.

@jdh30
jdh30 commented Apr 16, 2015

F# already has slices for both arrays and strings. Sadly, they deep copy which makes them too slow for many applications (I only use them in code golf). Aliasing is definitely the way to go. Provided the slice supports stride it could also help when hoisting bounds checks.

I wish .NET provided overloads for functions like System.Double.Parse that accepted string, start index and length rather than just string. I often find my parsing code is much slower than necessary because this API design incurs huge allocation rates from unnecessary objects.

@jnm2
jnm2 commented Apr 18, 2015

Is there any chance we could get string slices at the same time? Not just ReadOnlySlice<char>?

@gafter
Member
gafter commented Apr 18, 2015

@jnm2 Yes, that would be part of the point. string.Substring would be more efficient than today.

@jdh30
jdh30 commented Apr 18, 2015

@gafter: I don't think you would want to break backward compatibility as Java has had some trouble with string slices keeping large strings reachable too long, i.e. memory leaks.

@HaloFour

@gafter Sounds like there is possibly movement on allowing a string to represent a range within another string? That would be awesome as it would make slicing both performant and usable within all existing API. I kind of agree with @jdh30 though that maybe it should be supported through a new member of string rather than string.Substring as some code might not expect the much larger original string to retain a root reference.

Would the same be possible with arrays?

@Przemyslaw-W

If this comes with proper GC integration, then there will be no need for
new API. GC just needs to deep understand slices and when original string
(or array) is no longer referenced other than via slice, then parts which
are not referenced by any slice can be collected.

2015-04-18 17:10 GMT+02:00 HaloFour notifications@github.com:

@gafter https://github.com/gafter Sounds like there is possibly
movement on allowing a string to represent a range within another string?
That would be awesome as it would make slicing both performant and usable
within all existing API. I kind of agree with @jdh30
https://github.com/jdh30 though that maybe it should be supported
through a new member of string rather than string.Substring as some code
might not expect the much larger original string to retain a root
reference.

Would the same be possible with arrays?


Reply to this email directly or view it on GitHub
#120 (comment).

@HaloFour

@Przemyslaw-W

If the GC could pull off being able to collect a large string from which at least one slice was taken then that would be great and does allay our concerns. My concern is that the slices would be treated as having references back to the parent string and thus keep it from being eligible for collection.

Another (tiny) reason to have a separate method is that we could establish a convention through which any type can be sliced. If a slice operation could function against any type that had a resolvable Slice(int,int) method (instance or extension) then the functionality could be provided to additional types. Off of the top of my head I could see slicing benefiting strings, arrays, any form of indexable collection, IEnumerable (via Skip+Take) and tuples.

@Przemyslaw-W

Yeah, such open convention would be really great. And I think such API need
to be introduced anyway, as arrays do not have "SubArray" method now. But
still, we can have cake and eat it too. If proper GC update comes together,
then Substring can be rewritten to internally use slicing. However, If GC
won't play together, then I agree it would be better to leave current
Substring implementation as is.

2015-04-18 21:38 GMT+02:00 HaloFour notifications@github.com:

@Przemyslaw-W https://github.com/Przemyslaw-W

If the GC could pull off being able to collect a large string from which
at least one slice was taken then that would be great and does allay our
concerns. My concern is that the slices would be treated as having
references back to the parent string and thus keep it from being eligible
for collection.

Another (tiny) reason to have a separate method is that we could establish
a convention through which any type can be sliced. If a slice operation
could function against any type that had a resolvable Slice(int,int)
method (instance or extension) then the functionality could be provided to
additional types. Off of the top of my head I could see slicing benefiting
strings, arrays, any form of indexable collection, IEnumerable (via
Skip+Take) and tuples.


Reply to this email directly or view it on GitHub
#120 (comment).

@JamesNK
JamesNK commented Apr 20, 2015

If slice is added, will it work with IList? IList is much more commonly used than raw arrays. A nice syntax for getting ranges of data should work with the most commonly used data structure.

@xen2
xen2 commented Apr 20, 2015

Ideally it would be great if such slice would not require allocation (i.e. be encoded in a struct that can be passed by ref/copy).

If not, it would result in two allocation and two indirections most of the time (and increased object number for GC).

Rust is doing something similar already: https://doc.rust-lang.org/std/slice/

Of course, if runtime can be modified, other options might be possible too.

@prasannavl

I don't understand why there are many comments about slicing in IEnumerable or IList. It simply doesn't make sense, since they aren't a contiguous representation of memory. They aren't even a direct representation of memory. They are very high level structures. The conceptual slicing of them is already possible, and is no different from using Skip, Take, and their relatives. We're talking about efficient referencing to existing memory, which really, only applies to arrays, or be extended to objects overall - in which case the garbage collector itself has to be tweaked, which changes a lot more dynamics, bringing the whole language closer to C/C++. If this indeed is a proposal, it seems completely out of scope of this thread.

I think the focus here should be only on arrays. If array are accomplished the right way, ILists can easily be extended, by perhaps another interface, that allows access to IList's source array, which in turn can be sliced.

@xen2
xen2 commented Apr 23, 2015

@weitzhandler Probably don't need syntactic sugar for a simple Skip+Take. I agree with @prasannavl that arrays should be the priority here.

@AdamsLair

I'm from this CoreCLR issue and as @jkotas noted, the issue seems to be related to this one and could be implemented as an addition to the proposed slices.

In short: It's about creating a "view" of an array in order to wrap the same block of data in different element types in order to satisfy the needs of different APIs or libraries. Feel free to join the discussion or move it here alltogether, in case this would be a suitable extension to slices.

@JeffreySax

There are really two separate issues here:

  1. Adding slicing syntax to the language.
  2. Implementing slices for strings, arrays, and other objects.

For many scenarios, especially in technical computing, just allowing the syntax solves the most painful issue of readability. Libraries can take care of the implementation for specific types. This way, the language side of the feature can be fully designed without having to handle all the complications of an efficient implementation. Significantly, and unlike what @MadsTorgersen writes about @stephentoub 's proposal in #2136, it does not require any changes in the CLR.

In fact, this can all be rather simple if you define a slice to be a built-in type with a specific construction syntax: from:to = new Slice(from, to) and, hopefully, from:stride:to = new Slice(from, to, stride). So a 'slice' here is the set of integer indexes, not the actual view of the larger object.

'Open' slices (where the from and/or to part is omitted) are only allowed as arguments in an indexer property. In this case, the missing value is replaced with a suitable bound:

  • For the lower bound, if the instance has a method or extension method GetLowerBound(dimension), the lower bound is set to a call to this method; otherwise, the lower bound is 0.
  • For the upper bound, if the instance has a method or extension method GetUpperBound(dimension), the upper bound is set to a call to this method; otherwise, if the instance has a Length or Count property, the value of this property is used; otherwise, if the instance implements ICollection or IList<T>, the corresponding Count property is used; otherwise a compile-time error is generated.

Indexing with a slice then just maps to an indexer call. Alternatively, it can map to an extension method (GetSlice or SetSlice) with the same signature. This allows libraries to add slicing support to types like strings and arrays after the fact.

This solution works for 1D objects like arrays, lists, and strings, but also for 2D and higher-dimensional arrays. It doesn't fix the type of the slice/view. It doesn't make any assumptions about how slicing is implemented, or how slices would be used.

@prasannavl

@JeffreySax, I'm guessing you completely misunderstood what slicing really means here. Please read the last few comments, and dotnet/coreclr#1015.

PS: It does require language support. Syntax and API is completely useless without a way to get a consistent view of a certain memory area from CLR.

@redknightlois

@JeffreySax Take a look at the link I posted on my comment at the start of the thread. The implementation using common constructs, with no CLR support for views of memory, as @prasannavl explains it, defeat the purpose of Slices by themselves.

The difference in performance is embarrasing, to give a hint with a microbenchmark (which can be totally flawed but its the best we have) we have something like this:

Access via delimited array: 258ms. (this uses indexers)
Access via array segment: 68ms.
Access via inline no checks delimited array: 45ms. (this use forced inlining)
Access without offset: 38ms. (this uses explicit code offsets)
Access via array slice: 38ms. (this uses IL manipulation)

Source: https://github.com/Codealike/arrayslice

As you can see, even paying 5% for Slices defeat the purpose of its use. Slices is no different to a custom IList if you dont care about performance, it is performance what makes Slices attractive and useful as a feature.

@benaadams

As an aside its a bit like TypedArrays in js?

Specifically the new TypedArray(buffer [, byteOffset [, length]]); constructor.

@JeffreySax

@redknightlois I strongly disagree that "it is performance what makes Slices attractive and useful as a feature." The principal motivation for adding slicing syntax is code clarity. Yes, it would be nice to have CLR-backed slices of strings and arrays, but it is not necessary to derive lots of benefit from the feature. It is common for numerical code to have 3-4 slicing operations in a single expression. Such code would benefit immensely from having slicing syntax.

F# has had slicing syntax from the beginning. Row and column slices of 2D collections were added in version 3.1. Clearly there was a demand for it. A similar demand exists for C#/VB.

How slices are implemented is not a C# language issue. If a specific implementation is particularly challenging, that should not keep the C# team from moving forward with such a useful feature.

@prasannavl

@JeffreySax,

In a hypothetical case, where I agree with you - C# already has slices. x.Skip(2).Take(10); If you want a refined way - Create a small wrapper class. Still not enough? Then, what you're looking for is sugar for the iterator syntax. Now, I don't think this is the best thread to ask for that.

Why, you ask?

  • Here's the shamelessly lifted problem statement from the first post with the large title "Problem":

However, it’s also very common to only want to share a portion of an array. This is typically achieved either by copying that portion out into its own array, or by passing around the array along with range indicators for which portion of the array is intended to be used. The former can lead to inefficiencies due to unnecessary copies of non-trivial amounts of data, and the latter can lead both to more complicated code as well as to lack of trust that the intended subset is the only subset that’s actually going to being used.

@redknightlois

@JeffreySax Not even an average F# developer here, so I could be completely wrong.

AFAIK on F# after slicing they become sort of sequences which for all intended purposes they are "custom lists" or "views". In essense that is only syntactic sugar (which is better than the current state of affairs) but quoting from the problem statement by @stephentoub

Arrays are extremely prevalent in C# code, as they are in most programming languages, and it’s very common to hand arrays around from one method to another.

EDIT: That the same syntactic sugar can be applied here is an entirely different matter.

@ssylvan
ssylvan commented Jul 21, 2015

This is great, but there should be a way to get slices from List, stack allocated arrays, fixed arrays, unsafe arrays (pointers), etc. This way you can use slices as an input argument without screwing the callers over if they don't have a plain array. Note, this needs to happen without sacrificing performance. So for List the type itself has to selectively expose its internal array, and for stack allocated arrays (and pointers, etc.) the slice needs to be invented in a stack-local struct that points to the memory.

@HaloFour

@ssylvan I agree but there's only so far that the CLR would allow that to go. For example, List<T> can't "selectively" expose it's internal array to arbitrary consumers. Doing so wouldn't be practical anyway. An operation that would cause the List<T> to expand its buffer would leave anyone holding onto a reference of the array with a stale copy. Furthermore, having a reference to that array doesn't give you something that you could then pass as a List<T> or even a sliced array to any arbitrary bit of code.

I agree that it would be useful to be able to obtain a slice of a List<T>, but in order to be able to treat it as a List<T> it has to be a new instance allocated on the heap. There's no way to define a struct which would represent a slice of a List<T> that can also be passed around like a List<T>. At best you could obtain a struct that implements IList<T>, which could not be passed to any existing methods expecting List<T>, and you'd incur the penalty of boxing in order to pass to any existing method expecting an IList<T>. The same is true of arrays and strings. You can't avoid that additional allocation, you need a reference to a new instance of the type, which must be modified to be aware of its own offset and potentially imposed length.

As for pointers, you can already obtain a new pointer at a given offset and there is no concept of a length so there is really no benefit to slicing.

@ssylvan
ssylvan commented Jul 21, 2015

@HaloFour C# has GC, so I don't see what the problem is for List. The List would implement some magic interface that takes the slice bounds and returns a slice. It just needs to add the internal array as a reference or something to keep it from being reclaimed. This may make the slice a bit more magic than you'd like. It would be a struct that contains:

  1. A reference to the array, or a pointer to raw memory (for unsafe pointers, or fixed buffers, etc.).
  2. A start index
  3. A count
  4. A bit somewhere that indicates if 1 is a legit reference or a pointer.

Furthermore, the GC would have to treat slices specially by examining the bit (4 above) to follow the reference only if it's a real reference. So for a List, the Slice would keep the full reference (to the internal array) alive which seems to jive with what you'd expect it to do. Maybe the GC could (as a further improvement) track the union of slices for the array and free up regions that aren't needed.

@HaloFour

@ssylvan

The issue isn't a question of GC. If you want to pass this slice to a method that accepts List<T> then the slice itself needs to be an instance of List<T>. If you want to pass an array slice to a method that accepts an array then the slice itself needs to be an instance of an array. That heap allocation cannot be avoided, and the array/List<T> types themselves both need to be aware of the concept of slices.

Unless you're arguing for some mechanism to allow some struct to be considered equivalent to List<T>, but that sounds like a very aggressive CLR change.

@ssylvan
ssylvan commented Jul 21, 2015

@HaloFour

Well who cares about that? Slices are meant to be a cheaper and better abstraction over "array-like" things. Of course there will be a cost if you want to go back to a more specific type (you can't convert an IEnumerable to a List without an allocation, for example, I don't know why anyone would expect to be able to).

The point is that future methods will accept slices instead of arrays or IEnumerables, which means that they would enable callers to use them in more efficient ways. In order for that to pan out, it needs to be flexible enough (while still being a struct and thus cheap) to support many "array like" things. If I can't call a method that takes a slice because I used a stack allocated array then the performance benefits are moot in that case.

@HaloFour

@ssylvan

That is the original proposal on this thread. Seems that it was considered that slicing would be quite a bit more useful if it would enjoy some degree of CLR support and coexist more peacefully with the existing ecosystem. Does anyone expect that all new methods will be written/overloaded to accept some newfangled slice struct rather than just an array or a string? I don't.

@ssylvan
ssylvan commented Jul 21, 2015

@HaloFour Maybe not, but if your'e right the proposal should just be killed then. There's very little purpose to it unless you care about performance. And if you do care about performance, you would obviously be prepared to do a tiny amount of work to enable huge performance wins. The BCL would clearly get new overloads for methods that take arrays, or which currently take IEnumerables but could be more efficient with a slice. And any libraries or middleware that is performance sensitive (think Unity, and the like) would almost certainly be willing to enable much more efficient code at such a low cost. At a first approximation, replace any methods that take an array with one that takes a slice with unchanged method body. Done.

@HaloFour

@ssylvan I think that depends on what could be accomplished through the CLR and what the metrics of those changes would yield. If the CLR offered a way to obtain a portion of an array from an existing array at the cost of a single heap allocation I think it would be a worthwhile gain. Same with String and possibly List<T>.

Beyond that I am of the opinion that slicing should just be another overloadable operator (or potentially extension method) that can return any type.

@prasannavl

@HaloFour I agree. We require changes to the CLR to provide a view of the representation of memory. I'm curious to see if this makes sense to you: #120 (comment)?

@HaloFour

@prasannavl I think that would be great to have it both ways like that, even if the compiler didn't completely hide the detail of an intermediate struct representing the view. But would that work if slicing were permitted on multiple types, such as strings and arrays? Would a ReadOnlySlice<char> from an array be sufficient and safe to pass to String.FromSlice()? Would this struct have to keep a reference internally so that the root object(s) aren't collected?

@prasannavl

You are correct to ask those questions. I don't think its something that can be achieved without changes to the CLR. But for slicing to be done right I think this may be a necessary change.

We need some support from the CLR for providing a view of a memory of an object and ensure that they aren't collected.

While, in the above example, as you mentioned the struct itself can hold an internal reference, it would quickly become a dirty way of doing things - It would probably be a much better idea to implement a CLR specification for handling slicing properly rather than a partial slicing implementation, as given in the proposal.

While the proposal still provides efficiency in many scenarios, I'm concerned that it may proliferate across APIs with multiple overloads, making the whole environment cumbersome to work with.

@GSPP
GSPP commented Aug 20, 2015

A Slice<T> very much looks like an IList<T>. Maybe slices should not exist as a user-exposed concept. Rather, they should be internal subclasses of IList<T>. The slice syntax [:] could map to IList<T> as well.

IList<T> is capable of exposing parts of an array, of a List<T>, unmanaged memory, anything. Slicing does not need to be a concept that is user-visible.

Rather, we need JIT support for performance reasons. The JIT should have the capability to devirtualize virtual, interface and delegate calls. The techniques to do this are well known and field tested by the JVM. For example, if the code is:

for (int i = 0; i < list.Count; i++) { ... }

That loop would be compiled as a bailout initially (recompile()). The first time the loop is hit it is specialized to the exact runtime type found in list:

if (list.GetType() == typeof(ArraySlice<int>)) {
 for (int i = 0; i < list.Count; i++) {
  // access the array directly without a vcall.
  // the exact type of list is known here.
  // we can generate perfect as-if hand written code.
  // we can inline everything.
 }
}
else { recompile(); }

If the recompile() bailout is hit again some time in the future the method is re-jitted and another specialization is added. And so on.

This allows for a perfectly specialized, tight, vectorized, unrolled loop without range checks and other runtime tests. As if you had written C code.

This should put this slicing proposal on par or better than the initial proposal when it comes to performance. Also, this optimization benefits many other pieces of .NET code. In practice, any kind of vcall would no longer be an issue most of the time.

We already have many forms of "list" in .NET: Arrays, List<T>, IList<T>, ArraySegment<T>, pointers, and other collection types. Their usage is mixed and inconsistent. Let's not add another one. Rather, let's strive to unify them through a fast IList<T>.

@stephentoub does this approach cover the goals you wanted to achieve with this proposal? I think it does but without adding another way to describe what is essentially a list.

@stephentoub
Member

@stephentoub does this approach cover the goals you wanted to achieve with this proposal? I think it does

It'd be fine to have whatever framework type represents this also implement IList<T>, allowing a developer to choose to pass the slice into a function that accepted an IList<T> and pay the associated costs, but from my perspective that shouldn't be the primary/only access mechanism.

Even if the backend supported aggressive devirtualization, potentially with profile-guided optimizations to help pick the right types, representing this as an IList<T> has at least two issues: 1) it'd typically require either a reference type or a box, either of which requires an allocation, and 2) it wouldn't suffice for unmanaged memory, e.g. being handed a pointer to some data read off the wire and wanting to be able to process it with the same functions used to process managed data without first copying it to a managed array or wrapping it with some managed object. It's possible you could get some of the value for (1) if the backend also had extremely good support for escape analysis and stack allocation, but even if it did, that wouldn't address (2) (and it's unclear to me to what extent it would addres (1)). For (2) you'd need some way to express slicing a pointer, which I don't believe is worth the language/compiler complexity (but is an important scenario to support for performance for the folks that need it).

I think having a public value type to represent the concept is required, even if you provide additional options for how it's consumed, e.g. having it implement whatever interfaces are relevant, somehow enabling it to be "boxed" into a T[](which would require an allocation but avoid a memory copy), etc.

@GSPP
GSPP commented Aug 20, 2015

Regarding (2):

There could be an IList<T> that is backed by a base pointer and a length. After devirtualization that should collapse to the optimal code that one would hope for.

Creating a pointer-based slice would be as easy as new UnmanagedMemoryList<byte>(ptr, length).

You can slice to subregions by saying mySlice.Subslice(a, b). Maybe we indeed need a new interface on top of IList that allows for subslicing. Valid point, though.

If you care about not allocating on the heap you could make the various slice types public structs and statically type the variables with those struct types. That does not allow a function to take any kind of slice, though, which is another valid point.

You've found some valid issues.

@GrabYourPitchforks

If you do continue to investigate Slice and ReadonlySlice, please ensure that we can still get ref values and unsafe pointers to the underlying data. This is necessary for volatile memory access and interop scenarios. You may need to introduce the concept of "readonly ref" in order to satisfy immutability constraints.

@jods4
jods4 commented Sep 22, 2015

This is interesting because in some domains, using substrings or subarrays is very common and either (a) awkward (manually passing start index and length, not compatible with many existing methods/libs) or (b) costly (making new copies each time).

But I think that to be truly successful, the new Slice<T> type should be magically a subclass of T[] and string, etc. You need to be able to pass a Slice<int> to any method that takes an int[]. Conceptually that's not a problem as they operate on the same things: .Length, indexer...
Another option could be an implicit no-copy conversion.

If you don't do that slices will miss out on many existing libraries that don't handle them, including the BCL unless it's completely updated, which makes them glorified ArraySegment<T> and that's it.

In a world where you don't do that, maybe all libs will start accepting Slice<T> and only Slice<T> since they can be cheaply and implicitly created from T[] and they are more flexible... proving that the equivalence was needed!

@HaloFour

@omariom Unless I'm missing something it still looks like a surrogate struct making it unusable with the existing framework. I don't think we need another ArraySegment<T>.

@ssylvan
ssylvan commented Nov 10, 2015

@HaloFour No, we need a better ArraySegment. Plus, it implemente IEnumerable right? So you can still use it with legacy APIs (at the cost of a box). That said, it would obviously be far more useful if overloads were added to the BCL to accept slices directly (and avoid the allocation).

@omariom
omariom commented Nov 10, 2015

@HaloFour
According to Joe Duffy they are "still working out whether Slice will be a runtime feature, purely done as a library, and, related, what syntax to surface in the language."

@HaloFour

@ssylvan ArraySegment<T> is already IEnumerable<T> and IList<T>, hasn't helped it at all. You can't use it with any of the existing APIs that accept array or string. Most people don't treat strings as IEnumerable<char> and treating an array as IEnumerable<T> or IList<T> yields awful performance.

@omariom Understood. I am firmly in the camp that it's usefulness is tied directly in whether or not the slice can be implicitly used as an array or string with little/no performance penalty.

@ssylvan
ssylvan commented Nov 10, 2015

@HaloFour there are many reasons ArraySegment failed. Lack of BCL support is one of them, but then again the lack of BCL support is probably because ArraySegment is so gimped. I'd totally agree that in order for this to be useful anything that currently takes an IList or IEnumerable needs to get an overload that supports a slice (and not just by boxing it up and calling an existing implementation).

I don't think using it implicitly in as an array makes a lot of sense. The point of a slice is to be a lowest common denominator for a bunch of higher level concepts. Going the other way would be dicey.

I wish it would support List as well, though.. Maybe even generic ILists, but that would require some kind of extra reference in the struct, most like.

@HaloFour

@ssylvan Like anything else having to support ArraySegment<T> or Span<T> or Slice<T> increases the complexity considerably. As you mentioned, it's not a straight overload, and you immediately lose all of the supporting methods that exist around strings or arrays. And even then in the best of scenarios it will still be significantly slower than the overload designed to work with a string or an array. And the cases where you'd want a substring would most likely be very different from the cases where you'd want an array segment, so being a lowest common denominator makes little sense.

In my opinion I think that a slice of an array should produce something that is indistinguishable from an array as far as the framework is concerned. The CLR should provide an array instance that internally represents that view into a parent array, with the GC doing the appropriate book keeping. Ditto with strings.

@ssylvan
ssylvan commented Nov 10, 2015

@HaloFour I sympathize with your point of view, I just think it would be a tricky proposition to make all arrays and strings bigger/slower in order to make them representable as Spans. Now everyone pays a cost, even if they're not using Spans.

The idea is that a Span is the thing that represents generic data underneath, making strings and arrays into their own Span-like constructs seems messy. It sucks that it comes in so late and there's work to update APIs, but I think the better approach is to just take the hit and be very proactive in updating at least the internal APIs.

@HaloFour

@ssylvan

Now everyone pays a cost, even if they're not using Spans.

Indeed, there would be some additional overhead with strings and arrays across the board but I wouldn't expect it to be more than negligible.

It sucks that it comes in so late and there's work to update APIs, but I think the better approach is to just take the hit and be very proactive in updating at least the internal APIs.

Here is where the real cost lies. The cost in man-hours in updating the BCL in order to make Span<T> a first-class citizen is enormous, particularly writing the overloads as to not be garbage performance-wise. Then you have the rest of the libraries offered by Microsoft. Then you have the ecosystem as a whole. And every time you run into any scenario where Span<T> hasn't been specifically accommodated the entire house of cards comes crashing down.

If I were the C# designers I'd stay far, far away from a slicing syntax based on top of Span<T> until it can be demonstrated that said struct won't end up in the dustbin next to ArraySegment<T>. I'm sorry but I have exceedingly little faith that this won't be the case.

@ssylvan
ssylvan commented Nov 10, 2015

@HaloFour I think you're overstating things. The entire "house of cards" won't come crashing down because you can still always go back to passing it as an IEnumerable. You'd take a perf hit in those cases but it's now worse than it is now. The issues would be localized to just those problematic interfaces.

Yes it would be a pain to change APIs to use this, but it's not that big of a deal. It's mostly mechanical updates, and in some cases you could even use extension methods (e.g. an AddRange method that takes a span can be implemented pretty efficiently with public APIs). It's effort, but it's far less effort than adding generics, or async, was.

@HaloFour

@ssylvan

I think you're overstating things.

No doubt. 😀

The entire "house of cards" won't come crashing down because you can still always go back to passing it as an IEnumerable.

That assumes that such an overload would exist. My experience is that overloads that accept IEnumerable<char> instead of string are not very common.

It's effort, but it's far less effort than adding generics, or async, was.

You think? It did take generics some time to completely supplant the .NET 1.x collection types and the ecosystem was considerably smaller then. I'd say async/Task was similar in that it was largely replacing the less-frequently-employed Begin*/End*/IAsyncResult model. Passing around arrays/strings is quite pervasive across the runtime and ecosystem.

I would absolutely love to be wrong, though.

@prasannavl

I completely agree with @HaloFour - It has to be a data-structure that works with existing ArraySegment<T> or pure arrays, although I'm a little more biased that it should work with pure arrays.

Perhaps a different approach would be to let the runtime handle an Array<T> itself as a different type when something like Array.FromSlice or Array.ToSlice is used.. (Which in turns returns a pure array on both occasions.

If this can be done, we also avoid the overheads when dealing with normal arrays, and at the same time, we don't get an extra type so everything just works with all the existing libraries.

@jods4
jods4 commented Nov 11, 2015

I also completely agree with @HaloFour. A slice of string at least has to be usable like a string, so that the crazy amount of things working with strings continues to.

And it makes perfect sense. When you extract the text between chars 12 and 27 of a large body of text you're parsing (e.g.: an identifier in a C# source file), what do you expect the result will be? What will you use it for? Answer: it is logically a string. Not a weird Slice<string> or ArraySegment<T> that you can't use with all apis you're used to use for strings.

@ssylvan it does not necessarily have to incur a cost. A string is mostly made of a length and a pointer to the start of the char buffer. A slice of a string is basically the same thing, with the buffer pointing in the middle of another string's buffer. So at the binary level it could be the same and I don't see why there would be a runtime cost for using it.

The big problem with this vision is the memory leak / GC thing. Holding onto a tiny substring of a very large string would keep the large one in memory. Something not desirable and I think Java hit the problem when they tried to make substring work that way.

I don't know how/if the GC could be intelligent enough to decide it's time to extract a copy of a substring and release the larger one. Or maybe it's a dev responsability and this internal behaviour could be a documented difference between say string Substring() and string Slice().

Heck, this fact could be even more visible by returning two different types (string vs Slice<string>), as long as a Slice<string> can be used everywhere a string can.

@prasannavl

@jods4, very well put - my thoughts since the beginning exactly.

Heck, this fact could be even more visible by returning two different types (string vs Slice), as long as a Slice can be used everywhere a string can.

I agree completely with the semantics here. But I think this will open up the ecosystem to weird characteristics, having two different types be interchangeable. Perhaps, a derivative or a more abstract string. i.e, StringSlice : String, or String: StringSlice whichever is more appropriate depending on the implementation. And similar to be applied for Array<T> as well.

But I do think the better approach would be to figure a way like pinning.

Say,

var a = [1, 2, 3, 4, 5];
using (var x = Array.Slice(a, 0, 3)) // Another pure array referring internally to [1, 2, 3], with array a `pinned`
{
  // Use array x as normal here.
} // Unpin inside the GC. 

The same is applicable to strings.

@jods4
jods4 commented Nov 11, 2015

@prasannavl at the language level, "interchangeable" could also be done with an implicit conversion from slice to string.

Using a block is a bit constraining as it means your slice x cannot escape the method where it is built...

@prasannavl

@jods4, on the interchangeable front, implicit conversion will have to take a performance hit, unless more special compiler semantics come into play. Perhaps, it can be done cleanly, but looks like a rabbit hole.

The block really is just a way to constrain and simplify the GC mechanism. If not, I believe, the GC will have to go through a lot of complicated changes to figure out how to do it if we are to avoid extra types, but if it can be done, well I think most would agree that would be the best approach forward.

@redknightlois

Completely agree with almost everything discussed. If there is no runtime support for "type substitution" at least for my use cases it is useless because of the allocations and performance profile.

@ssylvan
ssylvan commented Nov 11, 2015

@jods4 If you want a string to be able to represent both a real string, and one of this fake "facades" then there at least needs to be some kind of flag and a branch each time you want to do anything with it. I.e. you'd need to know that the "buffer" inside the string isn't a real GC buffer, but rather a slice (which can be a slice of an array, or a native buffer, etc.). So at least this would make the string object bigger, and would also incur some overhead when using it (since there are no multiple possible kinds of sources of the underlying data).

If you made some kind of special "string slice" that only worked for strings (which would mistake, IMO), you would only ever have one "kind" of data source, but you'd stlil have to add an "offset" variable for the underlying data, which at least adds a word and some extra book-keeping for some operations.

If you make the runtime representation identical (some low level representation like a pointer and an length) you'd still need some kind of flag to tell the GC that this is the true owner of the data, vs being a slice (and then the GC would need to figure out what allocation the pointer is pointing to, sine it can be anywhere within it, and keep the parent allocation alive). This would be the cheapest option, but would still have some kind of cost. Presumably you'd hide the "is slice" bit in the length or something, but the resulting masking/shifting isn't free. Or maybe you just have a set of hard coded checks in the GC (like "is this a string? If so, the pointer may point to the middle of an allocation so go to the heap meta data to find the true pointer").

The only other alternative would be to make strings virtual everywhere so that the representation can change, but that would obviously be crazy town (allocations, terrible performance, etc.).

@prasannavl

@ssylvan, when we are dealing with the compiler level semantics, why are we concerned about making strings "virtual" and having book keeping flags?

We're are free to make the compiler plug and play the implementations, as long as the compiler can properly detect usage. But that's the key - for the compiler to be able to understand "slices" so that it appropriately plug the implementation. So, if we could focus on that, everything that you mention will get implicitly taken care of (even if not in the immediate v0 implementation)

To be more clear, I don't disagree with your concerns, but I think if we follow the discussion path above, and as clearly elaborated by @HaloFour and @jods4, I feel it would actually end up being addressed automatically.

@ssylvan
ssylvan commented Nov 11, 2015

@prasannavl If you change the CLR, and the compiler, then the slice type can be some low level type (pointer + length), and you'd have special magic for arrays, strings, List and so on to generate that. The key problem is to make the GC and allocator aware of these things because pointers are no longer necessarily pointing to header words, they could be pointing to the middle of some allocation, and you'd have to do some kind of data structure lookup to figure out what allocation it's really pointing to. This may need some major redesign of the allocator (not sure if the current one could do that) and GC.

@jnm2
jnm2 commented Nov 11, 2015

There's also the problem that strings are null-terminated. A memory slice
will not be zero terminated.

On Wed, Nov 11, 2015, 2:06 PM ssylvan notifications@github.com wrote:

@prasannavl https://github.com/prasannavl If you change the CLR, and
the compiler, then the slice type can be some low level type (pointer +
length), and you'd have special magic for arrays, strings, List and so on
to generate that. The key problem is to make the GC and allocator aware of
these things because pointers are no longer necessarily pointing to header
words, they could be pointing to the middle of some allocation, and you'd
have to do some kind of data structure lookup to figure out what allocation
it's really pointing to. This may need some major redesign of the
allocator (not sure if the current one could do that) and GC.


Reply to this email directly or view it on GitHub
#120 (comment).

@prasannavl

Just a tad bit off-topic - I haven't looked into the implementation of Golang.. But I'm curious as to how they do it, since slicing and memory pinning are a very idiomatic way to write Go code.

Perhaps, a thing or two can be learnt from there, especially considering now their GC has a 10ms guarantee.

@jods4
jods4 commented Nov 11, 2015

@jnm2 I think .net strings are actually only 0 terminated for compatibility when using Interop. The internal .net way of manipulating string is length + buffer. So the implication is mostly that marshalling a slice absolutely requires a copy of the string to add a null terminator, but this is not impacting managed code.

@jods4
jods4 commented Nov 18, 2015

Heard on the keynote today: C# 7 does not require any CLR changes, so you can use it on .NET 4.5 today.

What should we understand from this announcement?

  1. She was talking about the C# 7 preview only? (Everything is still possible)
  2. Slices are more of a CLR / BCL thing than language, so they will be available on vNext but not 4.5?
  3. Slices won't be in C# 7?
  4. Slices will be in C# 7 but they will be done purely as a library thing on top of existing CLR / BCL?

Personally, I hope for 1. or 2., because a library-based slice is mostly ArraySegment<T> and will fail. I believe slices of strings or arrays need to be compatible with all apis taking strings and arrays to be truly useful/successful and I don't think it can happen without some kind of CLR support.

@redknightlois

Hoping 1 or 2 too.

@gafter
Member
gafter commented Nov 18, 2015

I think she was talking about (1).

@gafter gafter removed the 1 - Planning label Nov 20, 2015
@gafter gafter assigned jaredpar and unassigned MadsTorgersen Nov 20, 2015
@gafter gafter added the 2 - Ready label Nov 20, 2015
@omariom
omariom commented Jan 3, 2016

@jods4 @HaloFour
2 points:

  • Arrays and strings are reference types. Do you want another allocation when creating a slice?
  • Arrays and strings have specific memory layouts on which JITed code relies. They include the data in themselves and at runtime can't be represented uniformly with slices.

They could make all the strings and arrays to be slices underneath and make CLR understand that but it will change allocation profile significantly - instead of one ref type object a string or an array will be represented with a frontend slice (of reference type) and a backend holding actual data (a ref type too). Too many objects, too much memory allocated.

@omariom
omariom commented Jan 3, 2016

Since 1.0 we have a lot of API accepting and return arrays instead of enumerables and readonly wrappers. We have to live with it.

Imo, a good trade off would be if slices implemented all the relevant methods of array and strings (with the same efficiency) and BCL classes gradually implemented overrides of the popular methods like TryParse etc.

@HaloFour
HaloFour commented Jan 3, 2016

@omariom

Do you want another allocation when creating a slice?

Want? No, but it's probably unavoidable to at least require an object allocation if the result were to be a string or array, short of some severe CLR/GC shenanigans. But I'd prefer a slightly-more-expensive version rather than a largely-useless-due-to-lack-of-support-throughout-the-ecosystem version.

@omariom
omariom commented Jan 3, 2016

Just checked how it is in other langs.
In Go, Rust and D slices are value types and have separate syntax from arrays and srtings.

@omariom
omariom commented Jan 3, 2016

I would prefer to have slices now for my own work rather than waiting till runtime is ready to implement them natively, uniformly nd most efficiently - it could be done in the next version.

@omariom
omariom commented Jan 3, 2016

@stephentoub
What will the Slice's indexer be returning? Reference or value?

@stephentoub
Member

@stephentoub What will the Slice's indexer be returning?

My hope would be a ref return, but we'll see. You can see current experimentation at https://github.com/dotnet/corefxlab/tree/master/src/System.Slices.

@omariom
omariom commented Jan 4, 2016

@stephentoub I've already sent a couple of pull requests there )

@jods4
jods4 commented Jan 5, 2016

@omariom

TL;DR sorry that comment ended up very long. Most interesting idea is probably point 3 at the end.

You raised some good points that got me into thinking.
Making slices behave like strings / array might be harder than I first imagined.

Let's talk about strings (arbitrarily, I think everything applies equally to arrays).

Why I still believe that efficient interop with string is required
Playing the devil's advocate: if a splice is a struct that is not compatible with strings, than you already have it today. If you are concerned with perfs in your string handling, most of the core string functions today accept a string + an offset + a length. Having a struct wrapping this info is more convenient than having 3 variables, but it's quite the same in the end.

In particular it creates two worlds: the (few) functions that have optimized versions for people who needs perf, and the rest of the world who uses string. The problem with that vision is that sometimes people who need perf also need more advanced functions and re-implementing everything is just plain wrong. The other problem is that often at some points both worlds collide.

One example and then I move on to implementation ideas:
Say you want to write an efficient XML parser. It does not take long to understand that calling Substring() on every syntactical piece is going to allocate lots of objects. So say you use the new slices for that.
When the user processes the file and wants to know each tag name, what do you return? A string probably. So a copy has to be made.
But say that you are able to read an attribute value as a slice (yay for perf). If you know this attribute is a number and want to parse it, do you have a int.Parse overload that takes a slice? Probably not so you need to make a copy at that point (and your parser is not as efficient as it could have been), or you implement your own int.Parse accepting slices (very wrong).

Can string/slice compatibility even work?
That problem sure is hard. You are right that a true string is more than a length and a pointer to a char buffer. It also starts with the syncblock and the vtable pointer. These can't be removed from a reference type, so to be 100% compatible with a string, a slice should have them as well... maybe?

Here are all the solutions that I can think of:

  1. Let's do the opposite! Implement everything as Slice<char> and make string implicitly convertible to Slice<char> (which is easy to do and very cheap, copy the length and pointer in a struct).
    Great idea if we started today. Maybe MS can pull this off in the BCL but there are 14 years worth of existing libraries out there :(
    Crazy idea: the JIT could do that? Convert any method that takes a string as a method that takes a Slice<char> and modify calling site accordingly. That seems a bit crazy to me but hey...
    Of course the main issue is that some methods can't be converted, e.g. if they lock the string, or use it as an object or interface, etc. In those cases the JIT should convert Slice<char> parameters to proper strings instead...
  2. Let's make Slice<char> a reference type. I think that easily solves most issues, as we could adopt exactly the same layout as string.
    It would mean heap allocation, something we strive to avoid in perf critical code. This is truly a huge drawback, probably not acceptable.
    But what if... .NET could allocate references on the stack? This would mostly solve the issue here and boost performance of many other cases. Doing this is a hard problem and requires careful escape analysis, but if it could work even just in basic cases, there could be lots of benefits in terms of perf.
  3. Make Slice<char> implicitly and efficiently (no copy) convertible to string.
    That's rather easy: allocate one new string ref and make its char buffer point in the middle of the existing slice target.
    That's a bit of a compromise: we incur one allocation, but we don't copy the buffer and we can use any existing method or return to any user code with a plain string.
    In the end, because of the "reference type" baggage of string, 0 allocation is probably not achievable anyway.

One big issue that remains is that if you can return a "slice" string, it can live longer than the underlying buffer, which might cause trouble to GC (keeping a huge buffer alive for a tiny substring).

@Thaina
Thaina commented Jan 16, 2016

I am going against your proposal because we already have ArraySegment and ReadOnlyCollection and also IEnumerable. The thing we really lack is the functionality of working with it like the actual array

Which is, instead, the feature return by ref. So we should put you slicer back into ArraySegment instead

@alrz
Contributor
alrz commented Jan 22, 2016

It would be nice if we could capture slices within array patterns,

swich(array) {
  case { var first, int[:] slice , var last }: ...
  // or perhaps
  case { var first, var slice.. , var last }: ...
}

or something like that.

@jods4
jods4 commented Jan 22, 2016

@alrz I think I would prefer a third syntax, similar to your second: { var first, var ...slice, var last}
I think it's more consistent with other langages (e.g. destructuring in ES6) and it doesn't preclude var usage (unlike your first suggestion). In langages that supports this, that syntax is usually consistent with spread operator (should C# ever get that?), and params arguments (although C# has a different take on this one).

@alrz
Contributor
alrz commented Jan 22, 2016

@jods4 I'm agree the first one is ambiguous, but what do you mean by "more consistent with other languages"?

@jods4
jods4 commented Jan 22, 2016

@alrz I was thinking about destructuring arrays in other languages, which is quite similar to pattern matching (albeit unconditionally).
But to be honest my impression wasn't correct about that. After doing some actual research there are as many variations as there are languages. A few examples:
ES6 does it like I suggested: let [first, ...middle, last] = array.
Coffeescript does the opposite (your way): [first, middle..., last] = array.
Ruby uses a star (splat): first, *rest = [1, 2, 3]
Clojure uses ampersand: let [[first & rest] vector]

So... forget about that comment! Altough I still like ES6/TS syntax ;)

@DerpMcDerp

Having a count is far more common than having the index of the end element so it would be more convenient to programmers if the syntax foo[i:n] meant [i, i+n) rather than [i, n).

Or an alternative is that current proposal could be kept intact but an implicit variable $i could be created within the scope of the rhs representing the value computed on the lhs:

foo[a + b:$i + n]; // $i == a + b
foo[a:/* $i's scope begins here */ $i + 2 /* $i's scope ends here */];
@ilexp
ilexp commented Apr 4, 2016

I haven't followed through the entire discussion, but what's the current idea on having "array views" / slices that have a different type than the original array they share data with, i.e. CoreClr issue 1015? Will this be possible within reasonable constraints?

General idea:

byte[] someRawData = /*...*/

// Create a different view on the same data without copying 
// (for performance and library communication reasons)
MyStruct[] interpretedBlittableData = Array.CreateView<MyStruct>(someRawData, /*...*/);
// (The above throws an exception if MyStruct isn't considered safe for this)

Could probably be considered a slicing addon.

@omariom
omariom commented Apr 12, 2016

@ilexp
In the prototype it is possible with Cast method.

@gminorcoles

I see a slice as an array of indexes, not as a subset of the original array-like object. Regardless of how you support this concept, I believe that this is key to usability.

@choikwa
choikwa commented Jun 18, 2016 edited

Couple of things that triggered me

  • Slice as its own type as opposed to Slice returning child array type
  • Slice as mutable reference to original array by default
  • Lack of mentioning aliasing analysis work for overlapping references

The easiest solution is to deepcopy everywhere and make everything mutable and let CLR/RyuJIT deal with it, but essentially those are what kills performance. What gives performance is constant/immutable references and copy-as-needed per mutation. String is already immutable which should make it easy to slice. I'm neither aliasing nor GC expert, so I can't comment on the problem of 'small ref holding onto large superset'. I'm hoping this is a problem experts have previously tackled and have mitigation strategies.

@ilexp
ilexp commented Jun 18, 2016

I see a slice as an array of indexes, not as a subset of the original array-like object. Regardless of how you support this concept, I believe that this is key to usability.

That sounds like it wouldn't provide the same performance as accessing a regular array, which would defeat the purpose of one of the use cases for slicing: Providing high-performance view access to an array / memory block without copying.

What gives performance is constant/immutable references and copy-as-needed per mutation.

If you're going to mention performance, I believe that argument is in favor of mutability, rather than immutability. How is it more performant if you have to copy all the data on every mutation? Slices being mutable allows lightweight, high-performance array / memory views for both read access and mutation. If they're immutable, you'll get only half the use cases and have to use costly copy operations for the other half. You can still decide to copy a mutable array / slice if you want to - or pass it around as an IReadOnlyList<T> if you don't want others to mutate it.

@cesarsouza
cesarsouza commented Jun 18, 2016 edited

+1 against having slice as an array of indices. It would be better to learn with frameworks/libraries that got it right, like for example Python's NumPy. Slices should represent views of the original array and are interpretable by ordinary functions just like ordinal arrays. It should be totally transparent for called functions whether they are processing an int[] or an int[5:10] or however an slice should be defined.

To be honest, I couldn't completely understand from the above discussion why sometimes touching the compiler is seem as something to be avoided. In my view, this is a critical feature for #10378 that cannot be left half-baked (such as for example having only a pure BCL solution). Also, deprecating ArraySegment, and re-implementing it in terms of array slices should also be considered as an option. It is not like there weren't any breaking changes since .NET 1.0.

Array slices (or more generally, safe memory views) are absolutely necessary for the success of C# as a language for high-performance computing. Right now, Python is taken way more seriously for high-performance computing than C#, and this really shouldn't have been the case (Python is a fine language though, but it was C# that initially proposed the non-compromise solution of handling unsafe contexts for more performant code, for example - as such, the fact that we are not being able to fulfill one of the first premises of the language might be a sign that even large or possibly breaking changes should be considered at this point).

@ilexp ilexp referenced this issue in dotnet/coreclr Jun 21, 2016
Open

Span<T> #5851

3 of 13 tasks complete
@choikwa
choikwa commented Jun 27, 2016 edited

@cesarsouza Python's Achilles heel is the GIL, and Python devs have the penchant for single-threaded performance over asynchronous, multithreaded operations. It is fine as a scripting language for synchronous tasks.

If you're going to mention performance, I believe that argument is in favor of mutability, rather than immutability.

@ilexp I'm going to go ahead and say "It depends on the context" and whichever choice roslyn gets means backend will have to adjust their heuristics to catch the other case. Immutability is a very important property for compiler optimizations.

@GeirGrusom

@ilexp I'm going to go ahead and say "It depends on the context" and whichever choice roslyn gets means backend will have to adjust their heuristics to catch the other case. Immutability is a very important property for compiler optimizations.

Arrays are mutable... making a immutable slice doesn't magically change that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment